India is the second-largest market in the world for AI tools such as OpenAI's ChatGPT and Anthropic's Claude, yet the country's vast linguistic diversity — 22 official languages and over 100 dialects spoken by more than a billion people — remains a critical barrier to democratising artificial intelligence. As AI systems move from English-dominated benchmarks to real-world deployment in voice-driven economies, the challenge of making models that truly understand Indic languages has become the defining test of India's AI ambitions.
The Scale of India's Language Challenge
Over one billion people speak Indic languages, yet the training data available for these languages remains minuscule. Bengali, spoken by more than 280 million people, constitutes less than 0.1 per cent of all text available on the web. GPT-5, OpenAI's most advanced model, achieves only about 45 per cent accuracy on a human-curated benchmark covering 11 Indic languages — including the Gujarati language of Prime Minister Narendra Modi. This performance gap between English and Indic languages is not merely a technical inconvenience; it determines whether AI becomes a tool for inclusive growth or another technology that divides the English-speaking elite from the rest of the population.
Voice: The Primary Interface in South Asia
The challenge is amplified by the fact that voice, not text, is the primary mode of interaction for hundreds of millions of Indians. Phone calls, WhatsApp voice memos, speech-based payment confirmations, and voice-enabled coding tools dominate everyday digital life. Sandeep Chinchali, co-founder of Andreessen Horowitz-backed startup Poseidon, notes that South Asia "uses voice for everything." AI systems that cannot comprehend Bengali voice notes, Gujarati payment queries, or code-switched Hindi-English business calls are "useless" for automation and potentially "dangerous" when deployed in public services such as healthcare and legal aid. Indian AI startups like Sarvam AI have been vocal about India needing models that "understand our voices" and "read our documents", warning that standard AI evaluation metrics were not designed for India's linguistic complexity.
The Data and Benchmarking Gap
Building AI that works across Indic languages faces a twin challenge: quantity and quality of data. While the first generation of AI models was trained almost exclusively on English internet text, newer models have improved in non-English and low-resource languages. However, the lack of proper benchmarks for non-English models remains a critical gap. Leading AI systems cannot even agree on what constitutes proper Bengali, a language spoken by more people than most European languages combined. Sarvam AI launched a new evaluation benchmark for Indic speech recognition in April 2026, arguing that existing metrics distort performance assessment. IBM's MILU (Multi-task Indic Language Understanding) benchmark and OpenAI's IndQA framework represent early attempts to address this gap, but the ecosystem of Indic-language evaluation remains fragmented. The projection that AI could add $1 trillion to India's GDP by 2035 hinges critically on whether these language barriers can be overcome.
The Safety and Ethical Dimension
Researchers have found that safety alignment tends to deteriorate when people engage with AI in low-resource languages. This creates a troubling dynamic: the populations most likely to be left behind by the AI revolution — those who communicate primarily in regional languages — are also the least protected from its risks. As AI systems move into schools, hospitals, courts, and public services, the language gap becomes a safety gap. There is also a data ethics dimension: crowdsourcing language data has a history of poor pay and exploitation. A Stanford research paper warned that quality and ethical sourcing are key challenges when scaling such efforts. Poseidon's Chinchali is exploring blockchain tools to give data contributors more control over how their voice data is deployed, aiming to prevent data from being "leaked out to train foreign AI tools" without consent or compensation.
Government and Industry Response
Prime Minister Modi has positioned AI as a tool for "democratised" access and "a medium of inclusion and empowerment, especially across the Global South." The Indian government's Bhashini platform, launched to collect spoken data and improve multilingual models, represents a significant policy push. The IndiaAI Mission, with its Rs 10,372 crore outlay, includes compute infrastructure, startup funding, and skill development. Research labs such as AI4Bharat at IIT Madras have pioneered open-source models including IndicBERT, IndicBART, and the Airavata LLM family, trained on diverse datasets like IndicCorpora. The lab's nationwide initiative aims to gather 15,000 hours of transcribed speech data from over 400 districts, covering all 22 scheduled languages. Meanwhile, Open AI released an evaluation framework for Indian culture and language, and IBM's Granite 4.0 model showed promising performance on Indic benchmarks after training on nearly 100 billion tokens of Indian-language data.
The Path Forward for Indic AI
India's language divide is not an insurmountable barrier — it is a solvable engineering and data problem. The country has unique advantages: a massive pool of engineering talent, strong digital public infrastructure (DPI) including Aadhaar and UPI, government commitment to indigenous AI development, and a vibrant startup ecosystem building Indic-language AI products. The missing pieces are sustained investment in high-quality multilingual training data, robust and standardised benchmarks for Indic language performance, policy frameworks that mandate language inclusivity in AI systems deployed for public services, and ethical data sourcing mechanisms that fairly compensate contributors. If India can bridge the language divide, it will not only unlock AI for its own billion-plus citizens but create a blueprint for multilingual AI deployment that can be replicated across the Global South. If it fails, AI will remain a technology that serves the English-speaking few, while the voices of hundreds of millions go unheard.
Frequently Asked Questions
Why is India's language diversity a problem for AI?
AI models are trained primarily on English internet text. Indic languages collectively spoken by over a billion people constitute less than 0.1% of web training data, resulting in poor accuracy for tasks in these languages.
How accurate are current AI models on Indian languages?
GPT-5 achieves only about 45% accuracy on benchmarks covering 11 Indic languages. Performance degrades further for low-resource languages with less available training data.
What is the Indian government doing about this?
The government has launched the Bhashini translation platform for spoken data collection, allocated Rs 10,372 crore under the IndiaAI Mission, and mandated a three-language policy in schools. AI4Bharat at IIT Madras is leading open-source Indic language AI development.
Which Indian startups are working on Indic language AI?
Sarvam AI, Poseidon, Gnani.ai, and AI4Bharat are among the leading startups and research labs building Indic-language AI models, with focus areas spanning speech recognition, translation, and domain-specific LLMs.
Why does voice matter more for India than text?
Voice is the primary interface in South Asia — phone calls, WhatsApp voice memos, speech-based payments, and voice-enabled tools dominate daily digital life. Text-based AI models exclude hundreds of millions who primarily communicate via voice in regional languages.
What benchmarks exist for Indic language AI?
Key benchmarks include OpenAI's IndQA (12 languages, 10 domains), IBM's MILU (11 languages, 41 subjects), Sarvam AI's Indic speech recognition benchmark, and Google's IndicGenBench (29 languages, 4 language families).




