Why We Need Arabic Language Models
24 August 2025
Published online 24 August 2025
Building strong Arabic language models is a strategic step to ensure the Arab world’s active role in shaping the future of artificial intelligence.
In the global race to develop generative AI models, attention tends to center on major companies and research institutes in the West and China. Flagship models, including OpenAI’s GPT-4 and Google’s Gemini, are trained on vast amounts of data, predominantly in English and other Western languages, and therefore tend to reflect the cultural assumptions and values of the context in which they were developed.
The growing reliance on language models that do not necessarily reflect the richness and diversity of Arabic language poses a significant challenge. It’s not simply a matter of technical preference, but one that raises questions of cultural sovereignty, technological independence, and national identity.
These widely used models such as ChatGPT have the potential to shape perceptions and ideas. When trained on data from different cultural contexts, these models can generate responses that sideline core Arab values or remain vague on critical issues.
Clear examples emerge when global language models address culturally sensitive issues, such as social relationships or political debates. They often adopt ambiguous positions that overlook the Arab cultural context, creating a gap between these digital tools and the values and lived experiences of Arab users.
The lack of robust and competitive Arabic language models forces researchers and developers across the region to rely on tools that fail to capture the linguistic complexity of Arabic, its dialects, or cultural contexts. This dependence constrains the ability to design AI applications and services tailored to local needs, while also weakens the Arab world’s contribution to global AI advancement. In many ways, language models serve as a mirror of our research and innovation capacity.
In response to this challenge, promising initiatives have emerged across the Arab world. For instance, the UAE’s ‘Jais,’ Saudi Arabia’s ‘ALLaM,’ and Qatar’s ‘Fanar,’ which was developed by the Qatar Computing Research Institute (QCRI) at Hamad Bin Khalifa University (HBKU) in collaboration with government partners. These initiatives are part of broader strategic efforts to localize technology, safeguard cultural identity, and build technological self-reliance.
Developing such models, however, comes with significant challenges. One of the most persistent is the scarcity of high-quality Arabic content online compared with English. While Fanar was trained on more than half a trillion Arabic words, this remains modest when compared to global models trained on trillions of tokens. The quality of available Arabic data also varies widely, due to accuracy issues, linguistic style, and considerable diversity between Modern Standard Arabic and regional dialects, making data collection and representation more complex.
Another major challenge is the high cost of training large language models. For example, training a 7-billion-parameter model on a trillion words requires more than 220 H100 GPUs running continuously for over a month, and these resources are often beyond the reach of most research institutions in the Arab world. This inspired the Fanar team to focus on developing smaller models with seven and nine billion parameters, prioritizing improvements in data quality and optimization techniques to deliver the best possible performance with the resources available.
Addressing the challenges of cultural and technological dependency requires collaboration across multiple sectors. Academic and research institutions need to invest in Arabic language processing and build international partnerships to maximize resources and expertise. Governments and policymakers, in return, should provide sustained funding; support data infrastructure; promote policies that facilitate the collection and organization of high-quality Arabic datasets; and foster collaboration between the public and private sectors, as it is essential for building a supportive ecosystem for technological innovation.
Startups and developers in the region also have a role to play. Both should adopt Arabic language models to build applications that respond to local needs, from AI personalized education platforms to voice assistants in regional dialects. Cultural, educational, and media institutions, meanwhile, can contribute by generating diverse, high-quality Arabic digital content that can be used to train these models.
Building robust Arabic language models is not a technological luxury, but a strategic necessity to ensure that the Arab world has a voice in shaping the future of AI. While significant progress has been made, the path ahead requires sustained investment and collective effort from stakeholders across the region.
This is a translation of the Arabic article published on 3rd August 2025
doi:10.1038/nmiddleeast.2025.142
Stay connected: