A Human-Powered Approach to Punjabi Language Preservation in AI

Introduction

When I first learned about Cohere For AI's ambitious Project Aya, aiming to build a state-of-the-art multilingual language model, I was thrilled. But my excitement quickly turned to dismay when I realized my mother tongue, Punjabi, was severely underrepresented. As a Punjabi speaker and a software engineer at AWS, I knew I had to do something about it. My familiarity with Dr. Sara Hooker's groundbreaking work on model evaluation and interpretability in AI, particularly her insightful paper, "Moving Beyond 'Algorithmic Bias is a Data Problem,'" was a significant motivator. This paper, which argues for a more holistic approach to evaluating machine learning models beyond just focusing on data, resonated deeply with the challenges I faced in ensuring fair representation for Punjabi in Project Aya. Dr. Hooker's leadership at Cohere For AI and her dedication to responsible AI development are nothing short of inspirational, making it an honor to contribute to an initiative she spearheaded. This article chronicles my journey with Project Aya, the challenges I faced, the lessons I learned, and the broader implications of this work for the future of AI and underrepresented languages.

Discovering Project Aya and the Punjabi Gap

I joined Cohere's community to explore new challenges and learn from leading experts in the field of Natural Language Processing (NLP). However, I was shocked to discover that Punjabi—a language spoken by over 100 million people globally—lacked substantial representation in Project Aya. Despite being six months late to the program, I felt a deep sense of responsibility and a personal connection to the cause. I immediately introduced myself to the team and volunteered as a language ambassador for Punjabi.

Securing Approval and Facing Initial Challenges

My first hurdle was securing approval from my employer, AWS, to officially contribute to this open-source project. After a month-long process involving discussions with my manager and senior leadership (L7), I finally gained the green light. Yet, as I soon discovered, the real challenges lay ahead. I initially believed that my experience in persuading people to undertake various tasks would make recruiting contributors a breeze. I could not have been more wrong.

Reconnecting with My Mother Tongue

Though Punjabi is my mother tongue, and I often read and spoke it regularly, I faced the unexpected hurdle of fluently writing it. Years of primarily using English for education and work had taken their toll. Writing in Punjabi, especially on a computer, felt unfamiliar. English keyboards lack the nuances needed for the Gurmukhi script, leading to frequent errors. This experience highlighted a disconnect many in my generation face—an emotional and cultural distance from our linguistic roots. I realized that even though I was comfortable with spoken and reading aspects of the language, actively writing Punjabi, especially typing it, had become a challenge.

Early Outreach Efforts: Friends, Family, and Gurdwaras

To gather contributors, I began reaching out to friends and family. It was a unique opportunity to reconnect with school and college friends, but persuading them to participate wasn't easy. The process also revealed my own need to relearn and embrace Punjabi writing. It had been a while since any of us had used Punjabi regularly, especially in written form. The act of typing in Gurmukhi script felt unfamiliar, and we found ourselves struggling to recall the correct spellings and grammatical structures. Often, while typing, especially matras (the vowel signs), we would make mistakes. I did not expect such situations at all.

For the next four months, I dedicated myself to recruiting contributors and generating data for Project Aya. I visited local Gurdwaras and Sikh community centers, mostly in and around the Seattle area, but once in California and my hometown, Chandigarh, armed with posters and a commitment to educating people about the project's importance. I stood there, explaining the significance of Project Aya and the need for Punjabi representation in AI. My family and friends helped me spread the word to other states and countries too, including Canada and the UK. Despite engaging with hundreds of people, the conversion rate remained disappointingly low.

Observations on the Community and the Challenge of Distrust

This journey gave me a deeper understanding of my community. I observed a generational drift from our linguistic and cultural heritage, fueled by an education system dominated by English and various socio-political factors. The higher education system, primarily conducted in English, further reinforces this disconnect. Many young Punjabis, while fluent in spoken Punjabi, lack proficiency in reading and writing Gurmukhi.

However, amidst this reality, I also met people dedicated to preserving Gurmukhi and Sikh heritage. I encountered a recurring challenge: a sense of distrust towards AI initiatives, particularly among those dedicated to preserving Punjabi language and culture. This hesitancy stemmed from a complex history that I could only briefly acknowledge in my outreach efforts. While I chose not to delve into the specifics during my work on Project Aya, I recognized that this distrust is a significant factor that warrants further exploration. I plan to address this issue in a separate article, examining the historical context and its impact on community engagement with technology. For now, it was important to acknowledge its presence and focus on the immediate task of data collection.

Expanding the Search Online

While engaging with the local Sikh community was crucial, I knew that to truly bolster Punjabi's representation in Project Aya, I needed to expand my search online. I reached out to several academics and researchers through email and professional networking platforms. However, I was met with a recurring challenge: many of the individuals I contacted, despite their impressive backgrounds and PhDs in related fields, had transitioned to more general AI roles. Some were no longer actively working with Punjabi, having shifted their focus to languages with more readily available data and research opportunities.

I found that many NLP professors in India, even those who were once deeply involved with Punjabi language research, had eventually pivoted to English. This was particularly striking because these were the individuals who truly understood the intricacies of the language. This situation highlighted a broader challenge: the dominance of English in the tech world and the resulting brain drain from less-represented languages. The efforts of organizations that supervise such projects, with their expert teams and actual Punjabi literature experts gathering data and training models, are commendable in this regard. However, even these supervised approaches can be susceptible to the personal biases of the individuals involved, a challenge that is arguably even more difficult to address in a fully crowdsourced model.

Accelerating Data Creation with a Human-Powered, NLP-Assisted Workflow

With time running out and limited resources, I turned to innovative methods to accelerate progress on Project Aya. Recognizing the critical need for high-quality, human-verified data, I developed a system that leveraged the power of NLP, coupled with the irreplaceable nuance of human understanding. This human-in-the-loop approach became the cornerstone of my contribution.

The core of this system was a workflow designed to generate Punjabi question-answer pairs, a crucial dataset for training the Aya model. Here's a breakdown of the process:

Content Acquisition: The process began with ethically sourced Punjabi news articles, generously provided by a fellow researcher. This ensured that the foundation of our data was grounded in real-world language usage and current events.
Language Choice Flexibility: Recognizing that contributors might be more comfortable working in either English or Punjabi, the system was designed to be bilingual. We employed Google's Translator API to seamlessly convert the Punjabi text into English, allowing the user to initiate the process in their preferred language.
Automated Question Generation (English): Leveraging the power of NLP, a set of English questions was automatically generated from the news articles using a robust question generation code. Full credit goes to the original developers of the question_generator code. This automated step significantly reduced the time and effort required to create a diverse set of questions.
Human-in-the-Loop Refinement (English): This is where the human element became paramount. The automatically generated questions were presented to a human operator (myself, initially, and later collaborators). This crucial step allowed for:

Quality Control: Reviewing the questions for relevance, clarity, and grammatical correctness.
Contextual Adaptation: Modifying or entirely rewriting questions to better reflect the nuances of the news article and the intended learning outcomes for the AI model.
Creative Expansion: Creating entirely new questions that the automated system might have missed, ensuring a comprehensive exploration of the subject matter.

Answer Generation and Verification (English): To further assist the human operator, we integrated Hugging Face's open-source QA models, like BERT, to suggest answers based on the refined questions. However, recognizing the limitations of even the most advanced models, this step was not fully automated. The generated answers served as a guide, helping to pinpoint the relevant information within the news article. Human intervention was critical to:

Accuracy Verification: Cross-referencing the suggested answer with the original text to ensure its factual correctness.
Contextual Nuance: Refining the answer to capture the full meaning and context of the information presented in the article.
Error Correction: Addressing any inaccuracies or misinterpretations made by the QA model.

Translation to Punjabi and Final Refinement: Once the English question-answer pair was meticulously crafted and verified, the human-provided answer was translated back into Punjabi using the Translator API. This is where our fluency in Punjabi became invaluable. We meticulously reviewed the translated answer, correcting any errors introduced during the automated translation process, particularly focusing on the accurate rendering of matras (vowel signs) which were often mistranslated. This iterative process of review and refinement ensured the highest possible quality of the Punjabi data.

Efficiency Gains

This human-powered, NLP-assisted workflow dramatically accelerated the data creation process. By automating the initial question generation and providing AI-powered assistance for answer retrieval, we achieved a tenfold increase in efficiency compared to a purely manual approach. My friends and I spent countless hours leveraging this system, meticulously crafting and refining the question-answer pairs.

Impact

This innovative approach not only addressed the immediate challenges faced by Project Aya but also contributed to the preservation and revitalization of the Punjabi language through effective human-machine collaboration. Our method empowered us to generate a substantial volume of high-quality training data, significantly boosting Punjabi's representation within the Aya model.

Ethical Considerations

While focusing on data collection, I remained acutely aware of the ethical implications, particularly regarding data sourcing and the potential for bias. I made sure that the news articles used in my project were sourced ethically, which, in this context, meant they were publicly available and did not violate any copyright or privacy concerns.

However, I also recognized that these constraints, while necessary, had limitations. The reliance on a single source, even a diverse one like a collection of news articles, inevitably introduced a certain degree of bias into our dataset. The initial model was not able to perform well in comparison to other languages that were part of the project. This may reflect the inherent biases in the source. News articles, by their nature, tend to focus on specific types of events and may not fully represent the richness and diversity of everyday language use.

This experience underscored a fundamental challenge in training AI models for underrepresented languages. While striving for ethical data sourcing is paramount, it can sometimes be at odds with the need for large and diverse datasets that accurately reflect the nuances of a language. I was truly disappointed with the initial performance of the model when it came to Punjabi. It was clear that the limitations of our source material had impacted the model's ability to generalize and perform well on a wider range of tasks. Despite the careful data curation and the innovative workflow we developed, the model's performance was still hampered by the inherent biases and limitations of our ethically sourced dataset.

This realization, though disappointing, served as a valuable lesson. It highlighted the importance of ongoing research into methods for mitigating bias in AI models, even when working with limited data. It also reinforced my commitment to advocating for more resources and attention to be directed towards underrepresented languages in AI research. While I know I did the right thing from my end by adhering to ethical sourcing principles, the experience highlighted the systemic challenges that need to be addressed to achieve true equity in AI. I plan to further explore the broader ethical dimensions of crowdsourced data collection and bias mitigation in future work, particularly focusing on strategies for creating more representative datasets for underrepresented languages. This also raises a larger question about the evaluation metrics used in multilingual models, and whether they adequately capture the nuances of languages with limited data representation.

Results and Reflections

Through our efforts, my friends and I became some of the top contributors to Project Aya, with me personally ranking in the top 10 globally. While this was a significant achievement, it was also evident that Punjabi was way behind other languages in terms of overall contributions. The project was not just a technical endeavor but a deeply personal journey of reconnection with my roots and a mission to preserve Punjabi for future generations.

While the road was fraught with challenges—from persuading contributors to navigating the complexities of language representation in AI—the experience was incredibly rewarding. It underscored the importance of responsible AI development and the role of community-driven initiatives in preserving linguistic diversity.

Looking Ahead

Projects like Aya are vital in ensuring that underrepresented languages find their place in the AI landscape. They highlight the need for collaboration, ethical data practices, and inclusivity in technological advancement. For me, this journey was a testament to the power of persistence and the impact of collective efforts.

As AI continues to shape the future, initiatives like Project Aya remind us of the importance of inclusivity, heritage, and responsibility. It's a lesson I'll carry forward in my work, and I hope it inspires others to contribute to preserving their linguistic and cultural identities through technology. The journey to build truly representative AI is a long and complex one, but it is a journey worth undertaking. It is a journey that requires us to work together to create a future where technology serves all of humanity, not just a select few.