BeitTigreAI

📚 Overview

Tigre is a language of significant demographic and linguistic importance in the Horn of Africa:

It serves as the native tongue for the Tigre ethnic group, who constitute a sizable portion of Eritrea's population.
It is used as a second language by other communities living in close proximity to Tigre-speaking areas.
It is also the primary language of the Tigre people in Sudan, who share strong cultural and linguistic ties with their counterparts in Eritrea.
From a linguistic standpoint, Tigre is considered a "linguistic fossil" because it has preserved many archaic features of Ge'ez, an ancient Semitic language. This makes it a valuable tool for scholars studying the historical evolution and relationships within the Semitic language family.

Despite its historical and cultural significance, the Tigre language remains significantly underrepresented in Natural Language Processing (NLP). To address this gap, Tigre speakers in Eritrea and the diaspora have launched this community-led project. We are building a comprehensive set of open resources and tools to support NLP research and development for the language. We believe advancements in Tigre NLP can pave the way for similar progress in Geʽez, its closest relative and a historically significant Semitic language.

Both Tigre and Geʽez are in the process of being added to the Meta/UNESCO Language Technology Partner Program and to Google Translate. Through the Meta/UNESCO program, low-resource languages gain vital digital tools such as translation, speech, and text technologies, ensuring their inclusion on global platforms. At the same time, the community is working to translate the open-source Smol (Set of Maximal Overall Leverage) dataset from English into Tigre and Geʽez. This is a required step for Google Translate to add support for new languages. Once this work is finished, Google Translate will proceed to include Tigre and Geʽez in its upcoming release cycle.

🧩 Data, Models & Resources

Category	Type	Size	Description	URI
Data	Speech	800 hours	Native speaker audio recordings in various Tigre dialects	available here soon
Data	Speech/Text (Aligned)	40 hours	Time-aligned speech and transcription pairs	available here soon
Data	Monolingual Text	15 million tokens	Unsupervised text corpus including books, articles, and online publications	tigre-monolingual-text
Data	Parallel Text	250k sentence pairs	Multi-lingual parallel sentences	available here soon
Trained Models	Neural Machine Translation	5.17 GB	Translation (Tigre ↔ others)	tigre-nllb-200-distilled-600M
Trained Models	ASR (Speech-to-text)	2.45 GB	Automatic Speech Recognition	tigre-asr-Wav2Vec2Bert
Trained Models	TTS (Text-to-speech)	—	Synthesizes speech from text	available soon
Trained Models	LLM pretrained	10.6 GB	Meta Llama Large Language	tigre-llm-Llama3.2-1B
Trained Models	XLM-RoBERTa-base	1.14 GB	Encoder-only, multilingual language model	tigre-xlm-roberta-base
Other Resources	Dictionary	~6,200 entries	A trilingual dictionary with approximately 6,200 entries in Tigre, Tigrinya, and English, by Memher Mussie Bekheit	available soon
Other Resources	Language Model (kenLM)	5-gram model (~1.2 GB)	Statistical n-gram language model trained on monolingual Tigre text, used in ASR and NLP pipelines.	tigre-data-kenLM
Other Resources	Word Embeddings (fastText)	4.04 GB	300-dimensional fastText vectors for Tigre, supporting subword modeling and morphological understanding.	tigre-data-fasttext
Other Resources	Lexicon (Segmented)	~6,200 entries	Segmented lexicon for NLP tasks: text normalization, grapheme-to-phoneme (G2P), morphological parsing, phonetic analysis, and speech processing (ASR/TTS).	tigre-data-lexicon

📚 Overview

🧩 Data, Models & Resources

Reference

📚 Cite This Work

⚠️ Disclaimer

📄 License