BeitTigreAI

Home of AI resources for the Tigre language

📚 Overview

Tigre is a language of significant demographic and linguistic importance in the Horn of Africa:

Despite its historical and cultural significance, the Tigre language remains significantly underrepresented in Natural Language Processing (NLP). To address this gap, Tigre speakers in Eritrea and the diaspora have launched this community-led project. We are building a comprehensive set of open resources and tools to support NLP research and development for the language. We believe advancements in Tigre NLP can pave the way for similar progress in Geʽez, its closest relative and a historically significant Semitic language.

Both Tigre and Geʽez are in the process of being added to the Meta/UNESCO Language Technology Partner Program and to Google Translate. Through the Meta/UNESCO program, low-resource languages gain vital digital tools such as translation, speech, and text technologies, ensuring their inclusion on global platforms. At the same time, the community is working to translate the open-source Smol (Set of Maximal Overall Leverage) dataset from English into Tigre and Geʽez. This is a required step for Google Translate to add support for new languages. Once this work is finished, Google Translate will proceed to include Tigre and Geʽez in its upcoming release cycle.

🧩 Data, Models & Resources

Category Type Size Description URI
Data Speech 800 hours Native speaker audio recordings in various Tigre dialects available here soon
Data Speech/Text (Aligned) 40 hours Time-aligned speech and transcription pairs available here soon
Data Monolingual Text 15 million tokens Unsupervised text corpus including books, articles, and online publications tigre-monolingual-text
Data Parallel Text 250k sentence pairs Multi-lingual parallel sentences available here soon
Trained Models Neural Machine Translation 5.17 GB Translation (Tigre ↔ others) tigre-nllb-200-distilled-600M
Trained Models ASR (Speech-to-text) 2.45 GB Automatic Speech Recognition tigre-asr-Wav2Vec2Bert
Trained Models TTS (Text-to-speech) Synthesizes speech from text available soon
Trained Models LLM pretrained 10.6 GB Meta Llama Large Language tigre-llm-Llama3.2-1B
Trained Models XLM-RoBERTa-base 1.14 GB Encoder-only, multilingual language model tigre-xlm-roberta-base
Other Resources Dictionary ~6,200 entries A trilingual dictionary with approximately 6,200 entries in Tigre, Tigrinya, and English, by Memher Mussie Bekheit available soon
Other Resources Language Model (kenLM) 5-gram model (~1.2 GB) Statistical n-gram language model trained on monolingual Tigre text, used in ASR and NLP pipelines. tigre-data-kenLM
Other Resources Word Embeddings (fastText) 4.04 GB 300-dimensional fastText vectors for Tigre, supporting subword modeling and morphological understanding. tigre-data-fasttext
Other Resources Lexicon (Segmented) ~6,200 entries Segmented lexicon for NLP tasks: text normalization, grapheme-to-phoneme (G2P), morphological parsing, phonetic analysis, and speech processing (ASR/TTS). tigre-data-lexicon

Reference

📚 Cite This Work

@misc{beittigreai2025,
  author = {Beshir Ibrahim},
  title = {BeitTigreAI: NLP for the Tigre Language},
  year = {2025},
  howpublished = {\url{https://beshir-a-ibrahim.github.io/BeitTigreAI/}},
  note = {Accessed: 2025-MM-DD}
}

⚠️ Disclaimer

Research project. Not affiliated. Use outputs with care.

📄 License

Licensed under CC BY-SA 4.0. Share and adapt with attribution.