Home of AI resources for the Tigre language
Tigre is a language of significant demographic and linguistic importance in the Horn of Africa:
Despite its historical and cultural significance, the Tigre language remains significantly underrepresented in Natural Language Processing (NLP). To address this gap, Tigre speakers in Eritrea and the diaspora have launched this community-led project. We are building a comprehensive set of open resources and tools to support NLP research and development for the language. We believe advancements in Tigre NLP can pave the way for similar progress in Geʽez, its closest relative and a historically significant Semitic language.
Both Tigre and Geʽez are in the process of being added to the Meta/UNESCO Language Technology Partner Program and to Google Translate. Through the Meta/UNESCO program, low-resource languages gain vital digital tools such as translation, speech, and text technologies, ensuring their inclusion on global platforms. At the same time, the community is working to translate the open-source Smol (Set of Maximal Overall Leverage) dataset from English into Tigre and Geʽez. This is a required step for Google Translate to add support for new languages. Once this work is finished, Google Translate will proceed to include Tigre and Geʽez in its upcoming release cycle.
Category | Type | Size | Description | URI |
---|---|---|---|---|
Data | Speech | 800 hours | Native speaker audio recordings in various Tigre dialects | available here soon |
Data | Speech/Text (Aligned) | 40 hours | Time-aligned speech and transcription pairs | available here soon |
Data | Monolingual Text | 15 million tokens | Unsupervised text corpus including books, articles, and online publications | tigre-monolingual-text |
Data | Parallel Text | 250k sentence pairs | Multi-lingual parallel sentences | available here soon |
Trained Models | Neural Machine Translation | 5.17 GB | Translation (Tigre ↔ others) | tigre-nllb-200-distilled-600M |
Trained Models | ASR (Speech-to-text) | 2.45 GB | Automatic Speech Recognition | tigre-asr-Wav2Vec2Bert |
Trained Models | TTS (Text-to-speech) | — | Synthesizes speech from text | available soon |
Trained Models | LLM pretrained | 10.6 GB | Meta Llama Large Language | tigre-llm-Llama3.2-1B |
Trained Models | XLM-RoBERTa-base | 1.14 GB | Encoder-only, multilingual language model | tigre-xlm-roberta-base |
Other Resources | Dictionary | ~6,200 entries | A trilingual dictionary with approximately 6,200 entries in Tigre, Tigrinya, and English, by Memher Mussie Bekheit | available soon |
Other Resources | Language Model (kenLM) | 5-gram model (~1.2 GB) | Statistical n-gram language model trained on monolingual Tigre text, used in ASR and NLP pipelines. | tigre-data-kenLM |
Other Resources | Word Embeddings (fastText) | 4.04 GB | 300-dimensional fastText vectors for Tigre, supporting subword modeling and morphological understanding. | tigre-data-fasttext |
Other Resources | Lexicon (Segmented) | ~6,200 entries | Segmented lexicon for NLP tasks: text normalization, grapheme-to-phoneme (G2P), morphological parsing, phonetic analysis, and speech processing (ASR/TTS). | tigre-data-lexicon |