Open sourcing UgannA_SiyabasaV2

announcement

We’re Open-Sourcing UgannA Siyabasa V2 — Our Most Advanced Sinhala FastText Embedding Model

January 5, 2026

Today, we’re releasing UgannA Siyabasa V2, our most advanced FastText embedding model for the Sinhala language. This release marks an important step in our effort to strengthen open language infrastructure for low-resource languages and to make high-quality Sinhala NLP accessible to researchers, developers, and organizations worldwide.

UgannA Siyabasa V2 was trained on a large, carefully processed Sinhala corpus and is designed for real-world natural language processing tasks, including semantic search, text similarity, clustering, and downstream machine learning applications. The model reflects our belief that language technology should be open, usable, and built with long-term ecosystems in mind.

About UgannA Siyabasa V2

UgannA Siyabasa V2 is a FastText-based word embedding model that produces 300-dimensional vector representations optimized for Sinhala. It offers improved vocabulary coverage, stronger semantic consistency, and efficient performance suitable for both offline and real-time use.

The model is now publicly available on Hugging Face: Get the model: https://huggingface.co/Remeinium/UgannA_SiyabasaV2

To make exploration easier, we’ve also released an interactive demo where users can test embeddings and similarity queries directly in the browser: Live test: https://huggingface.co/spaces/Remeinium/Embedding_Siyabasa

For developers building applications and services, the model can be accessed programmatically through our embedding API: Use via API: https://esdocs.ai.remeinium.com

Training Dataset

UgannA Siyabasa V2 was trained using the Clean Sinhala Text Corpus, a large-scale dataset curated and released by us to support Sinhala language research.

The dataset consists of approximately 17 GB of processed Sinhala text, compiled from diverse sources and refined through multiple preprocessing stages, including normalization, tokenization, deduplication, and noise removal. It is designed to balance scale with linguistic cleanliness, making it suitable for training embeddings and other foundational NLP components.

Dataset: https://huggingface.co/Remeinium/CleanSinhalaTextCorpus

Remeinium Open Model License

We’re releasing UgannA Siyabasa V2 under the Remeinium Open Model License (ROML), which permits both research and commercial use with attribution. This approach is intended to encourage broad adoption while preserving transparency around the model’s origin.

We see open models as shared infrastructure — tools that grow more valuable as more people build with them.

Looking Ahead

Sinhala remains a low-resource language in global AI systems, but it shouldn’t be low-ambition. By open-sourcing UgannA Siyabasa V2, we aim to support a growing ecosystem of tools, research, and applications that treat the language with the depth and seriousness it deserves.

This release is one step in a longer journey. We’ll continue improving language models, datasets, and APIs — and expanding toward broader multilingual and multimodal systems in the future.