འབྲུག་ཡུལ།

A shared space for
Dzongkha and English.

Dzosem is a bilingual sentence-embedding model that places Dzongkha and English meaning side by side — opening Bhutan's written heritage to anyone who can phrase the question.

"medicinal herbs" English
Same Point
སྨན་རྩི། Dzongkha
English Query
"Where is the nearest medical facility?"
Retrieved
སྨན་ཁང་ཉེ་ཤོས་ག་ཏེ་ཡོད?
The Challenge

A language at the edge of digital legibility.

Six quiet pressures shape the moment Dzongkha is in — a language carrying centuries of Himalayan knowledge, asking to be read by machines.

A low-resource language

Unlike English or Mandarin, Dzongkha lacks large digitised corpora, translation tooling, and core NLP infrastructure.

Roughly 600,000 speakers

A uniquely Bhutanese problem — too small for global model coverage to solve, too vital to leave unattended.

Written meets spoken

A wide gap separates classical written orthography from the modern spoken form, complicating computational modelling.

Economic gravity

The difficulty of Dzongkha and the pull of English in high-paying work nudges a generation away from the mother tongue.

Heritage at risk

Most manuscripts, court records, and Himalayan medical knowledge live in Dzongkha — and quietly grow more distant.

Digitised, yet unreachable

His Majesty's digital initiative has scanned vast archives, but they remain hard to search and harder to read.

Our Intent

To graduate Dzongkha from a low-resource language into a stable one — aligned with His Majesty's vision for cultural preservation.

600K
Speakers
700K
Pairs
1024
Dimensions
Impact

Three places this becomes useful.

A shared embedding space is quiet infrastructure — the value shows up where people meet the archive.

Gelephu Mindfulness City

Seamless bilingual search across legal, civic, and investment documents — for the foreigners helping build GMC.

Sowa Rigpa & medicine

Mapping topics and surfacing patterns inside centuries of Himalayan medical knowledge currently locked in script.

A door for the youth

An English-language entry into Bhutan's digitised Dzongkha record — keeping the door to heritage gently open.

Interactive

Search in English. Read in Dzongkha.

Type an English idea — the model returns the closest Dzongkha documents by meaning, not by keyword.

Measuring distance...
Architecture

How a sentence becomes a point.

Phase I: Inference
I. Tokenise

A Dzongkha or English sentence is broken into token IDs by a Unigram language-model tokeniser.

II. Encode

Token IDs flow through the transformer's deep encoder layers, accumulating context.

III. Pool

Final-layer activations are mean-pooled into a single representation of the sentence.

IV. Project

The pooled vector lands as one point in a shared 1,024-dimensional space.

Phase II: Retrieval
1. Embed the query
An English question — say medicinal herbs — becomes a point in the space.
2. Embed the documents
Each Dzongkha document is embedded into the same shared space.
3. Measure distance
Documents are ranked by cosine similarity to the query point.
4. Return the ranking
An ordered list of Dzongkha documents most semantically aligned with the English query.
Roadmap

From corpus to deployment.

Eight movements that take raw parallel sentences and end with an encoder ready for document-level retrieval.

1. Gathering the corpus
Collect and filter roughly 700,000 Dzongkha–English pairs.
2. Setting the architecture
An encoder-only model projecting into a 1024-dimension space.
3. Pre-training
Masked-language modelling on large corpora to learn basic semantics.
4. Fine-tuning
Contrastive learning on the 700K pairs, optimising for retrieval.
5. Post-training
Aligning the model with deployment shapes.
6. Matryoshka pruning
Representation learning for smaller dimensions.
7. Evaluation
Cross-lingual recall and qualitative review.
8. Document tuning
Final weights tuned for long-context retrieval.