Dzosem is a bilingual sentence-embedding model that places Dzongkha and English meaning side by side — opening Bhutan's written heritage to anyone who can phrase the question.
Six quiet pressures shape the moment Dzongkha is in — a language carrying centuries of Himalayan knowledge, asking to be read by machines.
Unlike English or Mandarin, Dzongkha lacks large digitised corpora, translation tooling, and core NLP infrastructure.
A uniquely Bhutanese problem — too small for global model coverage to solve, too vital to leave unattended.
A wide gap separates classical written orthography from the modern spoken form, complicating computational modelling.
The difficulty of Dzongkha and the pull of English in high-paying work nudges a generation away from the mother tongue.
Most manuscripts, court records, and Himalayan medical knowledge live in Dzongkha — and quietly grow more distant.
His Majesty's digital initiative has scanned vast archives, but they remain hard to search and harder to read.
A shared embedding space is quiet infrastructure — the value shows up where people meet the archive.
Seamless bilingual search across legal, civic, and investment documents — for the foreigners helping build GMC.
Mapping topics and surfacing patterns inside centuries of Himalayan medical knowledge currently locked in script.
An English-language entry into Bhutan's digitised Dzongkha record — keeping the door to heritage gently open.
Type an English idea — the model returns the closest Dzongkha documents by meaning, not by keyword.
A Dzongkha or English sentence is broken into token IDs by a Unigram language-model tokeniser.
Token IDs flow through the transformer's deep encoder layers, accumulating context.
Final-layer activations are mean-pooled into a single representation of the sentence.
The pooled vector lands as one point in a shared 1,024-dimensional space.
Eight movements that take raw parallel sentences and end with an encoder ready for document-level retrieval.