PAN Localization project

PAN Localization project releases its research and outputs on 11th Mother Language Day

A Giant Leap for Multilingual Cyberspace

 “In the field of IT, all language communities are entitled to have at their disposal equipment adapted to their linguistic system and tools and products in their language, so as to derive full advantage from the potential offered by such technologies for self-expression, education, communication, publication, translation and information processing and the dissemination of culture in general” [1].

PAN Localization project ( has been a regional initiative addressing these challenges and promoting the use of language technology across Asia. The project, initiated in 2003, has developed and disseminated computing solutions for Bahasa Indonesia, Bangla, Dzongkha, Khmer, Lao, Mongolian, Nepali, Pashto, Sinhala, Tamil Tibetan and Urdu. These languages represent a population of nearly one billion people across developing Asia and globally.

On the occasion of the eleventh International Mother Language Day, 21st February 2010, PAN Localization project is pleased to release its research, technology and resources through its website (

This project has been carried out with collaboration of Pan Asia Networking (PAN) program of IDRC, Canada (, Centre for Research in Urdu Language Processing ( at National University of Computer and Emerging Sciences, Pakistan ( and the following partner organizations:

[1] Universal Declaration on Linguistic Rights, UNESCO, 1996.


Salient Research Outputs

  • Bahasa Indonesia

Statistical Machine Translation (Awarded), English-Bahasa Parallel Corpus (1 Million words), POS Tagged Bahasa Corpus (500,000 words), Part of Speech Tagset and Tagger...[details]

  •  Bangla

Text to Speech System (Awarded), Optical Character Recognition System (Shortlisted for Award), Bangla Pad, Spell Checker, Lexicon, Language Table for IDNs, Part of Speech Tagset and Tagger, Wordnet (1000 words), Tagged Corpus (5 Million words), English-Bangla Parallel Corpus, Training on Content Development using infomediaries, Online Legal Content for Farmers in Bangla…[details]

  •  Dzongkha

DzongkhaLinux, Optical Character Recognition System, Language Table for IDNs, Part of Speech Tagset, Corpus (600,000 words), Lexicon (23,000 words), Text to Speech System (prototype), Dzongkha Terminology, Collation, Locale, Fonts and Keyboard, Training on DzongkhaLinux…[details]

  •  Khmer

Optical Character Recognition System, Java Applications and Plug-ins for Collation, Encoding Conversion, Word Segmentation, Locale, Mobile SMS, Language Table for IDNs, Part of Speech Tagset and Tagger, Lexicon, Text to Speech System (prototype), Tagged Corpus (150,000 words), Online Khmer Content on, Training of Govt. officials on Khmer Open Source Software…[details]

  •  Lao

Optical Character Recognition System, and MS Office Plug-in for Word Segmentation, Collation, Spell Checker, Lao Pad, Fonts, Keyboard, Language Table for IDNs, Part of Speech Tagset, POS Tagged Corpus, Parallel Corpus (37,000 words), Online Lao Content …[details]

  • Mongolian

Part of Speech Tagset and Tagger, Spell Checker, Corpus (1,000,000 words), Tagged Corpus (100,000 words), Lexicon (10,000 words), Automatic Speech Recognition, Localization of Pidgin and SeaMonkey… [details]

  • Nepali

NepaLinux (Awarded), Spell Checker, Grammar Checker, Parallel Corpus (100,000 words), Tagged Corpus (80,000 words), Lexicon (37,000 words), Optical Character Recognition System (prototype), Language Table for IDNs, Training Material on NepaLinux, Training of Rural Centers on Nepali Open Source Software…[details]

  • Pashto

Localized SeaMonkey (Awarded), Keyboard, Fonts, Language Table for IDNs,…[details]

  • Sinhala & Tamil

Sinchala Optical Recognition System, Sinhala Text to Speech System(Awarded), Screen Reader for Sinhala for Blind, Language Learning Tool for Tamil in Sinhala and English, Sinhala Wordnet, Localized OpenTM,, Language Table for IDNs, Collation Standard, Encoding Conversion tool, Training Students for development of Online Sinhala Content…[details]

  • Tibetan

Collation, Online Tibetan Content, Farmer Training on using Online Tibetan Content…[details]

  •  Urdu

Parallel Corpus (100,000 words), Stemmer, Collation, Optical Character Recognition, Localization of, SeaMonkey, Web Composer and Psi, Terminology Glossary, Gendered Outcome Mapping Tool (Awarded), Part of Speech Tagset and Tagger, Tagged Corpus (200,000 words), Language Table for IDNs, Training Material on Localized Applications, Training on Localized Software to Rural School Children, Content Generated by Rural School Children and Teachers …[details]


And much more … on the project website (


Last Updated ( Monday, 22 February 2010 )