Partners - BRAC University, Banglasesh

Bangladesh

Center for Research on Bangla Language Processing (CRBLP)

The Center for Research on Bangla Language Processing (CRBLP) of BRAC University is currently conducting research projects that deal with Bangla language processing. At present the research team is working on Bangla Document authoring, Information Retrieval (Spelling checker, Search Engine), Optical Character Recognition, Pronunciation Generator, Speech Processing, Morphological Analysis, Parts of Speech Tagging, Syntax, Grammar Checker, Text Categorization, Language Modeling and many more interesting research areas.

Team's Achievements

OCR: BanglaOCR is the Optical Character Recognizer for Bangla Script. It takes scanned images of a printed page or document as input and converts them into editable Unicode text. BanglaOCR allows users to train the data set from any document and observe the recognition performance.

CRBLPConverter: CRBLPConverter is a software package to convert various TTF encoded Bangla documents to Unicode encoding. CRBLPConverter includes converters for SutonnyMJ, Bangsee Alpona, Prothoma, and Alo. This software is free and open source, released under the GNU Public License (GPL) version 2.

BanglaPad: BanglaPad is an open source, full-featured cross-platform Unicode rich text editor capable of editing Bangla that can run on different operating systems, such as Windows, Linux/Unix, owing to its base on the Java programming language. Users can type Bangla text without using external helper applications such as keyboard drivers and can check spelling of both Bangla and English document.

Bangla Speller Sandbox: Bangla phonetic spelling checker (Puspa) gives suggestion for misspelled words based on similarities in pronunciation. Its implementation is based on Double Metaphone phonetic encoding.

JKimmo: It is Java Interface for PC-Kimmo. It is a command line morphological analyser provided by SIL

English to Bangla Transliteration (Pata): It can be used for Bangla text input with English keyboard. When you type phonetically in English, you will get similar sounding dictionary word.

Automated Pronunciation Generator: When you input any Bangla word, this application will give the pronunciation of that word in IPA (International Phonetic Alphabet).

Goals during Second Phase

Optical Character Recognition: OCRs are used to convert printed text to digital text that can be used and reformatted for other uses. We are developing a Bangla Character Recognizer that can recognize printed Bangla document and convert to editable text.

Speech Recognition: Speech recognition is the process of converting a speech signal to a sequence of words, by means of an algorithm implemented as a computer program. We are currently working on Bangla speech recognition using Hidden Markov Model (HMM) as the technique and HTK as the toolkit.

Speech Synthesis: Speech Synthesis is the artificial production of human speech. A Text-to-speech system converts normal text to speech. We are working on generating speech signal from Bangla Text.

TTF to Unicode Font Converter: There are number of ASCII based Bangla fonts out there. The trouble with these fonts is that if the host machine doesn’t have the font installed, then the text gets jumbled up. We are working on a TTF to Unicode Font converter which will enable us to convert the ASCII text to a Unicode text. That way, we just need to have a Unicode Bangla font installed and we will be able to see the text properly.

Corpus Analysis: We have developed a tool for extensive corpus analysis on word frequency distribution. We currently have a corpus of one-year of Prothom-Alo newspaper text and Charjapad and Baru Chandi Das Er Kabbo. We have analyzed our corpus for regularities and anomalies in Bangla Word Usage.

Lexicon: We need a rich and informative lexicon for any kind Bangla Language Processing. We have developed a wordlist of 160 thousand words with 1^st step parts of speech tagging

Wordnet: We are developing a semantic lexicon for Bangla that divides Bangla words into sets of synonyms and maps various semantic relations between these word sets.

CPI Team Members and their Designations

Mumit Khan, Head, Center for Research on Bangla Language Processing, Associate Professor, CSE Department

Matin Saad Abdullah, Program Manager, Center for Research on Bangla Language Processing, Senior Lecturer, CSE Department

Naira Khan, Linguist, Lecturer - English and Humanities

Zahurul Islam, Research Programmer

Naushhad Uzzaman, Research Programmer (on leave)

Md. Abul Hasnat, Research Programmer

S.M. Murtaza Habib, Research Programmer

Firoj Alam, Research Prograamer

Fahim Tawfique Chowdhury, Research Programmer