Blogs

Cross Language Approximate Search on Indic Languages- A demo

Posted on April 3, 2011 | Santhosh Thottingal

A demo of cross language approximate search in Indic text: The Malayalam word സാമ്പാര്‍ is compared against a paragraph from http://ml.wikipedia.org/wiki/Sambar. In the bottom half, words marked in yellow color are search results. You can see that a Kannada word ಸಾಂಬಾರ್‍ is matched for Malayalam word. And that is why this is called cross-language. The inflections of the words സാമ്പാര്‍ – സാമ്പാറും, സാമ്പാറു etc are also found as results. [Read More]

Tamil Collation in GLIBC

Posted on February 26, 2011 | Santhosh Thottingal

A few months back, we started fixing the collation rules of Indian languages in GNU C library. Pravin Satpute prepared patches for many languages and I prepared patches for Malayalam and Tamil. Later Pravin enhanced the Tamil patch. You can read the rules used for Malayalam collation here[PDF document]. Tamil patch was applied to upstream, but the bug is still open since there is some confusion on the results. Before reading the below discussion, please read the discussion happened in the bug report : [ta_IN] Tamil collation rules are not working in other locales [Read More]

Bugs Collation glibc Tamil

Identifiers In Indic Languages

Posted on January 8, 2011 | Santhosh Thottingal

Recently, while preparing a critique for IDN Policy for Malayalam language prepared by CDAC, I noticed that ICANN does not allow control characters in the domain names. Sometime back I noticed Python 3 identifiers also does not allow control characters in the Identifiers. This blog post attempts to analyze the issue by looking at the Unicode and ICANN specifications about these special characters. Apart from the existing characters in Indic languages, Zero width Joiner and Zero width non joiners are widely used in Indic languages to control how the ligatures are formed. [Read More]

CDAC icann idn Malayalam python standards zwj zwnj

Dictionary Jabber Buddy Bots

Posted on November 20, 2010 | Santhosh Thottingal

Recently we released two Jabber buddy bots for dictionary lookup. By adding eng.mal.dict@gmail.com as a chat contact one can ask for the meaning of an English word in Malayalam by just sending a chat message. Similarly for English-Hindi or Hindi-English dictionary, we have another bot eng.hin.dict@jabber.org. Both of these dictionaries use Dict databases based on DICT protocol. Both of these bots were well received by the users. We have 8000+ users for English-Malayalam Dictionary. [Read More]

bots dictionary python xmpp

Indic Language Computing Workout, Pune

Posted on August 23, 2010 | Santhosh Thottingal

On 22nd August, I conducted a workout session with Praveen on Indic Language Computing at Red Hat Office, Pune. The plan was to solve some of the issues in Devanagari support for the encoding converter Payyans. But most of the time was spent on Introducing the concepts of Indic language computing to participants. Project Silpa was also introduced and demonstrated. Students from College of Engg, Pune and other colleges attended. Red Hat sponsored the venue at their office. [Read More]

talks workshops

Wikimania 2010, Poland

Posted on July 17, 2010 | Santhosh Thottingal

I left Chennai on Wednesday(8th) and reached Frankfurt airport on Thursday morning. Rest of the people from India for wikimania- Shiju Alex, Tinu Cherian, Srinivas Gunta, Arjun Rao were already reached the airport and I joined them. We reached Gdansk Airport by 12.30 PM. Our accommodation was at a students hostel of Gdansk University. Language was a big issue since most of the people does not understand English and only know Polish Language. [Read More]

wikipedia

Attending Wikimania 2010

Posted on July 6, 2010 | Santhosh Thottingal

I will be attending Wikimania 2010, Gdansk, Poland. This annual international conference of the Wikimedia community is from July 9 to July 11. I will be presenting wik2cd, the tool I wrote for Malayalam wikipedia version 1.0 there in a joint workshop with wikipedia offline developers. I will be joining with Manuel Schneider, Shiju Alex, Martin Walker in the workshop titled: Creating offline version of Wiki content – Solutions and Challenges. [Read More]

conference wikipedia

Malayalam Wikipedia releases selected articles on CD

Posted on April 17, 2010 | Santhosh Thottingal

As part of Malayalam Wikipedia Meetup 2010 , today Malayalam wikipedia releases 500 selected articles on a CD ROM. This is the first time in India, a Wikipedia on local language releasing its articles for offline usage. I handled the technology part of the project. The idea was to get the selected articles in static form to the CD. But this is not easy as we imagine. It is not like saving each page from browser to the local machine. [Read More]

wikipedia

Predictive text entry with ibus

Posted on March 12, 2010 | Santhosh Thottingal

A few days back I came to know about this project :Text Prediction on GNOME based on GTK+ Input Method context. Basically it is an input method with text prediction feature. I had a similar project idea during 2009 May and had done some amount of coding for that. The project was to have an IBUS input method which can do letter prediction as well as word prediction. The prediction is based on ngrams. [Read More]

ibus predictive text entry

Conferences : FOSS.IN and NCIDEEE

Posted on November 28, 2009 | Santhosh Thottingal

FOSS.IN 2009 starts on 1st December. I wanted to attend all 5 days but I have another conference on Dec 1st to 3rd at Chennai. I am attending National Conference on ICTs for the differently- abled/under privileged communities in Education, Employment and Entrepreneurship 2009 – (NCIDEEE 2009) at Loyola College, Chennai. So I will miss the first 3 days of foss.in. We have a workout on Project Silpa during foss.in. I am also planning to have a workout with Debayan and Jinesh to get his tesseract-indic OCR work with Malayalam. [Read More]

conference dhvani silpa