Cross Language Approximate Search on Indic Languages- A demo

A demo of cross language approximate search in Indic text:
click to enlarge
The Malayalam word സാമ്പാര്‍ is compared against a paragraph from http://ml.wikipedia.org/wiki/Sambar.
In the bottom half, words marked in yellow color are search results.
You can see that a Kannada word ಸಾಂಬಾರ್‍ is matched for Malayalam word. And that is why this is called cross-language.
The inflections of the words സാമ്പാര്‍ – സാമ്പാറും, സാമ്പാറു etc are also found as results.
This is the kind of search we need in Indic languages, not just the letter by letter comparison we do for English.

Another example showing all inflection forms of the noun പാലക്കാട്, and the same word written in Tamil, Telugu, Hindi. The search shows the results in those languages too. – click to enlarge

You can try it here: http://silpa.org.in/ApproxSearch

This is a Fuzzy string search application. This application illustrates the combined use of Edit distance and Indic Soundex algorithm.

By mixing both written like(edit distance) and sounds like(soundex), we achieve an efficient aproximate string searching. This application is capable of cross language string search too. That means, you can search Hindi words in Malayalam text. If there is any Malayalam word, which is approximate transliteration of hindi word, or sounds alike the Hindi words, it will be returned as an approximate match. The “written like” algorithm used here is a bigram average algorithm. The ratio of common bigrams in two strings and average number of bigrams will give a factor which is greater than zero and less than 1. Similarly the soundex algorithm also gives a weight. By selecting words which has comparison weight more than the threshold weight(which 0.6), we get the search results.

Conferences : FOSS.IN and NCIDEEE

FOSS.IN 2009 starts on 1st December. I wanted to attend all 5 days but I have another conference on Dec 1st to 3rd at Chennai. I am attending National Conference on ICTs for the differently- abled/under privileged communities in Education, Employment and Entrepreneurship 2009 – (NCIDEEE 2009) at Loyola College, Chennai. So I will miss the first 3 days of foss.in.
We have a workout on Project Silpa during foss.in. I am also planning to have a workout with Debayan and Jinesh to get his tesseract-indic OCR work with Malayalam.

See you at foss.in!

Announcing Project Silpa

Many of my friends already know about a project I am working on,  this is a public announcement of that.

The project is named as Silpa, may be an acronym of Swathanthra(Mukth, Free as in Freedom) Indian Language Processing Applications. It is a web framework and a set of applications for processing Indian Languages in many ways. Or in other words, it is a platform for porting existing and upcoming language processing applications to the web.

Before going to the details, you can have a quick preview of the application here : http://smc.org.in/silpa

The project is designed for adding applications/utilities as plugins. The framework is written from scratch using python language. As you can see in the development version, there are number of modules already written.  Most of the modules requires some more work to make it _complete_. The application is free software and there is a link to the source code at the bottom of the application.

As it is meant for covering all languages of India, all modules should be capable of handling all scripts from India(Sometimes English too). At the same time , the language of input data is transparent , meaning, user need not mention that _this_ is the language in which she is entering the data. Unlike desktop applications which asks to specify the language along with the input data(for eg: Spell checker) , the modules should try to detect the language them self. And if possible, modules try to process the data even if the input data is in multiple Indic scripts.

The modules may be General purpose(eg: Dictionary, Spellcheck,Sort. Transliteration, Font conversion..) or Technology/Algorithm  Demonstration purpose (eg: Hyphenation, Stemmer, Search algorithms)

Some of the modules are usable  as of now, while some of them are in development. You may just try out them. User’s data will not be logged  except when a crash occurs(at that time user data and exception trace will be logged for later debugging).

And, this is also a call for contributors. You may propose new ideas for modules, feature suggestion etc.. A few  students showed interest in the project. Unfortunately python is not a language in their  college syllabus. So if you are good in python and have interest in contributing to the project, drop me a mail :). There is no separate version for development and the one which is present at http://smc.org.in/silpa . All development happens there itself and any change in the code is immediately available for use!(or immediately starts crashing for user data)

I will write on some interesting algorithms I used for some modules later. If you are curious to know them, read the code!