Skip to content


Indic Language Computing Workout, Pune

On 22nd August, I conducted a workout session with Praveen on Indic Language Computing at Red Hat Office, Pune. The plan was to solve some of the issues in Devanagari support for the encoding converter Payyans. But most of the time was spent on Introducing the concepts of Indic language computing to participants.  Project Silpa was also introduced and demonstrated. Students from College of Engg, Pune and other colleges attended.  Red Hat sponsored the venue at their office. It was very interesting to interact with energetic and enthusiastic students.

Posted in Community, Indic.

Tagged with , .


Wikimania 2010, Poland

I left Chennai on Wednesday(8th) and reached Frankfurt airport on Thursday morning. Rest of the people from India for wikimania- Shiju Alex, Tinu Cherian, Srinivas Gunta, Arjun Rao  were already reached the airport and I joined them. We reached Gdansk Airport by 12.30 PM. Our accommodation was at a students hostel of Gdansk University.  Language was a big issue since most of the people does not understand English and only know Polish Language. The lady at the reception of the hostel we stayed was using Google translate tool for communicating with us.  The Gdansk city is a very beautiful city with streets of  big brick made tall buildings.

The conference started on Friday morning. Sue Gardner, Executive Director of the Wikimedia Foundation. talked about the strategies of foundation, and it followed by a QnA with wikimedia board members. We presented the Malayalam CD to Sue Gardner. She remembered the International free software conference she attended at Trivandum on 2008 december.
Our workshop on offline wikipedia versions started at 2.30. Martin Walked introduced the workshop to participants. Manuel Schneider from German wikipedia explained the Openzim format for offline compressed storage of wikipedia and the available readers on desktop computers and mobile phones. Shiju Alex introduced the Malayalam wikipedia offline verision 1.0. I talked about the issues and solutions for providing an offline version, particularly a non-latin complex script wiki to users in CD ROM or DVD. I demonstrated sample offline wikis on Hebru, Tamil, Polish, English, Japanese with the wiki2cd tool. There were a couple of questions on wiki2cd and openzim. Kul Takanao Wadhwa and Tomasz Finc  from wikimedia foundation who are focusing on offline wiki projects attended the workshop and we had a discussion after the talk.

Offline wikipedia people: myself, Shiju Alex, Martin Walker, Manual Schneider

Offline wikipedia people: myself, Shiju Alex, Martin Walker, Manuel Schneider

The offline wiki workshop continued with Pediapress team. They are the people behind the recently added export book/PDF feature of wikipedia. Unfortunately this feature require lots of improvements to get work with Indian scripts.
We met Gerald M, who focus on language support of wikis. He is such a person with amazing passion towards our local language wikipedias. We discussed on the technical issues of the local language wikis. Siebrand joined the discussion and he pointed out some improvements in wiki2cd software.

Discussion with Siebrand on wiki2cd improvements. From left 2 right: Tinu Cherian, myself, Gerard M, Siebrand

On the second day I met Volker Haas, the developer of PDF export/books feature of wikipedia. The library used by the extension of creating PDFs is Reportlab. But it does not support complex scripts such as Indic or Arabic. We have a long discussion on possible solutions. Discussed the Reportlab code. the mwlib code, and the s/w which I am writing now  a days to provide complex script pdf rendering APIs. We will continue to try out some of the options we have to solve this issue soon.

Martin Walker, who presented the Article Selection process of English wikipedia along with us in the workshop  invited me and Shiju for a dinner with his family. And we went for that.

The third day started with plenary session by Jimmy Wales. He talked about small language wikipedia and the issues faced by them. He emphasized the need for offline versions of wikipedia to reach more number of people and talked about the Malayalam Wikipedia offline version.

Jimmy Wales with Malayalam wikipedia CD

Jimmy Wales with Malayalam wikipedia CD

Mayooranathan from Tamil wikipedia presented the issues and statistics of Tamil Wikipedia Community. On Monday and Tuesday, we spent time by roaming around the Old Town of Gdansk. Visited St. Marys Church , the biggest brick made church in the world. We climbed the 400 steps of the tower of the church. From the top of the chruch, one can see the entire city. We went to the Baltic sea beach -Westerplatte on a boat and visited Wisłoujście Fortress

Related posts:

* Creating Malayalam Wikipedia CD: http://shijualex.wordpress.com/2010/04/24/creating-malayalam-wikipedia-cd/
* Wiki2cd: http://thottingal.in/blog/2010/04/17/mlwikioncd/
* Wikipedia Sign post: http://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2010-04-19/News_and_notes#Briefly
* Gerard’s Blog: http://ultimategerardm.blogspot.com/2010/04/best-of-malayalam-wikipedia.html
* Gerard’s Blog: http://ultimategerardm.blogspot.com/2010/04/cd-dowloaded-4390-in-10-days.html
* Gerard’s Blog: http://ultimategerardm.blogspot.com/2010/07/malayalamwikipedia-success-story.html

Posted in Community, Projects.

Tagged with .


Attending Wikimania 2010

I will be attending  Wikimania 2010,  Gdansk, Poland.  This annual international conference of the Wikimedia community is from July 9 to July 11.

I will be presenting wik2cd, the tool I wrote for Malayalam wikipedia version 1.0 there in a joint workshop with wikipedia offline developers.  I will be joining with Manuel Schneider,  Shiju Alex, Martin Walker in the workshop titled: Creating offline version of Wiki content – Solutions and Challenges. Apart from this, I will be meeting pediapress team, the team behind wikipedia’s latest PDF/Book export feature. There are some issues in this tool for working with Indic languages, mainly because of the PDF rendering engine not capable of rendering complex scripts.

Thanks to Wikimedia foundation for granting me a scholarship to cover travel expenses.

Posted in Community, Indic, Projects.

Tagged with , .


Malayalam Wikipedia releases selected articles on CD

As part of Malayalam Wikipedia Meetup 2010 , today  Malayalam wikipedia releases 500 selected articles on a CD ROM. This is the first time in India, a Wikipedia on local language releasing its articles for offline usage. I handled the technology part  of the project.

The idea was to get the selected articles in static form to the CD. But this is not easy as we imagine. It is not like saving each  page from browser to the local machine. Following were the challenges:

  • Automate the process of getting the page and the images in it. Wikipedia articles changes frequently. So we need the program to fetch the latest article from wiki whenever it is executed.
  • Fix all the links, css, javascript, image references so that all resolves within CD itself
  • Provide an categorized index of the articles for easily locating the article.
  • Provide a search in the article titles.
  • ISO 9660 filesystem of CD/DVD has lots of limitations. There are restrictions on unicode names of the files, length of the file names, directory depth, special characters in filenames etc. Wikipedia has its article and image names with unicode, special characters and most of the time they exceeds the filename length. To avoid all these, we should rename most of the files and then fix the cross references in all files.
  • It should work on all Operating systems. All the content should be presented with HTML, Javascript and CSS. Being the content in Malayalam, even if the user does not have required fonts in her/his machine, there should not be any problem for reading the content(font embedding required).

Manually solving all these challenges is not the way to go. So I wrote a program, which just takes the article titles and does all the above tasks and finally creates a repository ready for burning to CD ROM.

Wget disappointed me in fetching the content from wiki. There is an open bug in wget which make the download of non-latin URLs impossible.

Have a look at the CD content we created : Malayalam Wikipedia Selected 500 Articles . Hiran helped me with the artworks.

The CD cover image designed by Hiran

Since entire process is automated, the program can be used for any other language.  I am releasing the program for the benefit of everybody. You can get the program from here. It is written on Python. Jquery was used for the UI.  For details on the usage, customization etc read the wiki page of the project.

For those who can’t read Malayalam, here is a sample wiki created  by the wiki2cd program from English wikipedia by selecting 10 articles.

Malayalam Wikipedia Community  hope that this is a big step to reach the majority of the people who does not have internet access. If printed, this 500 articles will be at least 5000 pages. CDROM also includes information about commonly used free software based tools for Malayalam computing. Some writing tools and fonts are distributed in the same CD ROM.

Thanks to Malayalam Wikipedia for giving this great opportunity to wok on this project.

The ISO image of the CD is available here for download.

Posted in Malayalam, Projects.

Tagged with .


Predictive text entry with ibus

A few days back I came to know about this project :Text Prediction on GNOME based on GTK+ Input Method context. Basically it is an input method with text prediction feature.

I had a similar project idea during 2009 May and had done some amount of coding for that. The project was to have an IBUS input method which can do letter prediction as well as word prediction. The prediction is based on ngrams.  Since it is based on ibus, it works on all desktop applications.  You can see the screenshots of prototype from here, here and here

The core code was ready. It was written in python and use ibus-python. Unfortunately I did not get time to spend on this project for a long time and currently this project is not there in my top priorities.  Since I see many people interested in auto-completion or predictive text entry, I uploaded the code here http://github.com/santhoshtr/ibus-sulekha .  It is not in a working state as of now, but I would be happy if anybody interested in taking it forward.  I wrote a small documentation on algorithm here, and feel free to contact me if any help is required.

Posted in Projects.

Tagged with , .


Conferences : FOSS.IN and NCIDEEE

FOSS.IN 2009 starts on 1st December. I wanted to attend all 5 days but I have another conference on Dec 1st to 3rd at Chennai. I am attending National Conference on ICTs for the differently- abled/under privileged communities in Education, Employment and Entrepreneurship 2009 – (NCIDEEE 2009) at Loyola College, Chennai. So I will miss the first 3 days of foss.in.
We have a workout on Project Silpa during foss.in. I am also planning to have a workout with Debayan and Jinesh to get his tesseract-indic OCR work with Malayalam.

See you at foss.in!

Posted in Community.

Tagged with , , .


Inkscape hyphenation extension

One year back I wrote about how to use Inkscape as a workaround solution for DTP in indic scripts. Still we don’t have any DTP software which supports Indic scripts in Unicode. Scribus still does not have the Indic support.

One issue with inkscape when used as DTP for indic script was, a few indic scripts always wanted hyphenation when text is justified. For example Malayalam has lengthy words and often space is wasted in lines if the text is not automatically hyphenated. But this feature was not available in inkscape. There is a wishlist bug for adding this feature to Inkscape.  I tried to develop an extension for Inkscape to achieve this.

It is on top of the python hyphenation code written by Wilbert  Berendsen. The hyphenation rules, also called as patterns is TeX or
Openoffice itself. So  I can support any language which has TeX hyphenation rules. But, since the hyphenation rules are language specific we need a language selection mechanism for the text first. Then only we can select the rules and do the hyphenation. But it is very tricky to implement.  Asking the language of the text every time it is justified is not a good idea. Setting a language for document is another choice, but what if the text contains multiple languages?  But for Indian languages it is very easy, we can automatically detect the scripts using unicode codepoints and load the rules accordingly. So for the time being, my extension support only English and all Indian languages.

Download the extension from http://thottingal.in/projects/inkscape_hyphenation/inkscape-hyphenation.zip . In GNU/Linux machines,  extract the zip file and copy to /usr/share/inkscape/extensions folder. In Windows , extract to [inkscape installation directory]\extensions folder.  After this close and reopen inkscape. You will see a menu named Hyphenate in Effects->Text menu.    In the document, add a text field, enter text in any indian language. Select the text and apply hyphenation by Effects->Text->Hyphenate. Then change the alignment of text to justify. You will see the text get hyphenated and occupying maximum possible space in the text field

I got satisfactory result with Malayalam and Tamil. I did not test other languages. Following images illustrates hyphenated, justified, two column layout of text done in Inkscape

Malayalam Hyphenation In inkscape
Malayalam Hyphenation In inkscape
Tamil Hyphenation in Inkscape
Tamil Hyphenation in Inkscape

We had a discussion about this in inkscape mailing list . Some developers suggested to have this feature built in, not as extension.  There are few issues to be solved for that. One thing is language selection as I explained. The other issue is regarding the hyphenation character to be used. Unicode standard insists to use soft hyphen – u00AD as hyphenation character. This is an invisible character. For Malayalam, visible hyphens are not required. But some other languages require the hyphen sign where the word is broken at the end of the line. The rules for whether the soft hyphen should be visible or not visible is not clear in Unicode’s specification. Pango never displays a the soft hyphen. There are criticism on this specification of softhyphen

So I think there is something to be done from Rendering engine or Unicode need to clarify the confusions.  But Openoffice and HTML rendering engines always make soft hyphen at the end of the line, which is not desired for some languages.

Try this extension, let me know the comments. For small scale DTP works, such as pamphlets, notices, brochures  inkscape is enough. But since inkscape is not primarily a DTP software and does not have paging support, for books and large scale DTP works, it may not work well.

Posted in Indic, Projects.

Tagged with , , .


New Hyphenation Pattern Extensions for Openoffice

Openoffice Indic Natural Language group announces the availability of the following Openoffice hyphenation dictionary extensions.

  1. Malayalam Hyphenation Rules version 1.2
  2. Kannada Hyphenation Rules version 1.1
  3. Bengali Hyphenation Rules verson 1.1
  4. Hindi Hyphenation Rules version 1.1
  5. Telugu Hyphenation Rules version 1.0
  6. Tamil Hyphenation Rules version 1.0
  7. Gujarati Hyphenation Rules version 1.0
  8. Panjabi Hyphenation Rules version 1.0
  9. Oriya Hyphenation Rules version 1.0
  10. Marathi Hyphenation Rules version 1.0

Spellchecker extension for Malayalam is also ready.

For a complete list of writing aids for Openoffice in Indic Languages is available here

Hyphenation Rules for Languages other than Marathi is already packages in Fedora 11. This releases contains updates and bug fixes. Fedora 12 will contains these updates. These extensions are yet to be packaged for Debian/Ubuntu.

More details about hyphenation rules are available here

Posted in Indic, Projects.

Tagged with .


Project Silpa Updates

[Please read the Silpa project annoucement  before reading this blogpost]

Project silpa is getting ready for a 0.1 version.

  1. The web framework got many changes to support JSON based RPC calls from external applications. That means,  web/desktop applications can use the APIs of Silpa through RPC calls.
  2. Page rendering logic is moved from server to client. Web interface use javascript based synchronous JSON based RPC calls to get the results from server. Jquery is used for render the page.
  3. Uses PyMeld Templating Engine for modules having web interface(Not all modules will not have web interface)
  4. Framework is now Python WSGI application. Initially it was plain CGI. WSGI reduces the response time and allows the server to be executed as daemon
  5. Many new modules are getting added- Spellchecker : which is not based on aspell or hunspell  and I am going to try out some algorithms to get optimal suggestions. Not completed.
  6. Soundex Algorithm- webbased demo and APIs as I explained in my  previous blog post
  7. An Inexact search algorithm and its implementation based on visual and phonetic distance between two words is getting ready. I will explain it in another blogpost
  8. Hyphenation – Online tool as well as APIs
  9. N-gram for Indic languages- API, web interface
  10. API documentation is going on, but not completed. I have plans to make silpa as a python library for offline use too.
  11. Moved from SMC‘s git repo to a seperate git repo. After 0.1 baseline, I will create branches for stable and development.
  12. Application is running on a git controlled deployment workflow. Thanks to Joe Maller  for nice documentation on this.

That’s all for now!.  There are too many things to be done. Some of the modules does not support all languages as of now.  If anybody interested in contributing to the project, please contact me.  Try out the application, read the code and let me know your comments.

Posted in Indic, Projects.


Phonetic Comparison Algorithm for Indian Languages

Soundex is a phonetic indexing algorithm. It is used to search/retrieve words having similar pronunciation but slightly different spelling. Soundex was developed by Robert C. Russell and Margaret K. Odell. A variation called American Soundex was used in the 1930s for a retrospective analysis of the US censuses from 1890 through 1920. It is also described in Donald Knuth’s The Art of Computer Programming. The National Archives and Records Administration (NARA) maintains the current rule set for the official implementation of Soundex used by the U.S. Government.

The soundex code for a word is an english alphabet followed by a number of digits. The algorithm is explained  with examples here

By this algorithm, if my name is written as Santhosh , Santosh , Santhos or Santos , the soundex code remains same and it is S532

Soundex had many limitations and sometimes creates false positive or false negative errors. There are variants for soundex and one important variant is Metaphone alogirthm by Lawrence Philip . Metaphone have another improved version called Double Metaphone.

Well,  it works fine, but only for English. Just like our languages also have varying spelling. But more than spelling, in India , we have another issue to be addressed: Words(often nouns) getting transliterated among Indian Languages. Let me give some examples: In railway reservation chart, your name will be written in English as well as in Hindi. You are from Kerala(or some other state) and your name is transliterated to some other language. The only thing remain same is its pronunciation.  It will be great if we can search on this data based on pronunciation, right?

We see a lot of discussions happening around e-governance and other computerization projects such as national UID etc now a days. Projects that handle Indic text heavily will require efficient search and string processing algorithms.

You got a list of names in Bengali and you dont know Bengali but you know Malayalam. Obviously you can’t search on Bengali text using Malayalam. But can’t we develop such algorithm?

So our requirement is this:

  1. Language Independent Search on Indic text.
  2. Comparison should be based on pronunciation
  3. Should be tolerant to spelling variation

The above discussed soundex algorithm just fits to the solution. Only issue is that algorithm is defined only for English. So time to design soundex for our languages!.

Original Soundex algorithm is not multilingual. And not designed for language independent indexing.  And the digits to represent phonetic categories for Indian languages will not fit into 6. We have more phonetical families. So our algorithm will not be exact “localization” of English soundex, but we will use the concept. Let us call it as IndicSoundex :) . Infact Metaphone algorithm uses 16 families for English.

One of the characteristcs of Indian languages is that all languages share same phonetic features. We have vowels a, aa, i, ii, u…. then consonant families ka, cha, ta, tha, pa..  etc.  What we need to do is mapping all these sets to a common representation which is independent of language. And we call that representation as soundex for an Indic word.

While grouping and mapping Indic letters to phonetic codes, the following are facts are taken into consideration.

  1. Group short and long vowel to a single code. e and ee is considered as equal
  2. Consider half consonants as full consonants. For this ignore halants.
  3. Group consonant families. ka, kha,ga,gha, nga becomes a single family. Same is the case of cha, ta, tha,pa
  4. Group ra, Ra,
  5. Group la,La, zha
  6. Group sa,Sa,sha

When I grouped like this I got 20 groups.  I have prepared a table for all Indic letters and corresponding code here http://thottingal.in/soundex/soundex.html

Algorithm:

  1. For each letter in the word except first letter, get the corresponding soundex digit from the character map, which is nothing but a table like this.
  2. If the letter is not found in character map, the  soundex digit for that letter is 0
  3. Duplicate consecutive soundex codes are skipped. ie, effectively क्क will be considered as क.
  4. Replace first digit with first alpha character.
  5. remove all 0s from the soundex code.
  6. Return soundex code padded to the required length (ie, if required length of code is 5 and soundex is സBCD, then soundex returned will be സBCD0.

Of cource, we need an implementation. Get the python code for this from here. If you don’t care about the code,  try the online soundex converter from here: http://smc.org.in/silpa/Soundex

From the above algorithm , the soundex code for സന്തോഷ് is സLKES000. and for सन्तौष , it is सLKES000 . So if I need to compare सन्तौष and സന്തോഷ് and need a positive result, we should do something in comparison logic and there by making the comparison language independent

An example :
കാര്‍തിക്, കാര്‍ത്തിക്,  കാര്‍തിഗ് = കAPKBF00
கார்திக்= கAPKBF00

Comparison

  1. Compare the two string without calculating the soundex codes. If there are same return 0
  2. Calculate the sounedx codes for both strings, if the match return 1. ie, both strings are from same language and sounds alike
  3. If the soundex of strings has different fist letter and rest of the part is same, check whether the first letters are with same soundex digit. If so, both words from different languages, but sounds alike. Return code for this will be 2
  4. If none of the above conditions match, return -1, indicating both strings are completely different.

Use http://smc.org.in/silpa/Soundex to experiment with this algorithm. Following screenshot shows how കാര്‍ത്തിക് and કાર્તિક્ are found to be phonetically similar or “sounds alike”

soundexcomparison

Ok,  we got the soundex code. The soundex code itself is not useful. We need to implement a search utility based on this. As I explained above, an intra-indic search program. We will see it later.

Feel free to comment on algorithm and please suggest any improvement we can do.

Posted in Indic.

Tagged with .