Malayalam morphology analyser – First release

I am happy to announce the first version of Malayalam morphology analyser. After two years of development, I tagged version 1.0.0 . In this release In this release, mlmorph can analyse and generate malayalam words using the morpho-phonotactical rules defined and based on a lexicon. We have a test corpora of Fifty thousand words and 82% of the words in it are recognized by the analyser. A python interface is released to make the usage of library very easy for developers. [Read More]

Identifiers In Indic Languages

Recently, while preparing a critique for IDN Policy for Malayalam language prepared by CDAC, I noticed that ICANN does not allow control characters in the domain names. Sometime back I noticed Python 3 identifiers also does not allow control characters in the Identifiers. This blog post attempts to analyze the issue by looking at the Unicode and ICANN specifications about these special characters. Apart from the existing characters in Indic languages, Zero width Joiner and Zero width non joiners are widely used in Indic languages to control how the ligatures are formed. [Read More]

Dictionary Jabber Buddy Bots

Recently we released two Jabber buddy bots for dictionary lookup. By adding eng.mal.dict@gmail.com as a chat contact one can ask for the meaning of an English word in Malayalam by just sending a chat message. Similarly for English-Hindi or Hindi-English dictionary, we have another bot eng.hin.dict@jabber.org. Both of these dictionaries use Dict databases based on DICT protocol. Both of these bots were well received by the users. We have 8000+ users for English-Malayalam Dictionary. [Read More]

Python isalpha is buggy

This code

#!/usr/bin/env python
# -*- coding: utf-8 -*-
ml_string=u"സന്തോഷ്  हिन्दी"
for ch in ml_string:
    if(ch.isalpha()):
        print ch

gives this output

സ
ന
ത
ഷ
ह
न
द

And fails for all mathra signs of Indian languages. This is a known bug in glibc.

Does anybody know whether python internally use glibc functions for this basic string operations or use separate character database llke QT does?