KDE spellchecker not working for Indian Languages

As I mentioned in my blog post on Language detection the sonnet spellchecker of KDE is not working. I read the code of the Sonnet and found that it fails to determine the word boundaries in a sentence (or string buffer) and passes the parts of the words to backend spellcheckers like aspell or hunspell. And eventually we get all words wrong. This is the logic used in Sonnet to recognize the word boundaries

Loop through the chars of the word, until the current char is not a letter/ anymore.

And for this , it use the QChar::.isLetter() function. This functions fails for Matra signs of our languages.

A screenshot from a text area in Konqueror:

For example

#include <QtCore/QString>
#include <stdlib.h>
int main(){
	QChar letter ;
	letter = 'அ';
	fprintf(stdout,"%d\n", letter.isLetter());
	letter = 'ी';
	fprintf(stdout,"%d\n", letter.isLetter());
}


In this program, you will get true as output for அ and false for ी.

When I showed this to Sayamindu during foss.in , he showed me a bug in glibc . Eventhough the bug is about Bengali, it is applicable for all languages. It is assigned to Pravin Satpute and he told me that he got a solution and will be submitting soon to glibc.

But I am wondering why this bug in KDE unnoticed so far? Nobody used spellcheck for Indian languages in KDE?!

Let me explain why this is not happening in GNOME spellchecker if this is a glibc bug. In gnome, this word splitting will be done in application itself using gtk_text_iter_* and these iteration through words are done by pango words boundary detection algorithms.

Filed a bug in KDE to track it.

Language Detection and Spellcheckers

A few weeks back there was a discussion on #indlinux IRC channel about automatic language detection. The idea is, spellcheckers or any language tools should not ask the users to select a language. Instead, they should detect the language automatically. The idea is not new. There is a KDE bug hereand Ubuntu has this as an brainstorm idea. It seems M$ word already have this.

A sample use case can be this: “While preparing a document in Openoffice, I want to write in English as well as in Hindi. For doing spellcheck, I need to manually change the language rather than the application detect it automatically”

Regarding the algorithm behind automatic language detection, there are many approaches. Statistical approaches are effective for languages sharing same script(For eg: languages which use latin script or Hindi and Marathi). N-gram based methods are used in statistical approach. Here is a ‘patented’ idea . And this page explains a character trigram approach. Google has a language detection service(http://www.google.com/uds/samples/language/detect.html) and it seems it is still in development or ‘learning stage’.

Here is an example of statistical language detection: languid(It did not work for me when I tried, But you can download the source code and check)

Sonnet is the spellchecker framework of KDE written by J. Rideout. It is also trying to provide the language detection feature. Here is an old article in linux.com about that. It is based on n-gram based text categorization and is a port of languid. From the article:

A gram is a segment of text made of N number of characters. Sonnet uses trigrams, made from three characters. By analyzing the popularity of any given trigram within a text, one may make assumptions about the language the text is written in. Rideout gives an example: “The top trigram for our English model is ‘_th’ and for Spanish ‘_de’. Therefore, if the text contains many words that start with ‘th’ and no words that start with ‘de,’ it is more likely the text is in English [than Spanish]. Additionally, there are several optimizations which include only checking the language against languages with similar scripts and some heuristics that use the language of neighboring text as a hint.”

(I tried sonnet and could not get it working for ml_IN. Instead of words, it was iterating through letters. Anyway I will check this problem later.)

As far as Indian languages are concerned, Unicode code range based language detection will work for most of the cases. Most of the languages has its own script and Unicode code point range. For example, detecting Malayalam is a matter of checking the letters are in the Malayalam Unicode range. But for Devanagari script it is not straight forward. Hindi , Marathi etc use Devanagari script. Dhvani, the text to speech system for Indian languages use a simple algorithm for language detection(http://dhvani.sourceforge.net/doc/language-detection.html). There the Hindi and Marathi is identified by giving a priority for LANG environment variable. But it will fail if somebody try to use Marathi in an English desktop(Users can specify the language to be used – In that case language detection will not be done.).

In the case of spell checkers other than LANG environment variable there are other options. When you type in gedit or any text editors, detecting the keyboard layout will be one way of detecting the language. But it depends which IME the users uses. It can be xkb or scim or even a copy-paste.

Anyway, it is pretty clear that the current natural language features in the free desktops requires more improvements. Based on a discussion we had in #indlinux IRC, we had setup a wiki page here to discuss on this.

As a proof of concept, I tried to write a spellchecker for Gedit texteditor with language detection for Indian languages. Basically it uses Unicode character range. It is a gedit plugin written in python. And it uses pyenchant spellcheck wrapper library. Install python-enchant using your package manager if it is not already installed. Download the plugin and python module to ~/.gnome2/gedit/plugins folder and restart gedit. Enable external tools and new Spellchecker plugin in edit->preferences->plugins. It does not have the pango error style underline or suggestions in context menu as of now. It just prints the results and suggestions in the console of gedit. And ‘Add to Dictionary’ etc are not there now.

I would like to request interested developers to come forward and make this feature ready to use in free desktops. Suggestions are welcome. We need good algorithms for detecting the language too.
A sample use case: “System locale is English and I am typing a document in Hindi and want to write some Marathi sentences in between. Without manually changing the language, system detect the language of each word and check the spelling against corresponding dictionaries.”

PS: Because of the inflectional and agglutinative nature of some of the Indian languages, the spell checking is not at all effective. I will write on that later.

Bug in Firefox Spellcheck

There is a bug in Firefox in the spell check functionality that affects many Indian Langauges using Zero Width [Non] Joiners in the words. Firefox uses hunspell as the spelling checker. Openoffice also uses Hunspell. The bug is not there in Openoffice and problem with firefox is with the tokenization of words in editable textfields before doing spellcheck. Firefox splits the words if there is ZWJ/ZWNJ in the word. And because of this the input to the spellchecker is wrong and it is not the actual word.
I have filed a bug against the spellchecker of Firefox and you can see it here (bug #434044 )
I have given some sample words in Malayalam and Bengali(Thanks to Runa) with ZWJ/ZWNJ. If your language uses ZWJ/ZWNJ, please comment/vote in mozilla bugzilla.

I found this when I was trying to prepare a Malayalam spellcheck extension for firefox(Hunspell wordlist). Still many languages do not have the affix rules in place for aspell/hunspell and it makes the spellcheck less efficient particularly for highly inflected/agglutinated languages like Malayalam.

Thanks to Németh László, Hunspell developer for helping me to figure out the problem

Only Aspell, no space for others…

It seems that our work on our own spell checker doesnot have any importance other than learning. Aspell is light years ahead of us.There are ispell, myspell also. But we learned a lot about the approximate string comparison, fast search on a big wordlist, candidate list generation etc.. Gora Mohanty gave valuable insights to me on Aspell and how to create the Aspell word list for Malayalam.But still problems on compound words of malayalam.. “Sandhi & Samasam” and the infinite number of words that can be created by that in malayalam is a big hurdle for us..
Can we create a dictionary with all those words?
Can we code that large set of rules?!!
Wait and See 😉

Spell checker and Late night coding..

It was a wonderful week end. Myself and Benzi were working on the spell checker for Malayalam. In April we had done lot of research on this. We did the coding for the dictionary representation in the Binary Retrieval tree (TRIE). Saturday night we did the candidate list generation coding. It is a wonderful experience to code in the late night – one laptop and two persons to code!!!. Every thing worked fine. When we finished the coding, we realized that the application can be tuned to a universal(?) spell checker. So sunday we tested it using a 3 lakh english words.. Worked fine!!. We compared the spelling options generated with aspell. Ours was giving more options since our dictionary is bigger than aspell’s.
We want it to be called as bspell :-). But qns….
Why bspell? what is the extra/less features that bspell has compared to aspell
how to make it language independent?
How to rank the spelling suggestions?
How to make it work with Office suits/Editors?

Answer is “study Aspell”