Python isalpha is buggy

This code

#!/usr/bin/env python
# -*- coding: utf-8 -*-
ml_string=u"സന്തോഷ്  हिन्दी"
for ch in ml_string:
    if(ch.isalpha()):
        print ch

gives this output

സ
ന
ത
ഷ
ह
न
द

And fails for all mathra signs of Indian languages. This is a known bug in glibc.
Does anybody know whether python internally use glibc functions for this basic string operations or use separate character database llke QT does?

Yahoo search bug

None of the search engines can handle Indian languages very well. Google removes the zero width joiners, non joiners , that are used in many languages. Yahoo doesnot remove it. But a UI bug in webpage makes the results wrong..
See the below image:

The bottom half of the image is the source code. We can clearly see that the closing bold tag is placed in between the word instead of putting at the end of the word. As a result, the word is rendered wrong in the page.
This happens for all languages which use ZWJ, ZWNJ, ZWS etc. It breaks the word just before the zwnj/zwj and puts the end of bold tag to highlight the search result..

I showed this to Gopal and told me that he filed a bug on that.

KDE spellchecker not working for Indian Languages

As I mentioned in my blog post on Language detection the sonnet spellchecker of KDE is not working. I read the code of the Sonnet and found that it fails to determine the word boundaries in a sentence (or string buffer) and passes the parts of the words to backend spellcheckers like aspell or hunspell. And eventually we get all words wrong. This is the logic used in Sonnet to recognize the word boundaries

Loop through the chars of the word, until the current char is not a letter/ anymore.

And for this , it use the QChar::.isLetter() function. This functions fails for Matra signs of our languages.

A screenshot from a text area in Konqueror:

For example

#include <QtCore/QString>
#include <stdlib.h>
int main(){
	QChar letter ;
	letter = 'அ';
	fprintf(stdout,"%d\n", letter.isLetter());
	letter = 'ी';
	fprintf(stdout,"%d\n", letter.isLetter());
}


In this program, you will get true as output for அ and false for ी.

When I showed this to Sayamindu during foss.in , he showed me a bug in glibc . Eventhough the bug is about Bengali, it is applicable for all languages. It is assigned to Pravin Satpute and he told me that he got a solution and will be submitting soon to glibc.

But I am wondering why this bug in KDE unnoticed so far? Nobody used spellcheck for Indian languages in KDE?!

Let me explain why this is not happening in GNOME spellchecker if this is a glibc bug. In gnome, this word splitting will be done in application itself using gtk_text_iter_* and these iteration through words are done by pango words boundary detection algorithms.

Filed a bug in KDE to track it.

Firefox spellcheck bugs…

Firefox spellcheck feature requires some volunteers to fix the
tokenization issue. There are two bugs related to the tokenization

  1. Bug 434044 – The tokenization of words for spellcheck is wrong when there is a ZWJ/ZWNJ/ZWS in the word. – Reported: 2008-05-16 07:49 PDT by Santhosh Thottingal
  2. Bug 318040 – Spell checker flags words containing full stops (periods) Reported: 2005-11-28 12:45 PDT by Joseph Wright

10 GB /var/log/messages file

Again fedora! 🙂
After the installation of linux kernel and linux operating system, I installed some libraries, some small applications that I usually use… I have a partition for Fedora 9 with 14 GB size. After installing all those softwares, when I rebooted the system today, the gdm was not starting. GDM kept on restarting and I could not take a user session by pressing ALT + CTRL + F1. hmm… So added single at the kernel argument in the grub, and got the shell.
To my surprise I saw that df -a is saying the partition is 100% full..!. I just installed a few application and not anything for 14 GB..
So tried to figure out who is taking the full diskspace and I caught him.
/var/log/messages 🙂
Yes!
$ls -l messages
-rw——-+ 1 root root 10450239682 2008-05-27 20:39 messages

Ok, 9.7 GB. so who is writing to messages?
$tail -n 100 messages
This gave me some hint. Some sample lines from messages file:
May 27 20:39:23 thottingal gdm-simple-slave[2523]: DEBUG: GdmSignalHandler: Adding handler 5: signum=8 0x804c520
May 27 20:39:23 thottingal gdm-simple-slave[2523]: DEBUG: GdmSignalHandler: Registering for 8 signals
May 27 20:39:23 thottingal gdm-simple-slave[2523]: DEBUG: GdmSignalHandler: Adding handler 6: signum=1 0x804c520
May 27 20:39:23 thottingal gdm-simple-slave[2523]: DEBUG: GdmSignalHandler: Registering for 1 signals

GDM was writing all debug messages to the /var/log/messages. can somebody help me to figure out what is wrong with my GDM?
(the [debug] section of /etc/gdm/custom.conf is empty)

Bug in Firefox Spellcheck

There is a bug in Firefox in the spell check functionality that affects many Indian Langauges using Zero Width [Non] Joiners in the words. Firefox uses hunspell as the spelling checker. Openoffice also uses Hunspell. The bug is not there in Openoffice and problem with firefox is with the tokenization of words in editable textfields before doing spellcheck. Firefox splits the words if there is ZWJ/ZWNJ in the word. And because of this the input to the spellchecker is wrong and it is not the actual word.
I have filed a bug against the spellchecker of Firefox and you can see it here (bug #434044 )
I have given some sample words in Malayalam and Bengali(Thanks to Runa) with ZWJ/ZWNJ. If your language uses ZWJ/ZWNJ, please comment/vote in mozilla bugzilla.

I found this when I was trying to prepare a Malayalam spellcheck extension for firefox(Hunspell wordlist). Still many languages do not have the affix rules in place for aspell/hunspell and it makes the spellcheck less efficient particularly for highly inflected/agglutinated languages like Malayalam.

Thanks to Németh László, Hunspell developer for helping me to figure out the problem