KDE Indic Screensavers

I ported all of the Matrix screensavers with Indian language glyphs to KDE4. For details about the screensavers please read:

Download the binary packages: Deb package, and RPM package

There are 6 screensavers in that package, for Malayalam, Hindi, Oriya , Bengali, Tamil and Gujarati. After installation, goto KDE system settings->Desktop->Screensaver and select any of this.

Screenshots(click to get the image in original size):


KDE Screensaver configuration for Hindi:


Enjoy…!

Hyphenation of Indian Languages in Webpages

In my last blogpost I explained hyphenation of Indian language text in openoffice. In this blogpost I will explain how hyphenation can be done in webpages.

As I explained importance of hyphenation come into picture when we justify the text. The length of the lines are controlled by the parent tags…. Unicode had defined a special character called soft hyphen for hyphenation denoted by ­ . In HTML, the plain hy­phen is rep­re­sent­ed by the “-” char­ac­ter (- or-). The soft hy­phen is rep­re­sent­ed by the char­ac­ter en­ti­ty ref­er­ence ­ (­ or ­)

User agents-browsers can break the line whenever a soft hyphen is found. So if we have a javascript based implemenation, which insert the softhyphen in between the words based on language specific rules, we can achieve hyphenation in webpages too.

Hyphenator is a project which does exactly the same. “Hyphenator.js brings client-side hyphenation of HTML-Documents on to every browser by inserting soft hyphens using hyphenation patterns and Frank M. Liangs hyphenation algorithm commonly known from LaTeX and Openoffice. “

Hyphenator was not tested for any non-latin languages so far. I tried to add support for Indian languages and the result was satisfactory. I used the
same rules I defined for openoffice. Unlike latin languages, the number of hyphenation patterns for Indian languages is very less and the performance is good because of that.

I have added Malayalam, Tamil, Hindi, Oriya, Kannda, Telugu, Bengali, Gujarati and Panjabi support to it. You can see a working example here. (I wanted to embed one example here. But livejournal doesnot allow javascript inside blog body ). The column layout is done by CSS. Try resizing the browser windows and try a print preview too..

Don’t forget to read the source code of that page. It is very simple. If you want hyphenation in your webpage, all you need is to include the javascript as done in the example. We need to provide the lang attributes for nodes so that the required patterns for that language can be loaded. I placed the new language patterns temporarily in download area of SMC. I will ask the author of Hyphenator to include it in upstream itself. Code is available here


Update(18-Dec-2008):Thanks to Mathias Nater, author of hyphenator, the patterns were added to upstream.

Hyphenation of Indian Languages and Openoffice

What is Hiphenation?

Hyphenation is the process inserting hyphens in between the syllables of a word so that when the text is justified, maximum space is utilized.

Hiphenation is an important feature that DTP softwares provide. For Indian languages there is no good DTP softwares available. XeTex is the only choice to work with unicode and professional quality page layout. But xetex and DTP are not exactly same. Inkscape can be used as temporary solution. But only for small scale works. There is a project going on to add Harfbuzz backend to Scribus, the freedomware DTP package.

Hiphenation is also requred in many other places. Actually it is required where ever we ‘justify’ a block of text in openoffice or any wordprocessors. Same is the case of webpages. If we justify a block of text in ml_IN, let is see what is happening now

Note the long gaps between words. This is a screenshot taken from firefox. The default hiphenation just breaking the lines in space characters. And no doubt that it makes the pages ugly. The problem becomes worse if the length of the word is more and column width is less.

So what is the solution?

Ideal solution : Applications should be aware of the language, its hiphenation rules and should to the hiphenation wherever required.

Openoffice can take hiphenation dictionaries just like spell checkers. But for Indian languages, we are yet to prepare hiphenation dictionaries(more on that later.) . CSS3 draft of w3c has a provision for hyphenate. But it is stil in draft stage

Algorithm For Hiphenation

The basic for all hyphenation algorithms is the hyphenation algorithm, designed by Frank Liang in 1983, which is adopted in TeX. Wikipedia artcle of TeX explain this with very simple example

If TeX must find the acceptable hyphenation positions in the word encyclopedia, for example, it will consider all the subwords of the extended word .encyclopedia., where . is a special marker to indicate the beginning or end of the word. The list of subwords include all the subwords of length 1 (., e, n, c, y, etc), of length 2 (.e, en, nc, etc), etc, up to the subword of length 14, which is the word itself, including the markers. TeX will then look into its list of hyphenation patterns, and find subwords for which it has calculated the desirability of hyphenation at each position. In the case of our word, 11 such patterns can be matched, namely 1c4l4, 1cy, 1d4i3a, 4edi, e3dia, 2i1a, ope5d, 2p2ed, 3pedi, pedia4, y1c. For each position in the word, TeX will calculate the maximum value obtained among all matching pattern, yielding en1cy1c4l4o3p4e5d4i3a4. Finally, the acceptable positions are those indicated by an odd number, yielding the acceptable hyphenations en-cy-clo-pe-di-a. This system based on subwords allows the definition of very general patterns (such as 2i1a), with low indicative numbers (either odd or even), which can then be superseded by more specific patterns (such as 1d4i3a) if necessary. These patterns find about 90% of the hyphens in the original dictionary; more importantly, they do not insert any spurious hyphen. In addition, a list of exceptions (words for which the patterns do not predict the correct hyphenation) are included with the Plain TeX format; additional ones can be specified by the user.

For more details about the algorithm used in Openoffice read this paper by Nemeth Laszlo

Hiphenation in Indian languages.

Unlike English or any other languages, hiphenation in Indian languages are not that much complex. In general following are the rules

  • [consonant][vowel][consonat] can be hiphenated as [consonant][vowel] – [consonat] if vowel is not a virama or halant
  • Dont split a word after ZWJ
  • We can split a word after ZWNJ
  • plus any language specific rules. For eg: in ml_IN a line should not start with a chillu letter.

Hiphenation Dictionaries for Indian languages.

Based on the above mentioned rules, Let us try to create hiphenation dictionaries for Indian languages. I will explain this with the help of a Hindi word example: अनुपल्ब्ध.
We have to define the following rules in the dictionary for this
अ1 -> 1 is odd number , ie. word can be splitterd after अ
ु1 -> 1 is odd number , ie. word can be splitterd after ु
1ल -> 1 is odd number , ie. word can be splitterd before ल
1प -> 1 is odd number , ie. word can be splitterd before प
1ब -> 1 is odd number , ie. word can be splitterd before ब
्2 -> 2 is even number , ie. word can NOT be splitterd after ्
1ध -> 1 is odd number , ie. word can be splitterd before ध
So the end result is अ+नु+प+ल्ब्ध

Same way we can create the Hyphenation dictionaries for all other languages. I have prepared the Hyphenation dictionaries for 8 Indian Languages. Download it from the git repo of the SMC.
How to Install a xx_IN hyphenation dictionary.

  • Copy the hyphenation dictionay hyph_xx_IN to /usr/share/myspell/dicts folder.
  • Create a file at /usr/share/myspell/infos/ooo/ folder named openoffice.org-hyphenation-xx with one line content
    HYPH xx IN hyph_xx_IN
  • Run this command sudo update-openoffice-dicts

Open the openoffice writer, Open some fille in your language or type some text. Justify the text. Set the language of the selection by using Tools->Language menu Hiphenate it by using Tools->Language->Hiphenation menu.

Hope it works :). I tested only Hindi and Malayalam. For other languages , inform me if you see any problems or if it is not working . Here is the hyphenated Malayalam paragraph. Compare it with the image I showed at the beginning of this blogpost

Ok. so after testing these hyphenation dictionaries, if we provide them to upstream and packaged, hyiphenation problems in openoffice is solved. 🙂

But…. How to solve this problem in web pages?!. We will discuss it in next blogpost!
PS: Thanks to Nemeth Laszlo , author of Hunspell and Openoffice Hyphenation for helping me to prepare the hyphenation tables.


Update(Apr 16,2009) The hyphenation dictionaries were packaged for Fedora and will be part of Fedora 11

Gedit plugin for showing unicode codepoints

While working with Unicode text, it is often required to get the Unicode code points of text for debugging. Using python, it is very easy to get the unicode codepoints of the text. Following examples illustrates it.

>>> "സന്തോഷ്".decode("utf-8")
u'\u0d38\u0d28\u0d4d\u0d24\u0d4b\u0d37\u0d4d'

or

>>> str=u"സന്തോഷ്"
>>> print repr(str)
u'\u0d38\u0d28\u0d4d\u0d24\u0d4b\u0d37\u0d4d'

Well, But we need to take python console and type/paste the text etc..How can we make it more easy? What if pressing F12 key after selecting some text gives the codepoints?
So I wrote a plugin for gedit. I never knew that writing a gedit plugin is too easy. This tutorial gives all the required information.
Download the plugin file and python module and place it in .gnome2/gedit/plugins folder inside your home folder. And restart gedit. Enable the plugin from Edit->Preferences->Plugins menu. Note that you need to enable the External tools plugin too.

Select some text and press F12. If text is not selected, entire content of the document will be used.

Screensavers in your language

I had written a blog post about hacking the glmatrix screensaver with the glyphs of our languages.

Now I have those screensavers in the following languages:

Hindi : Deb Package , RPM

Gujarati : Deb Package , RPM

Bengali : Deb Package , RPM

Oriya: Deb Package , RPM

Tamil : Deb Package , RPM

Malayalam: Deb Package , RPM

Try it and enjoy !!
ps: I used the default fonts of Fedora 9 for these. If you have any specific font to be used please let me know. I used Dyuthi calligraphic font for Malayalam.

say_namaskaar.c

/* say_namaskaar.c
 *  This is a sample C code using dhvani text to speech API which I am 
 *  developing now and planning to release soon. New version of dhvani 
 *  will provide a shared library libdhvani and it allows other C or C++
 *  applications to use dhvani synthesizer. Tamil and Marathi modules, pitch, tempo 
 *  control etc are the features for the coming release.
 *  I need to prepare documentation, fix many bugs, test, commit the files in cvs ...
 *  Looking for some free time for all these...
 *  Visit http://dhvani.sourceforge.net
 */

/* compile with gcc -ldhvani -o namaskaar say_namaskaar.c */
#include <dhvani/dhvani_lib.h>
int main(int argc, char *argv[]) {
    dhvani_options options;
    /* Set the pitch and tempo of the speech */
    options.tempo = -10.0; /* reduce the speed by 10%  */
    options.pitch = 2.0;    /* increase the pitch b 2 semitons */
    options.rate = 16000;  /* 16KHz Sampling rate */
    /* Initialize dhvani */
    dhvani_init(&options);
    /* Say Namaskar */
    dhvani_say("नमसकार",  &options);
    /* close the synthesizer */
    dhvani_close();
    return 0;
}
 
/*  We can write a blog post in C too :P . Syntax highlighted by Code2HTML */

Can’t Speak? Dhvani will speak for you!

Dhvani can help not only blind users but also dumb users. I will explain how dhvani act as your mouth using KMouth.
Kmouth is as KDE Accessibility Appllication and it act as a test to speech front end. KMouth is a program that enables persons that cannot speak to let their computers speak. It includes a history of spoken sentences from which the user can select sentences to be re-spoken. It learns the words the user wrote and have autocompletion. It also includes a phrasebook, using that you can store the commonly used phrases for quick access.
We will see how dhvani can be used with Kmouth.
open KMouth : KMenu->Utilities->Accessibility->Kmouth. Install it if not already installed
You will get configuration window and give the “Command to speak text” as dhvani %f

Done. Now you can type some text in the Kmouth and ask it speak.

To avoid typing the words that are used often, create a Phrasebook. Refer KMouth Help document for that. You can also add a wordlist so that you will get autocompletion feature while typing words. Refer Kmouth Handbook for that also. It is easy and just a matter of giving some text file to learn.
I hope it will be helpfull for the dumb users even though there are some practical problem like keeping the computer with them…

For for information about dhvani, how to install etc see the documentation

Dhvani – KDE Integration.

It is possible integrate Dhvani Indian Langauge TTS to KDE desktop through its TTS system KTTS. Using this you can dhvani can read the text in kate,kedit,kwrite, Konqueror. You can even listen to the text in the webpages in Konqueror
Dhvani can be itegrated to KTTS using its Command plugin feature. To do this go to control center–>Regional and Accessibility –>Text-to-speech –>Talker Tab. Add a new Synthesizer.


Select the syntesizer type as Command and Langauge as Other. You can select any language since Dhvani doesn’t want langauge parameter and it detects the language automatically.
Give the synthesizer command as dhvani %f

Move this synthesizer to top in the list of Synthesizers and Click apply. Done.
Now take a UTF-8 text in any of the editors described above or take a webpage in any of the supported language. From the tools menu take Speak Text and listen !!!
For for information about dhvani, how to install etc see the documentation

Creating audio books using Dhvani

Dhvani can be used for creating audiobooks in any of the supported languages(Hindi, Malayalam, Telugu, Kannada, Oriya, Bengali, Gujarati, Panjabi).
First of all you should get the latest dhvani source code from CVS in sourceforge. Compile it and install.
To create an audiobook follow these steps
You need the text in utf-8 format. No need to specify the langauge. Dhvani will detect the langauge automatically.

dhvani -o audiobook.wav textfile
oggenc -B 16 -C 1 -R 16000 audiobook.wav

Now you have a file called audiobook.ogg. If you prefer ogg, then your audiobook is ready. If you want the file in mp3 format

oggdec audiobook.ogg

(This will create a file named audiobook.ogg.wav )
lame --preset 192 -ms -h audiobook.ogg.wav

(install lame if it is not present using your package manager)

Now your mp3 file is ready. Transfer it to your music player and enjoy!

I have a sample Malayalam Audio book here

Note: The speech produced for Languages other than Hindi and Malayalam may not be as per their pronunciation rules. There are two solution for this:
a) Teach me that langauge 😉 or
b) Submit a patch to fix that language module

You can find the Dhvani documentation here

Hackers or Crackers?

When will these journalists understand the difference between the _Hacker_ and Cracker?
See these two news
1. ADMK website hacked again
2. Hacker who stole bank details held
3. Goa govt web site hacked by Turkish hacker

Dear journalists, Could you please find time to read these?
http://fci.wikia.com/wiki/IfYouAre#journalist
http://en.wikipedia.org/wiki/Hacker
http://fci.wikia.com/wiki/Hackers
Hackers are not Crackers