In my last blogpost I explained hyphenation of Indian language text in openoffice. In this blogpost I will explain how hyphenation can be done in webpages.
As I explained importance of hyphenation come into picture when we justify the text. The length of the lines are controlled by the parent tags…. Unicode had defined a special character called soft hyphen for hyphenation denoted by . In HTML, the plain hyphen is represented by the “-” character (- or-).
[Read More]
Hyphenation of Indian Languages and Openoffice
What is Hiphenation?
Hyphenation is the process inserting hyphens in between the syllables of a word so that when the text is justified, maximum space is utilized.
Hiphenation is an important feature that DTP softwares provide. For Indian languages there is no good DTP softwares available. XeTex is the only choice to work with unicode and professional quality page layout. But xetex and DTP are not exactly same. Inkscape can be used as temporary solution.
[Read More]
Yahoo search bug
None of the search engines can handle Indian languages very well. Google removes the zero width joiners, non joiners , that are used in many languages. Yahoo doesnot remove it. But a UI bug in webpage makes the results wrong..
See the below image:
The bottom half of the image is the source code. We can clearly see that the closing bold tag is placed in between the word instead of putting at the end of the word.
[Read More]
KDE spellchecker not working for Indian Languages
As I mentioned in my blog post on Language detection the sonnet spellchecker of KDE is not working. I read the code of the Sonnet and found that it fails to determine the word boundaries in a sentence (or string buffer) and passes the parts of the words to backend spellcheckers like aspell or hunspell. And eventually we get all words wrong. This is the logic used in Sonnet to recognize the word boundaries
[Read More]
Youtube to MPEG or Ogg video conversion
Here is the two line method to convert a youtube video to oggvorbis video.
Locate clive and ffmpeg2theora in your package and install
$clive <a href="http://in.youtube.com/watch?v=6JeZ5oeAEyU">http://in.youtube.com/watch?v=6JeZ5oeAEyU </a>(replace this with the youtube address you want)
It will create a flv file.
Convert to mpeg video file
$ffmpeg -i AmericaAmerica.flv AmericaAmerica.mpg
Convert to ogg video file
$ffmpeg2theora AmericaAmerica.mpg (replace it with the name of the flv file the previous command created)
Done. You can see the .
[Read More]
Dhvani 0.94 Released
A new version of Dhvani -The Indian Language Text to Speech System is available now. The new version comes with the following improvements/features
Support for 11 languages- Hindi, Panjabi, Gujarati, Marati, Bengali, Oriya, Telugu, Kannada, Tamil , Malayalam and Pashto(Afganistan)
Pitch and Tempo modification for speech
Direct ogg-vorbis speech output and optional wav output format
C/C++ APIs for applications to use dhvani as a shared library. Generic driver for Speech-dispatcher and Integration to Orca through speech dispatcher Python binding through speech dispatcher Improved language detection algorithm Dhvani documentation is available here.
[Read More]
Language Detection and Spellcheckers
A few weeks back there was a discussion on #indlinux IRC channel about automatic language detection. The idea is, spellcheckers or any language tools should not ask the users to select a language. Instead, they should detect the language automatically. The idea is not new. There is a KDE bug hereand Ubuntu has this as an brainstorm idea. It seems M$ word already have this.
A sample use case can be this: “While preparing a document in Openoffice, I want to write in English as well as in Hindi.
[Read More]
Gedit plugin for showing unicode codepoints
While working with Unicode text, it is often required to get the Unicode code points of text for debugging. Using python, it is very easy to get the unicode codepoints of the text. Following examples illustrates it.
`
“സന്തോഷ്”.decode(“utf-8”)
u’\u0d38\u0d28\u0d4d\u0d24\u0d4b\u0d37\u0d4d’
`
or
`
str=u"സന്തോഷ്"
print repr(str)
u’\u0d38\u0d28\u0d4d\u0d24\u0d4b\u0d37\u0d4d’
`
Well, But we need to take python console and type/paste the text etc..How can we make it more easy? What if pressing F12 key after selecting some text gives the codepoints?
[Read More]
Screensavers in your language
I had written a blog post about hacking the glmatrix screensaver with the glyphs of our languages.
Now I have those screensavers in the following languages:
Hindi : Deb Package , RPM
Gujarati : Deb Package , RPM
Bengali : Deb Package , RPM
Oriya: Deb Package , RPM
Tamil : Deb Package , RPM
Malayalam: Deb Package , RPM
Try it and enjoy !!
ps: I used the default fonts of Fedora 9 for these.
[Read More]
Swanalekha M17N based Input Method for 11 Languages
Swanalekha is an Input method originally designed for Malayalam. It is works with scim. as well as m17n. The input method scheme is transliteration based and it has a unique feature of candidate list menu(which I will explain shortly). Now I have extended it to 10 other Indian languages.
Before explaining how swanalekha is different from other phonetic/transliteration based input methods, let me explain some of the characteristics of transliteration. Transliteration based input methods were following a strict one to one mapping from english letters to another Indian language.
[Read More]