GSOC 2014 – Mentoring for SMC
Posted on May 25, 2014
| Santhosh Thottingal
I am a mentor for Google Summer of Code 2014 for SMC. I will be helping Praveen Sridhar to port input methods from jquery.ime to the Firefox OS.
We started the project and Praveen already has a proof of concept ready.
Tim Chien and Rudy Lu from Mozilla is co-mentoring the same project
Meera Tamil font in Ubuntu Trusty Tahr
Posted on April 13, 2014
| Santhosh Thottingal
Ubuntu Trusty Tahr is going to be released on April 17th 2014.
Meera Tamil font, a free licensed unicode font for Tamil will be available in this release.
The font is already available in Debian. In both Ubuntu and Debian you can install the font by
sudo apt-get install fonts-meera-taml
Thanks Vasudev for packaging it for Debian.
Collaboratively edited documentation for Indic font developers
Posted on January 11, 2014
| Santhosh Thottingal
One of the integral building blocks for providing multilingual support for digital content are fonts. In current times, OpenType fonts are the choice. With the increasing need for supporting languages beyond the Latin script, the TrueType font specification was extended to include elements for the more elaborate writing systems that exist. This effort was jointly undertaken in the 1990s by Microsoft and Adobe. The outcome of this effort was the OpenType Specification – a successor to the TrueType font specification.
[Read More]
Hyphenation in web
Posted on March 17, 2013
| Santhosh Thottingal
This is a follow up of a 4 year old blog post about hyphenation. Hyphenation allows the controlled splitting of words to improve the layout of paragraphs, typically splitting words at syllabic or morphemic boundaries and visually indicating the split (usually with a hyphen).
I wrote about how a webpage can use Hyphenator javascript library to achieve hyphenation for a text with ‘justify‘ style. Along with the hyphenation rules I wrote for many Indian languages, this solution works and some websites already use it.
[Read More]
Malayalam Wikisource Offline version
Posted on June 11, 2011
| Santhosh Thottingal
Malayalam Wikisource community today released the first offline version of Malayalam wikisource during the 4th annual wiki meetup of Malayalam wikimedians. To the best of our knowledge, this is the first time a wikisource project release its offline version. Malayalam wiki community had released the first version of Malayalam wikipedia one year back.
Releasing the offline version of a wikisource is a challenging project. The technical aspects of the project was designed and implemented by myself.
[Read More]
Mediawiki Berlin hackathon
Posted on May 17, 2011
| Santhosh Thottingal
I am just back from Mediawiki Berlin Hackathon. On May 13 to 15, Mediawiki developers attended the hackathon and squashed many bugs and discussed many features. Members of language committee had its first real-life meeting in parallel with hackathon. It was a nice event, learned a lot, talked to many awesome hackers and linguists.
Milos Rancic has written a summary of the discussions happened during language committee meeting here : http://lists.
[Read More]
Creating a new Language ecosystem- Sourashtra as example
Posted on May 7, 2011
| Santhosh Thottingal
Sourashtra is a language spoken by Sourashtra people living in South Tamilnadu and Gujarat of India. Originated from Brahmi and then Grandha, this language is mother tongue for half a million people. But most of them are not familiar with the script of this language. Very few people knows reading and writing on Sourashtra script. Sourashtra has a ISO 639-3 language code saz and Unicode range U+A880 – U+A8DF
Recently Sourashtra wikipedia project was started in the wikimedia incubator : http://incubator.
[Read More]
Cross Language Approximate Search on Indic Languages- A demo
Posted on April 3, 2011
| Santhosh Thottingal
A demo of cross language approximate search in Indic text:
The Malayalam word സാമ്പാര് is compared against a paragraph from http://ml.wikipedia.org/wiki/Sambar.
In the bottom half, words marked in yellow color are search results.
You can see that a Kannada word ಸಾಂಬಾರ್ is matched for Malayalam word. And that is why this is called cross-language.
The inflections of the words സാമ്പാര് – സാമ്പാറും, സാമ്പാറു etc are also found as results.
[Read More]
Tamil Collation in GLIBC
Posted on February 26, 2011
| Santhosh Thottingal
A few months back, we started fixing the collation rules of Indian languages in GNU C library. Pravin Satpute prepared patches for many languages and I prepared patches for Malayalam and Tamil. Later Pravin enhanced the Tamil patch.
You can read the rules used for Malayalam collation here[PDF document]. Tamil patch was applied to upstream, but the bug is still open since there is some confusion on the results.
Before reading the below discussion, please read the discussion happened in the bug report : [ta_IN] Tamil collation rules are not working in other locales
[Read More]
Identifiers In Indic Languages
Posted on January 8, 2011
| Santhosh Thottingal
Recently, while preparing a critique for IDN Policy for Malayalam language prepared by CDAC, I noticed that ICANN does not allow control characters in the domain names. Sometime back I noticed Python 3 identifiers also does not allow control characters in the Identifiers. This blog post attempts to analyze the issue by looking at the Unicode and ICANN specifications about these special characters.
Apart from the existing characters in Indic languages, Zero width Joiner and Zero width non joiners are widely used in Indic languages to control how the ligatures are formed.
[Read More]