<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Santhosh Thottingal &#187; Indic</title>
	<atom:link href="http://thottingal.in/blog/category/indic/feed/" rel="self" type="application/rss+xml" />
	<link>http://thottingal.in/blog</link>
	<description>/home/santhosh</description>
	<lastBuildDate>Mon, 14 Nov 2011 06:06:25 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Malayalam Wikisource Offline version</title>
		<link>http://thottingal.in/blog/2011/06/11/malayalam-wikisource-offline-version/</link>
		<comments>http://thottingal.in/blog/2011/06/11/malayalam-wikisource-offline-version/#comments</comments>
		<pubDate>Sat, 11 Jun 2011 09:11:38 +0000</pubDate>
		<dc:creator>Santhosh</dc:creator>
				<category><![CDATA[Community]]></category>
		<category><![CDATA[Indic]]></category>
		<category><![CDATA[Malayalam]]></category>
		<category><![CDATA[Misc]]></category>
		<category><![CDATA[Projects]]></category>
		<category><![CDATA[wikipedia]]></category>

		<guid isPermaLink="false">http://thottingal.in/blog/?p=358</guid>
		<description><![CDATA[Malayalam Wikisource community today released the first offline version of Malayalam wikisource during the 4th annual wiki meetup of Malayalam wikimedians. To the  best of our knowledge, this is the first time a wikisource project release its offline version. Malayalam wiki community had released the first version of Malayalam wikipedia one year back. Releasing the [...]]]></description>
			<content:encoded><![CDATA[<p>Malayalam Wikisource community today released the first offline version of <a href="http://ml.wikisource.org" target="_blank">Malayalam wikisource</a> during the 4th annual wiki meetup of Malayalam wikimedians. To the  best of our knowledge, this is the first time a wikisource project release its offline version. Malayalam wiki community <a href="http://thottingal.in/blog/2010/04/17/mlwikioncd/" target="_blank">had released</a> the first version of Malayalam wikipedia one year back.</p>
<p>Releasing the offline version of a wikisource is a challenging project. The technical aspects of the project was designed and implemented by myself. So let me share the details of the project.</p>
<p>As you know a Wikisource contains lot of books, and each book varies in its size, it is divided to chapters or sections. There is no common pattern for books. Each having its own structure. A novel presentation is different from a collection of poems from a Poet. Wikisource also has religious books like Bible, Quran, Bhagavat Geeta, Ramayana etc.  Since books are for continuous reading for a long time, the readabilty and how we present the lengthy chapters in screen also matters. Offline wikipedia tools for example, <a href="http://www.kiwix.org/" target="_blank">Kiwix</a> does not do any layout modification of the content and present as it is shown in wikipedia/wikisource. <a href="https://github.com/santhoshtr/wiki2cd" target="_blank">The tool</a> we wrote last year for Malayalam wikipedia offline version also present scrollable vertical content in the screen. Both are not configurable to give different presentation styles depending on the nature of the book.</p>
<p>What we wanted is a book reader kind of application interface.  Readers should be able to easily navigate to books, chapters. The chapter content will be very lengthy. For a long time reading of this content,  a lengthy vertically scrolled text is not a good idea. We also need to take care of the width of the lines.  If each line spans 80-90% of the screen, especially for a wide screen monitor, it is a strain for neck and eyes.</p>
<p>&nbsp;</p>
<div id="attachment_361" class="wp-caption aligncenter" style="width: 405px"><a href="http://thottingal.in/blog/wp-content/uploads/2011/06/2011-06-09-19-29-211.png"><img class="size-large wp-image-361" title="2011-06-09-19-29-21" src="http://thottingal.in/blog/wp-content/uploads/2011/06/2011-06-09-19-29-211-1024x455.png" alt="" width="395" height="175" /></a><p class="wp-caption-text">Screenshot of Offline version. Click to enlarge</p></div>
<p style="text-align: center;"><a href="http://thottingal.in/blog/wp-content/uploads/2011/06/2011-06-09-19-29-21.png"><br />
</a></p>
<p>The selection of books for the offline version was done by the active wikimedians at Wiksource. Some of the selected books was proof read by many volunteers within the last  2 weeks.</p>
<p>The tools used for extracting htmls were adhoc and adapted to meet the good presentation of each book. So there is nothing much to reuse here. Extracting the html and then taking the content part alone using pyquery and removing some unwanted sections from html- basically this is what our scripts did. The content is added to predefined HTML templates with proper CSS for the UI. CSS3 multicolumn feature was used for book like interface. Since IE did not implement this standard even in IE9, for that browser the book like interface was not provided. Chrome browser with version less than 12 could not support, because of these bugs: <a href="http://code.google.com/p/chromium/issues/detail?id=45840">http://code.google.com/p/chromium/issues/detail?id</a><a href="http://code.google.com/p/chromium/issues/detail?id=45840">=45840</a> and <a href="http://code.google.com/p/chromium/issues/detail?id=78155">http://code.google.com/p/chromium/issues/detail?id</a><a href="http://code.google.com/p/chromium/issues/detail?id=78155">=78155</a>. For easy navigation, mouse wheel support and page navigation buttons provided. For solving non-availability of required fonts, webfonts were integrated with a selection box  to select favorite font. Reader can also select the font size to make the reading comfortable.</p>
<p>Why static html? The variety of platforms and other versions we need to support, necessity to have webfonts, complex script rendering, effort to develop and customize UI, relatively small size of the data, avoiding any installation of software in users system etc made us to choose static html+ jquery + css as the technology choice. The downside is we could not provide full text search.</p>
<p>Apart from the wikisource, we also included a collection of copyleft of images from wikimedia commons. Thanks to <a href="http://nishan-naseer.blogspot.com/" target="_blank">Nishan Naseer</a>, for preparing a gallery application using jquery. We selected 4 categories from Commons which are related to Kerala. We hope everybody will like the pictures and it will give  a small introduction to Wikimedia Commons.<br />
<a href="http://thottingal.in/blog/wp-content/uploads/2011/06/2011-06-11-09-22-06.png"><img class="aligncenter size-large wp-image-364" title="2011-06-11 09-22-06" src="http://thottingal.in/blog/wp-content/uploads/2011/06/2011-06-11-09-22-06-1024x474.png" alt="" width="453" height="209" /></a><br />
Even though the python scripts are not ready to reuse in any projects, if anybody want to have a look at it, please mail me. I am not putting it in public since the script does not make sense outside the context of each book and its existing presentation in Malayalam wikisource.</p>
<p>The CD image is available for download <a href="http://www.mlwiki.in/cdimage/mlwikisource.iso" target="_blank">here</a> and one can also browse the CD content <a href="http://www.mlwiki.in/wikisrccd" target="_blank">here</a>.</p>
<p>Thanks to Shiju Alex for coordinating this project. And thanks to all Malayalam wikisource volunteers for making this happen.  We have included poems, folk songs, devotional songs, novel, grammar book, tales, books on Hinduism, Islam-ism, Christianity, Communism, Philosophy. With this release, it becomes the biggest offline digital archive of Malayalam books.</p>
]]></content:encoded>
			<wfw:commentRss>http://thottingal.in/blog/2011/06/11/malayalam-wikisource-offline-version/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Mediawiki Berlin hackathon</title>
		<link>http://thottingal.in/blog/2011/05/17/mediawiki-berlin-hackathon/</link>
		<comments>http://thottingal.in/blog/2011/05/17/mediawiki-berlin-hackathon/#comments</comments>
		<pubDate>Tue, 17 May 2011 16:16:36 +0000</pubDate>
		<dc:creator>Santhosh</dc:creator>
				<category><![CDATA[Community]]></category>
		<category><![CDATA[Indic]]></category>
		<category><![CDATA[Projects]]></category>
		<category><![CDATA[wikipedia]]></category>

		<guid isPermaLink="false">http://thottingal.in/blog/?p=353</guid>
		<description><![CDATA[I am just back from Mediawiki Berlin Hackathon. On May 13 to 15, Mediawiki developers attended the hackathon and squashed many bugs and discussed many features. Members of language committee had its first real-life meeting in parallel with hackathon. It was a nice event, learned a lot, talked to many awesome hackers and linguists. Milos [...]]]></description>
			<content:encoded><![CDATA[<p>I am just back from <a href="http://www.mediawiki.org/wiki/Berlin_Hackathon_2011">Mediawiki Berlin Hackathon</a>. <a href="http://commons.wikimedia.org/wiki/File:Wikimedia_Hackathon_Berlin_2011_group_photo.jpg"><img class="size-medium wp-image-4184 alignright" title="Group photo at the Berlin Hackathon 2011" src="http://blog.wikimedia.org/wp-content/uploads/2011/05/Wikimedia_Hackathon_Berlin_2011_group_photo-300x143.jpg" alt="" width="300" height="143" /></a>On May 13 to 15, Mediawiki developers attended the hackathon and squashed many bugs and discussed many features. Members of <a href="http://meta.wikimedia.org/wiki/Language_committee">language committee</a> had its first real-life meeting in parallel with hackathon. It was a nice event, learned a lot, talked to many awesome hackers and linguists.</p>
<ul>
<li><a title="User:Millosh" href="http://meta.wikimedia.org/wiki/User:Millosh">Milos Rancic</a> has written a summary of the discussions happened during language committee meeting here : <a href="http://lists.wikimedia.org/pipermail/foundation-l/2011-May/065537.html">http://lists.wikimedia.org/pipermail/foundation-l/2011-May/065537.html</a></li>
<li><a href="http://translatewiki.net/wiki/User:Nike">Niklas Laxström</a> and <a href="http://en.wikipedia.org/wiki/User:Siebrand">Siebrand</a> reviewed the <a href="http://www.mediawiki.org/wiki/Extension:WebFonts">WebFonts extension</a> and enabled at <a href="http://translatewiki.net/">translatewiki.net</a>. Fixed a few bugs that Niklas reported on the extension.</li>
<li><a href="http://en.wikipedia.org/wiki/User:Purodha">Purodha Blissenbach </a>was very much interested in the WebFonts and Narayam extensions, we discussed some of the features we need to add. We have it here: <a href="https://bugzilla.wikimedia.org/show_bug.cgi?id=28900">Bug 28900</a> , <a href="https://bugzilla.wikimedia.org/show_bug.cgi?id=28999">Bug 28999</a> and <a href="https://bugzilla.wikimedia.org/show_bug.cgi?id=29000">Bug 29000</a></li>
</ul>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://thottingal.in/blog/2011/05/17/mediawiki-berlin-hackathon/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Creating a new Language ecosystem- Sourashtra as example</title>
		<link>http://thottingal.in/blog/2011/05/07/language-ecosystem-sourashtra/</link>
		<comments>http://thottingal.in/blog/2011/05/07/language-ecosystem-sourashtra/#comments</comments>
		<pubDate>Sat, 07 May 2011 06:31:39 +0000</pubDate>
		<dc:creator>Santhosh</dc:creator>
				<category><![CDATA[Indic]]></category>
		<category><![CDATA[Projects]]></category>
		<category><![CDATA[fonts]]></category>
		<category><![CDATA[glibc]]></category>
		<category><![CDATA[sourashtra]]></category>
		<category><![CDATA[wikipedia]]></category>

		<guid isPermaLink="false">http://thottingal.in/blog/?p=347</guid>
		<description><![CDATA[Sourashtra is a language spoken by Sourashtra  people living in South Tamilnadu and Gujarat of India. Originated from Brahmi and then Grandha, this language is mother tongue for half a million people. But most of them are not familiar with the script of this language. Very few people knows reading and writing on Sourashtra script. [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://en.wikipedia.org/wiki/Saurashtra_language" target="_blank">Sourashtra</a> is a language spoken by Sourashtra  people living in South Tamilnadu and Gujarat of India. Originated from Brahmi and then Grandha, this language is mother tongue for half a million people. But most of them are not familiar with <a href="http://en.wikipedia.org/wiki/Saurashtra_script" target="_blank">the script</a> of this language. Very few people knows reading and writing on Sourashtra script. Sourashtra has a ISO 639-3 language code saz and  Unicode range  U+A880 &#8211; U+A8DF</p>
<p>Recently Sourashtra wikipedia project was started in the wikimedia incubator : <a href="http://incubator.wikimedia.org/wiki/Wp/saz" target="_blank">http://incubator.wikimedia.org/wiki/Wp/saz</a> and Mediawiki localization <a href="http://ultimategerardm.blogspot.com/2011/03/saurashtra-language-from-india-new-to.html" target="_blank">started in translatewiki</a> Since the language did not had any proper fonts or input tools, this was not going well.</p>
<p>When we add a  new language support in Mediawiki or start a new language wikipedia,  we need to develop the language technology ecosystem to support its growth. This ecosystem comprises of Unicode code points for the script, proper fonts, rendering support,  input tools, availability of these fonts and input tools in operating systems or alternate ways to get it working in operating system etc.</p>
<p>Sourashtra language had a unicode font developed by<a href="http://www.khenikeri.com/" target="_blank"> Prabu M Rengachari</a>, named &#8216;Sourashtra&#8217; itself. The font <a href="http://khenikeri.blogspot.com/2011/01/test-sourashtra-unicode-font-versions.html" target="_blank">had problems</a> with browsers/operating systems. We fixed to make it work properly. The font was not licensed properly. Prabu agreed to release it in <a href="http://www.gnu.org/licenses/gpl-3.0.txt" target="_blank">GNU GPLV3</a> license with<a href="http://www.gnu.org/licenses/gpl-faq.html#FontException" target="_blank"> font exception</a>. He also agreed to rename the font to another name other than the script name itself.</p>
<p>The font was <a href="http://khenikeri.blogspot.com/2011/04/pagul-web-font.html" target="_blank">renamed to Pagul</a>, meaning &#8216;Footstep&#8217; in Sourashtra and <a href="https://sourceforge.net/projects/pagul/" target="_blank">hosted in sourceforge</a></p>
<p>Once we have a font with proper license, we wanted it to be available in operating systems. I filed a<a href="http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=623944" target="_blank"> packaging request</a> in Debian. <a href="http://blog.copyninja.info/" target="_blank">Vasudev Kamath</a> of Debian India Team packaged it and now it is available in<a href="http://packages.debian.org/sid/fonts-pagul" target="_blank"> debian unstable</a>(sid).  Parag Nemade of Fedora India <a href="https://bugzilla.redhat.com/show_bug.cgi?id=699587" target="_blank">packaged the font for Fedora</a> and will be avialable in upcoming Fedora 15.</p>
<p>To add a new language support in operating system, we need <a href="http://en.wikipedia.org/wiki/Locale" target="_blank">a locale definition</a>. In GNU Linux this is GLibc locale definition. With the help of Prabu, I prepared the saz_IN locale file for glibc, and filed as <a href="https://bugzilla.redhat.com/show_bug.cgi?id=698346" target="_blank">bug report to add to glibc</a>. I hope, soon it will be part of Glibc.</p>
<p>Well, all of these was possible since it was GNU/Linux or Free software. Things are a bit difficult on the other side, proprietary operating system world. There is nothing we can do with those operating systems. Since there is no &#8216;market&#8217; for these minority language, it won&#8217;t come to the priority of those companies to add support for these languages. Users will see squares or question marks when they visit sourashtra wikipedia.</p>
<p>We are working on a solution for this, not only for sourashtra, but a common solution for all languages. We are developing a webfonts extension for Mediawiki to provide font embedding in wiki pages to avoid the necessity of having fonts installed in user&#8217;s computers. The extension is <a href="http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/WebFonts" target="_blank">in development</a> and one can preview it in <a href="http://thottingal.in/wiki/" target="_blank">my test wiki</a>. For Sourashtra, we added webfonts support(<a href="http://thottingal.in/wiki/index.php?title=Sourashtra&amp;setlang=saz" target="_blank">preview</a>) .</p>
<p>Input tools needs to be developed and packaged. For mediaiwki, with the help of Narayam extension, we can easily add this support.</p>
<p>With the <a href="http://silpa.org.in" target="_blank">silpa project</a>, I added a server side, PDF/PNG/SVG <a href="http://silpa.org.in/Render" target="_blank">rendering support</a> for Sourashtra as well.</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://thottingal.in/blog/2011/05/07/language-ecosystem-sourashtra/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Cross Language Approximate Search on Indic Languages- A demo</title>
		<link>http://thottingal.in/blog/2011/04/03/cross-language-approximate-search-on-indic-languages-a-demo/</link>
		<comments>http://thottingal.in/blog/2011/04/03/cross-language-approximate-search-on-indic-languages-a-demo/#comments</comments>
		<pubDate>Sun, 03 Apr 2011 11:27:39 +0000</pubDate>
		<dc:creator>Santhosh</dc:creator>
				<category><![CDATA[Indic]]></category>
		<category><![CDATA[Projects]]></category>
		<category><![CDATA[algorithms]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[silpa]]></category>
		<category><![CDATA[wikipedia]]></category>

		<guid isPermaLink="false">http://thottingal.in/blog/?p=335</guid>
		<description><![CDATA[A demo of cross language approximate search in Indic text: The Malayalam word സാമ്പാര്‍ is compared against a paragraph from http://ml.wikipedia.org/wiki/Sambar. In the bottom half, words marked in yellow color are search results. You can see that a Kannada word ಸಾಂಬಾರ್‍ is matched for Malayalam word. And that is why this is called cross-language. The [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: left;">A demo of cross language approximate search in Indic text:<br />
<a href="http://thottingal.in/images/silpaappoximatesearch-demo-1.png"><img class="aligncenter" src="http://thottingal.in/images/silpaappoximatesearch-demo-1.png" alt="click to enlarge" width="NaN" height="NaN" /></a><br />
The Malayalam word സാമ്പാര്‍ is compared against a paragraph from <a href="http://ml.wikipedia.org/wiki/Sambar">http://ml.wikipedia.org/wiki/Sambar</a>.<br />
In the bottom half,  words marked in yellow color are search results.<br />
You can see that a Kannada word ಸಾಂಬಾರ್‍ is matched for Malayalam word. And that is why this is called cross-language.<br />
The inflections of the words സാമ്പാര്‍ &#8211; സാമ്പാറും, സാമ്പാറു  etc are also found as results.<br />
This is the kind of search we need in Indic languages, not just the letter by letter comparison we do for English.</p>
<p style="text-align: center;">Another example showing all inflection forms of the noun പാലക്കാട്, and the same word written in Tamil, Telugu, Hindi. The search shows the results in those languages too. &#8211; <a href="http://thottingal.in/images/silpaappoximatesearch-demo-2.png"><img class="aligncenter" src="http://thottingal.in/images/silpaappoximatesearch-demo-2.png" alt="click to enlarge" width="NaN" height="NaN" /></a></p>
<p>You can try it here: <a href="http://silpa.org.in/ApproxSearch">http://silpa.org.in/ApproxSearch</a></p>
<p>This is a <a href="http://en.wikipedia.org/wiki/Fuzzy_string_searching">Fuzzy string search</a> application. This application illustrates the combined use of          <a href="http://en.wikipedia.org/wiki/Levenshtein_distance">Edit distance</a> and <a href="http://silpa.org.in/Soundex">Indic Soundex </a> algorithm.</p>
<p>By mixing both written like(edit distance) and sounds like(soundex), we achieve an efficient aproximate string searching. This application is capable of cross language string search too. That means, you can search Hindi words in Malayalam text. If there is any Malayalam word, which is approximate transliteration of hindi word, or sounds alike the Hindi words, it will be returned as an approximate match. The &#8220;written like&#8221; algorithm used here is a bigram average algorithm. The ratio of common bigrams in two strings and average number of bigrams will give a factor which is greater than zero and less than 1. Similarly the soundex algorithm also gives a weight. By selecting words which has comparison weight more than the threshold weight(which 0.6), we get the search results.</p>
]]></content:encoded>
			<wfw:commentRss>http://thottingal.in/blog/2011/04/03/cross-language-approximate-search-on-indic-languages-a-demo/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Tamil Collation in GLIBC</title>
		<link>http://thottingal.in/blog/2011/02/26/tamil-collation-in-glibc/</link>
		<comments>http://thottingal.in/blog/2011/02/26/tamil-collation-in-glibc/#comments</comments>
		<pubDate>Sat, 26 Feb 2011 12:29:51 +0000</pubDate>
		<dc:creator>Santhosh</dc:creator>
				<category><![CDATA[Indic]]></category>
		<category><![CDATA[Bugs]]></category>
		<category><![CDATA[Collation]]></category>
		<category><![CDATA[glibc]]></category>
		<category><![CDATA[Tamil]]></category>

		<guid isPermaLink="false">http://thottingal.in/blog/?p=328</guid>
		<description><![CDATA[A  few months back, we started fixing the collation rules of Indian languages in GNU C library. Pravin Satpute prepared patches for many languages and I prepared patches for Malayalam and Tamil. Later Pravin enhanced the Tamil patch. You can read the rules used for Malayalam collation here[PDF document]. Tamil patch was applied to upstream, [...]]]></description>
			<content:encoded><![CDATA[<p>A  few months back, we started fixing the <a href="http://en.wikipedia.org/wiki/Collation" target="_blank">collation</a> rules of Indian languages in GNU C library. Pravin Satpute prepared patches for many languages and I prepared patches for Malayalam and Tamil. Later Pravin enhanced the Tamil patch.</p>
<p>You can read the rules used for Malayalam collation <a href="http://smc.org.in/doc/malayalam-collation.pdf">here[PDF document]</a>. Tamil patch was applied to upstream, but the bug is still open since there is some confusion on the results.</p>
<p>Before reading the below discussion, please read the discussion happened in the bug report : <a href="https://bugzilla.redhat.com/show_bug.cgi?id=514110">[ta_IN] Tamil collation rules are not working in other locales</a></p>
<p>Since many Tamil friends can give valuable comments on this, I am giving an explanation for my patch here. K Sethu gave some interestin his <a href="https://bugzilla.redhat.com/show_bug.cgi?id=514110#c9">comments</a> on the patch and I would like to hear from others also. Since collation is a very important component on Tamil support, I feel that an open discussion and consensus  should happen among language speakers outside bug trackers.</p>
<p>This is the logic used currently in Tamil and Malayalam Collation rules also follow the same logic.</p>
<ol>
<li> Consider each consonant as pure consonant + implicit a vowel.   ie க= க் + அ   and த= த்+ அ</li>
<li>Similarly கா = க்+ ஆ,  தி = த்+ இ</li>
<li>From #1 and #2,  க் &lt; க,   த்&lt; த  , We get this output for example:அ<br />
அக்<br />
அகம்<br />
அகால<br />
அக்கம<br />
அக்கு<br />
But K Sethu questions this order in <a href="https://bugzilla.redhat.com/show_bug.cgi?id=514110#c9" target="_blank">his comment here</a>.According to him<br />
<strong> ( consonant1+ virma+ consonant2 ) &lt;  ( consonant1+ vowel + [consonant2] )</strong><br />
or The correct sequence should be அ, அக், அக்கம், அக்கு, அகம், அகால<br />
But as per my patch<br />
<strong> ( consonant1+ virma+ consonant2 ) &gt;  ( consonant1+ vowel + [consontant2] )</strong><br />
ie, all conjuncts for consonant1 happens after all consonant1+vowel + * sequences.<br />
So let me try to explain this behaviour.</li>
<li>let us take க்த  and கத:க்த =  க்+ த்+ அ<br />
கத = க்+ அ+ த்+ அ<br />
considering the weight comparison logic(decreasing weight from left to right)<br />
this comparison becomes between<br />
க்+ த்+ அ  and க்+ அ+ த்+ அ<br />
since க் is common in first weight, removing it. so it becomes<br />
த்+ அ  and  அ+ த்+ அ<br />
Since த் &gt; அ<br />
த்+ அ  &gt;   அ+ த்+ அ<br />
and there by<br />
க்த  &gt; கத<br />
So conjuncts comes after the cosonant+vowel pairs. hence the result given in #3</li>
</ol>
<p>Apart from these, equal weights are assigned for  ோ (0BCB), ௌ (0BCC), and their canonical equivalent forms.</p>
<p>If anybody interested in testing the patch, get ta_IN and iso14651_t1_common files from <a href="http://sourceware.org/git/?p=glibc.git;a=tree;f=localedata/locales;h=97c84a37822446bba3d52c9d5001a420e1aebb85;hb=refs/heads/release/2.13/master">here</a>, back up those file in /usr/share/i18n/locales, and place these two files there. reconfigure your locale using &#8220;sudo dpkg-reconfigure locales&#8221;. Sort some random file using &#8220;LANG=ta_IN sort yourfile&#8221;. If your distro is not debian based, follow the instructions from <a href="http://pravin-s.blogspot.com/2008/08/shortcut-method-for-testing-new-locale.html">here</a></p>
<p>There is an easy way to test this. Silpa project provides an online application for Indic language collation. You can <a href="http://silpa.smc.org.in/Sort" target="_blank">try it from here</a>. It is a Unicode Collation algorithm implementation. The Unicode collation definition has many mistakes but we have a patched version. You can compare the results between original and patched version.</p>
<p>Feel free to inform this discussion to anybody interested on Tamil Computing. I would be happy to help in the <strong>implementation </strong>if we  reach  a consensus.<strong><br />
</strong></p>
<p><strong><br />
</strong></p>
]]></content:encoded>
			<wfw:commentRss>http://thottingal.in/blog/2011/02/26/tamil-collation-in-glibc/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Identifiers In Indic Languages</title>
		<link>http://thottingal.in/blog/2011/01/08/identifiers-in-indic-languages/</link>
		<comments>http://thottingal.in/blog/2011/01/08/identifiers-in-indic-languages/#comments</comments>
		<pubDate>Sat, 08 Jan 2011 11:27:00 +0000</pubDate>
		<dc:creator>Santhosh</dc:creator>
				<category><![CDATA[Indic]]></category>
		<category><![CDATA[Malayalam]]></category>
		<category><![CDATA[CDAC]]></category>
		<category><![CDATA[icann]]></category>
		<category><![CDATA[idn]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[standards]]></category>
		<category><![CDATA[zwj]]></category>
		<category><![CDATA[zwnj]]></category>

		<guid isPermaLink="false">http://thottingal.in/blog/?p=314</guid>
		<description><![CDATA[Recently, while preparing a critique for  IDN Policy for Malayalam language prepared by CDAC,  I noticed that ICANN does not allow control characters in the domain names.  Sometime back I noticed Python 3 identifiers also does not allow control characters in the Identifiers. This blog post attempts to analyze the issue by looking at the [...]]]></description>
			<content:encoded><![CDATA[<p>Recently, while preparing a critique for  <a href="http://wiki.smc.org.in/CDAC-IDN-Critique" target="_blank">IDN Policy for Malayalam </a>language prepared by CDAC,  I noticed that ICANN does not allow control characters in the domain names.  Sometime back I noticed Python 3 identifiers also does not allow control characters in the Identifiers. This blog post attempts to analyze the issue by looking at the Unicode and ICANN specifications about these special characters.</p>
<p>Apart from the existing characters in Indic languages,  <a href="http://en.wikipedia.org/wiki/Zero-width_joiner" target="_blank">Zero width Joiner</a> and <a href="http://en.wikipedia.org/wiki/Zero-width_non-joiner" target="_blank">Zero width non joiners</a> are widely used in Indic languages to control how the ligatures are formed. For some samples on how they are used, refer the wikipedia links. Being control characters and invisible characters, they are often removed while doing normalization , particularly before doing a string comparison, or collation (sort).</p>
<p>Identifiers, the strings that uniquely represent some data often has a policy on what kind of characters it can contain. For example, email address is an identifier, which unambiguously defines somebody&#8217;s email address, does not allow &#8216;space&#8217; characters in between. Some examples for this kind of identifiers are: email ids, web domain address, variables in programming languages etc.</p>
<p>Gone are the days where identifiers can be represented only using English characters. Python 3.0+ allows  you to define a variable in program using any words that can be represented in Unicode. For more details on this Python feature read <a href="http://www.python.org/dev/peps/pep-3131/" target="_blank">PEP 3131 &#8211; Supporting Non Ascii Identifiers</a> . Some samples : <a href="http://wiki.python.org/moin/MalayalamLanguage" target="_blank">Program written in Malayalam.</a> <a href="http://wiki.python.org/moin/TamilLanguage" target="_blank">In tamil</a> , and <a href="http://wiki.python.org/moin/HindiLanguage" target="_blank">In Hindi </a></p>
<p>Same is the case of Web addresses. With the advent of Internationalized Domain Names(IDN) that allows you register web addresses in your own languages, the English only web address scene is changing.</p>
<p>But this change brings some issues in the definition of &#8216;Identifiers&#8217; &#8211; just like English, what are the characters allowed in using a domain name or programming language identifier that can be used? Standards and specifications are being drafted on this for each language. For Internationalized domain names in Indian languages,<a href="http://cdac.in/" target="_blank"> CDAC</a> is drafting the policy. For python, the PEP 3131 has specification.</p>
<p>As a general rule, Unicode standard and the standards based on Unicode does not allow you use Unicode control characters such as zwj and zwnj in identifiers. Based on that <a href="http://icann.org/" target="_blank">The <em>Internet Corporation for Assigned Names and Numbers</em> (<em>ICANN</em>)</a> , in <a href="http://tools.ietf.org/html/rfc3454#page-12" target="_blank">RFC 3454</a> , it prohibits a list of control characters. RFC 3454 is used as a specification for converting a Unicode encoded domain name to its <a href="http://en.wikipedia.org/wiki/Punycode" target="_blank">Punicode </a>version for doing the validation.  For example,Thottingal, in Malayalam- തോട്ടിങ്ങല്‍ (0D24 0D4B 0D1F 0D4D 0D1F 0D3F 0D19 0D4D 0D19 0D32 0D4D 200D), when converted to punicode becomes xn--fwcaqax2g2d7dtadc . This conversion excludes the zwj at the end of the word. If I do a reverse conversion from xn--fwcaqax2g2d7dtadc to unicode what I get is തോട്ടിങ്ങല് (0D24 0D4B 0D1F 0D4D 0D1F 0D3F 0D19 0D4D 0D19 0D32 0D4D). Note that codepoint 200D &#8211; ZWJ is removed. That means I cannot register my domain thottingal.in in Malayalam properly. You can verify this using <a href="http://demo.icu-project.org/icu-bin/idnbrowser?t=%E0%B4%A4%E0%B5%8B%E0%B4%9F%E0%B5%8D%E0%B4%9F%E0%B4%BF%E0%B4%99%E0%B5%8D%E0%B4%99%E0%B4%B2%E0%B5%8D%E2%80%8D" target="_blank">ICU online converter</a>.  Now another example, Tamilnadu &#8211; in Malayalam തമിഴ്‌നാട് (0D24 0D2E 0D3F 0D34 0D4D 200C 0D28 0D3E 0D1F 0D4D) becomes xn--lwcjmx4a2de7id. When I do a reverse conversion, I getതമിഴ്നാട് (0D24 0D2E 0D3F 0D34 0D4D 0D28 0D3E 0D1F 0D4D) . Now ZWNJ(200C) is missed. Try yourself using <a href="http://demo.icu-project.org/icu-bin/idnbrowser?t=%E0%B4%A4%E0%B4%AE%E0%B4%BF%E0%B4%B4%E0%B5%8D%E2%80%8C%E0%B4%A8%E0%B4%BE%E0%B4%9F%E0%B5%8D" target="_blank">the converter </a>. This means one cannot register a website with Tamilnadu written in Malayalam properly. The IDN policies for Indic languages are based on this exclusion rules for zwj, zwnj.</p>
<p>For python 3.0+ ,  you cannot have an identifier in programming language with zwj, zwnj  or any control character in it. See this bug report for more details:<a href="http://bugs.python.org/issue5358" target="_blank"> Issue 5358 </a></p>
<p>All of the above issues are because of the assumption that zwj,zwnj is prohibited from Identifiers for all cases. But that is not true. Look at the <a href="http://unicode.org/reports/tr31" target="_blank">Unicode Standard Annex 31</a> &#8211; &#8220;Unicode Identifier and Pattern Syntax&#8221;(TR31). TR31 is based on <a href="http://unicode.org/review/pr-96.html" target="_blank">Public Review 96</a> &#8211; &#8220;Allowing Special Characters in Identifiers&#8221;</p>
<blockquote><p><em>This annex describes specifications for recommended defaults for the        use of Unicode in the definitions of identifiers and in pattern-based syntax.        It also supplies guidelines for use of normalization with identifiers. [...]</em></p>
<p><em>default-ignorable characters are normally        excluded from Unicode identifiers. However, visible distinctions created        by certain format characters (particularly the </em><em>Join_Control characters)        are necessary  in certain languages. A        blanket exclusion of these characters makes it impossible to create        identifiers with the correct visual appearance for common words or phrases in those languages.        Identifier systems that attempt to provide more natural representations        of terms in modern, customary use should allow these        characters in input and display, but limit them to contexts in which they are necessary. [...]</em></p></blockquote>
<p>But since the characters are invisible, to meet the security considerations,  It should be clearly defined where and all we can use them. What if a domain is registered with 5 zwnj  continuously in it? It will look same to a string with 4 zwnjs. So TR31 defines 3 valid cases where zwnj and zwj can be used in an Identifier.</p>
<ul>
<li>Allow ZWNJ in breaking a cursive connection</li>
<li>Allow ZWNJ in a conjunct context (example:  തമിഴ്‌നാട് , ദൃക്‌സാക്ഷി)</li>
<li>Allow ZWJ in a conjunct context (examples:  ന + ് + zwj -&gt; ന്‍ , <big> क+  ् +  zwj -&gt; </big> <big>क्‍</big> )</li>
</ul>
<p>These 3 cases covers all zwj,zwnj usage patterns in our languages.</p>
<p>So now it is clear that Unicode standard allows them in Identifiers. In that case, there should not be a conflict between Unicode Identifier policy and ICANN policy or any other identifier policy such as PEP 3131. Blanket exclusion of these characters are not allowed. So RFC 3454 should be compatible with TR31. The IDN policy of Indic languages should be based on that new specification and not based on the existing RFC 3454. Since CDAC is responsible of Indic Domain policy, they should take responsibility for bringing this change.</p>
<p>For making a change in PEP 3131, myself and <a href="http://www.muthukadan.net/" target="_blank">Baiju M</a> started a wiki page explaining what change need to be done. <a href="http://wiki.python.org/moin/ZwjAndZwnjAsIdentifiers" target="_blank">Read it from here</a>.</p>
<p>Having said that, is it desirable to have  two domains,  one with a valid zwj/zwnj usage and another without them? Of course, they will be visually different, avoiding any possibilities for spoofing. Now the question is whether those  two words represent two words in the language?</p>
<p>As far as Malayalam is concerned there are three cases here:</p>
<ol>
<li>Missing ZWJ is considered as a spelling mistake &#8211; തമിഴ്‌നാട് (correct), തമിഴ്നാട് (incorrect) pair is an example for that.  Should we allow both domains ? I don&#8217;t know any case where a missing ZWNJ form another valid word with different meaning.</li>
<li>Missing ZWJ means , the word is a different word with different meaning. This is very rare &#8211; വന്‍യവനിക , വന്യവനിക pair is often cited an example for this. But many people argues this is not a valid case.</li>
<li>Missing ZWJ never means a spelling mistake, but just a writing style. There are many examples for this. നന്‍മ-നന്മ is one obvious one.</li>
</ol>
<p>So the question is whether a domain differing by a valid zwj/zwnj use  to an existing registered domain to be allowed or not? I would suggest to use existing policy for domain comparison for this. ie, If the collation weights of existing domain and to-be registered domains are same ,  don&#8217;t register the new one. ZWJ, ZWNJ are characters with zero collation weight and in collation or string comparison they are ignored.</p>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 10px; width: 1px; height: 1px;">http://www.python.org/dev/peps/pep-3131/PEP</div>
]]></content:encoded>
			<wfw:commentRss>http://thottingal.in/blog/2011/01/08/identifiers-in-indic-languages/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Dictionary Jabber Buddy Bots</title>
		<link>http://thottingal.in/blog/2010/11/20/dictionary-jabber-buddy-bots/</link>
		<comments>http://thottingal.in/blog/2010/11/20/dictionary-jabber-buddy-bots/#comments</comments>
		<pubDate>Sat, 20 Nov 2010 17:24:05 +0000</pubDate>
		<dc:creator>Santhosh</dc:creator>
				<category><![CDATA[Indic]]></category>
		<category><![CDATA[Malayalam]]></category>
		<category><![CDATA[Projects]]></category>
		<category><![CDATA[SMC]]></category>
		<category><![CDATA[bots]]></category>
		<category><![CDATA[dictionary]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[xmpp]]></category>

		<guid isPermaLink="false">http://thottingal.in/blog/?p=308</guid>
		<description><![CDATA[Recently we released two Jabber buddy bots for dictionary lookup. By adding eng.mal.dict@gmail.com as a chat contact one can ask for the meaning of an English word in Malayalam by just sending a chat message. Similarly for English-Hindi or Hindi-English dictionary, we have another bot eng.hin.dict@jabber.org. Both of these dictionaries use Dict databases based on  [...]]]></description>
			<content:encoded><![CDATA[<p>Recently we released two Jabber buddy bots for dictionary lookup. By adding eng.mal.dict@gmail.com as a chat contact one can ask for the meaning of an English word in Malayalam by just sending a chat message. Similarly for English-Hindi or Hindi-English dictionary, we have another bot eng.hin.dict@jabber.org. Both of these dictionaries use Dict databases based on  <a title="DICT" href="http://en.wikipedia.org/wiki/DICT" target="_blank">DICT protocol</a>.</p>
<p>Both of these bots were well received  by the users. We have 8000+ users for English-Malayalam Dictionary.  Online blogs/media also gave good publicity. Thanks a lot!.</p>
<p><a title="Swathanthra Malayalam Computing" href="http://smc.org.in" target="_blank">SMC </a>developers Rajeesh Nambiar, Ershad, Ragsagar, and  Sarath Lakshman had helped in improving the program. You can get the source code from <a href="http://git.savannah.gnu.org/cgit/smc.git/tree/bots" target="_blank">here</a>. It is a small program written using python XMPP library.</p>
<p>We had written this programs one year back, 2009 december itself. We could not launch them for public since we did not had a server to host them.  Usually webhosting providers wont allow to run programs like this in their servers. Recently <a href="netdotnet.com" target="_blank">netdotnet.com</a> provided a VPS server for SMC and we could launch them from that server.</p>
<p>English-Hindi dictionary is reasonably big, but English-Malayalm is very small with only ~10k words. So we just added a Malayalam Wiktionary backend for the bot.</p>
<p>Here is a video on how to use English-Hindi bot prepared by  <a href="http://varunverma.org/blogs/translate-inside-your-google-chat-window/" target="_blank">Varun Verma </a></p>
<p><object classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" width="425" height="350" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0"><param name="src" value="http://www.youtube.com/v/1znJAHisf5M&amp;feature" /><embed type="application/x-shockwave-flash" width="425" height="350" src="http://www.youtube.com/v/1znJAHisf5M&amp;feature"></embed></object></p>
<ul>
<li>An article about English Malayalam bot in Epathram.com <a href="http://epathram.com/column-itsit/11/03/225654-english-malayalam-dictionary-in-google-chat.html" target="_blank">here. </a></li>
<li>A blog post by Sailesh in Hindi <a href="http://emadad.hindyugm.com/2010/10/know-hindi-meanings-while-chatting.html" target="_blank">http://emadad.hindyugm.com/2010/10/know-hindi-meanings-while-chatting.html</a></li>
</ul>
<p>We can start this kind of bot for other languages too, if we have dictionaries with Free S/w compatible licenses. If interested, please contact me.</p>
]]></content:encoded>
			<wfw:commentRss>http://thottingal.in/blog/2010/11/20/dictionary-jabber-buddy-bots/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Indic Language Computing Workout, Pune</title>
		<link>http://thottingal.in/blog/2010/08/23/indic-language-computing-workout-pune/</link>
		<comments>http://thottingal.in/blog/2010/08/23/indic-language-computing-workout-pune/#comments</comments>
		<pubDate>Mon, 23 Aug 2010 11:29:15 +0000</pubDate>
		<dc:creator>Santhosh</dc:creator>
				<category><![CDATA[Community]]></category>
		<category><![CDATA[Indic]]></category>
		<category><![CDATA[talks]]></category>
		<category><![CDATA[workshops]]></category>

		<guid isPermaLink="false">http://thottingal.in/blog/?p=297</guid>
		<description><![CDATA[On 22nd August, I conducted a workout session with Praveen on Indic Language Computing at Red Hat Office, Pune. The plan was to solve some of the issues in Devanagari support for the encoding converter Payyans. But most of the time was spent on Introducing the concepts of Indic language computing to participants.  Project Silpa [...]]]></description>
			<content:encoded><![CDATA[<p>On 22nd August, I conducted a workout session with <a href="http://j4v4m4n.in">Praveen </a>on Indic Language Computing at Red Hat Office, Pune. The plan was to solve some of the issues in Devanagari support for the encoding converter <a href="http://wiki.smc.org.in/Payyans">Payyans</a>. But most of the time was spent on Introducing the concepts of Indic language computing to participants.  <a href="http://smc.org.in/silpa/">Project Silpa</a> was also introduced and demonstrated. Students from College of Engg, Pune and other colleges attended.  Red Hat sponsored the venue at their office. It was very interesting to interact with energetic and enthusiastic students.</p>
]]></content:encoded>
			<wfw:commentRss>http://thottingal.in/blog/2010/08/23/indic-language-computing-workout-pune/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Attending Wikimania 2010</title>
		<link>http://thottingal.in/blog/2010/07/06/wikimania-2010/</link>
		<comments>http://thottingal.in/blog/2010/07/06/wikimania-2010/#comments</comments>
		<pubDate>Tue, 06 Jul 2010 09:32:45 +0000</pubDate>
		<dc:creator>Santhosh</dc:creator>
				<category><![CDATA[Community]]></category>
		<category><![CDATA[Indic]]></category>
		<category><![CDATA[Projects]]></category>
		<category><![CDATA[conference]]></category>
		<category><![CDATA[wikipedia]]></category>

		<guid isPermaLink="false">http://thottingal.in/blog/?p=273</guid>
		<description><![CDATA[I will be attending  Wikimania 2010,  Gdansk, Poland.  This annual international conference of the Wikimedia community is from July 9 to July 11. I will be presenting wik2cd, the tool I wrote for Malayalam wikipedia version 1.0 there in a joint workshop with wikipedia offline developers.  I will be joining with Manuel Schneider,  Shiju Alex, [...]]]></description>
			<content:encoded><![CDATA[<p>I will be attending  <a href="http://wikimania2010.wikimedia.org" target="_blank">Wikimania 2010,  Gdansk, Poland</a>.  This annual international conference of the Wikimedia community is from July 9 to July 11.</p>
<p>I will be presenting wik2cd, the tool I wrote for Malayalam wikipedia version 1.0 there in a joint workshop with wikipedia offline developers.  I will be joining with Manuel Schneider,  Shiju Alex, Martin Walker in the workshop titled: <a title="Submissions/Creating offline version of Wiki content - Solutions  and Challenges" href="http://wikimania2010.wikimedia.org/wiki/Submissions/Creating_offline_version_of_Wiki_content_-_Solutions_and_Challenges">Creating offline version of Wiki content – Solutions  and Challenges. </a>Apart from this, I will be meeting <a href="http://code.pediapress.com/" target="_blank">pediapress team</a>, the team behind wikipedia&#8217;s latest <a href="http://en.wikipedia.org/wiki/Help:Books" target="_blank">PDF/Book export feature</a>. There are some issues in this tool for working with Indic languages, mainly because of the PDF rendering engine not capable of rendering complex scripts.</p>
<p>Thanks to <a href="http://www.wikimediafoundation.org/" target="_blank">Wikimedia foundation</a> for granting me a scholarship to cover travel expenses.</p>
]]></content:encoded>
			<wfw:commentRss>http://thottingal.in/blog/2010/07/06/wikimania-2010/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Inkscape hyphenation extension</title>
		<link>http://thottingal.in/blog/2009/10/03/inkscape-hyphenation-extension/</link>
		<comments>http://thottingal.in/blog/2009/10/03/inkscape-hyphenation-extension/#comments</comments>
		<pubDate>Sat, 03 Oct 2009 14:33:03 +0000</pubDate>
		<dc:creator>Santhosh</dc:creator>
				<category><![CDATA[Indic]]></category>
		<category><![CDATA[Projects]]></category>
		<category><![CDATA[extensions]]></category>
		<category><![CDATA[hyphenation]]></category>
		<category><![CDATA[inkscape]]></category>

		<guid isPermaLink="false">http://thottingal.in/blog/?p=231</guid>
		<description><![CDATA[One year back I wrote about how to use Inkscape as a workaround solution for DTP in indic scripts. Still we don&#8217;t have any DTP software which supports Indic scripts in Unicode. Scribus still does not have the Indic support. One issue with inkscape when used as DTP for indic script was, a few indic [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify;">One year back I wrote about <a href="http://thottingal.in/blog/2008/04/10/using-inkscape-for-dtp-in-indic-scripts/">how to use Inkscape as a workaround solution for DTP in indic scripts</a>. Still we don&#8217;t have any DTP software which supports Indic scripts in Unicode. <a href="http://www.scribus.net/">Scribus</a> still does not have the Indic support.</p>
<p style="text-align: justify;">One issue with inkscape when used as DTP for indic script was, a few indic scripts always wanted hyphenation when text is justified. For example Malayalam has lengthy words and often space is wasted in lines if the text is not automatically hyphenated. But this feature was not available in inkscape. There is a <a href="https://bugs.launchpad.net/inkscape/+bug/171140">wishlist bug</a> for adding this feature to Inkscape.  I tried to develop an extension for Inkscape to achieve this.</p>
<p style="text-align: justify;">It is on top of the python hyphenation code written by Wilbert  Berendsen. The hyphenation rules, also called as patterns is TeX or<br />
Openoffice itself. So  I can support any language which has TeX hyphenation rules. But, since the hyphenation rules are language specific we need a language selection mechanism for the text first. Then only we can select the rules and do the hyphenation. But it is very tricky to implement.  Asking the language of the text every time it is justified is not a good idea. Setting a language for document is another choice, but what if the text contains multiple languages?  But for Indian languages it is very easy, we can automatically detect the scripts using unicode codepoints and load the rules accordingly. So for the time being, my extension support only English and all Indian languages.</p>
<p style="text-align: justify;">Download the extension from <a href="http://thottingal.in/projects/inkscape_hyphenation/inkscape-hyphenation.zip">http://thottingal.in/projects/inkscape_hyphenation/inkscape-hyphenation.zip</a> . In GNU/Linux machines,  extract the zip file and copy to /usr/share/inkscape/extensions folder. In Windows , extract to [inkscape installation directory]\extensions folder.  After this close and reopen inkscape. You will see a menu named Hyphenate in Effects-&gt;Text menu.    In the document, add a text field, enter text in any indian language. Select the text and apply hyphenation by Effects-&gt;Text-&gt;Hyphenate. Then change the alignment of text to justify. You will see the text get hyphenated and occupying maximum possible space in the text field</p>
<p style="text-align: justify;">I got satisfactory result with Malayalam and Tamil. I did not test other languages. Following images illustrates hyphenated, justified, two column layout of text done in Inkscape</p>
<div class="mceTemp" style="text-align: justify;">
<dl class="wp-caption alignnone" style="width: 417px;">
<dt class="wp-caption-dt"><a href="http://thottingal.in/projects/inkscape_hyphenation/hyphenated-inkscape.png"><img title="Malayalam Hyphenation In inkscape " src="http://thottingal.in/projects/inkscape_hyphenation/hyphenated-inkscape.png" alt="Malayalam Hyphenation In inkscape " width="407" height="574" /></a></dt>
<dd class="wp-caption-dd">Malayalam Hyphenation In inkscape </dd>
</dl>
</div>
<div class="mceTemp" style="text-align: justify;">
<dl class="wp-caption alignnone" style="width: 420px;">
<dt class="wp-caption-dt"><a href="http://thottingal.in/projects/inkscape_hyphenation/hyphenated-inkscape-tamil.png"><img title="Tamil Hyphenation in Inkscape" src="http://thottingal.in/projects/inkscape_hyphenation/hyphenated-inkscape-tamil.png" alt="Tamil Hyphenation in Inkscape" width="410" height="577" /></a></dt>
<dd class="wp-caption-dd">Tamil Hyphenation in Inkscape</dd>
</dl>
</div>
<p style="text-align: justify;">We had a discussion about this in<a href="me: OK, Once you read it http://sourceforge.net/mailarchive/forum.php?thread_name=20090924155717.GC4250%40bowman.infotech.monash.edu.au&amp;forum_name=inkscape-devel"> inkscape mailing list </a>. Some developers suggested to have this feature built in, not as extension.  There are few issues to be solved for that. One thing is language selection as I explained. The other issue is regarding the hyphenation character to be used. <a href=" http://www.unicode.org/unicode/reports/tr14/#SoftHyphen">Unicode standard insists to use soft hyphen</a> &#8211; u00AD as hyphenation character. This is an invisible character. For Malayalam, visible hyphens are not required. But some other languages require the hyphen sign where the word is broken at the end of the line. The rules for whether the soft hyphen should be visible or not visible is not clear in Unicode&#8217;s specification. Pango never displays a the soft hyphen. There are criticism on this specification of softhyphen</p>
<ul style="text-align: justify;">
<li>Jukka Korpela, Soft hyphen (SHY) &#8211; a hard problem?  <a href="http://www.cs.tut.fi/%7Ejkorpela/shy.html" target="_blank">http://www.cs.tut.fi/~jkorpela/shy.html</a></li>
<li> Markus Kuhn, Unicode interpretation of SOFT HYPHEN breaks ISO 8859-1   compatibility. Unicode Technical Committee document L2/03-155R, June 2003. <a href="http://www.cl.cam.ac.uk/%7Emgk25/ucs/L2/03155r-kuhn-soft-hyphen.pdf" target="_blank">http://www.cl.cam.ac.uk/~mgk25/ucs/L2/03155r-kuhn-soft-hyphen.pdf</a></li>
</ul>
<p style="text-align: justify;">So I think there is something to be done from Rendering engine or Unicode need to clarify the confusions.  But Openoffice and HTML rendering engines always make soft hyphen at the end of the line, which is not desired for some languages.</p>
<p style="text-align: justify;">Try this extension, let me know the comments. For small scale DTP works, such as pamphlets, notices, brochures  inkscape is enough. But since inkscape is not primarily a DTP software and does not have paging support, for books and large scale DTP works, it may not work well.</p>
<p style="text-align: justify;">
]]></content:encoded>
			<wfw:commentRss>http://thottingal.in/blog/2009/10/03/inkscape-hyphenation-extension/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>
<!-- WP Super Cache is installed but broken. The path to wp-cache-phase1.php in wp-content/advanced-cache.php must be fixed! -->
