<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Santhosh Thottingal</title>
	<atom:link href="http://thottingal.in/blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://thottingal.in/blog</link>
	<description>/home/santhosh</description>
	<lastBuildDate>Sat, 28 Nov 2009 08:43:56 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Conferences : FOSS.IN  and  NCIDEEE</title>
		<link>http://thottingal.in/blog/2009/11/28/conferences-foss-in-and-ncideee/</link>
		<comments>http://thottingal.in/blog/2009/11/28/conferences-foss-in-and-ncideee/#comments</comments>
		<pubDate>Sat, 28 Nov 2009 05:14:16 +0000</pubDate>
		<dc:creator>Santhosh</dc:creator>
				<category><![CDATA[Community]]></category>
		<category><![CDATA[conference]]></category>
		<category><![CDATA[dhvani]]></category>
		<category><![CDATA[silpa]]></category>

		<guid isPermaLink="false">http://thottingal.in/blog/?p=240</guid>
		<description><![CDATA[FOSS.IN 2009 starts on 1st December. I wanted to attend all 5 days but I have another conference on Dec 1st to 3rd at Chennai. I am attending National Conference on ICTs for the differently- abled/under privileged communities in Education, Employment and Entrepreneurship 2009 &#8211; (NCIDEEE 2009) at Loyola College, Chennai. So I will miss [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://foss.in">FOSS.IN 2009</a> starts on 1st December. I wanted to attend all 5 days but I have another conference on Dec 1st to 3rd at Chennai. I am attending <a href="http://cis-india.org/events/ncideee-2009">National Conference on ICTs for the differently- abled/under privileged communities in Education, Employment and Entrepreneurship 2009 &#8211; (NCIDEEE 2009)</a> at Loyola College, Chennai. So I will miss the first 3 days of foss.in.<br />
We have a <a href="http://workouts.foss.in/2009/index.php/Project_SILPA_workout">workout</a> on <a href="http://smc.org.in/silpa">Project Silpa</a> during foss.in. I am also planning to have a workout with <a href="http://debayan.wordpress.com">Debayan</a> and Jinesh to get his <a href="http://hacking-tesseract.blogspot.com">tesseract-indic OCR</a> work with Malayalam.</p>
<p>See you at foss.in!</p>
]]></content:encoded>
			<wfw:commentRss>http://thottingal.in/blog/2009/11/28/conferences-foss-in-and-ncideee/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Inkscape hyphenation extension</title>
		<link>http://thottingal.in/blog/2009/10/03/inkscape-hyphenation-extension/</link>
		<comments>http://thottingal.in/blog/2009/10/03/inkscape-hyphenation-extension/#comments</comments>
		<pubDate>Sat, 03 Oct 2009 14:33:03 +0000</pubDate>
		<dc:creator>Santhosh</dc:creator>
				<category><![CDATA[Indic]]></category>
		<category><![CDATA[Projects]]></category>
		<category><![CDATA[extensions]]></category>
		<category><![CDATA[hyphenation]]></category>
		<category><![CDATA[inkscape]]></category>

		<guid isPermaLink="false">http://thottingal.in/blog/?p=231</guid>
		<description><![CDATA[One year back I wrote about how to use Inkscape as a workaround solution for DTP in indic scripts. Still we don&#8217;t have any DTP software which supports Indic scripts in Unicode. Scribus still does not have the Indic support.
One issue with inkscape when used as DTP for indic script was, a few indic scripts [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify;">One year back I wrote about <a href="http://thottingal.in/blog/2008/04/10/using-inkscape-for-dtp-in-indic-scripts/">how to use Inkscape as a workaround solution for DTP in indic scripts</a>. Still we don&#8217;t have any DTP software which supports Indic scripts in Unicode. <a href="http://www.scribus.net/">Scribus</a> still does not have the Indic support.</p>
<p style="text-align: justify;">One issue with inkscape when used as DTP for indic script was, a few indic scripts always wanted hyphenation when text is justified. For example Malayalam has lengthy words and often space is wasted in lines if the text is not automatically hyphenated. But this feature was not available in inkscape. There is a <a href="https://bugs.launchpad.net/inkscape/+bug/171140">wishlist bug</a> for adding this feature to Inkscape.  I tried to develop an extension for Inkscape to achieve this.</p>
<p style="text-align: justify;">It is on top of the python hyphenation code written by Wilbert  Berendsen. The hyphenation rules, also called as patterns is TeX or<br />
Openoffice itself. So  I can support any language which has TeX hyphenation rules. But, since the hyphenation rules are language specific we need a language selection mechanism for the text first. Then only we can select the rules and do the hyphenation. But it is very tricky to implement.  Asking the language of the text every time it is justified is not a good idea. Setting a language for document is another choice, but what if the text contains multiple languages?  But for Indian languages it is very easy, we can automatically detect the scripts using unicode codepoints and load the rules accordingly. So for the time being, my extension support only English and all Indian languages.</p>
<p style="text-align: justify;">Download the extension from <a href="http://thottingal.in/projects/inkscape_hyphenation/inkscape-hyphenation.zip">http://thottingal.in/projects/inkscape_hyphenation/inkscape-hyphenation.zip</a> . In GNU/Linux machines,  extract the zip file and copy to /usr/share/inkscape/extensions folder. In Windows , extract to [inkscape installation directory]\extensions folder.  After this close and reopen inkscape. You will see a menu named Hyphenate in Effects-&gt;Text menu.    In the document, add a text field, enter text in any indian language. Select the text and apply hyphenation by Effects-&gt;Text-&gt;Hyphenate. Then change the alignment of text to justify. You will see the text get hyphenated and occupying maximum possible space in the text field</p>
<p style="text-align: justify;">I got satisfactory result with Malayalam and Tamil. I did not test other languages. Following images illustrates hyphenated, justified, two column layout of text done in Inkscape</p>
<div class="mceTemp" style="text-align: justify;">
<dl class="wp-caption alignnone" style="width: 417px;">
<dt class="wp-caption-dt"><a href="http://thottingal.in/projects/inkscape_hyphenation/hyphenated-inkscape.png"><img title="Malayalam Hyphenation In inkscape " src="http://thottingal.in/projects/inkscape_hyphenation/hyphenated-inkscape.png" alt="Malayalam Hyphenation In inkscape " width="407" height="574" /></a></dt>
<dd class="wp-caption-dd">Malayalam Hyphenation In inkscape </dd>
</dl>
</div>
<div class="mceTemp" style="text-align: justify;">
<dl class="wp-caption alignnone" style="width: 420px;">
<dt class="wp-caption-dt"><a href="http://thottingal.in/projects/inkscape_hyphenation/hyphenated-inkscape-tamil.png"><img title="Tamil Hyphenation in Inkscape" src="http://thottingal.in/projects/inkscape_hyphenation/hyphenated-inkscape-tamil.png" alt="Tamil Hyphenation in Inkscape" width="410" height="577" /></a></dt>
<dd class="wp-caption-dd">Tamil Hyphenation in Inkscape</dd>
</dl>
</div>
<p style="text-align: justify;">We had a discussion about this in<a href="me: OK, Once you read it http://sourceforge.net/mailarchive/forum.php?thread_name=20090924155717.GC4250%40bowman.infotech.monash.edu.au&amp;forum_name=inkscape-devel"> inkscape mailing list </a>. Some developers suggested to have this feature built in, not as extension.  There are few issues to be solved for that. One thing is language selection as I explained. The other issue is regarding the hyphenation character to be used. <a href=" http://www.unicode.org/unicode/reports/tr14/#SoftHyphen">Unicode standard insists to use soft hyphen</a> &#8211; u00AD as hyphenation character. This is an invisible character. For Malayalam, visible hyphens are not required. But some other languages require the hyphen sign where the word is broken at the end of the line. The rules for whether the soft hyphen should be visible or not visible is not clear in Unicode&#8217;s specification. Pango never displays a the soft hyphen. There are criticism on this specification of softhyphen</p>
<ul style="text-align: justify;">
<li>Jukka Korpela, Soft hyphen (SHY) &#8211; a hard problem?  <a href="http://www.cs.tut.fi/%7Ejkorpela/shy.html" target="_blank">http://www.cs.tut.fi/~jkorpela/shy.html</a></li>
<li> Markus Kuhn, Unicode interpretation of SOFT HYPHEN breaks ISO 8859-1   compatibility. Unicode Technical Committee document L2/03-155R, June 2003. <a href="http://www.cl.cam.ac.uk/%7Emgk25/ucs/L2/03155r-kuhn-soft-hyphen.pdf" target="_blank">http://www.cl.cam.ac.uk/~mgk25/ucs/L2/03155r-kuhn-soft-hyphen.pdf</a></li>
</ul>
<p style="text-align: justify;">So I think there is something to be done from Rendering engine or Unicode need to clarify the confusions.  But Openoffice and HTML rendering engines always make soft hyphen at the end of the line, which is not desired for some languages.</p>
<p style="text-align: justify;">Try this extension, let me know the comments. For small scale DTP works, such as pamphlets, notices, brochures  inkscape is enough. But since inkscape is not primarily a DTP software and does not have paging support, for books and large scale DTP works, it may not work well.</p>
<p style="text-align: justify;">
]]></content:encoded>
			<wfw:commentRss>http://thottingal.in/blog/2009/10/03/inkscape-hyphenation-extension/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>New Hyphenation Pattern Extensions for Openoffice</title>
		<link>http://thottingal.in/blog/2009/08/15/ooo_hyphenation_extensions/</link>
		<comments>http://thottingal.in/blog/2009/08/15/ooo_hyphenation_extensions/#comments</comments>
		<pubDate>Sat, 15 Aug 2009 09:35:45 +0000</pubDate>
		<dc:creator>Santhosh</dc:creator>
				<category><![CDATA[Indic]]></category>
		<category><![CDATA[Projects]]></category>
		<category><![CDATA[hyphenation]]></category>

		<guid isPermaLink="false">http://thottingal.in/blog/?p=221</guid>
		<description><![CDATA[Openoffice Indic Natural Language group announces the availability of the following Openoffice hyphenation dictionary extensions.

Malayalam Hyphenation Rules version 1.2
 Kannada Hyphenation Rules version 1.1
 Bengali Hyphenation Rules verson 1.1
 Hindi Hyphenation Rules version 1.1
 Telugu Hyphenation Rules version 1.0
 Tamil Hyphenation Rules version 1.0
 Gujarati Hyphenation Rules version 1.0
 Panjabi Hyphenation Rules version 1.0
 Oriya [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://wiki.services.openoffice.org/wiki/NLC/IndicGroup">Openoffice Indic Natural Language group</a> announces the availability of the following Openoffice hyphenation dictionary extensions.</p>
<ol>
<li><a title="http://extensions.services.openoffice.org/project/hyph_ml_IN" rel="nofollow" href="http://extensions.services.openoffice.org/project/hyph_ml_IN">Malayalam Hyphenation Rules version 1.2</a></li>
<li> <a title="http://extensions.services.openoffice.org/project/hyph_kn_IN" rel="nofollow" href="http://extensions.services.openoffice.org/project/hyph_kn_IN">Kannada Hyphenation Rules version 1.1</a></li>
<li> <a title="http://extensions.services.openoffice.org/project/hyph_bn_IN" rel="nofollow" href="http://extensions.services.openoffice.org/project/hyph_bn_IN">Bengali Hyphenation Rules verson 1.1</a></li>
<li> <a title="http://extensions.services.openoffice.org/project/hyph_hi_IN" rel="nofollow" href="http://extensions.services.openoffice.org/project/hyph_hi_IN">Hindi Hyphenation Rules version 1.1</a></li>
<li> <a title="http://extensions.services.openoffice.org/project/hyph_te_IN" rel="nofollow" href="http://extensions.services.openoffice.org/project/hyph_te_IN">Telugu Hyphenation Rules version 1.0</a></li>
<li> <a title="http://extensions.services.openoffice.org/project/hyph_ta_IN" rel="nofollow" href="http://extensions.services.openoffice.org/project/hyph_ta_IN">Tamil Hyphenation Rules version 1.0</a></li>
<li> <a title="http://extensions.services.openoffice.org/project/hyph_gu_IN" rel="nofollow" href="http://extensions.services.openoffice.org/project/hyph_gu_IN">Gujarati Hyphenation Rules version 1.0</a></li>
<li> <a title="http://extensions.services.openoffice.org/project/hyph_pa_IN" rel="nofollow" href="http://extensions.services.openoffice.org/project/hyph_pa_IN">Panjabi Hyphenation Rules version 1.0</a></li>
<li> <a title="http://extensions.services.openoffice.org/project/hyph_or_IN" rel="nofollow" href="http://extensions.services.openoffice.org/project/hyph_or_IN">Oriya Hyphenation Rules version 1.0</a></li>
<li> <a title="http://extensions.services.openoffice.org/project/hyph_mr_IN" rel="nofollow" href="http://extensions.services.openoffice.org/project/hyph_mr_IN">Marathi Hyphenation Rules version 1.0</a></li>
</ol>
<p><a href="http://extensions.services.openoffice.org/project/dict_ml_IN">Spellchecker extension for Malayalam</a> is also ready.</p>
<p>For a complete list of writing aids for Openoffice in Indic Languages is available <a href="http://wiki.services.openoffice.org/wiki/NLC/IndicGroup">here</a></p>
<p>Hyphenation Rules for Languages other than Marathi is already packages in Fedora 11. This releases contains updates and bug fixes. Fedora 12 will contains these updates. These extensions are yet to be packaged for Debian/Ubuntu.</p>
<p>More details about hyphenation rules are <a href="http://thottingal.in/blog/tag/hyphenation/">available here </a></p>
]]></content:encoded>
			<wfw:commentRss>http://thottingal.in/blog/2009/08/15/ooo_hyphenation_extensions/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Project Silpa Updates</title>
		<link>http://thottingal.in/blog/2009/08/11/project-silpa-updates/</link>
		<comments>http://thottingal.in/blog/2009/08/11/project-silpa-updates/#comments</comments>
		<pubDate>Tue, 11 Aug 2009 10:47:04 +0000</pubDate>
		<dc:creator>Santhosh</dc:creator>
				<category><![CDATA[Indic]]></category>
		<category><![CDATA[Projects]]></category>

		<guid isPermaLink="false">http://thottingal.in/blog/?p=212</guid>
		<description><![CDATA[[Please read the Silpa project annoucement  before reading this blogpost]
Project silpa is getting ready for a 0.1 version.

The web framework got many changes to support JSON based RPC calls from external applications. That means,  web/desktop applications can use the APIs of Silpa through RPC calls.
Page rendering logic is moved from server to client. Web interface [...]]]></description>
			<content:encoded><![CDATA[<p>[Please read the <a href="http://thottingal.in/blog/2009/06/16/announcing-project-silpa/">Silpa project annoucement </a> before reading this blogpost]</p>
<p><a href="http://smc.org.in/silpa">Project silpa</a> is getting ready for a 0.1 version.</p>
<ol>
<li>The web framework got many changes to support <a href="http://json-rpc.org/wiki/python-json-rpc">JSON based RPC </a>calls from external applications. That means,  web/desktop applications can use the APIs of Silpa through RPC calls.</li>
<li>Page rendering logic is moved from server to client. Web interface use javascript based synchronous <a href="http://code.google.com/p/json-xml-rpc">JSON based RPC </a>calls to get the results from server. Jquery is used for render the page.</li>
<li>Uses <a href="http://entrian.com/PyMeld/">PyMeld </a>Templating Engine for modules having web interface(Not all modules will not have web interface)</li>
<li>Framework is now Python <a href="http://wsgi.org">WSGI </a>application. Initially it was plain CGI. WSGI reduces the response time and allows the server to be executed as daemon</li>
<li>Many new modules are getting added- <a href="http://smc.org.in/silpa/Spellcheck">Spellchecker </a>: which is not based on aspell or hunspell  and I am going to try out some algorithms to get optimal suggestions. Not completed.</li>
<li>Soundex Algorithm- webbased demo and APIs as I explained in my  <a href="http://thottingal.in/blog/2009/07/26/indicsoundex/">previous blog post</a></li>
<li><a href="http://smc.org.in/silpa/ApproxSearch">An Inexact search algorithm</a> and its implementation based on visual and phonetic distance between two words is getting ready. I will explain it in another blogpost</li>
<li>Hyphenation &#8211; <a href="http://smc.org.in/silpa/Hyphenate">Online tool </a>as well as APIs</li>
<li><a href="http://smc.org.in/silpa/NGram">N-gram for Indic languages</a>- API, web interface</li>
<li><a href="http://smc.org.in/silpa/apis.html">API documentation </a>is going on, but not completed. I have plans to make silpa as a python library for offline use too.</li>
<li>Moved from <a href="http://smc.org.in">SMC</a>&#8217;s git repo to a <a href="http://smc.org.in/silpa/source.html">seperate git repo</a>. After 0.1 baseline, I will create branches for stable and development.</li>
<li>Application is running on a git controlled deployment workflow. Thanks to <a href="http://joemaller.com">Joe Maller </a> for nice <a href="http://joemaller.com/2008/11/25/a-web-focused-git-workflow/">documentation on this</a>.</li>
</ol>
<p>That&#8217;s all for now!.  There are too many things to be done. Some of the modules does not support all languages as of now.  If anybody interested in contributing to the project, please contact me.  Try out the application, read the code and let me know your comments.</p>
]]></content:encoded>
			<wfw:commentRss>http://thottingal.in/blog/2009/08/11/project-silpa-updates/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Phonetic Comparison Algorithm for Indian Languages</title>
		<link>http://thottingal.in/blog/2009/07/26/indicsoundex/</link>
		<comments>http://thottingal.in/blog/2009/07/26/indicsoundex/#comments</comments>
		<pubDate>Sun, 26 Jul 2009 11:14:08 +0000</pubDate>
		<dc:creator>Santhosh</dc:creator>
				<category><![CDATA[Indic]]></category>
		<category><![CDATA[soundex]]></category>

		<guid isPermaLink="false">http://thottingal.in/blog/?p=193</guid>
		<description><![CDATA[Soundex is a phonetic indexing algorithm. It is used to search/retrieve words having similar pronunciation but slightly different spelling. Soundex was developed by Robert C. Russell and Margaret K. Odell. A variation called American Soundex was used in the 1930s for a retrospective analysis of the US censuses from 1890 through 1920. It is also described [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify;"><a href="http://en.wikipedia.org/wiki/Soundex">Soundex</a> is a phonetic indexing algorithm. It is used to search/retrieve words having similar pronunciation but slightly different spelling. Soundex was developed by Robert C. Russell and Margaret K. Odell. A variation called American Soundex was used in the 1930s for a retrospective analysis of the US censuses from 1890 through 1920. It is also described in <a title="Donald Knuth" href="http://en.wikipedia.org/wiki/Donald_Knuth">Donald Knuth&#8217;s</a> <em><a title="The Art of Computer Programming" href="http://en.wikipedia.org/wiki/The_Art_of_Computer_Programming">The Art of Computer Programming</a></em>. The <a title="National Archives and Records Administration" href="http://en.wikipedia.org/wiki/National_Archives_and_Records_Administration">National Archives and Records Administration</a> (NARA) maintains the current rule set for the official implementation of Soundex used by the U.S. Government.</p>
<p>The soundex code for a word is an english alphabet followed by a number of digits. The algorithm is explained  with examples <a href="http://www.archives.gov/genealogy/census/soundex.html">here</a></p>
<p style="text-align: justify;">By this algorithm, if my name is written as Santhosh , Santosh , Santhos or Santos , the soundex code remains same and it is <strong>S532 </strong></p>
<p style="text-align: justify;">Soundex had many limitations and sometimes creates <a href="http://en.wikipedia.org/wiki/Type_I_and_type_II_errors">false positive </a>or false negative errors<strong>. </strong>There are variants for soundex and one important variant is <a href="http://en.wikipedia.org/wiki/Metaphone">Metaphone alogirthm </a>by <a href="http://en.wikipedia.org/wiki/Lawrence_Philips">Lawrence Philip</a> . Metaphone have another improved version called <a href="http://en.wikipedia.org/wiki/Double_Metaphone">Double Metaphone.</a></p>
<p style="text-align: justify;">Well,  it works fine, but only for English. Just like our languages also have varying spelling. But more than spelling, in India , we have another issue to be addressed: Words(often nouns) getting transliterated among Indian Languages. Let me give some examples: In railway reservation chart, your name will be written in English as well as in Hindi. You are from Kerala(or some other state) and your name is transliterated to some other language. The only thing remain same is its pronunciation.  It will be great if we can search on this data based on pronunciation, right?</p>
<p style="text-align: justify;">We see a lot of discussions happening around e-governance and other computerization projects such as national UID etc now a days. Projects that handle Indic text heavily will require efficient search and string processing algorithms.</p>
<p style="text-align: justify;">You got a list of names in Bengali and you dont know Bengali but you know Malayalam. Obviously you can&#8217;t search on Bengali text using Malayalam. But can&#8217;t we develop such algorithm?</p>
<p style="text-align: justify;">So our requirement is this:</p>
<ol>
<li>
<div style="text-align: justify;">Language Independent Search on Indic text.</div>
</li>
<li>
<div style="text-align: justify;">Comparison should be based on pronunciation</div>
</li>
<li>
<div style="text-align: justify;">Should be tolerant to spelling variation</div>
</li>
</ol>
<p style="text-align: justify;">The above discussed soundex algorithm just fits to the solution. Only issue is that algorithm is defined only for English. So time to design soundex for our languages!.</p>
<p style="text-align: justify;">Original Soundex algorithm is not multilingual. And not designed for language independent indexing.  And the digits to represent phonetic categories for Indian languages will not fit into 6. We have more phonetical families. So our algorithm will not be exact &#8220;localization&#8221; of English soundex, but we will use the concept. Let us call it as IndicSoundex <img src='http://thottingal.in/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  . Infact Metaphone algorithm uses 16 families for English.</p>
<p style="text-align: justify;">One of the characteristcs of Indian languages is that all languages share same phonetic features. We have vowels a, aa, i, ii, u&#8230;. then consonant families ka, cha, ta, tha, pa..  etc.  What we need to do is mapping all these sets to a common representation which is independent of language. And we call that representation as soundex for an Indic word.</p>
<p style="text-align: justify;">While grouping and mapping Indic letters to phonetic codes, the following are facts are taken into consideration.</p>
<ol>
<li>
<div style="text-align: justify;">Group short and long vowel to a single code. e and ee is considered as equal</div>
</li>
<li>
<div style="text-align: justify;">Consider half consonants as full consonants. For this ignore halants.</div>
</li>
<li>
<div style="text-align: justify;">Group consonant families. ka, kha,ga,gha, nga becomes a single family. Same is the case of cha, ta, tha,pa</div>
</li>
<li>
<div style="text-align: justify;">Group ra, Ra,</div>
</li>
<li>
<div style="text-align: justify;">Group la,La, zha</div>
</li>
<li>
<div style="text-align: justify;">Group sa,Sa,sha</div>
</li>
</ol>
<p style="text-align: justify;">When I grouped like this I got 20 groups.  I have prepared a table for all Indic letters and corresponding code here <a href="http://thottingal.in/soundex/soundex.html">http://thottingal.in/soundex/soundex.html</a></p>
<p style="text-align: justify;"><strong>Algorithm:</strong></p>
<ol>
<li>For each letter in the word except first letter, get the corresponding soundex digit from the character map, which is nothing but a table <a href="http://thottingal.in/soundex/soundex.html">like this.</a></li>
<li>If the letter is not found in character map, the  soundex digit for that letter is 0</li>
<li>Duplicate consecutive soundex codes are skipped. ie, effectively क्क will be considered as क.</li>
<li>Replace first digit with first alpha character.</li>
<li>remove all 0s from the soundex code.</li>
<li>Return soundex code padded to the required length (ie, if required length of code is 5 and soundex is സBCD, then soundex returned will be സBCD0.</li>
</ol>
<p style="text-align: justify;">Of cource, we need an implementation. Get the python code for this from <a href="http://thottingal.in/soundex/indicsoundex.tar.gz">here</a>. If you don&#8217;t care about the code,  try the online soundex converter from here: <a href="http://smc.org.in/silpa/Soundex" target="_blank">http://smc.org.in/silpa/Soundex</a></p>
<p style="text-align: justify;">From the above algorithm , the soundex code for സന്തോഷ് is സLKES000. and for सन्तौष , it is सLKES000 . So if I need to compare सन्तौष and സന്തോഷ് and need a positive result, we should do something in comparison logic and there by making the comparison language independent</p>
<p>An example :<br />
കാര്‍തിക്, കാര്‍ത്തിക്,  കാര്‍തിഗ് = കAPKBF00<br />
கார்திக்= கAPKBF00</p>
<p style="text-align: justify;"><strong>Comparison</strong></p>
<ol>
<li>Compare the two string without calculating the soundex codes. If there are same return 0</li>
<li>Calculate the sounedx codes for both strings, if the match return 1. ie, both strings are from same language and sounds alike</li>
<li>If the soundex of strings has different fist letter and rest of the part is same, check whether the first letters are with same soundex digit. If so, both words from different languages, but sounds alike. Return code for this will be 2</li>
<li>If none of the above conditions match, return -1, indicating both strings are completely different.</li>
</ol>
<p style="text-align: justify;">Use <a href="http://smc.org.in/silpa/Soundex" target="_blank">http://smc.org.in/silpa/Soundex</a> to experiment with this algorithm. Following screenshot shows how കാര്‍ത്തിക് and કાર્તિક્ are found to be phonetically similar or &#8220;sounds alike&#8221;</p>
<p style="text-align: justify;"><img title="soundexcomparison" src="http://thottingal.in/blog/wp-content/uploads/2009/07/soundexcomparison.png" alt="soundexcomparison" width="337" height="235" /></p>
<p style="text-align: justify;">Ok,  we got the soundex code. The soundex code itself is not useful. We need to implement a search utility based on this. As I explained above, an intra-indic search program. We will see it later.</p>
<p style="text-align: justify;">Feel free to comment on algorithm and please suggest any improvement we can do.</p>
]]></content:encoded>
			<wfw:commentRss>http://thottingal.in/blog/2009/07/26/indicsoundex/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>On Machine Translation and God</title>
		<link>http://thottingal.in/blog/2009/06/27/on-machine-translation-and-god/</link>
		<comments>http://thottingal.in/blog/2009/06/27/on-machine-translation-and-god/#comments</comments>
		<pubDate>Sat, 27 Jun 2009 04:58:31 +0000</pubDate>
		<dc:creator>Santhosh</dc:creator>
				<category><![CDATA[Misc]]></category>

		<guid isPermaLink="false">http://thottingal.in/blog/?p=174</guid>
		<description><![CDATA[I was reading an article named &#8220;Why Can&#8217;t a Computer Translate More Like a Person?&#8221; by Alan K. Melby. The article is about  the challenges that machine translation technology face to reach a acceptable quality of translation. He explains the importance of culture sensitivity required for machine translation programs. Article lists a  number [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify;">I was reading an article named <a href="http://www.ttt.org/theory/barker.html">&#8220;Why Can&#8217;t a Computer Translate More Like a Person?&#8221;</a> by Alan K. Melby. The article is about  the challenges that machine translation technology face to reach a acceptable quality of translation. He explains the importance of culture sensitivity required for machine translation programs. Article lists a  number of examples where MT can go wrong if context , culture etc are not taken into consideration.  There are very interesting arguments about how <a href="http://en.wikipedia.org/wiki/Reductionism">reductionalism</a> becomes a wrong choice while designing MT. If you are interested in  natural language processing or machine translation and wondering if there is any limit for computer programs to reach human&#8217;s language capabilities, please read it.</p>
<p style="text-align: justify;">The article is written long time back, and Machine Translation technologies improved a lot. There are commercial as well as free translation products for many languages. There are research going on in intra-indic as well as english-indic translations.  I am not sure how far these technologies solved the challenges mentioned in the above mentioned article, but I believe that the questions are still valid.</p>
<p>The question is whether the programs can understand our culture, language usage , emotions etc. For translating limited domain or dry content, the machine translation may be effective, but in a general purpose use, I don&#8217;t know <a href="http://en.wikipedia.org/wiki/Machine_translation#Major_issues">how effective they are. </a></p>
<p>Melby argues :</p>
<blockquote>
<p style="text-align: justify;">That key factor which is missing from current theories is agency. By agency, I mean the capacity to make real choices by exercising our will, ethical choices for which we are responsible. [...]. Any &#8216;choice&#8217; that is a rigid and unavoidable consequence of the circumstances is not a real choice that could have gone either way and is thus not an example of agency. A computer has no real choice in what it will do next. Its next action is an unavoidable consequence of the machine language it is executing and the values of data presented to it. I am proposing that any approach to meaning that discounts agency will amount to no more than the mechanical manipulation of symbols such as words, that is, moving words around and linking them together in various ways instead of understanding them. Computers can already manipulate symbols. In fact, that is what they mostly do. But manipulating symbols does not give them agency and it will not let them handle language like humans. Symbol manipulation works only within a specific domain, and any attempt to move beyond a domain through symbol manipulation is doomed, for manipulation of symbols involves no true surprises, only the strict application of rules. General vocabulary, as we have seen, involves true surprises that could not have been predicted.</p>
</blockquote>
<p style="text-align: justify;">With all these advanced technologies, can we develop a universal , any-to-any language translation program? We have seen many examples where human beings are <a href="http://www.ojohaven.com/fun/translation.funnies.html">failing</a> <a href="http://www.flickr.com/groups/chinglish/pool/">miserably</a> in sensible translation. If you are looking for  english-&gt;hindi translation effectiveness, <a href="http://translate.google.com/translate_t?text=%E0%A4%86%E0%A4%AA%20%E0%A4%B9%E0%A4%BF%E0%A4%A8%E0%A5%8D%E0%A4%A6%E0%A5%80%20%E0%A4%B8%E0%A4%AE%E0%A4%9D%E0%A4%A4%E0%A5%87%20%E0%A4%B9%E0%A5%88%20?#hi|en|%E0%A4%86%E0%A4%AA%20%E0%A4%B9%E0%A4%BF%E0%A4%A8%E0%A5%8D%E0%A4%A6%E0%A5%80%20%E0%A4%B8%E0%A4%AE%E0%A4%9D%E0%A4%A4%E0%A5%87%20%E0%A4%B9%E0%A5%88%20%3F">try this using google Translation</a></p>
<blockquote><p>आप <span style="text-decoration: underline;">हिन्दी</span> समझते है ? ==&gt; You understand <span style="text-decoration: underline;">English</span>?</p></blockquote>
<p style="text-align: justify;">So do you think that if there is any such <a href="http://en.wikipedia.org/wiki/Universal_translator">universal translation tool</a>,  it is nearly impossible and &#8220;only god can create such a tool&#8221; ?! . Heard about <a href="http://en.wikipedia.org/wiki/Races_and_species_in_The_Hitchhiker%27s_Guide_to_the_Galaxy#Babel_fish">Babel fish</a> (of The Hitchhiker&#8217;s Guide to the Galaxy)? .  The babel fish<strong><strong> </strong></strong> is small, yellow, leech-like, and is a <a title="Universal translator" href="http://en.wikipedia.org/wiki/Universal_translator">universal translator</a> which simultaneously translates from one spoken language to another. When inserted into the ear, its nutrition processes convert sound waves into brain waves, neatly crossing the language divide between any species you should happen to meet whilst travelling in space. According to the <a href="http://en.wikipedia.org/wiki/The_Hitchhiker%27s_Guide_to_the_Galaxy"><em>Hitchhiker&#8217;s Guide</em></a>, the Babel fish was put forth as an example for the <em>non</em>-existence of  God: .</p>
<blockquote><p><em>&#8220;I refuse to prove that I exist,&#8221; says God, &#8220;for  proof denies  faith, and without faith I am nothing.&#8221;</em><br />
<em></em></p>
<p><em>&#8220;But,&#8221; says Man, &#8220;the Babel fish is a dead giveaway isn&#8217;t it? It could not have  evolved by chance. It proves that you exist, and so therefore, by your own arguments, you don&#8217;t.  Q.E.D. </em></p>
<p><em></em><em>&#8220;Oh dear,&#8221; says God, &#8220;I hadn&#8217;t thought of that,&#8221; and promptly vanishes in a puff of  logic</em></p></blockquote>
<p>Alan K Melby argues that Douglas Adams was saying that there can&#8217;t be any such fish.</p>
<blockquote>
<p style="text-align: justify;">The silliness of the above argument is intended, I believe, to show the futility of trying to prove the existence of God, through physics or any other route. Belief in God is a starting point, not a conclusion. If it were a conclusion, then that conclusion would have to be based on something else that is firmer than our belief in God. If that something else forces everyone to believe in God, then faith is denied. If that something else does not force us to believe in God, then it may not be a sufficiently solid foundation for our belief.</p>
<p style="text-align: justify;">Adams may also be saying something about translation and the nature of language. I can speculate on what Adams had in mind to say about translation when he dreamed up the Babel fish. My own bias would have him saying indirectly that there could be no such fish since there is no universal set of thought patterns underlying all languages. Even with direct brain to brain communication, we would still need shared concepts in order to communicate. Words do not really fail us. If two people share a concept, they can eventually agree on a word to express it. Ineffable experiences are those that are not shared by others.</p>
</blockquote>
<p style="text-align: justify;">I have some friends studying on machine translation with Indian Languages. They are evaluating shallow transfer method(Statistical methods to the words surrounding the ambiguous word.) for this using tools like <a href="http://www.apertium.org/">apertium</a>. Let us hope that they will succeed in their efforts.</p>
<p>Let me give one example translation between Tamil and Malayalam where context matters.</p>
<p style="text-align: justify;">In Malayalam, for &#8216;wait, wait&#8217;, we usually say, &#8220;നില്ക്കു് നില്ക്ക്&#8221;(Literal meaning:  &#8217;stand, stand&#8217; ) . For the same purpose , I have noticed that my Tamil speaking friends  use &#8220;இரு இரு&#8221; (Literal meaning: &#8217;sit, sit&#8217;  ). Now if the translation is done without knowing this usage, it is going to be funny. Shallow transfer methods use multiple intermediate  languages for translation. For eg: If there is a translation  tool available for a-&gt;b and b-&gt;c and then a-&gt;c is possible through a-&gt;b-&gt;c . I feel that this is going to be a big challenge.. to keep the word meaning, context, common usage&#8230;etc.. Let us <strong>wait/sit/stand</strong> and see <img src='http://thottingal.in/blog/wp-includes/images/smilies/icon_biggrin.gif' alt=':D' class='wp-smiley' /> </p>
<p>Since we saw &#8220;a nonexistence of God proof&#8221;, let me give another one, that I read sometime back.</p>
<ol>
<li>God is so powerful, he can do any thing,</li>
<li>God can create anything , if #1 is true</li>
<li>If #2 is is true, he can create a big stone that he cannot lift!</li>
<li>If he cannot lift a stone, then #1 is wrong, hence #2 also wrong. So God does not exist!</li>
</ol>
<p>Looks very silly, right? or &#8220;Logical&#8221; ? <img src='http://thottingal.in/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://thottingal.in/blog/2009/06/27/on-machine-translation-and-god/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>PDFBox : Extract Text from PDF</title>
		<link>http://thottingal.in/blog/2009/06/24/pdfbox-extract-text-from-pdf/</link>
		<comments>http://thottingal.in/blog/2009/06/24/pdfbox-extract-text-from-pdf/#comments</comments>
		<pubDate>Wed, 24 Jun 2009 04:40:34 +0000</pubDate>
		<dc:creator>Santhosh</dc:creator>
				<category><![CDATA[Misc]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[pdf]]></category>

		<guid isPermaLink="false">http://thottingal.in/blog/?p=168</guid>
		<description><![CDATA[Recently I had to extract text from PDF files for indexing the content using Apache Lucene. Apache PDFBox was the obvious choice for the java library to be used.
Apache PDFBox is an opensource java library for working with PDF files. The PDFBox library allows creation of new PDF documents, manipulation of existing documents and the [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify;">Recently I had to extract text from PDF files for indexing the content using Apache Lucene. Apache PDFBox was the obvious choice for the java library to be used.</p>
<p style="text-align: justify;"><a href="http://incubator.apache.org/pdfbox/">Apache PDFBox</a> is an opensource java library for working with PDF files. The PDFBox library allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. PDFBox also includes several command line utilities.</p>
<p style="text-align: justify;">There is <a href="http://incubator.apache.org/pdfbox/download.html">no latest build available</a> for PDFBox. Sourceforge has very <a href="http://sourceforge.net/project/showfiles.php?group_id=78314">old binaries</a>. But  the old version<a href="https://issues.apache.org/jira/browse/PDFBOX-361"> fails to work with PDF 1.5 specification</a>. So one need to compile the latest code from SVN. </p>
<p style="text-align: justify;">I am sharing the latest jar file built from svn <a href="http://thottingal.in/download/pdfbox/">here</a></p>
<p style="text-align: justify;">The following example explains how to extract the text from a pdf file using PDFBox.</p>

<div class="wp_codebox_msgheader"><span class="right"><sup><a href="http://www.ericbess.com/ericblog/2008/03/03/wp-codebox/#examples" target="_blank" title="WP-CodeBox HowTo?"><span style="color: #99cc00">?</span></a></sup></span><span class="left"><a href="javascript:;" onclick="javascript:showCodeTxt('p168code2'); return false;">View Code</a> JAVA</span><div class="codebox_clear"></div></div><div class="wp_codebox"><table><tr id="p1682"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
</pre></td><td class="code" id="p168code2"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.io.File</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.io.FileInputStream</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.io.IOException</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.pdfbox.cos.COSDocument</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.pdfbox.pdfparser.PDFParser</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.pdfbox.pdmodel.PDDocument</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.pdfbox.util.PDFTextStripper</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">class</span> PDFTextParser <span style="color: #009900;">&#123;</span>
&nbsp;
	<span style="color: #666666; font-style: italic;">// Extract text from PDF Document</span>
	<span style="color: #000000; font-weight: bold;">static</span> <a href="http://www.google.com/search?hl=en&amp;q=allinurl%3Astring+java.sun.com&amp;btnI=I%27m%20Feeling%20Lucky"><span style="color: #003399;">String</span></a> pdftoText<span style="color: #009900;">&#40;</span><a href="http://www.google.com/search?hl=en&amp;q=allinurl%3Astring+java.sun.com&amp;btnI=I%27m%20Feeling%20Lucky"><span style="color: #003399;">String</span></a> fileName<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
		PDFParser parser<span style="color: #339933;">;</span>
		<a href="http://www.google.com/search?hl=en&amp;q=allinurl%3Astring+java.sun.com&amp;btnI=I%27m%20Feeling%20Lucky"><span style="color: #003399;">String</span></a> parsedText <span style="color: #339933;">=</span> <span style="color: #000066; font-weight: bold;">null</span><span style="color: #339933;">;;</span>
		PDFTextStripper pdfStripper <span style="color: #339933;">=</span> <span style="color: #000066; font-weight: bold;">null</span><span style="color: #339933;">;</span>
		PDDocument pdDoc <span style="color: #339933;">=</span> <span style="color: #000066; font-weight: bold;">null</span><span style="color: #339933;">;</span>
		COSDocument cosDoc <span style="color: #339933;">=</span> <span style="color: #000066; font-weight: bold;">null</span><span style="color: #339933;">;</span>
		<a href="http://www.google.com/search?hl=en&amp;q=allinurl%3Afile+java.sun.com&amp;btnI=I%27m%20Feeling%20Lucky"><span style="color: #003399;">File</span></a> file <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <a href="http://www.google.com/search?hl=en&amp;q=allinurl%3Afile+java.sun.com&amp;btnI=I%27m%20Feeling%20Lucky"><span style="color: #003399;">File</span></a><span style="color: #009900;">&#40;</span>fileName<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #339933;">!</span>file.<span style="color: #006633;">isFile</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
			<a href="http://www.google.com/search?hl=en&amp;q=allinurl%3Asystem+java.sun.com&amp;btnI=I%27m%20Feeling%20Lucky"><span style="color: #003399;">System</span></a>.<span style="color: #006633;">err</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;File &quot;</span> <span style="color: #339933;">+</span> fileName <span style="color: #339933;">+</span> <span style="color: #0000ff;">&quot; does not exist.&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
			<span style="color: #000000; font-weight: bold;">return</span> <span style="color: #000066; font-weight: bold;">null</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span>
		<span style="color: #000000; font-weight: bold;">try</span> <span style="color: #009900;">&#123;</span>
			parser <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> PDFParser<span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> <a href="http://www.google.com/search?hl=en&amp;q=allinurl%3Afileinputstream+java.sun.com&amp;btnI=I%27m%20Feeling%20Lucky"><span style="color: #003399;">FileInputStream</span></a><span style="color: #009900;">&#40;</span>file<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span> <span style="color: #000000; font-weight: bold;">catch</span> <span style="color: #009900;">&#40;</span><a href="http://www.google.com/search?hl=en&amp;q=allinurl%3Aioexception+java.sun.com&amp;btnI=I%27m%20Feeling%20Lucky"><span style="color: #003399;">IOException</span></a> e<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
			<a href="http://www.google.com/search?hl=en&amp;q=allinurl%3Asystem+java.sun.com&amp;btnI=I%27m%20Feeling%20Lucky"><span style="color: #003399;">System</span></a>.<span style="color: #006633;">err</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;Unable to open PDF Parser. &quot;</span> <span style="color: #339933;">+</span> e.<span style="color: #006633;">getMessage</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
			<span style="color: #000000; font-weight: bold;">return</span> <span style="color: #000066; font-weight: bold;">null</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span>
		<span style="color: #000000; font-weight: bold;">try</span> <span style="color: #009900;">&#123;</span>
			parser.<span style="color: #006633;">parse</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
			cosDoc <span style="color: #339933;">=</span> parser.<span style="color: #006633;">getDocument</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
			pdfStripper <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> PDFTextStripper<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
			pdDoc <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> PDDocument<span style="color: #009900;">&#40;</span>cosDoc<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
			pdfStripper.<span style="color: #006633;">setStartPage</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
			pdfStripper.<span style="color: #006633;">setEndPage</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">5</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
			parsedText <span style="color: #339933;">=</span> pdfStripper.<span style="color: #006633;">getText</span><span style="color: #009900;">&#40;</span>pdDoc<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span> <span style="color: #000000; font-weight: bold;">catch</span> <span style="color: #009900;">&#40;</span><a href="http://www.google.com/search?hl=en&amp;q=allinurl%3Aexception+java.sun.com&amp;btnI=I%27m%20Feeling%20Lucky"><span style="color: #003399;">Exception</span></a> e<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
			<a href="http://www.google.com/search?hl=en&amp;q=allinurl%3Asystem+java.sun.com&amp;btnI=I%27m%20Feeling%20Lucky"><span style="color: #003399;">System</span></a>.<span style="color: #006633;">err</span>
					.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;An exception occured in parsing the PDF Document.&quot;</span>
							<span style="color: #339933;">+</span> e.<span style="color: #006633;">getMessage</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span> <span style="color: #000000; font-weight: bold;">finally</span> <span style="color: #009900;">&#123;</span>
			<span style="color: #000000; font-weight: bold;">try</span> <span style="color: #009900;">&#123;</span>
				<span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>cosDoc <span style="color: #339933;">!=</span> <span style="color: #000066; font-weight: bold;">null</span><span style="color: #009900;">&#41;</span>
					cosDoc.<span style="color: #006633;">close</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
				<span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>pdDoc <span style="color: #339933;">!=</span> <span style="color: #000066; font-weight: bold;">null</span><span style="color: #009900;">&#41;</span>
					pdDoc.<span style="color: #006633;">close</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
			<span style="color: #009900;">&#125;</span> <span style="color: #000000; font-weight: bold;">catch</span> <span style="color: #009900;">&#40;</span><a href="http://www.google.com/search?hl=en&amp;q=allinurl%3Aexception+java.sun.com&amp;btnI=I%27m%20Feeling%20Lucky"><span style="color: #003399;">Exception</span></a> e<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
				e.<span style="color: #006633;">printStackTrace</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
			<span style="color: #009900;">&#125;</span>
		<span style="color: #009900;">&#125;</span>
		<span style="color: #000000; font-weight: bold;">return</span> parsedText<span style="color: #339933;">;</span>
	<span style="color: #009900;">&#125;</span>
	<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #000066; font-weight: bold;">void</span> main<span style="color: #009900;">&#40;</span><a href="http://www.google.com/search?hl=en&amp;q=allinurl%3Astring+java.sun.com&amp;btnI=I%27m%20Feeling%20Lucky"><span style="color: #003399;">String</span></a> args<span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#123;</span>
		<a href="http://www.google.com/search?hl=en&amp;q=allinurl%3Asystem+java.sun.com&amp;btnI=I%27m%20Feeling%20Lucky"><span style="color: #003399;">System</span></a>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span>pdftoText<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;/home/santhosh/pdfbox/test.pdf&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
	<span style="color: #009900;">&#125;</span>
&nbsp;
<span style="color: #009900;">&#125;</span>
&nbsp;
 </pre></td></tr></table></div>

<p>More details on the APIs can be read from <a href="http://incubator.apache.org/pdfbox/userguide/index.html">here</a></p>
]]></content:encoded>
			<wfw:commentRss>http://thottingal.in/blog/2009/06/24/pdfbox-extract-text-from-pdf/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Announcing Project Silpa</title>
		<link>http://thottingal.in/blog/2009/06/16/announcing-project-silpa/</link>
		<comments>http://thottingal.in/blog/2009/06/16/announcing-project-silpa/#comments</comments>
		<pubDate>Tue, 16 Jun 2009 15:13:53 +0000</pubDate>
		<dc:creator>Santhosh</dc:creator>
				<category><![CDATA[Indic]]></category>
		<category><![CDATA[Projects]]></category>
		<category><![CDATA[silpa]]></category>

		<guid isPermaLink="false">http://thottingal.in/blog/?p=150</guid>
		<description><![CDATA[Many of my friends already know about a project I am working on,  this is a public announcement of that.
The project is named as Silpa, may be an acronym of Swathanthra(Mukth, Free as in Freedom)  Indian Language Processing Applications. It is a web  framework and a  set of applications for processing Indian [...]]]></description>
			<content:encoded><![CDATA[<p>Many of my friends already know about a project I am working on,  this is a public announcement of that.</p>
<p>The project is named as Silpa, may be an acronym of Swathanthra(Mukth, Free as in Freedom)  Indian Language Processing Applications. It is a web  framework and a  set of applications for processing Indian Languages in many ways. Or in other words, it is a platform for porting existing and upcoming language processing applications to the web.</p>
<p>Before going to the details, you can have a quick  preview of the application here : <a href="http://smc.org.in/silpa" target="_blank">http://smc.org.in/silpa</a></p>
<p>The project is designed for adding applications/utilities as plugins. The framework is written from scratch using python language. As you can see in the development version, there are number of modules already written.  Most of the modules requires some more work to make it _complete_. The application is free software and there is a link to the source code at the bottom of the application.</p>
<p>As it is meant for covering all languages of India, all modules should be capable of handling all scripts from India(Sometimes English too). At the same time , the language of input data is transparent , meaning, user need not mention that _this_ is the language in which she is entering the data. Unlike desktop applications which asks to specify the language along with the input data(for eg: Spell checker) , the modules should try to detect the language them self. And if possible, modules try to process the data even if the input data is in multiple Indic scripts.</p>
<p>The modules may be General purpose(eg: Dictionary, Spellcheck,Sort. Transliteration, Font conversion..) or Technology/Algorithm  Demonstration purpose (eg: Hyphenation, Stemmer, Search algorithms)</p>
<p>Some of the modules are usable  as of now, while some of them are in development. You may just try out them. User&#8217;s data will not be logged  except when a crash occurs(at that time user data and exception trace will be logged for later debugging).</p>
<p>And, this is also a call for contributors. You may propose new ideas for modules, feature suggestion etc.. A few  students showed interest in the project. Unfortunately python is not a language in their  college syllabus. So if you are good in python and have interest in contributing to the project, drop me a mail <img src='http://thottingal.in/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> . There is no separate version for development and the one which is present at http://smc.org.in/silpa . All development happens there itself and any change in the code is immediately available for use!(or immediately starts crashing for user data)</p>
<p>I will write on some interesting algorithms I used for some modules later. If you are curious to know them, read the code!</p>
]]></content:encoded>
			<wfw:commentRss>http://thottingal.in/blog/2009/06/16/announcing-project-silpa/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Openoffice Indic Regional Language group</title>
		<link>http://thottingal.in/blog/2009/05/26/openoffice-indic-regional-language-group/</link>
		<comments>http://thottingal.in/blog/2009/05/26/openoffice-indic-regional-language-group/#comments</comments>
		<pubDate>Wed, 27 May 2009 03:48:00 +0000</pubDate>
		<dc:creator>Santhosh</dc:creator>
				<category><![CDATA[Community]]></category>
		<category><![CDATA[Indic]]></category>
		<category><![CDATA[openoffice]]></category>

		<guid isPermaLink="false">http://thottingal.in/blog/?p=67</guid>
		<description><![CDATA[We just formed Indic Regional Language group for Openoffice. This is as per the Openoffice Native Language Consortium Plans. The objectives of such groups can be read from here. Basically the group is meant for better coordination among Indic languages to make Openoffice experience in our language better.
The announcement of this group is here
Thanks to [...]]]></description>
			<content:encoded><![CDATA[<p>We just formed Indic Regional Language group for Openoffice. This is as per the <a href="http://wiki.services.openoffice.org/wiki/NLC">Openoffice Native Language Consortium Plans</a>. The objectives of such groups can be read from <a href="http://wiki.services.openoffice.org/wiki/Regional_Groups">here</a>. Basically the group is meant for better coordination among Indic languages to make Openoffice experience in our language better.<br />
The announcement of this group is <a href="http://native-lang.openoffice.org/servlets/ReadMsg?list=dev&amp;msgNo=8769">here</a></p>
<p>Thanks to <a href="http://www.standardsandfreedom.net">Charles-H. Schulz</a>, we got a mailing list <a href="http://native-lang.openoffice.org/servlets/SummarizeList?listName=indic">indic@native-lang.openoffice.org</a>. To subscribe login to http://native-lang.openoffice.org </p>
<p>We just  started, and I will soon setup a wiki page there. To start with , I will collect the list of issues pending for Indian languages from people from various languages and will find out people from various languages as point of contacts. Feel free to contact me for anything related to Openoffice in your language.</p>
<p><b>Update: June 3, 2009</b>: <a href=" http://wiki.services.openoffice.org/wiki/NLC/IndicGroup ">This is our wiki page </a></p>
]]></content:encoded>
			<wfw:commentRss>http://thottingal.in/blog/2009/05/26/openoffice-indic-regional-language-group/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>In solidarity</title>
		<link>http://thottingal.in/blog/2009/05/13/in-solidarity/</link>
		<comments>http://thottingal.in/blog/2009/05/13/in-solidarity/#comments</comments>
		<pubDate>Wed, 13 May 2009 22:34:00 +0000</pubDate>
		<dc:creator>Santhosh</dc:creator>
				<category><![CDATA[Misc]]></category>
		<category><![CDATA[politics]]></category>

		<guid isPermaLink="false">http://thottingal.in/blog/?p=66</guid>
		<description><![CDATA[
]]></description>
			<content:encoded><![CDATA[<p><a href="http://binayaksen.net"><img src="http://binayaksen.net/wp-content/gallery/site-graphics/have-a-heart-1.gif"/></a></p>
]]></content:encoded>
			<wfw:commentRss>http://thottingal.in/blog/2009/05/13/in-solidarity/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
