Inkscape hyphenation extension

One year back I wrote about how to use Inkscape as a workaround solution for DTP in indic scripts. Still we don’t have any DTP software which supports Indic scripts in Unicode. Scribus still does not have the Indic support. One issue with inkscape when used as DTP for indic script was, a few indic scripts always wanted hyphenation when text is justified. For example Malayalam has lengthy words and often space is wasted in lines if the text is not automatically hyphenated. [Read More]

New Hyphenation Pattern Extensions for Openoffice

Openoffice Indic Natural Language group announces the availability of the following Openoffice hyphenation dictionary extensions. Malayalam Hyphenation Rules version 1.2 Kannada Hyphenation Rules version 1.1 Bengali Hyphenation Rules verson 1.1 Hindi Hyphenation Rules version 1.1 Telugu Hyphenation Rules version 1.0 Tamil Hyphenation Rules version 1.0 Gujarati Hyphenation Rules version 1.0 Panjabi Hyphenation Rules version 1.0 Oriya Hyphenation Rules version 1.0 Marathi Hyphenation Rules version 1.0 Spellchecker extension for Malayalam is also ready. [Read More]

Project Silpa Updates

[Please read the Silpa project annoucement before reading this blogpost] Project silpa is getting ready for a 0.1 version. The web framework got many changes to support JSON based RPC calls from external applications. That means, web/desktop applications can use the APIs of Silpa through RPC calls. Page rendering logic is moved from server to client. Web interface use javascript based synchronous JSON based RPC calls to get the results from server. [Read More]

Phonetic Comparison Algorithm for Indian Languages

Soundex is a phonetic indexing algorithm. It is used to search/retrieve words having similar pronunciation but slightly different spelling. Soundex was developed by Robert C. Russell and Margaret K. Odell. A variation called American Soundex was used in the 1930s for a retrospective analysis of the US censuses from 1890 through 1920. It is also described in Donald Knuth’s The Art of Computer Programming. The National Archives and Records Administration (NARA) maintains the current rule set for the official implementation of Soundex used by the U. [Read More]

On Machine Translation and God

I was reading an article named “Why Can’t a Computer Translate More Like a Person?” by Alan K. Melby. The article is about the challenges that machine translation technology face to reach a acceptable quality of translation. He explains the importance of culture sensitivity required for machine translation programs. Article lists a number of examples where MT can go wrong if context , culture etc are not taken into consideration. [Read More]

PDFBox : Extract Text from PDF

Recently I had to extract text from PDF files for indexing the content using Apache Lucene. Apache PDFBox was the obvious choice for the java library to be used. Apache PDFBox is an opensource java library for working with PDF files. The PDFBox library allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. PDFBox also includes several command line utilities. There is no latest build available for PDFBox. [Read More]
java  pdf 

Announcing Project Silpa

Many of my friends already know about a project I am working on, this is a public announcement of that. The project is named as Silpa, may be an acronym of Swathanthra(Mukth, Free as in Freedom) Indian Language Processing Applications. It is a web framework and a set of applications for processing Indian Languages in many ways. Or in other words, it is a platform for porting existing and upcoming language processing applications to the web. [Read More]
silpa 

Openoffice Indic Regional Language group

We just formed Indic Regional Language group for Openoffice. This is as per the Openoffice Native Language Consortium Plans. The objectives of such groups can be read from here. Basically the group is meant for better coordination among Indic languages to make Openoffice experience in our language better. The announcement of this group is here Thanks to Charles-H. Schulz, we got a mailing list indic@native-lang.openoffice.org. To subscribe login to http://native-lang.openoffice.org [Read More]

“ക്ടാവ്” Slang converter തയാറാവുന്നു

ചങ്ങാതിമാരേ, കേരളത്തിലെ രസകരമായ പ്രാദേശിക ഭാഷാ ഭേദങ്ങളെക്കുറിച്ചു് നിങ്ങള്‍ക്കെല്ലാമറിയാമല്ലോ? തിരുവനന്തപുരം, കോട്ടയം, തൃശ്ശൂര്‍, ഷൊര്‍ണ്ണൂര്‍, പാലക്കാട്, കോഴിക്കോട് കണ്ണൂര്‍, വയനാട് തുടങ്ങി നമുക്കു് വ്യത്യസ്തങ്ങളായ മലയാളത്തിന്റെ രൂപഭേദങ്ങളുണ്ടു്. അച്ചടി മലയാളത്തില്‍ നിന്നും വളരെയേറെ വ്യത്യസ്തമാണു് അവ. അച്ചടി മലയാളം കൊടുത്തു് സ്ഥലത്തിന്റെ പേരു കൊടുത്താല്‍ ആ പ്രദേശത്തെ മലയാളത്തിന്റെ രീതിയിലേക്കു അതിനെ മാറ്റിത്തരുന്ന ഒരു സോഫ്റ്റ്‌വെയര്‍ രസകരമാവില്ലേ? അത്തരത്തിലൊരു ശ്രമമാണു് “ക്ടാവ്” Slang converter എന്നു പേരിട്ടിരിക്കുന്ന പ്രൊജക്ട്. ഇതിന്റെ കൂടെ കൊടുത്തിരിക്കുന്ന സ്ക്രീന്‍ഷോട്ട് നോക്കൂ. ഡെവലപ്മെന്റ് പതിപ്പിന്റെ ചിത്രമാണതു്. കുറച്ചു നിയമങ്ങളുടെ അടിസ്ഥാനത്തില്‍ Natural Language Processing ന്റെ പുതിയ ശാഖയായ AMP(Ambiguous Language Processing) എന്ന വിദ്യ ഉപയോഗിച്ചാണു് ഇതു ചെയ്തിരിക്കുന്നതു്. [Read More]