Attending Wikimania 2010

I will be attending Wikimania 2010, Gdansk, Poland. This annual international conference of the Wikimedia community is from July 9 to July 11. I will be presenting wik2cd, the tool I wrote for Malayalam wikipedia version 1.0 there in a joint workshop with wikipedia offline developers. I will be joining with Manuel Schneider, Shiju Alex, Martin Walker in the workshop titled: Creating offline version of Wiki content – Solutions and Challenges. [Read More]

Malayalam Wikipedia releases selected articles on CD

As part of Malayalam Wikipedia Meetup 2010 , today Malayalam wikipedia releases 500 selected articles on a CD ROM. This is the first time in India, a Wikipedia on local language releasing its articles for offline usage. I handled the technology part of the project. The idea was to get the selected articles in static form to the CD. But this is not easy as we imagine. It is not like saving each page from browser to the local machine. [Read More]

Predictive text entry with ibus

A few days back I came to know about this project :Text Prediction on GNOME based on GTK+ Input Method context. Basically it is an input method with text prediction feature. I had a similar project idea during 2009 May and had done some amount of coding for that. The project was to have an IBUS input method which can do letter prediction as well as word prediction. The prediction is based on ngrams. [Read More]

Conferences : FOSS.IN and NCIDEEE

FOSS.IN 2009 starts on 1st December. I wanted to attend all 5 days but I have another conference on Dec 1st to 3rd at Chennai. I am attending National Conference on ICTs for the differently- abled/under privileged communities in Education, Employment and Entrepreneurship 2009 – (NCIDEEE 2009) at Loyola College, Chennai. So I will miss the first 3 days of foss.in. We have a workout on Project Silpa during foss.in. I am also planning to have a workout with Debayan and Jinesh to get his tesseract-indic OCR work with Malayalam. [Read More]

Inkscape hyphenation extension

One year back I wrote about how to use Inkscape as a workaround solution for DTP in indic scripts. Still we don’t have any DTP software which supports Indic scripts in Unicode. Scribus still does not have the Indic support. One issue with inkscape when used as DTP for indic script was, a few indic scripts always wanted hyphenation when text is justified. For example Malayalam has lengthy words and often space is wasted in lines if the text is not automatically hyphenated. [Read More]

New Hyphenation Pattern Extensions for Openoffice

Openoffice Indic Natural Language group announces the availability of the following Openoffice hyphenation dictionary extensions. Malayalam Hyphenation Rules version 1.2 Kannada Hyphenation Rules version 1.1 Bengali Hyphenation Rules verson 1.1 Hindi Hyphenation Rules version 1.1 Telugu Hyphenation Rules version 1.0 Tamil Hyphenation Rules version 1.0 Gujarati Hyphenation Rules version 1.0 Panjabi Hyphenation Rules version 1.0 Oriya Hyphenation Rules version 1.0 Marathi Hyphenation Rules version 1.0 Spellchecker extension for Malayalam is also ready. [Read More]

Project Silpa Updates

[Please read the Silpa project annoucement before reading this blogpost] Project silpa is getting ready for a 0.1 version. The web framework got many changes to support JSON based RPC calls from external applications. That means, web/desktop applications can use the APIs of Silpa through RPC calls. Page rendering logic is moved from server to client. Web interface use javascript based synchronous JSON based RPC calls to get the results from server. [Read More]

Phonetic Comparison Algorithm for Indian Languages

Soundex is a phonetic indexing algorithm. It is used to search/retrieve words having similar pronunciation but slightly different spelling. Soundex was developed by Robert C. Russell and Margaret K. Odell. A variation called American Soundex was used in the 1930s for a retrospective analysis of the US censuses from 1890 through 1920. It is also described in Donald Knuth’s The Art of Computer Programming. The National Archives and Records Administration (NARA) maintains the current rule set for the official implementation of Soundex used by the U. [Read More]

On Machine Translation and God

I was reading an article named “Why Can’t a Computer Translate More Like a Person?” by Alan K. Melby. The article is about the challenges that machine translation technology face to reach a acceptable quality of translation. He explains the importance of culture sensitivity required for machine translation programs. Article lists a number of examples where MT can go wrong if context , culture etc are not taken into consideration. [Read More]

PDFBox : Extract Text from PDF

Recently I had to extract text from PDF files for indexing the content using Apache Lucene. Apache PDFBox was the obvious choice for the java library to be used. Apache PDFBox is an opensource java library for working with PDF files. The PDFBox library allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. PDFBox also includes several command line utilities. There is no latest build available for PDFBox. [Read More]
java  pdf