New Hyphenation Pattern Extensions for Openoffice

Openoffice Indic Natural Language group announces the availability of the following Openoffice hyphenation dictionary extensions. Malayalam Hyphenation Rules version 1.2 Kannada Hyphenation Rules version 1.1 Bengali Hyphenation Rules verson 1.1 Hindi Hyphenation Rules version 1.1 Telugu Hyphenation Rules version 1.0 Tamil Hyphenation Rules version 1.0 Gujarati Hyphenation Rules version 1.0 Panjabi Hyphenation Rules version 1.0 Oriya Hyphenation Rules version 1.0 Marathi Hyphenation Rules version 1.0 Spellchecker extension for Malayalam is also ready. [Read More]

Project Silpa Updates

[Please read the Silpa project annoucement before reading this blogpost] Project silpa is getting ready for a 0.1 version. The web framework got many changes to support JSON based RPC calls from external applications. That means, web/desktop applications can use the APIs of Silpa through RPC calls. Page rendering logic is moved from server to client. Web interface use javascript based synchronous JSON based RPC calls to get the results from server. [Read More]

Phonetic Comparison Algorithm for Indian Languages

Soundex is a phonetic indexing algorithm. It is used to search/retrieve words having similar pronunciation but slightly different spelling. Soundex was developed by Robert C. Russell and Margaret K. Odell. A variation called American Soundex was used in the 1930s for a retrospective analysis of the US censuses from 1890 through 1920. It is also described in Donald Knuth’s The Art of Computer Programming. The National Archives and Records Administration (NARA) maintains the current rule set for the official implementation of Soundex used by the U. [Read More]

On Machine Translation and God

I was reading an article named “Why Can’t a Computer Translate More Like a Person?” by Alan K. Melby. The article is about the challenges that machine translation technology face to reach a acceptable quality of translation. He explains the importance of culture sensitivity required for machine translation programs. Article lists a number of examples where MT can go wrong if context , culture etc are not taken into consideration. [Read More]

PDFBox : Extract Text from PDF

Recently I had to extract text from PDF files for indexing the content using Apache Lucene. Apache PDFBox was the obvious choice for the java library to be used. Apache PDFBox is an opensource java library for working with PDF files. The PDFBox library allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. PDFBox also includes several command line utilities. There is no latest build available for PDFBox. [Read More]
java  pdf 

Announcing Project Silpa

Many of my friends already know about a project I am working on, this is a public announcement of that. The project is named as Silpa, may be an acronym of Swathanthra(Mukth, Free as in Freedom) Indian Language Processing Applications. It is a web framework and a set of applications for processing Indian Languages in many ways. Or in other words, it is a platform for porting existing and upcoming language processing applications to the web. [Read More]
silpa 

Openoffice Indic Regional Language group

We just formed Indic Regional Language group for Openoffice. This is as per the Openoffice Native Language Consortium Plans. The objectives of such groups can be read from here. Basically the group is meant for better coordination among Indic languages to make Openoffice experience in our language better. The announcement of this group is here Thanks to Charles-H. Schulz, we got a mailing list indic@native-lang.openoffice.org. To subscribe login to http://native-lang.openoffice.org [Read More]

“ക്ടാവ്” Slang converter തയാറാവുന്നു

ചങ്ങാതിമാരേ, കേരളത്തിലെ രസകരമായ പ്രാദേശിക ഭാഷാ ഭേദങ്ങളെക്കുറിച്ചു് നിങ്ങള്‍ക്കെല്ലാമറിയാമല്ലോ? തിരുവനന്തപുരം, കോട്ടയം, തൃശ്ശൂര്‍, ഷൊര്‍ണ്ണൂര്‍, പാലക്കാട്, കോഴിക്കോട് കണ്ണൂര്‍, വയനാട് തുടങ്ങി നമുക്കു് വ്യത്യസ്തങ്ങളായ മലയാളത്തിന്റെ രൂപഭേദങ്ങളുണ്ടു്. അച്ചടി മലയാളത്തില്‍ നിന്നും വളരെയേറെ വ്യത്യസ്തമാണു് അവ. അച്ചടി മലയാളം കൊടുത്തു് സ്ഥലത്തിന്റെ പേരു കൊടുത്താല്‍ ആ പ്രദേശത്തെ മലയാളത്തിന്റെ രീതിയിലേക്കു അതിനെ മാറ്റിത്തരുന്ന ഒരു സോഫ്റ്റ്‌വെയര്‍ രസകരമാവില്ലേ? അത്തരത്തിലൊരു ശ്രമമാണു് “ക്ടാവ്” Slang converter എന്നു പേരിട്ടിരിക്കുന്ന പ്രൊജക്ട്. ഇതിന്റെ കൂടെ കൊടുത്തിരിക്കുന്ന സ്ക്രീന്‍ഷോട്ട് നോക്കൂ. ഡെവലപ്മെന്റ് പതിപ്പിന്റെ ചിത്രമാണതു്. കുറച്ചു നിയമങ്ങളുടെ അടിസ്ഥാനത്തില്‍ Natural Language Processing ന്റെ പുതിയ ശാഖയായ AMP(Ambiguous Language Processing) എന്ന വിദ്യ ഉപയോഗിച്ചാണു് ഇതു ചെയ്തിരിക്കുന്നതു്. [Read More]

Python isalpha is buggy

This code

#!/usr/bin/env python
# -*- coding: utf-8 -*-
ml_string=u"സന്തോഷ്  हिन्दी"
for ch in ml_string:
    if(ch.isalpha()):
        print ch

gives this output

സ
ന
ത
ഷ
ह
न
द

And fails for all mathra signs of Indian languages. This is a known bug in glibc.

Does anybody know whether python internally use glibc functions for this basic string operations or use separate character database llke QT does?

python