Lexicon Curation for Mlmorph

One of the key components of Mlmorph is its lexicon. The lexicon contains the root words categorized as nouns, verbs, adjectives, adverbs etc. These are the components used with morphological rules to generate the vocabulary of Malayalam. I collected initial lexicon with about 100,000 words from various sources such as Wikipedia, CLDR and many targeted web crawls. One problem with such collected words is they often contains spelling mistakes. Secondly, classifying these words is not possible without the tedious task of a person going through each and every words. [Read More]

LibreOffice Malayalam spellchecker using mlmorph

A few months back, I wrote about the spellchecker based on Malayalam morphology analyser. I was also trying to intergrate that spellchecker with LibreOffice. It is not yet ready for any serious usage, but if you are curious and would like to help me in its further development, please read on.Malayalam spellchecker – a morphology analyser based approach Blog post on spellchecker approach and pla Current status The libreoffice spellchecker for Malayalam is available at https://gitlab. [Read More]

Malayalam Named Entity Recognition using morphology analyser

Named Entity Recognition, a task of identifying and classifying real world objects such as persons, places, organizations from a given text is a well known NLP problem. For Malayalam, there were several research papers published on this topic, but none are functional or reproducible research. The morphological characteristics of Malayalam has been always a challenge to solve this problem. When the named entities appear in an inflected or agglutinated complex word, the first step is to analyse such words and arrive at the root words. [Read More]

Malayalam morphology analyser – First release

I am happy to announce the first version of Malayalam morphology analyser. After two years of development, I tagged version 1.0.0 . In this release In this release, mlmorph can analyse and generate malayalam words using the morpho-phonotactical rules defined and based on a lexicon. We have a test corpora of Fifty thousand words and 82% of the words in it are recognized by the analyser. A python interface is released to make the usage of library very easy for developers. [Read More]

Malayalam morphology analyser – status update

For the last several months, I am actively working on the Malayalam morphology analyser project. In case you are not familiar with the project, my introduction blog post is a good start. I was always skeptical about the approach and the whole project as such looked very ambitious. But, now I am almost confident that the approach is viable. I am making good progress in the project, so this is some updates on that. [Read More]