Photo by Ankush Minda on Unsplash

Malayalam morphology analyser – First release

I am happy to announce the first version of Malayalam morphology analyser.

After two years of development, I tagged version 1.0.0

In this release

In this release, mlmorph can analyse and generate malayalam words using the morpho-phonotactical rules defined and based on a lexicon. We have a test corpora of Fifty thousand words and 82% of the words in it are recognized by the analyser.

A python interface is released to make the usage of library very easy for developers. The library is available in pypi.org – https://pypi.org/project/mlmorph/ Installing it is very easy:

Installing it is very easy:

pip install mlmorph

It avoids all difficulties of compiling the sfst formalism and installing the required hfst, sfst packages.

For detailed python api documentation and command line utility refer https://pypi.org/project/mlmorph/

Next

There are lot of known limitations with the current release. I plan to address them in future releases.

  • Expand lexicon further: The current lexicon was compiled by testing various text and adding missing words found in it. Preparing the coverage test corpora also helped to increase the lexicon. But it still need more improvement
  • Many language specific constructs which are commonly used, but consisting of multiple conjunctions, adjectives are not well covered. Some examples are മറ്റൊരു, പിന്നീട്, അതുപോലെത്തന്നെ, എന്നതിന്റെ etc.
  • Optimizing the weight calculation: As the lexicon size is increased, many rarely used words can become alternate parts in agglutination of the words. For example, പാലക്കാട് can have an analysis of പാല്, അക്ക്, ആട് -Even though this is grammatically correct, it should get less preference than പാലക്കാട്<proper noun>.
  • Standardization of POS tags: mlmorph has its own pos tags definition. These tags need documentation with examples. I tried to use universal dependencies as much as possible, but it is not enough to cover all required tags for malayalam.
  • Documentation of formalism and tutorials for developers. So far I am the only developer for the project, which I am not happy about. The learning curve for this project is too steep to attract new developers. Above average understanding of Malayalam grammar is a difficult requirement too. I am planning to write down some tutorials to help new developers to join.

Applications

The project is meaningful only when practical applications are built on top of this.



Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.