I am happy to announce the first version of Malayalam morphology analyser.
After two years of development, I tagged version 1.0.0 .
In this release
In this release, mlmorph can analyse and generate malayalam words using the morpho-phonotactical rules defined and based on a lexicon. We have a test corpora of Fifty thousand words and 82% of the words in it are recognized by the analyser.
A python interface is released to make the usage of library very easy for developers. The library is available in pypi.org – https://pypi.org/project/mlmorph/ Installing it is very easy:
Installing it is very easy:
pip install mlmorph
It avoids all difficulties of compiling the sfst formalism and installing the required hfst, sfst packages.
For detailed python api documentation and command line utility refer https://pypi.org/project/mlmorph/
There are lot of known limitations with the current release. I plan to address them in future releases.
Expand lexicon further: The current lexicon was compiled by testing various text and adding missing words found in it. Preparing the coverage test corpora also helped to increase the lexicon. But it still need more improvement
Many language specific constructs which are commonly used, but consisting of multiple conjunctions, adjectives are not well covered. Some examples are മറ്റൊരു, പിന്നീട്, അതുപോലെത്തന്നെ, എന്നതിന്റെ etc.
Optimizing the weight calculation: As the lexicon size is increased, many rarely used words can become alternate parts in agglutination of the words. For example, പാലക്കാട് can have an analysis of പാല്, അക്ക്, ആട് -Even though this is grammatically correct, it should get less preference than പാലക്കാട്
Standardization of POS tags: mlmorph has its own pos tags definition. These tags need documentation with examples. I tried to use universal dependencies as much as possible, but it is not enough to cover all required tags for malayalam.
Documentation of formalism and tutorials for developers. So far I am the only developer for the project, which I am not happy about. The learning curve for this project is too steep to attract new developers. Above average understanding of Malayalam grammar is a difficult requirement too. I am planning to write down some tutorials to help new developers to join.
The project is meaningful only when practical applications are built on top of this.
A spellchecker based on mlmorph is being developed. See https://gitlab.com/smc/mlmorph-spellchecker. It is also published in pypi.org https://pypi.org/project/mlmorph-spellchecker/. The spellchecker also inherits the known limitations of the mlmorph as explained above.
The web interface of the spellchecker is available at https://morph.smc.org.in/spellchecker
A Libreoffice extension for Malayalam spelling check is being prepared at https://gitlab.com/smc/mlmorph-libreoffice-spellchecker
A number spellout demo application that uses the mlmorph web api is available at https://codepen.io/santhoshtr/pen/MONZow/