Mlmorph -

New version of Malayalam morphology analyser

Posted on March 20, 2021 | Santhosh Thottingal

In the previous blog post I explained my efforts to modernize SFST. Since I published SFST python binding and modernized to make it compile in all operating systems, next step was to drop HFST dependency of mlmorph and use the new version of SFST 1.5.0. mlmorph 1.3.0 has no dependency on HFST and all installation problems in different operating systems and python versions are solved now. Latest version is also available in pypi. [Read More]

mlmorph fst

New version of Stuttgart Finite State Transducer

Posted on March 20, 2021 | Santhosh Thottingal

The Malayalam morphology analyser(mlmorph) I am actively developing since 2016 is based on Stuttgart Finite State Transducer(SFST) formalism. SFST was developed by Dr Helmut Schmid. Since SFST had minimal developer apis and no functional python APIs, I was using Helsinki Finite State Transducer(HFST) toolkit. HFST has python binding and better tooling and development support. HFST has SFST backend so there was no issue in using SFST formalism. But HFST development has slowed down in past years. [Read More]

mlmorph fst

Foreign word detection in mlmorph

Posted on October 24, 2020 | Santhosh Thottingal

The test corpus for Malayalam Morphological analysis has many foreign words. They are either written in a non-Malayalam script or written in Malayalam. For example, “ഇലക്ട്രിസിറ്റി”, “ഡോക്സ്”, “ഇന്റർമീഡിയറ്റ്”, “അബ്സ്ട്രാക്റ്റ്”, “ഇല്ലസ്ടേഷൻ”, “ഇല്ലിറ്ററേറ്റ്”, “റെക്കോർഡ്”, “procrastination”, “唐宸禹” - These are all foreign words and it is useless to analyse them using mlmorph. Since mlmorph works based on a root word lexicon, it is practically impossible to have them in lexicon. So there should be a way to identify the words easily and tag them as FW - Foreign word Part of speech. [Read More]

mlmorph foreign-word-detection

Stuttgart Finite State Transducer(SFST) formalism support for VS Code

Posted on September 11, 2020 | Santhosh Thottingal

I just published a VS Code language extension to support syntax highlighting for Stuttgart Finite State Transducer (SFST) formalism to VS Code.

I learned how to write a language extension when I attempted the opentype feature file support. So I thought of applying that learning to SFST which I regularly use for the Malayalam morphology analyser project.

vscode sfst mlmorph

Mlmorph at MT Summit 2019

Posted on August 22, 2019 | Santhosh Thottingal

I presented the Malayalam Morphology Analyser at 2nd Workshop on Technologies for MT of Low Resource Languages- LORESMT Dublin, Ireland as part of 17th MT Summit.

Paper: https://www.aclweb.org/anthology/W19-6801

Presentation: https://docs.google.com/presentation/d/…

Proceedings from the LORESMT workshop: https://aclweb.org/anthology/volumes/W19-68/

paper mlmorph

Updated web interface for mlmorph

Posted on June 8, 2019 | Santhosh Thottingal

The web interface of Malayalam morphology analyser(mlmorph) is updated. You can see new interface at https://morph.smc.org.in/. The new web application is written in vuejs using vuetify UI framework. The backend is flask. Source code is available at https://gitlab.com/smc/mlmorph-web

mlmorph

Lexicon Curation for Mlmorph

Posted on May 26, 2019 | Santhosh Thottingal

One of the key components of Mlmorph is its lexicon. The lexicon contains the root words categorized as nouns, verbs, adjectives, adverbs etc. These are the components used with morphological rules to generate the vocabulary of Malayalam. I collected initial lexicon with about 100,000 words from various sources such as Wikipedia, CLDR and many targeted web crawls. One problem with such collected words is they often contains spelling mistakes. Secondly, classifying these words is not possible without the tedious task of a person going through each and every words. [Read More]

mlmorph

LibreOffice Malayalam spellchecker using mlmorph

Posted on March 10, 2019 | Santhosh Thottingal

A few months back, I wrote about the spellchecker based on Malayalam morphology analyser. I was also trying to intergrate that spellchecker with LibreOffice. It is not yet ready for any serious usage, but if you are curious and would like to help me in its further development, please read on. Malayalam spellchecker – a morphology analyser based approach Blog post on spellchecker approach and pla Current status The libreoffice spellchecker for Malayalam is available at https://gitlab. [Read More]

libreoffice mlmorph spell checker

Malayalam Named Entity Recognition using morphology analyser

Posted on March 10, 2019 | Santhosh Thottingal

Named Entity Recognition, a task of identifying and classifying real world objects such as persons, places, organizations from a given text is a well known NLP problem. For Malayalam, there were several research papers published on this topic, but none are functional or reproducible research. The morphological characteristics of Malayalam has been always a challenge to solve this problem. When the named entities appear in an inflected or agglutinated complex word, the first step is to analyse such words and arrive at the root words. [Read More]

mlmorph morphology named entity recognition

Malayalam morphology analyser – First release

Posted on November 25, 2018 | Santhosh Thottingal

I am happy to announce the first version of Malayalam morphology analyser. After two years of development, I tagged version 1.0.0 . In this release In this release, mlmorph can analyse and generate malayalam words using the morpho-phonotactical rules defined and based on a lexicon. We have a test corpora of Fifty thousand words and 82% of the words in it are recognized by the analyser. A python interface is released to make the usage of library very easy for developers. [Read More]

mlmorph morphology python