LibreOffice Malayalam spellchecker using mlmorph

A few months back, I wrote about the spellchecker based on Malayalam morphology analyser. I was also trying to intergrate that spellchecker with LibreOffice. It is not yet ready for any serious usage, but if you are curious and would like to help me in its further development, please read on.

Blog post on spellchecker approach and pla

Current status

The libreoffice spellchecker for Malayalam is available at https://gitlab.com/smc/mlmorph-libreoffice-spellchecker. You need to get the code using git checkout or download the master version as zip file

You need LibreOffice 4.1 or later. Latest version is recommended. In the source code directory, run make install to install the extension.

Open libreoffice writer, add some Malayalam text. Make sure to select the language as Malayalam by choosing it from the menu or bottom status bar. You should see the spelling check in action… if everything goes as expected 😉

LibreOffice language settings, You can see mlmorph listed.
Spellchecker in action- libreoffice writer.

How can you help?

Theoretically, the extension should work in non-Linux platforms as well. But I have not tested it. The extension need python3 and python-hfst for the operating system. But python-hfst is not available for Windows 64 bit python installation. If you test and get the extension working, please add documentation and if anything missing to make the installation more easy, let me know.

As the mlmorph project get wider support for Malayalam vocabulary, the quality of spellchecker improves automatically.

Malayalam Named Entity Recognition using morphology analyser

Named Entity Recognition, a task of identifying and classifying real world objects such as persons, places, organizations from a given text is a well known NLP problem. For Malayalam, there were several research papers published on this topic, but none are functional or reproducible research.

The morphological characteristics of Malayalam has been always a challenge to solve this problem. When the named entities appear in an inflected or agglutinated complex word, the first step is to analyse such words and arrive at the root words.

As the Malayalam morphology analyser is progressing well, I attempted to build a first version of Malayalam NER on top of it. Since mlmorph gives the POS tagging and analysis, there is not much to do in NER. We just need to look for tags corresponding to proper nouns and report.

You can try the system at https://morph.smc.org.in/ner

Malayalam named entity recognition example using https://morph.smc.org.in/ner

Known Limitations

  • The recognition is limited by the current lexicon of mlmorph. To recognize out of lexicon entities, a POS guesser would be needed. But this is a general problem not limited to NER. A morphology analyser should also have a POS guesser. In other words as the mlmorph improves, this system also improves automatically.
  • Currently the recognition is at word level. But sometimes, the entities are written in multiple consecutive words. To resolve that we will need to write a wrapper on top of word level detection system.
  • The current system is a javascript wrapper on top the mlmorph analyse api. I think NER deserve its own api.

Malayalam morphology analyser – First release

Photo by Ankush Minda on Unsplash

I am happy to announce the first version of Malayalam morphology analyser.

After two years of development, I tagged version 1.0.0

In this release

In this release, mlmorph can analyse and generate malayalam words using the morpho-phonotactical rules defined and based on a lexicon. We have a test corpora of Fifty thousand words and 82% of the words in it are recognized by the analyser.

A python interface is released to make the usage of library very easy for developers. The library is available in pypi.org – https://pypi.org/project/mlmorph/ Installing it is very easy:

Installing it is very easy:

pip install mlmorph

It avoids all difficulties of compiling the sfst formalism and installing the required hfst, sfst packages.

For detailed python api documentation and command line utility refer https://pypi.org/project/mlmorph/

Next

There are lot of known limitations with the current release. I plan to address them in future releases.

  • Expand lexicon further: The current lexicon was compiled by testing various text and adding missing words found in it. Preparing the coverage test corpora also helped to increase the lexicon. But it still need more improvement
  • Many language specific constructs which are commonly used, but consisting of multiple conjunctions, adjectives are not well covered. Some examples are മറ്റൊരു, പിന്നീട്, അതുപോലെത്തന്നെ, എന്നതിന്റെ etc.
  • Optimizing the weight calculation: As the lexicon size is increased, many rarely used words can become alternate parts in agglutination of the words. For example, പാലക്കാട് can have an analysis of പാല്, അക്ക്, ആട് -Even though this is grammatically correct, it should get less preference than പാലക്കാട്<proper noun>.
  • Standardization of POS tags: mlmorph has its own pos tags definition. These tags need documentation with examples. I tried to use universal dependencies as much as possible, but it is not enough to cover all required tags for malayalam.
  • Documentation of formalism and tutorials for developers. So far I am the only developer for the project, which I am not happy about. The learning curve for this project is too steep to attract new developers. Above average understanding of Malayalam grammar is a difficult requirement too. I am planning to write down some tutorials to help new developers to join.

Applications

The project is meaningful only when practical applications are built on top of this.



Malayalam morphology analyser – status update

For the last several months, I am actively working on the Malayalam morphology analyser project. In case you are not familiar with the project, my introduction blog post is a good start. I was always skeptical about the approach and the whole project as such looked very ambitious. But, now  I am almost confident that the approach is viable. I am making good progress in the project, so this is some updates on that.

Analyser coverage statistics

Recently I added a large corpora to frequently monitor the percentage of words the analyser can parse.  The corpora was selected from two large chapters of ഐതിഹ്യമാല, some news reports, an art related essay, my own technical blog posts to have some diversity in the vocabulary.

Total words
15808
Analysed words10532
Coverage66.62%
Time taken
0.443 seconds

This is a very encouraging. Achieving a 66% for such a morphologically rich language Malayalam is no small task. From my reading, Turkish and Finnish, languages with same complexity of morphology achieved about 90% coverage. It may be more difficult to increase the coverage for me compared to achieving this much so far. So I am planning some frequency analysis on words that are not parsed by analyser, and find some patterns to improve.

The performance aspect is also notable. Once the automata is loaded to memory, the analysis or generation is super fast. You can see that ~16000 words were analyzed under half of a second.

Tests

From the very beginning the project was test driven. I now has 740 test cases for various word forms

The transducer

The compiled transducer now is 6.2 MB.  The transducer is written in SFST-PL and compile using SFST. It used to be compiled using hfst, but hfst is now severely broken for SFST-PL compilation, so I switched to SFST. But the compiled transducer is read using hfst python binding.

Fst type
SFST
arc typeSFST
Number of states
200562
Number or arcs
732268
Number of final states
130

The Lexicon

The POS tagged lexicon I prepared is from various sources like wiktionary, wikipedia(based on categories), CLDR. While developing I had to improve the lexicon several times since none of the above sources are accurate. The wiktionary also introduced a large amount of archaic or sanskrit terms to the lexicon. As of today, following table illustrates the lexicon status

Nouns
64763
Person names
505
Place names
2031
Postpositions
85
Pronouns
33
Quantifiers
57
Abbreviations
27
Adjectives
18
Adverbs
14
Affirmatives
6
Conjunctions
75
Demonstratives
9
English borrowed nouns
657
Interjections
36
Language names(nouns)
639
Affirmations and negations
8
Verbs
3844

As you can see, the lexicon is not that big. Especially it is very limited for proper nouns like names, places. I think the verb lexicon is much better. I need to find a way to expand this further.

POS Tagging

There is no agreement or standard on the POS tagging schema to be used for Malayalam. But I refused to set this is as a blocker for the project. I defined my own POS tagging schema and worked on the analyser. The general disagreement is about naming, which is very trivial to fix using a tag name mapper. The other issue is classification of features, which I found that there no elaborate schema that can cover Malayalam.

I started referring http://universaldependencies.org/ and provided links to the pages in it from the web interface.  But UD is also missing several tags that Malayalam require. So far I have defined 85 tags

Challenges

The main challenge I am facing is not technical, it is linguistic. I am often challenged by my limited understanding of Malayalam grammar. Especially about the grammatical classifications, I find it very difficult to come up with an agreement after reading several grammar books. These books were written in a span of 100 years and I miss a common thread in the approach for Malayalam grammar analysis. Sometimes a logical classification is not the purpose of the author too. Thankfully, I am getting some help from Malayalam professors whenever I am stuck.

The other challenge is I hardly got any contributor to the project except some bug reporting. There is a big entry barrier to this kind of projects. The SFST-PL is not something everybody familiar with. I need to write some simple examples for others to practice and join.

I found that some practical applications on top of the morphology analyser is attracting more people. For example, the number spellout application I wrote caught the attention of many people. I am excited to present the upcoming spellchecker that I was working recently. I will write about the theory of that soon.