Malayalam morphology analyser – status update

For the last several months, I am actively working on the Malayalam morphology analyser project. In case you are not familiar with the project, my introduction blog post is a good start. I was always skeptical about the approach and the whole project as such looked very ambitious. But, now  I am almost confident that the approach is viable. I am making good progress in the project, so this is some updates on that.

Analyser coverage statistics

Recently I added a large corpora to frequently monitor the percentage of words the analyser can parse.  The corpora was selected from two large chapters of ഐതിഹ്യമാല, some news reports, an art related essay, my own technical blog posts to have some diversity in the vocabulary.

<td style="width: 142px;">
  15808
</td>
<td style="width: 142px;">
  10532<br />
</td>
<td style="width: 142px;">
  66.62%<br />
</td>
<td style="width: 142px;">
  0.443 seconds<br />
</td>
Total words
Analysed words
Coverage
Time taken

This is a very encouraging. Achieving a 66% for such a morphologically rich language Malayalam is no small task. From my reading, Turkish and Finnish, languages with same complexity of morphology achieved about 90% coverage. It may be more difficult to increase the coverage for me compared to achieving this much so far. So I am planning some frequency analysis on words that are not parsed by analyser, and find some patterns to improve.

The performance aspect is also notable. Once the automata is loaded to memory, the analysis or generation is super fast. You can see that ~16000 words were analyzed under half of a second.

Tests

From the very beginning the project was test driven. I now has 740 test cases for various word forms

The transducer

The compiled transducer now is 6.2 MB.  The transducer is written in SFST-PL and compile using SFST. It used to be compiled using hfst, but hfst is now severely broken for SFST-PL compilation, so I switched to SFST. But the compiled transducer is read using hfst python binding.

<td style="height: 43px;">
  SFST
</td>
<td style="height: 45px;">
  SFST<br />
</td>
<td style="height: 45px;">
  200562
</td>
<td style="height: 45px;">
  732268
</td>
<td style="height: 45px;">
  130<br />
</td>
Fst type
arc type
Number of states
Number or arcs
Number of final states

The Lexicon

The POS tagged lexicon I prepared is from various sources like wiktionary, wikipedia(based on categories), CLDR. While developing I had to improve the lexicon several times since none of the above sources are accurate. The wiktionary also introduced a large amount of archaic or sanskrit terms to the lexicon. As of today, following table illustrates the lexicon status

<td style="width: 250px; height: 45px;">
  64763<br />
</td>
<td style="width: 250px; height: 45px;">
  505<br />
</td>
<td style="width: 250px; height: 45px;">
  2031<br />
</td>
<td style="width: 250px; height: 45px;">
  85<br />
</td>
<td style="width: 250px; height: 45px;">
  33<br />
</td>
<td style="width: 250px; height: 45px;">
  57<br />
</td>
<td style="width: 250px; height: 45px;">
  27<br />
</td>
<td style="width: 250px; height: 45px;">
  18<br />
</td>
<td style="width: 250px; height: 45px;">
  14<br />
</td>
<td style="width: 250px;">
  6<br />
</td>
<td style="width: 250px;">
  75<br />
</td>
<td style="width: 250px;">
  9<br />
</td>
<td style="width: 250px;">
  657<br />
</td>
<td style="width: 250px;">
  36<br />
</td>
<td style="width: 250px;">
  639<br />
</td>
<td style="width: 250px;">
  8<br />
</td>
<td style="width: 250px;">
  3844<br />
</td>
Nouns
Person names
Place names
Postpositions
Pronouns
Quantifiers
Abbreviations
Adjectives
Adverbs
Affirmatives
Conjunctions
Demonstratives
English borrowed nouns
Interjections
Language names(nouns)
Affirmations and negations
Verbs

As you can see, the lexicon is not that big. Especially it is very limited for proper nouns like names, places. I think the verb lexicon is much better. I need to find a way to expand this further.

POS Tagging

There is no agreement or standard on the POS tagging schema to be used for Malayalam. But I refused to set this is as a blocker for the project. I defined my own POS tagging schema and worked on the analyser. The general disagreement is about naming, which is very trivial to fix using a tag name mapper. The other issue is classification of features, which I found that there no elaborate schema that can cover Malayalam.

I started referring http://universaldependencies.org/ and provided links to the pages in it from the web interface.  But UD is also missing several tags that Malayalam require. So far I have defined 85 tags

Challenges

The main challenge I am facing is not technical, it is linguistic. I am often challenged by my limited understanding of Malayalam grammar. Especially about the grammatical classifications, I find it very difficult to come up with an agreement after reading several grammar books. These books were written in a span of 100 years and I miss a common thread in the approach for Malayalam grammar analysis. Sometimes a logical classification is not the purpose of the author too. Thankfully, I am getting some help from Malayalam professors whenever I am stuck.

The other challenge is I hardly got any contributor to the project except some bug reporting. There is a big entry barrier to this kind of projects. The SFST-PL is not something everybody familiar with. I need to write some simple examples for others to practice and join.

I found that some practical applications on top of the morphology analyser is attracting more people. For example, the number spellout application I wrote caught the attention of many people. I am excited to present the upcoming spellchecker that I was working recently. I will write about the theory of that soon.

comments powered by Disqus