Foreign word detection in mlmorph

The test corpus for Malayalam Morphological analysis has many foreign words. They are either written in a non-Malayalam script or written in Malayalam. For example, “ഇലക്ട്രിസിറ്റി”, “ഡോക്സ്”, “ഇന്റർമീഡിയറ്റ്”, “അബ്സ്ട്രാക്റ്റ്”, “ഇല്ലസ്ടേഷൻ”, “ഇല്ലിറ്ററേറ്റ്”, “റെക്കോർഡ്”, “procrastination”, “唐宸禹” - These are all foreign words and it is useless to analyse them using mlmorph. Since mlmorph works based on a root word lexicon, it is practically impossible to have them in lexicon. So there should be a way to identify the words easily and tag them as FW - Foreign word Part of speech. The presence of these foreign words also distorts the coverage statistics of mlmorph. A good part of the test corpus is Malayalam wikipedia corpus and it has so many foreign words when the article is about foreign places or people.

How to identify a foreign word?

For words written in non-Malayalam script, it is quite easy. But it is a tricky problem if the words are transliterated to Malayalam. Even for humans, it is not possible to accurately say one word is foreign or not. We usually use some patterns in the pronunciation to guess it and we base that guess from our understanding of known patterns in Malayalam. So, if we want to write a program to detect foreign words, we can try some of these patterns, but still does not guarantee a 100% accuracy. But I think that is acceptable considering language is not something you can achieve 100% accuracy always.

So I wrote a very simple python script with all known pattens that can be attributed to a non-Malayalam word and integrated with mlmorph. I could write these patterns in the FST system of mlmorph, but I wanted the flexibility of conditionally using this foreign word detection system. There are two reasons for this. First and most important one is, some foreign words get infused to the language and after few years they become part of common vocabulary. So the mlmorph lexicon has a collection of such words that originated from Latin and Sanskrit. The words we borrowed from English will follow the same inflection pattern of native Malayalam words(Examples: ബഞ്ച്, ബഞ്ചിൽ, ബസ്സിന്റെ, ബോക്സിലുള്ള). Sanskrit originated words will follow most of the inflection and agglutination rules, but there are exceptions. The foreign word detection should only be a fallback process in mlmorph analysis because of this. Secondly, when mlmorph is used for spelling correction suggestions, I should not suggest foreign words with these patterns.

This is part of mlmorph version 1.2.2 and you can try it in the collab notebook.

The API is as follows:

from mlmorph import foreign_word_detector
words = ["ഇലക്ട്രിസിറ്റി", "ഏതാണെന്ന്", "ഡോക്സ്", "ഇന്റർമീഡിയറ്റ്", "അബ്സ്ട്രാക്റ്റ്", "ഇല്ലസ്ടേഷൻ", "ഇല്ലിറ്ററേറ്റ്", "റെക്കോർഡ്", "procrastination", "唐宸禹"]
for word in words:
  if foreign_word_detector.check_foreign_word(word) == 1:
    print(word)

This will print

ഇലക്ട്രിസിറ്റി
ഡോക്സ്
ഇന്റർമീഡിയറ്റ്
അബ്സ്ട്രാക്റ്റ്
ഇല്ലസ്ടേഷൻ
ഇല്ലിറ്ററേറ്റ്
റെക്കോർഡ്
procrastination
唐宸禹

mlmorph CLI tool has a new option to check for foregin words- mlmorph -f.

Evaluation results

To evaluate this system, I gave the borrowed word lexicon of mlmorph to foreign word detection and it detected 60% of the words as foreign words.

cat lexicon/english-borrowed.lex | wc -l
1309
cat lexicon/english-borrowed.lex | mlmorph -f | awk '{ if($2 == 1) { print $1 }}' | wc -l
793
echo "(793/1309)*100" | bc -l
60.58059587471352177200

Integration results

The integration of foreign word detection system, gave a 12% increase in the words coverage for mlmorph. In the 14,00,000 word in test corpus(which is non curated collection of text from Malayalam wikipedia and other websites), it helped to increase coverage from 45% to 57%.

But if the corpus is non-wikipedia content, and have less foreign words, the coverage is 80% and above as you can see from the coverage test results:

File name Words Analysed Percentage
mlwiki-all-unique-words-00.txt 199986 112281 56.14%
mlwiki-all-unique-words-01.txt 199989 108650 54.33%
mlwiki-all-unique-words-02.txt 199982 119577 59.79%
mlwiki-all-unique-words-03.txt 199996 106705 53.35%
mlwiki-all-unique-words-04.txt 199994 104216 52.11%
mlwiki-all-unique-words-05.txt 199992 102098 51.05%
mlwiki-all-unique-words-06.txt 105936 64755 61.13%
deshabhimani-all-unique-words.txt 47821 34919 73.02%
26.txt 3481 2979 85.58%
18.txt 5696 4690 82.34%
8.txt 366 320 87.43%
16.txt 2043 1745 85.41%
20.txt 474 394 83.12%
9.txt 456 386 84.65%
19.txt 419 379 90.45%
24.txt 1402 1175 83.81%
10.txt 321 281 87.54%
25.txt 3657 3032 82.91%
14.txt 864 766 88.66%
2.txt 3666 3080 84.02%
7.txt 6627 5717 86.27%
21.txt 1561 1300 83.28%
15.txt 5805 5053 87.05%
13.txt 3705 3284 88.64%
12.txt 416 368 88.46%
3.txt 1843 1543 83.72%
6.txt 422 375 88.86%
11.txt 391 331 84.65%
23.txt 848 676 79.72%
4.txt 527 436 82.73%
5.txt 248 217 87.50%
1.txt 2026 1650 81.44%
17.txt 787 691 87.80%
22.txt 1010 855 84.65%
Total 1402757 794924 56.67%
comments powered by Disqus