Foreign word detection in mlmorph -

The test corpus for Malayalam Morphological analysis has many foreign words. They are either written in a non-Malayalam script or written in Malayalam. For example, “ഇലക്ട്രിസിറ്റി”, “ഡോക്സ്”, “ഇന്റർമീഡിയറ്റ്”, “അബ്സ്ട്രാക്റ്റ്”, “ഇല്ലസ്ടേഷൻ”, “ഇല്ലിറ്ററേറ്റ്”, “റെക്കോർഡ്”, “procrastination”, “唐宸禹” - These are all foreign words and it is useless to analyse them using mlmorph. Since mlmorph works based on a root word lexicon, it is practically impossible to have them in lexicon. So there should be a way to identify the words easily and tag them as FW - Foreign word Part of speech. The presence of these foreign words also distorts the coverage statistics of mlmorph. A good part of the test corpus is Malayalam wikipedia corpus and it has so many foreign words when the article is about foreign places or people.

How to identify a foreign word?

For words written in non-Malayalam script, it is quite easy. But it is a tricky problem if the words are transliterated to Malayalam. Even for humans, it is not possible to accurately say one word is foreign or not. We usually use some patterns in the pronunciation to guess it and we base that guess from our understanding of known patterns in Malayalam. So, if we want to write a program to detect foreign words, we can try some of these patterns, but still does not guarantee a 100% accuracy. But I think that is acceptable considering language is not something you can achieve 100% accuracy always.

So I wrote a very simple python script with all known pattens that can be attributed to a non-Malayalam word and integrated with mlmorph. I could write these patterns in the FST system of mlmorph, but I wanted the flexibility of conditionally using this foreign word detection system. There are two reasons for this. First and most important one is, some foreign words get infused to the language and after few years they become part of common vocabulary. So the mlmorph lexicon has a collection of such words that originated from Latin and Sanskrit. The words we borrowed from English will follow the same inflection pattern of native Malayalam words(Examples: ബഞ്ച്, ബഞ്ചിൽ, ബസ്സിന്റെ, ബോക്സിലുള്ള). Sanskrit originated words will follow most of the inflection and agglutination rules, but there are exceptions. The foreign word detection should only be a fallback process in mlmorph analysis because of this. Secondly, when mlmorph is used for spelling correction suggestions, I should not suggest foreign words with these patterns.

This is part of mlmorph version 1.2.2 and you can try it in the collab notebook.

The API is as follows:

from mlmorph import foreign_word_detector
words = ["ഇലക്ട്രിസിറ്റി", "ഏതാണെന്ന്", "ഡോക്സ്", "ഇന്റർമീഡിയറ്റ്", "അബ്സ്ട്രാക്റ്റ്", "ഇല്ലസ്ടേഷൻ", "ഇല്ലിറ്ററേറ്റ്", "റെക്കോർഡ്", "procrastination", "唐宸禹"]
for word in words:
  if foreign_word_detector.check_foreign_word(word) == 1:
    print(word)

This will print

ഇലക്ട്രിസിറ്റി
ഡോക്സ്
ഇന്റർമീഡിയറ്റ്
അബ്സ്ട്രാക്റ്റ്
ഇല്ലസ്ടേഷൻ
ഇല്ലിറ്ററേറ്റ്
റെക്കോർഡ്
procrastination
唐宸禹

mlmorph CLI tool has a new option to check for foregin words- mlmorph -f.

Evaluation results

To evaluate this system, I gave the borrowed word lexicon of mlmorph to foreign word detection and it detected 60% of the words as foreign words.

cat lexicon/english-borrowed.lex | wc -l
1309
cat lexicon/english-borrowed.lex | mlmorph -f | awk '{ if($2 == 1) { print $1 }}' | wc -l
793
echo "(793/1309)*100" | bc -l
60.58059587471352177200

Integration results

The integration of foreign word detection system, gave a 12% increase in the words coverage for mlmorph. In the 14,00,000 word in test corpus(which is non curated collection of text from Malayalam wikipedia and other websites), it helped to increase coverage from 45% to 57%.

But if the corpus is non-wikipedia content, and have less foreign words, the coverage is 80% and above as you can see from the coverage test results:

File name	Words	Analysed	Percentage
mlwiki-all-unique-words-00.txt	199986	112281	56.14%
mlwiki-all-unique-words-01.txt	199989	108650	54.33%
mlwiki-all-unique-words-02.txt	199982	119577	59.79%
mlwiki-all-unique-words-03.txt	199996	106705	53.35%
mlwiki-all-unique-words-04.txt	199994	104216	52.11%
mlwiki-all-unique-words-05.txt	199992	102098	51.05%
mlwiki-all-unique-words-06.txt	105936	64755	61.13%
deshabhimani-all-unique-words.txt	47821	34919	73.02%
26.txt	3481	2979	85.58%
18.txt	5696	4690	82.34%
8.txt	366	320	87.43%
16.txt	2043	1745	85.41%
20.txt	474	394	83.12%
9.txt	456	386	84.65%
19.txt	419	379	90.45%
24.txt	1402	1175	83.81%
10.txt	321	281	87.54%
25.txt	3657	3032	82.91%
14.txt	864	766	88.66%
2.txt	3666	3080	84.02%
7.txt	6627	5717	86.27%
21.txt	1561	1300	83.28%
15.txt	5805	5053	87.05%
13.txt	3705	3284	88.64%
12.txt	416	368	88.46%
3.txt	1843	1543	83.72%
6.txt	422	375	88.86%
11.txt	391	331	84.65%
23.txt	848	676	79.72%
4.txt	527	436	82.73%
5.txt	248	217	87.50%
1.txt	2026	1650	81.44%
17.txt	787	691	87.80%
22.txt	1010	855	84.65%
Total	1402757	794924	56.67%

mlmorph foreign-word-detection