Foreign word detection in mlmorph
The test corpus for Malayalam Morphological analysis has many foreign words. They are either written in a non-Malayalam script or written in Malayalam. For example, “ഇലക്ട്രിസിറ്റി”, “ഡോക്സ്”, “ഇന്റർമീഡിയറ്റ്”, “അബ്സ്ട്രാക്റ്റ്”, “ഇല്ലസ്ടേഷൻ”, “ഇല്ലിറ്ററേറ്റ്”, “റെക്കോർഡ്”, “procrastination”, “唐宸禹” - These are all foreign words and it is useless to analyse them using mlmorph. Since mlmorph works based on a root word lexicon, it is practically impossible to have them in lexicon. So there should be a way to identify the words easily and tag them as FW - Foreign word Part of speech.
[Read More]