Blogs -

POS Tagging: A review of BIS POS tagset and ILCI-II Malayalam Text Corpus

Posted on September 10, 2019 | Santhosh Thottingal

The Bureau of Indian Standards(BIS) had published a Part of Speech(POS) tagset for Indian languages. POS is the process of assigning a part of speech marker to each word in a given text. In this article, I am reviewing the tag set defined in it. While developing mlmorph project I had explored a candidate POS tagging schema for Malayalam. I did not choose BIS tagset for the reasons I am going to explian in this article. [Read More]

POS Tagging Morphology analysis

Mlmorph at MT Summit 2019

Posted on August 22, 2019 | Santhosh Thottingal

I presented the Malayalam Morphology Analyser at 2nd Workshop on Technologies for MT of Low Resource Languages- LORESMT Dublin, Ireland as part of 17th MT Summit.

Paper: https://www.aclweb.org/anthology/W19-6801

Presentation: https://docs.google.com/presentation/d/…

Proceedings from the LORESMT workshop: https://aclweb.org/anthology/volumes/W19-68/

paper mlmorph

Presidential award for contributions to Malayalam

Posted on August 16, 2019 | Santhosh Thottingal

Happy to share the news that I am awarded by President of India for contributions to Malayalam language. Maharshi Badrayan Vyas Samman by the Hon. President of India is in recognition of my contributions in the field of Malayalam language. The award, instituted in 2016, is given to the substantial contributions to languages such as Sanskrit, Persian, Arabi, Pali and Classical Oriya, Classical Kannada, Classical Telugu, and Classical Malayalam. This is given to young scholars in the age group of 30 to 45 years. [Read More]

award

Root Zone Label generation rules for Malayalam released

Posted on July 13, 2019 | Santhosh Thottingal

On July 10,2019 ICANN released Label generation rules for eight scripts Devanagari, Gurmukhi, Gujarati, Kannada, Malayalam. Oriya, Tamil, Telugu. These rules are criteria for determining valid Domain Names for the Root Zone of the Domain Name System (DNS). The Internet Corporation for Assigned Names and Numbers (ICANN) is a non-profit organization which takes care of the whole internet domain name system and registration process. Internationalized Top Level Domain Names are domain names not limited to English. [Read More]

icann idn

Markov chain for Malayalam

Posted on June 8, 2019 | Santhosh Thottingal

I have been trying to generate a Markov chain for Malayalam content. A Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.(wikipedia). For natural language, it represents a probabilistic model of words- the probability that one word can come after another word. This model can be prepared by feeding large amount of text to system that learns the probabilities of each words. [Read More]

markov

Updated web interface for mlmorph

Posted on June 8, 2019 | Santhosh Thottingal

The web interface of Malayalam morphology analyser(mlmorph) is updated. You can see new interface at https://morph.smc.org.in/. The new web application is written in vuejs using vuetify UI framework. The backend is flask. Source code is available at https://gitlab.com/smc/mlmorph-web

mlmorph

Chilanka version 1.400 released

Posted on June 5, 2019 | Santhosh Thottingal

A new version of Chilanka typeface is available now. Version 1.400 is available for download from SMC’s font download and preview site smc.org.in/fonts For users, there is not much changes, but the source and code build system got a major upgrade. Source code updated to UFO format from fontforge sfd format. This allows to work with modern font editors. Use cubic beziers for master design, generate OTF along with TTF. The original drawings for Chilanka was using cubic beziers. [Read More]

fonts

Lexicon Curation for Mlmorph

Posted on May 26, 2019 | Santhosh Thottingal

One of the key components of Mlmorph is its lexicon. The lexicon contains the root words categorized as nouns, verbs, adjectives, adverbs etc. These are the components used with morphological rules to generate the vocabulary of Malayalam. I collected initial lexicon with about 100,000 words from various sources such as Wikipedia, CLDR and many targeted web crawls. One problem with such collected words is they often contains spelling mistakes. Secondly, classifying these words is not possible without the tedious task of a person going through each and every words. [Read More]

mlmorph

LibreOffice Malayalam spellchecker using mlmorph

Posted on March 10, 2019 | Santhosh Thottingal

A few months back, I wrote about the spellchecker based on Malayalam morphology analyser. I was also trying to intergrate that spellchecker with LibreOffice. It is not yet ready for any serious usage, but if you are curious and would like to help me in its further development, please read on. Malayalam spellchecker – a morphology analyser based approach Blog post on spellchecker approach and pla Current status The libreoffice spellchecker for Malayalam is available at https://gitlab. [Read More]

libreoffice mlmorph spell checker

Malayalam Named Entity Recognition using morphology analyser

Posted on March 10, 2019 | Santhosh Thottingal

Named Entity Recognition, a task of identifying and classifying real world objects such as persons, places, organizations from a given text is a well known NLP problem. For Malayalam, there were several research papers published on this topic, but none are functional or reproducible research. The morphological characteristics of Malayalam has been always a challenge to solve this problem. When the named entities appear in an inflected or agglutinated complex word, the first step is to analyse such words and arrive at the root words. [Read More]

mlmorph morphology named entity recognition