For the last several months, I am actively working on the Malayalam morphology analyser project. In case you are not familiar with the project, my introduction blog post is a good start. I was always skeptical about the approach and the whole project as such looked very ambitious. But, now I am almost confident that the approach is viable. I am making good progress in the project, so this is some updates on that.
Analyser coverage statistics
Recently I added a large corpora to frequently monitor the percentage of words the analyser can parse. The corpora was selected from two large chapters of ഐതിഹ്യമാല, some news reports, an art related essay, my own technical blog posts to have some diversity in the vocabulary.
This is a very encouraging. Achieving a 66% for such a morphologically rich language Malayalam is no small task. From my reading, Turkish and Finnish, languages with same complexity of morphology achieved about 90% coverage. It may be more difficult to increase the coverage for me compared to achieving this much so far. So I am planning some frequency analysis on words that are not parsed by analyser, and find some patterns to improve.
The performance aspect is also notable. Once the automata is loaded to memory, the analysis or generation is super fast. You can see that ~16000 words were analyzed under half of a second.
From the very beginning the project was test driven. I now has 740 test cases for various word forms
The compiled transducer now is 6.2 MB. The transducer is written in SFST-PL and compile using SFST. It used to be compiled using hfst, but hfst is now severely broken for SFST-PL compilation, so I switched to SFST. But the compiled transducer is read using hfst python binding.
Number of states
Number or arcs
Number of final states
The POS tagged lexicon I prepared is from various sources like wiktionary, wikipedia(based on categories), CLDR. While developing I had to improve the lexicon several times since none of the above sources are accurate. The wiktionary also introduced a large amount of archaic or sanskrit terms to the lexicon. As of today, following table illustrates the lexicon status
English borrowed nouns
Affirmations and negations
As you can see, the lexicon is not that big. Especially it is very limited for proper nouns like names, places. I think the verb lexicon is much better. I need to find a way to expand this further.
There is no agreement or standard on the POS tagging schema to be used for Malayalam. But I refused to set this is as a blocker for the project. I defined my own POS tagging schema and worked on the analyser. The general disagreement is about naming, which is very trivial to fix using a tag name mapper. The other issue is classification of features, which I found that there no elaborate schema that can cover Malayalam.
The main challenge I am facing is not technical, it is linguistic. I am often challenged by my limited understanding of Malayalam grammar. Especially about the grammatical classifications, I find it very difficult to come up with an agreement after reading several grammar books. These books were written in a span of 100 years and I miss a common thread in the approach for Malayalam grammar analysis. Sometimes a logical classification is not the purpose of the author too. Thankfully, I am getting some help from Malayalam professors whenever I am stuck.
The other challenge is I hardly got any contributor to the project except some bug reporting. There is a big entry barrier to this kind of projects. The SFST-PL is not something everybody familiar with. I need to write some simple examples for others to practice and join.
I found that some practical applications on top of the morphology analyser is attracting more people. For example, the number spellout application I wrote caught the attention of many people. I am excited to present the upcoming spellchecker that I was working recently. I will write about the theory of that soon.
Santhosh and I jointly presented a paper at Typoday 2018. The paper was titled ‘Spiral splines in typeface design: A case study of Manjari Malayalam typeface’. The full paper is available here. The presentation is available here.
Typoday is the annual conference where typographers and graphic designers from academia and industry come up with their ideas and showcase their work. Typoday 2018 was held at Convocation Hall, University of Mumbai.
The ligatures for the Malayalam conjuncts ച്ച, ള്ള have less popular variants as shown below
The second form is not seen in print but often in handwritten Malayalam. I have seen it a lot in bus boards especially at Thiruvananthapuram. There are no digital typefaces with the second style, except the Chilanka font I designed. It uses the second variant of ച്ച. I got lot of appreciation for that style variant, but also recieved request for the first form of ച്ച. I had a private copy of Chilanka with that variant and had given to whoever requested. I also recieved some requests for the second style of ള്ള. For the Manjari font too, I recieved requests for second variant.
Today I am announcing the new version of Manjary and Chilanka font, with these two forms as optional variants without the need for a different copy of a font. In a single font, you will get both these variants using the Opentype stylistic alternatives feature.
The default styles of ച്ച and ള്ള are not changed in new version. The fonts comes with an option to chose a different form.
Choosing the style for webfonts using CSS
Use the font-feature-settings CSS style to choose a style. For the element or class in the html, use it as follows:
For style 1:
font-feature-settings: "salt" 1;
For style 2:
font-feature-settings: "salt" 2;
Choosing the style variant in LibreOffice
In the place of the font name in font selector, append :salt=1 for first style, :salt=2 for second style. So you need to give Manjari Regular:salt=2 as the font name for example to get second style.
Writing a number 6493 as six thousand four hundred and ninety three is known as spellout of that number. The most familiar example of this is in cheques. Text to speech systems also need to convert numbers to words.
The reverse process of this, to convert a phrase like six thousand four hundred and ninety three to number 6493 – the number generation, is also common. In software, it is often required in Speech recognition and in general any kind of semantic analysis of text.
Numbers and its conversion to English words is not really a complex problem to solve with a computer. But how about other languages? In this article, I am discussing the nature of these words in Malayalam and an approach to parse the number and numbers written in words.
Malayalam number spellout
In Malayalam, the spellout of numbers forms a single word. For example, a number 108 is നൂറ്റെട്ട് – a single word. This word is formed by adjective form of നൂറ്(100) and എട്ട്(8). While these two words are glued, Malayalam phonological rules are also applied, resulting this single word നൂറ്റെട്ട്. This word formation characteristics are present for almost all possible numbers you can imagine. Parsing the number നൂറ്റെട്ട് and interpreting it as 108 or converting 108 to നൂറ്റെട്ട് is an interesting problem in Malayalam computing.
I came across this problem while I was trying to develop a dictionary based spellchecker years back. Such a dictionary should have all these single words for all possible numbers, right? Then how big it will be? Later when I was researching on Malayalam morphology analyser, I again encountered this problem. You cannot have all these words in lexicon as entries – it is not practical. At the same time, you should be able to parse these words and and also generate with correct morpho-phonological rules of Malayalam.
Like I mentioned in my introduction article of my Malayalam morphological analyser, project, Malayalam is a heavily agglutinative language. While I was learning the Finite transducer technology, Malayalam number words were one of the obvious candidates to try out. These numbers perfectly model Malayalam word formations. They get agglutinated and inflected, during which morpho-phonological rules get applied. നൂറ്റെട്ടിലായിരുന്നു, നൂറ്റെട്ടിനെ, നൂറ്റെട്ടോ? നൂറ്റെട്ടാം, നൂറ്റെട്ടാമത്തെ, നൂറ്റെട്ടര – All are examples of words you get on top number word നൂറ്റെട്ട്. Also, it is not two word agglutination, പതിനാറായിരത്തൊരുനൂറ്റെട്ട് – 16108 is an example where പതിനാറ്(16), ആയിരം(1000), നൂറ്(100), എട്ട്(8) – all joined to form a single word. In fact this is a common word you often see in literature because of this myth about Lord Krishna. The current year, 2017 is often written as രണ്ടായിരത്തിപ്പതിനേഴ്.
Let us examine a nature of these word formation.
Numbers between 0 and 9 has words as പൂജ്യം, ഒന്ന്, രണ്ട്, മൂന്ന്, നാല്, അഞ്ച്, ആറ്, ഏഴ്, എട്ട്, ഒമ്പത് respectively. The word ഒമ്പത് is sometimes written as ഒൻപത് too, which is phonetically similar to ഒമ്പത്. Each of these words ending with Virama(്) is sometimes written with Samvruthokaram too. ഒന്ന് – ഒന്നു്, രണ്ടു്, മൂന്നു്, നാലു് etc.
Number 10 is പത്ത്. Multiples of tens till 80 follows the rough pattern:
Adjective form of [രണ്ട്|മൂന്ന്|നാല്|അഞ്ച്|ആറ്|ഏഴ്|എട്ട്] + പത്.
So, they are ഇരുപത്(20), മുപ്പത്(30), നാല്പത്(40), അമ്പത്(50), അറുപത്(6), എഴുപത്(70), എൺപത്/എമ്പത്(80). But at 90, a new form emerges – തൊണ്ണൂറ് – Which has no root on ഒമ്പത് (9). Instead it is more like something before നൂറ്(100).
The numbers 11-19 are unique words. പതിനൊന്ന്, പന്ത്രണ്ട്, പതിമൂന്ന്, പതിനാല്, പതിനഞ്ച്, പതിനാറ്, പതിനേഴ്, പതിനെട്ട്, പത്തൊമ്പത് respectively.
All other two digit numbers between the multiples of tens follow the following pattern
[Word for 10x] + [Word for Ones]
So, 21 is ഇരുപത്(20)+ ഒന്ന്(1). But to form a single word, An adjective form is used, which is similar to female gender inflection of Malayalam nouns- ഇരുപത്തി + ഒന്ന് . Phonological rules should be applied to combine these two words. The vowel sign ി(i) at the end of ഇരുപത്തി will introduce a new consonant യ(ya). Also the first letter of ഒന്ന് – the vowel ഒ will change to its vowel sign form ൊ. So we get ഇരുപത്തി + യ + ൊന്ന്. It results ഇരുപത്തിയൊന്ന്. This phonological rule is actually Agama Sandhi / ആഗമ സന്ധി as per Malayalam grammer rules. But, ഇരുപത്തിയൊന്ന് has a more propular form, ഇരുപത്തൊന്ന് which is generated by dropping ി + യ from the generation process.
The words for 20s can be generated similarly. ഇരുപത്തിരണ്ട്(22), ഇരുപത്തിമൂന്ന്(23), ഇരുപത്തിനാല്(24), ഇരുപത്തിയഞ്ച്/ഇരുപത്തഞ്ച്(25), ഇരുപത്തിയാറ്/ഇരുപത്താറ്(26), ഇരുപത്തിയേഴ്/ഇരുപത്തേഴ്(27), ഇരുപത്തിയെട്ട്/ഇരുപത്തെട്ട്(28), ഇരുപത്തിയൊമ്പത്/ഇരുപത്തൊമ്പത്(29). For all other two digit numbers the pattern is same. Note that തൊണ്ണൂറ് (90) has the prefix form തൊണ്ണൂറ്റി. So 98 is തൊണ്ണൂറ്റിയെട്ട്/തൊണ്ണൂറ്റെട്ട്.
100 is നൂറ്. Its prefix form is നൂറ്റി. Multiples of 100s is somewhat similar to multiples of 10s we saw above. They are ഇരുന്നൂറ്(200), മുന്നൂറ്(300), നാനൂറ്(400), അഞ്ഞൂറ്(500), ആറുനൂറ്(600), എഴുന്നൂറ്(700), എണ്ണൂറ്(800), തൊള്ളായിരം(900). Here also the 900 deviates from others. The word is related to 1000(ആയിരം) than 100 – Just like the case of 90-തൊണ്ണൂറ് we discussed above.
Forming 3 digits numbers is, in general the prefix of multiple of hundred followed by Tens we explained above. So 623 is അറുനൂറ് + ഇരുപത്തിമൂന്ന് = അറുനൂറ്റിയിരുപത്തിമൂന്ന് or the more popular and short form അറുനൂറ്റിരുപത്തിമൂന്ന്. 817 is എണ്ണൂറ്റി+ പതിനേഴ് = എണ്ണൂറ്റിപ്പതിനേഴ് with gemination of consonant പ as per phonological rule. 999 is തൊള്ളായിരത്തിത്തൊണ്ണൂറ്റിയൊമ്പത് or തൊള്ളായിരത്തിത്തൊണ്ണൂറ്റൊമ്പത് or തൊള്ളായിരത്തിത്തൊണ്ണൂറ്റിയൊൻപത്.
Numbers between 100-199 may optionally prefixed by ഒരു – Adjective form of ഒന്ന്(1). 101 – ഒരുന്നൂറ്റിയൊന്ന് 122-ഒരുന്നൂറ്റിയിരുപത്തിരണ്ട് etc. നൂറ്(100) can be also ഒരുന്നൂറ്
1000 is ആയിരം. ആയിരത്തി is prefix for all other 4 digit numbers till 1 lakh(ലക്ഷം 100000). Multiples of 1000 can be generated by suffixing ആയിരം. For example, 4000 is നാല് + ആയിരം = നാലായിരം. 6000 – ആറായിരം. But 5000 is അയ്യായിരം, and അഞ്ചായിരം is less popular version. 8000 is എട്ട് + ആയിരം = എട്ടായിരം, but എണ്ണായിരം is popular form. 10000 is പത്ത് + ആയിരം = പത്തായിരം. But പതിനായിരം is the more familiar version. പതിനായിരം is the suffix for multiples of 10K. They are ഇരുപതിനായിരം, മുപ്പതിനായിരം, നാല്പതിനായിരം, അമ്പതിനായിരം, അറുപതിനായിരം, എഴുപതിനായിരം, എൺപതിനായിരം, തൊണ്ണൂറായിരം. 3000 is മുവ്വായിരം than മൂന്നായിരം. So 73000 is എഴുപത്തിമുവ്വായിരം or എഴുപത്തിമൂന്നായിരം.
Numbers between 1000-1999 may optionally prefixed by ഒരു – Adjective form of ഒന്ന്(1). 1008 – ഒരായിരത്തിയെട്ട് 1122-ഒരായിരത്തിയൊരുന്നൂറ്റിയിരുപത്തിരണ്ട് etc. ആയിരം(1000) can be also ഒരായിരം.
Lakhs & Crores
100, 000 is ലക്ഷം. ലക്ഷത്തി is prefix. 1,00, 00, 000 is കോടി. കോടി itself is prefix. 12,00,90 is പന്ത്രണ്ടുലക്ഷത്തിത്തൊണ്ണൂറ്. 99,00,00,00,00,00,00 is തൊണ്ണൂറ്റൊമ്പതുലക്ഷംകോടി.
Why morphology analyser?
From the above explanation of word formation for numbers in Malayalam, one can see that there are patterns and there are lot of exceptions. But still, isn’t it possible to write a generator using just a rule based program in a programming language. I would agree. Yes, it is possible. But other than mapping these numbers to word forms, handling exceptional rules, there are a few other things also we saw. When words are agglutinated, there are phonological rules in action. Also, I said that these words can be inflected again. We also want the bidirectional conversion – not just word generation, but converting those words back into a number. All these will make such a program so complicated and it has to duplicate so many things from morphology analyser. That is why I used morphology analyser here.
What are the morphemes in a string like ആയിരത്തിത്തൊള്ളായിരത്തിത്തൊണ്ണൂറ്റിയാറ്? ആയിരം, തൊള്ളായിരം, തൊണ്ണൂറ്, ആറ്? Sounds good, but we see that തൊള്ളായിരം is ഒമ്പത്, നൂറ്. and തൊണ്ണൂറ് is ഒമ്പത്, പത്ത്. So expanding it, we get ആയിരം, ഒമ്പത്, നൂറു, ഒമ്പത്, പത്ത്, ആറ്. But this sequence does not make any sense of the single word it created. What is missing? Can we consider തൊള്ളായിരം, തൊണ്ണൂറ് as single morphemes? We can, but…
If തൊള്ളായിരം is a morpheme, it means, it is in a lexicon. That makes all other 3 digit number also eligible to be listed as items in lexicon. So ultimately, we go back to the large lexicon/dictionary issue I mentioned in the beginning of the article.
Semantically, any number spellout is originated from Ones and their place value. So തൊണ്ണൂറ് is 9<tens>.
I have not seen any morphology analyser dealing with number spellout. It seems Malayalam numbers are so unique in this aspect. I read a few academic papers on dealing with this complexity using Rule based approaches(See References) and an automata like paradigm language(Richard Gillam – A Rule-Based Approach to Number Spellout).
The approach I derived after trying out some choices is as follows:
Introduce morphology tags for positional values. This is similar to POS tags, but here we apply for number spellouts. <ones>, <tens>, <hundreds>, <thousands>, <lakhs>, <crores> are those tags.
Parse a spellout to reach the atomic morphemes in a number spellout – they are ഒന്ന്, രണ്ട്, മൂന്ന്, നാല്, അഞ്ച്, ആറ്, ഏഴ്,എട്ട്, ഒമ്പത്, പൂജ്യം.
These morphemes will have the tags mentioned above.
To illustrate this, let use use some examples,
As you can observe, only the atomic numbers are used as morphemes and place values are indicated using tags. You can also see that the analysis is easy to interpret for a program to generate the number.
For example, if the analysis is രണ്ട്<ones><thousands> ഒന്ന്<tens> ഏഴ്<ones>, replace the words with its numbers, tags by position value. You get
2*1*1000 + 1*10 + 7*1 = 2000+10+7 = 2017
I said that, the advantage of morphology analyser is you can generate the word from analysis strings. The bidirectional property. This means, if you have a number, you can generate the spellout. For that we first need to some maths on the number. For example, for same number 2017, we can divide incrementally by lakhs, thousands, hundreds, tens and arrive at the following formation
2017 = 2*1000 + 0*100 + 1*10+ 7*1
Which can be converted to:
The morphology analyser can easily generate the word രണ്ടായിരത്തിപ്പതിനേഴ് by applying all grammatical rules.
I did not write a convertor from the spelled out word to number. You are free to write one. The web interface of mlmorph is available for trying out some analysis too – https://morph.smc.org.in/
Some illustrations on inflected spellout analysis
Ordinal form of numbers are used to show position. Examples are first, third etc. In Malayalam examples are ഒന്നാം, പതിനെട്ടാം ഏഴാമത്, ഒമ്പതാമത്തെ etc. Supporting those forms is just like inflections. See the below screenshot
The system can support numbers till 99,00,00,00,00,00,00 That is Ninety nine lakh crores
Some commonly used forms like മുപ്പത്തിമുക്കോടി is not supported yet.There are also variations like മുവ്വായിരം, മൂവായിരം.
If there are are multiple ways to generate a number word, the system generates all such forms. But some of these forms may be very obscure and not used at all.
There is a practice to insert space after some prefixes like ആയിരത്തി, ലക്ഷത്തി, കോടി. In the model I assumed the words are generated as single word.
We analysed the word formation for the spellout of the numbers in Malayalam. Usage of morphology analyser for analysis and generation of these word forms are introduced. A demo program that converts numbers to its word forms considering all morphophonological rules are presented. Algorithm for spelled out word to number conversion is given with example. Programmable API and Web API is given for the system.
Malayalam is a highly inflectional and agglutinative language. This has posed a challenge for all kind of language processing. Algorithmic interpretation of Malayalam’s words and their formation rules continues to be an untackled problem. My own attempts to study and try out some of these characteristics was big failure in the past. Back in 2007, when I tried to develop a spellchecker for Malayalam, the infinite number of words this language can have by combining multiple words together and those words inflected was a big challenge. The dictionary based spellechecker was a failed attempt. I had documented these issues.
I was busy with my type design projects for last few years, but continued to search for the solution of this problem. Last year(2016), during Google summer of code mentor summit at Google campus, California, mentors working on language technology had a meeting and I explained this challenge. It was suggested that I need to look at Finnish, Turkish, German and such similarly inflected and agglutinated languages and their attempts to solve this. So, after the meeting, I started studying some of the projects – Omorfi for Finnish, SMOR for German, TRMorph for Turkish. All of them use Finite state transducer technology.
There are multiple FST implementation for linguistic purposes – foma, XFST – The Xerox Finite State Toolkit, SFST – The Stuttgart Finite State Toolkit and HFST – The Helsinki Finite State Toolkit. I chose SFST because of good documentation(in English) and availability of reference system(TRMorph, SMOR). And now we have mlmorph – Malayalam morphology analyser project in development here: https://github.com/santhoshtr/mlmorph
I will document the system in details later. Currently it is progressing well. I was able to solve arbitrary level agglutination with inflection. Nominal inflection and Verbal inflections are being solved one by one. I will try to provide a rough high level outline of the system as below.
Lexicon: This is a large collection of root words, collected and manually curated, classified into various part of speech categories. So the collection is seperated to nouns, verbs, conjunctions, interjections, loan words, adverbs, adjectives, question words, affirmatives, negations and so on. Nouns themselves are divided to pronouns, person names, place names, time names, language names and so on. Each of them get a unique tag and will appear when you analyse such words.
Morphotactics: Morphology rules about agglutination and inflection. This includes agglutination rules based on Samasam(സമാസം) – accusative, vocative, nominative, genitive, dative, instrumental, locative and sociative. Also plural inflections, demonstratives(ചുട്ടെഴുത്തുകൾ) and indeclinables(അവ്യയങ്ങൾ). For verbs, all possible tense forms, converbs, adverbal particles, concessives(അനുവാദകങ്ങൾ) and so on.
Phonological rules: This is done on top of the results from morphotactics. For example, from morphotactics, ആൽ<noun>, തറ<noun>, ഇൽ<locative> will give ആൽ<noun>തറ<noun>ഇൽ<locative>. But after the phonological treatment it becomes, ആൽത്തറയിൽ with consonant duplication after ൽ, and ഇ becomes യി.
Automata definition for the above: This is where you say nouns can be concatenated any number of times, following optional inflection etc in regular expression like language.
Programmable interface, web api, command line tools, web interface for demos.
What it can do now? Following screenshot is from its web demo. You can see complex words get analysed to its stems, inflections, tense etc.
Note that this is bidirectional. You can give a complex word, it will give analysis. Similarly when you give root words and POS tags, it will generate the complex word from it. For example:
Covering all possible word formation rules for Malayalam is an ambitious project, but let us see how much we can achieve. Now the effort is more on linguistic aspects of Malayalam than technical. I will update about the progress of the system here.
TruFont the font-editing application written with Python3, ufoLib, defcon and PyQt5 now has support for pasting SVG images as glyphs. It now also support drag and dropping SVG files.
For my font design workflow I mainly use Inkscape to desgin master drawings and then use fonteditor for further editing. I am migrating the fonts we maintained to Trufont from Fontforge(It is no longer developed). But, not having SVG support with Trufont was a blocker for me. So today I filed two patches and got merged to Trufont master.
ഫോണ്ട് ഇപ്പോൾ TTF നു പകരം OTF ഫോർമാറ്റിൽ ആണ് സ്വതവേ വരുന്നതു്. മഞ്ജരി രൂപകല്പന ചെയ്തത് OTF ഫോർമാറ്റ് മുന്നിൽ കണ്ടായിരുന്നെങ്കിലും(ക്യുബിക് ബെസിയർ കർവുകൾ) എല്ലാ ഓപ്പറേറ്റിങ്ങ് സിസ്റ്റങ്ങളിലും കുറ്റമറ്റതായി കാണാത്തതുകൊണ്ട് TTF ൽ ആയിരുന്നു ആദ്യം പുറത്തിറക്കിയതു്. ഇപ്പോൾ ആ പ്രശ്നങ്ങളൊക്കെ പരിഹരിച്ചിട്ടുണ്ടു്. TTF, Webfonts എന്നിവയും വേണമെങ്കിൽ ഡൌൺലോഡ് ചെയ്യാം.
ഫോണ്ട് ഫോർജ് ആയിരുന്നു മഞ്ജരിയടക്കമുള്ള എല്ലാ ഫോണ്ടുകളും എഡിറ്റ് ചെയ്യാൻ ഞങ്ങൾ ഉപയോഗിച്ചിരുന്നതു്. വളരെ പഴയ ആ സോഫ്റ്റ്വെയർ അതിന്റെ ഡെവലപ്മെന്റ് പതിയെ നിർത്തുകയാണ്. Unified Font Object ഫോർമാറ്റിലേക്ക് എല്ലാ ഫോണ്ടുകളുടെയും സോഴ്സ് കോഡ് മാറുകയും Trufont പോലുള്ള പുതിയ എഡിറ്റർ വരുകയും ചെയ്യുന്നുണ്ട്. ഈ മാറ്റത്തിനൊപ്പം പോകാൻ ആദ്യമായി മഞ്ജരിയുടെ സോഴ്സ് കോഡ് UFO ഫോർമാറ്റിലേക്ക് മാറ്റി. ബാക്കി ഫോണ്ടുകളും പതിയെ അങ്ങനെ മാറ്റി ഫോണ്ട് ഫോർജുമായുള്ള ബന്ധം ഉപേക്ഷിക്കും.
Out of all my projects, this is the project that gave me highest satisfaction. I see people using it in social media every day for memes, banners, notices. I have seen the font used for Government publishings, notices, reports. I have seen wedding invitations, book covers, Movie titles with Manjari font. I am so happy that Malayalam speakers loved it.
Kavya and me are working on new version of Manjari with more glyphs to support latest Unicode, and optimize some of the ligature rules. The latest version of the font is always available at https://smc.org.in/fonts