LibreOffice Malayalam spellchecker using mlmorph

A few months back, I wrote about the spellchecker based on Malayalam morphology analyser. I was also trying to intergrate that spellchecker with LibreOffice. It is not yet ready for any serious usage, but if you are curious and would like to help me in its further development, please read on.

Blog post on spellchecker approach and pla

Current status

The libreoffice spellchecker for Malayalam is available at https://gitlab.com/smc/mlmorph-libreoffice-spellchecker. You need to get the code using git checkout or download the master version as zip file

You need LibreOffice 4.1 or later. Latest version is recommended. In the source code directory, run make install to install the extension.

Open libreoffice writer, add some Malayalam text. Make sure to select the language as Malayalam by choosing it from the menu or bottom status bar. You should see the spelling check in action… if everything goes as expected 😉

LibreOffice language settings, You can see mlmorph listed.
Spellchecker in action- libreoffice writer.

How can you help?

Theoretically, the extension should work in non-Linux platforms as well. But I have not tested it. The extension need python3 and python-hfst for the operating system. But python-hfst is not available for Windows 64 bit python installation. If you test and get the extension working, please add documentation and if anything missing to make the installation more easy, let me know.

As the mlmorph project get wider support for Malayalam vocabulary, the quality of spellchecker improves automatically.

Gayathri – New Malayalam typeface

Swathanthra Malayalam Computing is proud to announce Gayathri – a new typeface for Malayalam. Gayathri is designed by Binoy Dominic, opentype engineering by Kavya Manohar and project coordination by Santhosh Thottingal.

This typeface was financially supported by Kerala Bhasha Institute, a Kerala government agency under cultural department. This is the first time SMC work with Kerala Government to produce a new Malayalam typeface.

Gayathri is a display typeface, available in Regular, Bold, Thin style variants. It is licensed under Open Font License. Source code, including the SVG drawings of each glyph is available in the repository. Gayathri is available for download from smc.org.in/fonts#gayathri

Gayathri has soft, rounded terminals, strokes with varying thickness and good horizontal packing. Gayathri has large glyph set for supporting Malayalam traditional orthography, which is the new trend in contemporary Malayalam. With a total of 1124 glyphs, Gayathri also has basic latin coverage. All Malayalam characters defined till Unicode 11 is supported.

There are not much Malayalam typefaces designed for titles and large displays. We hope Gayathri will fill that gap.

This is also the first typeface by Binoy Dominic. He had proved his lettering skills in his profession as graphic designer, working on branding with Malayalam content for his clients.

Binoy prepared all glyphs in SVGs, our scipts converted it to UFO sources. Trufont was used for small edits. Important glyph information like bearings, names, were defined in yaml configuration. Build scripts generated valid UFO sources and fontmake was used to build OTF output. Of course, there were lot of cycles of design fine tuning. Gitlab CI was used for running the build chain and testing. Fontbakery was used for quality assurance. UFO Normalizer, UFO Lint tools were also part of build system.

Swanalekha input method now available for Windows and Mac

The Swanalekha transliteration based Malayalam input method is now available in Windows and Mac platforms. Thanks to Ramesh Kunnappully, who wrote the keyman implementation.

I wrote this input method in 2008. At those days SCIM was the popular input method for Linux. Later it was rewritten for M17N and used with either IBus or FCITX. A few years later, this input method was made to available in Android using Indic keyboard. Last year, due to requests from Windows and Mac users, Chrome and Firefox extensions were prepared. Thanks to SIL Keyman, now we made it available in those operating systems as well.

By this, Swanalekha Malayalam becomes an input method you can use in all operating systems and phones.

Detailed documentation, downloads are available in Swanalekha website. Source code: gitlab.com/smc/swanalekha. A small video illustrating the installation, configuration and use in Windows 10 given below.

Update: The keyboard is now served by keyman from their website. And the supported platforms also increased.

Download options from https://keyman.com/keyboards/swanalekha_malayalam

Malayalam morphology analyser – First release

Photo by Ankush Minda on Unsplash

I am happy to announce the first version of Malayalam morphology analyser.

After two years of development, I tagged version 1.0.0

In this release

In this release, mlmorph can analyse and generate malayalam words using the morpho-phonotactical rules defined and based on a lexicon. We have a test corpora of Fifty thousand words and 82% of the words in it are recognized by the analyser.

A python interface is released to make the usage of library very easy for developers. The library is available in pypi.org – https://pypi.org/project/mlmorph/ Installing it is very easy:

Installing it is very easy:

pip install mlmorph

It avoids all difficulties of compiling the sfst formalism and installing the required hfst, sfst packages.

For detailed python api documentation and command line utility refer https://pypi.org/project/mlmorph/

Next

There are lot of known limitations with the current release. I plan to address them in future releases.

  • Expand lexicon further: The current lexicon was compiled by testing various text and adding missing words found in it. Preparing the coverage test corpora also helped to increase the lexicon. But it still need more improvement
  • Many language specific constructs which are commonly used, but consisting of multiple conjunctions, adjectives are not well covered. Some examples are മറ്റൊരു, പിന്നീട്, അതുപോലെത്തന്നെ, എന്നതിന്റെ etc.
  • Optimizing the weight calculation: As the lexicon size is increased, many rarely used words can become alternate parts in agglutination of the words. For example, പാലക്കാട് can have an analysis of പാല്, അക്ക്, ആട് -Even though this is grammatically correct, it should get less preference than പാലക്കാട്<proper noun>.
  • Standardization of POS tags: mlmorph has its own pos tags definition. These tags need documentation with examples. I tried to use universal dependencies as much as possible, but it is not enough to cover all required tags for malayalam.
  • Documentation of formalism and tutorials for developers. So far I am the only developer for the project, which I am not happy about. The learning curve for this project is too steep to attract new developers. Above average understanding of Malayalam grammar is a difficult requirement too. I am planning to write down some tutorials to help new developers to join.

Applications

The project is meaningful only when practical applications are built on top of this.



The many forms of ചിരി ☺️

This is an attempt to list down all forms of Malayalam word ചിരി(meaning: ☺️, smile, laugh). For those who are unfamiliar with Malayalam, the language is a highly inflectional Dravidian language. I am actively working on a morphology analyser(mlmorph) for the language as outlined in one of my previous blogpost.

I prepared this list as a test case for mlmorph project to evaluate the grammar rule coverage. So I thought of listing it here as well with brief comments.
1. ചിരി
ചിരി is a noun. So it can have all nominal inflections.

2. ചിരിയുടെ
3. ചിരിക്ക്
4. ചിരിയ്ക്ക്
5. ചിരിയെ
6. ചിരിയിലേയ്ക്ക്
7. ചിരികൊണ്ട്
8. ചിരിയെക്കൊണ്ട്
9. ചിരിയിൽ
10. ചിരിയോട്
11. ചിരിയേ

There is a plural form
12. ചിരികൾ

A number of agglutinations can happen at the end of the word using Affirmatives, negations, interrogatives etc. For example, ചിരിയുണ്ട്, ചിരിയില്ല, ചിരിയോ. But now I am ignoring all agglutinations and listing only the inflections.

ചിരിക്കുക is the verb form of ചിരി.
13.  ചിരിക്കുക

It can have the following tense forms
14. ചിരിച്ചു
15. ചിരിക്കുക
16. ചിരിക്കും

A concessive form for the word
17. ചിരിച്ചാലും

This verb has the following aspects
18. ചിരിക്കാറ്
19. ചിരിച്ചിരുന്നു
20. ചിരിച്ചിരിയ്ക്കുന്നു
21. ചിരിച്ചിരിക്കുന്നു
22. ചിരിച്ചിരിക്കും
23. ചിരിച്ചിട്ട്
24. ചിരിച്ചുകൊണ്ടിരുന്നു
25. ചിരിച്ചുകൊണ്ടേയിയിരുന്നു
26. ചിരിച്ചുകൊണ്ടേയിരിക്കുന്നു
27. ചിരിച്ചുകൊണ്ടിരിക്കുന്നു
28. ചിരിച്ചുകൊണ്ടിരിക്കും
29. ചിരിച്ചുകൊണ്ടേയിരിക്കും

There are number of mood forms for the verb ചിരിക്കുക
30. ചിരിക്കാവുന്നതേ
31. ചിരിച്ചേ
32. ചിരിക്കാതെ
33. ചിരിച്ചാൽ
34. ചിരിക്കണം
35. ചിരിക്കവേണം
36. ചിരിക്കേണം
37. ചിരിക്കേണ്ടതാണ്
38. ചിരിക്ക്
39. ചിരിക്കുവിൻ
40. ചിരിക്കൂ
41. ചിരിക്ക
42. ചിരിച്ചെനെ
43. ചിരിക്കുമേ
44. ചിരിക്കട്ടെ
45. ചിരിക്കട്ടേ
46. ചിരിക്കാം
47. ചിരിച്ചോ
48. ചിരിച്ചോളൂ
49. ചിരിച്ചാട്ടെ
50. ചിരിക്കാവുന്നതാണ്
51. ചിരിക്കണേ
52. ചിരിക്കേണമേ
53. ചിരിച്ചേക്കാം
54. ചിരിച്ചോളാം
55. ചിരിക്കാൻ
56. ചിരിച്ചല്ലോ
57. ചിരിച്ചുവല്ലോ

There are a few inflections with adverbial participles
58. ചിരിക്കാൻ
59. ചിരിച്ച്
60. ചിരിക്ക
61. ചിരിക്കിൽ
62. ചിരിക്കുകിൽ
63. ചിരിക്കയാൽ
64. ചിരിക്കുകയാൽ

The verb can act as an adverb clause. Examples
65. ചിരിച്ച
66. ചിരിക്കുന്ന
67. ചിരിച്ചത്
68. ചിരിച്ചതു്
69. ചിരിക്കുന്നത്

The above two forms act as nominal forms. Hence they have all nominal inflections too
70. ചിരിച്ചതിൽ
71. ചിരിക്കുന്നതിൽ
72. ചിരിക്കുന്നതിന്
73. ചിരിച്ചതിന്
74. ചിരിച്ചതിന്റെ
75. ചിരിക്കുന്നതിന്റെ
76. ചിരിച്ചതുകൊണ്ട്
77. ചിരിക്കുന്നതുകൊണ്ട്
78. ചിരിച്ചതിനോട്
79. ചിരിക്കുന്നതിനോട്
80. ചിരിക്കുന്നതിലേയ്ക്ക്

Now, a few voice forms for the verb ചിരിക്കുക
81. ചിരിക്കപ്പെടുക
82. ചിരിപ്പിക്കുക

These voice forms are again just verbs. So it can go through all the above inflections the verb ചിരിക്കുക has. Not writing it here, since it mostly a repeat of what is listed here. ചിരിക്കപ്പെടുക has all inflections of the verb പെടുക. You can see them listed in my test case file though

A noun can be derived from the verb ചിരിക്കുക too. That is
83. ചിരിക്കൽ

Since it is a noun, all nominal inflections apply.
84. ചിരിക്കലേ
85. ചിരിക്കലിനോട്
86. ചിരിക്കലിൽ
87. ചിരിക്കലിന്റെ
88. ചിരിക്കലിനെക്കൊണ്ട്
89. ചിരിക്കലിലേയ്ക്ക്
90. ചിരിക്കലിന്

My test file has 164 entries including the ones I skipped here. As per today, the morphology analyser can parse 74% of the items. You can check the test results here: https://paste.kde.org/pn5z0oh7g

A native Malayalam speaker may point out that the variation fo this word ചിരിയ്ക്കുക -with യ് before ക്കുക. My intention is to support that variation as well. Obviously that word also will have the above listed inflected forms.

Now that I wrote this list here, I think having a rough English translation of each items would be cool, but it is too tedious to me.

Talk on ‘Malayalam orthographic reforms’ at Grafematik 2018

Santhosh and I presented a paper on ‘Malayalam orthographic reforms: impact on language and popular culture’ at Graphematik conference held at IMT Atlantique, Brest, France. Our session was chaired by Dr. Christa Dürscheid.

The paper we presented is available here. The video of our presentation is available in youtube.

Grafematik is a conference, first of its kind, bringing together disciplines concerned with writing systems and their representation in written communication. There were lot of interesting talks on various scripts around the world, their digital representation, role of Unicode, typeface design and so on. All the talk videos are available in the conference website.

Typoday 2018

Santhosh and I jointly presented a paper at Typoday 2018. The paper was titled ‘Spiral splines in typeface design: A case study of Manjari Malayalam typeface’. The full paper is available here. The presentation is available here.

Typoday is the annual conference where typographers and graphic designers from academia and industry come up with their ideas and showcase their work. Typoday 2018 was held at Convocation Hall, University of Mumbai.

 

Manjari 1.5 version released

A new version of Manjari typeface is available now. Version 1.5 is mainly a bug fix release.

In version 1.3, the build tooling of the project was changed from fontforge to fontmake. Two weeks back a few people reported that the font no longer works in MS Word and Wordpad. Font selector lists the font, but when selected, the content remains same. It works in all other applications without any issues. Because of that the bug went unnoticed.

Debugging the issue was not easy since font works everywhere else. I did a line by line diff of the ttx format(XML font format) of old and new version fonts.  Found that the OS/2 ulUnicodeRange, ulCodePageRange values were set to 0 in version 1.3.  Apparently these values are really checked by MS Word and Wordpad. If these are missing Wordpad and Word just rejects the font.  Correct values for these fields are set now.

New version 1.5 is available now. You can download the latest fonts from https://smc.org.in/fonts/#manjari

Stylistic Alternates for ച്ച, ള്ള in Manjari and Chilanka fonts

The ligatures for the Malayalam conjuncts ച്ച, ള്ള have less popular variants as shown below

The second form is not seen in print but often in handwritten Malayalam. I have seen it a lot in bus boards especially at Thiruvananthapuram. There are no digital typefaces with the second style, except the Chilanka font I designed. It uses the second variant of ച്ച. I got lot of appreciation for that style variant, but also recieved request for the first form of ച്ച. I had a private copy of Chilanka with that variant and had given to whoever requested. I also recieved some requests for the second style of ള്ള. For the Manjari font too, I recieved requests for second variant.

Today I am announcing the new version of Manjary and Chilanka font, with these two forms as optional variants without the need for a different copy of a font. In a single font, you will get both these variants using the Opentype stylistic alternatives feature.

The default styles of ച്ച and ള്ള are not changed in new version. The fonts comes with an option to chose a different form.

Choosing the style for webfonts using CSS

Use the font-feature-settings CSS style to choose a style. For the element or class in the html, use it as follows:

For style 1:

font-feature-settings: "salt" 1;

For style 2:

font-feature-settings: "salt" 2;

Choosing the style variant in LibreOffice

In the place of the font name in font selector, append :salt=1 for first style, :salt=2 for second style. So you need to give Manjari Regular:salt=2 as the font name for example to get second style.

Choosing the style variant in XeLaTeX

fontspec allows to choose alterate style variants. Use Alternate=N syntax. Note that N starts from 0. So for style1, use Alternate=0 and for style2 use Alternate=2. Refer section 2.8.3 of fontspec documentation.

\documentclass[11pt]{article}
\usepackage{polyglossia}
\newfontfamily{\manjari}[Script=Malayalam]{Manjari}
\begin{document}

\manjari{\addfontfeature{Alternate=1}കാച്ചാണി, വെള്ളയമ്പലം}

\end{document}

This will produce the following rendering:

Choosing the style variant in Inkscape

Inkscape font selection dialog has a feature to chose font style variants. It uses the property values of CSS font-feature-settng.

In Adobe, Indesign, selecting the ligature will give stylistic alternative(s) if any to choose.

Updated fonts

Updated fonts are available in SMC’s font download microsite https://smc.org.in/fonts

Towards a Malayalam morphology analyser

Malayalam is a highly inflectional and agglutinative language. This has posed a challenge for all kind of language processing. Algorithmic interpretation of Malayalam’s words and their formation rules continues to be an untackled problem.  My own attempts to study and try out some of these characteristics was big failure in the past. Back in 2007, when I tried to develop a spellchecker for Malayalam, the infinite number of words this language can have by combining multiple words together and those words inflected was a big challenge. The dictionary based spellechecker was a failed attempt. I had documented these issues.

I was busy with my type design  projects for last few years, but continued to search for the solution of this problem. Last year(2016), during Google summer of code mentor summit at Google campus, California, mentors working on language technology had a meeting and I explained this challenge. It was suggested that I need to look at Finnish, Turkish, German and such similarly inflected and agglutinated languages and their attempts to solve this. So, after the meeting, I started studying some of the projects – Omorfi for Finnish, SMOR for German, TRMorph for Turkish. All of them use Finite state transducer technology.

There are multiple FST implementation for linguistic purposes – foma, XFST – The Xerox Finite State Toolkit, SFST – The Stuttgart Finite State Toolkit and HFST – The Helsinki Finite State Toolkit. I chose SFST because of good documentation(in English) and availability of reference system(TRMorph, SMOR).  And now we have mlmorph  – Malayalam morphology analyser project in development here:  https://github.com/santhoshtr/mlmorph

I will document the system in details later. Currently it is progressing well. I was able to solve arbitrary level agglutination with inflection. Nominal inflection and Verbal inflections are being solved one by one. I will try to provide a rough high level outline of the system as below.

  • Lexicon: This is a large collection of root words, collected and manually curated, classified into various part of speech categories. So the collection is seperated to nouns, verbs, conjunctions, interjections, loan words, adverbs, adjectives, question words, affirmatives, negations and so on. Nouns themselves are divided to pronouns, person names, place names, time names, language names and so on. Each of them get a unique tag and will appear when you analyse such words.
  • Morphotactics: Morphology rules about agglutination and inflection. This includes agglutination rules based on Samasam(സമാസം) – accusative, vocative, nominative, genitive, dative, instrumental, locative and sociative. Also plural inflections, demonstratives(ചുട്ടെഴുത്തുകൾ) and indeclinables(അവ്യയങ്ങൾ). For verbs, all possible tense forms, converbs, adverbal particles, concessives(അനുവാദകങ്ങൾ) and so on.
  • Phonological rules: This is done on top of the results from morphotactics. For example, from morphotactics, ആൽ<noun>, തറ<noun>, ഇൽ<locative> will give ആൽ<noun>തറ<noun>ഇൽ<locative>. But after the phonological treatment it becomes, ആൽത്തറയിൽ with consonant duplication after ൽ, and ഇ becomes യി.
  • Automata definition for the above: This is where you say nouns can be concatenated any number of times, following optional inflection etc in regular expression like language.
  • Programmable interface, web api, command line tools, web interface for demos.

What it can do now? Following screenshot is from its web demo. You can see complex words get analysed to its stems, inflections, tense etc.

Note that this is bidirectional. You can give a complex word, it will give analysis. Similarly when you give root words and POS tags, it will generate the complex word from it. For example:

ആടുക<v><past>കൊണ്ടിരിക്കുക<v><present> =>  ആടിക്കൊണ്ടിരിക്കുന്ന

Covering all possible word formation rules for Malayalam is an ambitious project, but let us see how much we can achieve. Now the effort is more on linguistic aspects of Malayalam than technical. I will update about the progress of the system here.