Malayalam Script LGR rules for public review

The Malayalam and Tamil Root Zone Label Generation Rules for International Domain names have been released for public comments. See the announcement from ICANN. This was drafted by the Neo-Brahmi Script Generation Panel (NBGP), in which I am also a member.

Your comments on the proposal for the Malayalam Script Label Generation Rules for the Root Zone (LGR [XML, 18 KB] and supporting documentation [PDF, 998 KB]) can be submitted at the feedback form till Nov 7 2018.

My earlier blog post on Internationalized Top Level Domain Names in Indian Languages has some detailed information about this.

Malayalam spellchecker – a morphology analyser based approach

My first attempt to develop a spellchecker for Malayalam was in 2007. I was using hunspell and a word list based approach. It was not successful because of rich morphology of Malayalam. Even though I prepared a  manually curated 150K words list, it was nowhere near to cover practically infinite words of Malayalam. For languages with productive morphological processes in compounding and derivation that are capable of generating dictionaries of infinite length, a morphology analysis and generation system is required. Since my efforts towards building such a morphology analyser is progressing well, I am proposing a finite state transducer based spellchecker for Malayalam. In this article, I will first analyse the characteristics of Malayalam spelling mistakes and then explain how an FST can be used to implement the solution.

What is a spellchecker?

The spellchecker is an application that tells whether the given word is spelled correctly as per the language or not. If the word is not spelled correctly, the spellchecker often gives possible alternatives as suggestion to correct the misspelled word. The word can be spellchecked independently or in the context of a sentence. For example, in the sentence “അസ്തമയസൂര്യൻ കടലയിൽ മുങ്ങിത്താഴ്ന്നു”, the word “കടലയിൽ” is spelled correctly if considered independently. But in the context of the sentence, it is supposed to be “കടലിൽ”.

The correctness of the word is tested by checking if that word is in the language model. The language model can be simply a list of all known words in the language. Or it can be a system which knows how a word in a language will look like and tell whether the given word is such a word. In the case of Malayalam, we saw that the finite dictionary is not possible. So we will need a system which is ‘aware’ of all words in the language. We will see how a morphology analyser can be such a system.

If the word is misspelled, the system need to give correction. To generate the correctly spelled words from a misspelled word form, an error model is needed. The most common error model is Levenshtein edit distance. In the edit distance algorithm, the misspelling is assumed to be a finite number of operations applied to characters of a string: deletion, insertion, change, or transposition. The number of operations is known as ‘edit distance‘. Any word from the known list of words in the language, with a minimal distance is a candidate for suggestion. Peter Norvig explains such a functional spellchecker in his article “How to Write a spelling corrector?

There are multiple problems with the edit distance based correction mechanism

  • For a query word, to generate all candidates after applying the four operations, we can calculate the number of words we need to generate and test its correctness. For a word of length n, an alphabet size a, an edit distance d=1, there will be n deletions, n-1 transpositions, a*n alterations, and a*(n+1) insertions, for a total of 2n+2an+a-1 terms at search time. In the case of Malayalam, a is 117 if we consider all encoded characters in Unicode version 11. If we remove all archaic characters, we still need about 75 characters. So, for edit distance d=1, a=75, for a word with 10 characters, 2*10+2*75*10+75-1 = 1594 and much larger for larger d. So, you will need to do 1594 lookups(spellchecks) in the language model to get possible suggestions.
  • The concept that the 4 edit operations are the cause for all spelling mistakes is not accurate for Malayalam. There are many common spelling mistakes in Malayalam that are 3 or 4 edit distance from the original word. Usually the edit distance based corrections won’t go beyond d=2 since the number of candidates increases.

The problems with hunspell based spellchecker and Malayalam

Hunspell has a limited compounding support, but limited to two levels. Malayalam can have more than 2 level compounding and sometimes the agglutinated words is also inflected. Hunspell system has an affix dictionary and suffix mapping system. But it is very limited to support complex morphology like Malayalam. With the help of Németh László, Hunspell developer, I had explored this path. But abandoned due to many limitation of Hunspell and lack of programmatic control of the morphological rules.

Nature of Malayalam spelling mistakes

Malayalam uses an alphasyllabary writing system. Each letter you write corresponds to the grapheme representation of a phoneme. In broader sense Malayalam can be considered as a language with one to one  grapheme to phoneme correspondence. Where as in English and similar languages, letters might represent a variety of sounds, or the same sounds can be written in different ways. The way a person learns writing a language strongly depends on the writing system.

In Malayalam, since there is one and only one set of characters that can correspond to a syllable, the confusion of letters does not happen. For example, in English, Education, Ship, Machine, Mission all has sh sound [ʃ]. So a person can mix up these combinations. But in Malayalam, if it is sh sound [ʃ], then it is always ഷ.

Because of this, the spelling mistakes that is resulted by four edit operations(deletion, insertion, change, or transposition) may not be an accurate classification of errors in Malayalam.  Let us try to classify and analyse the spelling mistake patterns of Malayalam.

  1. Phonetic approximation: The 1:1 grapheme to phoneme correspondence is the theory. But because of this the inaccurate utterance of syllables will cause incorrect spellings. For example, ബൂമി is a relaxed way of reading for ഭൂമി since it is relatively effortless. Since the relaxed way of pronunciation is normal, sometimes people think that they are writing in wrong way and will try to correct it unnecessarily പീഢനം->പീഡനം is one such example.
    • Consonants: Each consonant in Malayalam has aspirated, unaspirated, voiced and unvoiced variants. Between them, it is very usual to get mixed up
      • Aspirated and Unaspirated mix-up: Aspirated consonant can be mistakenly written as  Unaspirated consonant. For Example, ധ -> ദ, ഢ -> ഡ . Similarly Unaspirated consonant can be mistakenly written as aspirated consonant – Example, ദ ->ധ, ഡ ->ഢ.
      • Voiced and Voiceless mix-up. Voiced consonants like ഗ, ഘ can be mistakenly written as voiceless forms ക, ഖ. And vice versa.
      • Gemination of consonants is often relaxed or skipped in the speech, hence it appear in writing too. Gemination in Malayalam script is by combining two consonants using virama. നീലതാമര/നീലത്താമര is an example for this kind of mistakes. There are a few debatable words too, like സ്വർണം/സ്വർണ്ണം, പാർടി/പാർട്ടി. Another way of consonant stress indication is by using Unaspirated Consonant + Virama + Aspirated Consonant. അദ്ധ്യാപകൻ/അധ്യാപകൻ, തീർഥം/തീർത്ഥം, വിഡ്ഡി/വിഡ്ഢി pairs are examples.
      • Hard, Soft variants confusion. Examples: ശ/ഷ, ര/റ, ല/ള
    • Vowels: Vowel elongation or shortening, gliding vowels and semi vowels are the cause for vowel related mistakes in writing.
      • Each vowel in Malayalam can be a short vowel or long vowel. Local dialect can confuse people to use one for the other. ചിലപ്പൊൾ/ചിലപ്പോൾ is one example. Since many input tools place the short and long vowels forms with very close keystrokes, it is possible to cause errors. In Inscript keyboard, short and long vowels are in normal and shift position. In transliteration based input methods, long vowel is often typed by repeated keys(i, ii for ി, ീ).
      •  The vowel ഋ is close to റി or റു in pronunciation. Example: ഋതു/റിതു. The vowel sign of ഋ while appearing with a consonant is close to ്ര. Example ഗൃഹം/ഗ്രഹം. ഹൃദയം/ഹ്രുദയം.
      • Gliding vowels ഐ, ഔ get confused with its constituent vowels. കൈ/കഇ/കയ്, ഔ/അഉ/അവു are example.
      • In Malayalam, there is a tendency to use എ instead of ഇ, since the reduced effort. Examples: ചിലവ്/ചെലവ്, ഇല/എല, തിരയുക/തെരയുക. Due to wide usage of these variants, it is sometimes very difficult to say one word is wrong. See the discussion about the ‘Standard Malayalam’ at the end of this essay.
    • Chillus: Chillus are pure consonants. A consonant + virama sequence sometimes has no phonetic difference from a chillu. For example, കല്പന/കൽപന, നിൽക്കുക/നില്ക്കുക combinations. The chillu ർ is sometimes confused with ഋ sign. Examples are: പ്രവർത്തി/പ്രവൃത്തി. The chillu form of മ – ം can appear are as anuswara or ma+virama forms. Examples: പംപ, പമ്പ. But it is not rare to see പംമ്പ for this. Sometimes, the anuswara get confused with ന്, and പമ്പ becomes പന്പ. There were a few buggy fonts that used ന്+പ for മ്പ ligature too.
  2. Weak Phoneme-Grapheme correspondence: Due to historic or evolutionary nature of the script, Malayalam also has some phonemes which has a weak relationship with the graphemes.
    • ഹ്മ/ മ്മ as in ബ്രഹ്മം/ബ്രമ്മം, ന്ദ/ന്ന as in നന്ദി/നന്നി, ഹ്ന/ന്ന  as in ചിഹ്നം/ചിന്നം are some examples where what you pronounce is not exactly same as what you write.
    • റ്റ, ന്റ – These two highly used conjuncts heavily deviate from the letters and pronunciation. While writing using pen, people don’t make much mistakes since they just draw the shape of these ligatures, but while typing, one need to know the exact key sequence and they get confused. Common mistakes for these conjuncts are ററ, ൻറ, ൻറ്റ , ൻററ
  3. Visual similarity: While using visual input methods such as handwriting based or some onscreen keyboards, either the users or the input tool makes mistakes due to visual similarity
    • ൃ, ്യ often get confused.
    • ജ്ഞ, ഞ്ജ is one very common sequence where people are confused. ആദരാജ്ഞലി/ആദരാഞ്ജലി.
    • ത്സ, ഝ is another combination
    • The handwriting based input methods like Google handwriting tool is known for recognizing anuswara ം as zero, English o, O etc.
    • When people don’t know how to insert visarga ഃ, and since there is a very similar key in keyboard- colon : they use it. Example: ദുഃഖം/ദു:ഖം
    • ള്ള, the geminated form of ള, is very similar to two adjacent ള. This kind of mistakes are very frequent among people whi studied Malayalam inputting informally. Two adjacent റ, is another mistake for റ്റ,
    • The informal, trial-and-error based Malayalam inputting training also introduced some other mistakes such as using open parenthesis ‘(‘ for ്ര, closing parenthesis ‘)’ for ാ sign.
  4. Ambiguity due to regional dialect: A good example for this is insertion of യ് in verbs. കുറക്കുക/കുറയ്ക്കുക, ചിരിക്കുക/ചിരിയ്ക്കുക, Also in nominal inflections: പൂച്ചയ്ക്ക്/പൂച്ചക്ക്.  Usuage of Samvruthokaram to distinguish between a pure consonant and stressed consonant at the end of word is a highly debated topic. For example, അവന്/അവനു്/അവനു. All these forms are common, even though the usage of നു് is less after the script reformation. But since script reformation was not an absolute transformation, it still exist in usage
  5. Spaces: Malayalam is an agglutinative language. Words can be agglutinated, but nothing prevents people to put space and write in simple words. But this should be done carefully since it can alter the meaning. An example is “ആന പുറത്തു കയറി”, ആനപ്പുറത്തു കയറി”, “ആനപ്പുറത്തുകയറി”, “ആനപ്പുറത്ത് കയറി”. Another example: “മലയാള ഭാഷ”, “മലയാളഭാഷ” – Here, there is no valid word “മലയാള”. The anuswara at the end get deleted only when it joins with ഭാഷ as adjective. A morphology analyser can correctly parse “മലയാളഭാഷ” as മലയാളം<proper-noun><adjective>ഭാഷ<noun>. But since language already broke this rule and many people are liberally using space, a spellchecker would need to handle this cases.
  6. Slip of Finger: Accidental insertions or omissions of key presses is the common reason for spelling mistakes. For alphabetic language, mostly this type of errors are addressed. For Malayalam also, this type of accidental slip of finger can happen. For Latin based languages,  we can make some analysis since we know a QWERTY keyboard layout and do optimized checks for this kind of issues. Since Malayalam will use another level of mapping on top of QWERTY for inputting(inscript, phonetic, transliteration), it is not easy to analyse this errors. So, in general, we can expect random characters or omission of some characters in the query word. An accidental space insertion has the challenge that it will split the word to two words and if the spellchecking is done by one word at a time, we will miss it.

I must add that the above classification is not based on a systematic study of any test data that I can share. Ideally, this classification should done with real sample of Malayalam written on paper and computer. It should be then manually checked for spelling mistakes, list down the mistakes and analyse the patterns. This exercise would be very beneficial for spellcheck research. In my case, even since I released my word list based spellchecker, noticing spelling errors in internet(social media, mainly) has been my obsession. Sometimes I also tried to point out spelling mistakes to authors and that did not give much pleasant experience to me 😁 . The above list is based on my observation from such patterns.

Malayalam spelling checker

To check if a word is valid, known, correctly spelled word, a simple look up using morphology analyser is enough. If the morphology analyser can parse the word, it is correctly spelled. Note that the word can be an agglutinated at arbitrary levels and inflected at same time.

Out of lexicon words

Compared to the finite set word list, the FST based morphology analyser and generator system covers large number of words using its generation system based on morpho-phonotactics. For a discussion on this see my previous blog post about the coverage test. Since every language vocabulary is a dynamic system, it is still impossible to cover 100% words in a language all the time. New words get added to language every now and then. There are nouns related to places, people names, product names etc that is not in the lexicon of Morphology analyser. So, these words will be reported as unknown words by the spellchecker. Unknown word is interpreted as misspelled word too. This issue is a known problem. But since a spellchecker is often used by a human user, the severity of the issue depends whether the spellchecker does not know about lot of commonly used words or not. Most of the spellcheckers provide an option to add to dictionary to avoid this issue.

As part of the Morphology analyser, the expansion of the lexicon is a never ending task. As the lexicon grows, the spellchecker improves automatically.

Malayalam spelling correction

To provide spelling suggestions, the FST based morphology analyser can be used. This is a three step process

  1. Generate a list of candidate words from the query word. The words in this list may be incorrect too. The words are generated based on the patterns we defined based on the nature of spelling mistakes. We scan the query word for common patterns of errors and apply fix for that pattern. Since there dozens of patterns, we will have many candidate words.
  2. From the candidate list, find out the correctly spelled word using spellcheck method. This will result a very small number of words. These words are the probable replacements for the misspelled query word.
  3. Sort the candidate words to provide more probable suggestion as the first one. For this, we can do a ranking on the suggestion strategies. A very common error pattern get high priority at step 1. So the suggestions from that appear first in the candidate list. A more sophisticated approach would use a frequency model for the words. So candidate words that are very frequent in the language will appear as first candidate.

One thing I observed from the above approach is, in reality the candidate words after all the above steps for Malayalam is most of the time one or two. This make step 3 less relevant. At the same time, an edit distance based approach would have generated more than 5 candidate words for each misspelled word. The candidates from the edit distance based suggestion mechanism would be very diverse, meaning, they won’t have be related to the indented word at all.  The following images illustrates the difference.

Spelling suggestion from the morphology analyser based system.
Spelling suggestions from edit distance based candidates

Context sensitive spellchecking

Usually the spellchecking and suggestion are done at one word at a time. But if we know the context of the word, the spellchecking will be further useful. The context is usually the words before and after the word. An example from English is “I am in Engineer”. Here the word “in” is a correct word, but with in the context, it is wrong. To mark the word “in” wrong, and provide ‘an’ as suggestion, one approach is ngram model of part of speech for the language. In simple words, what kind of word can appear in between a known kind of words. If we build this model for a language, that will surely tell that the a locative POS “in” before Engineer is rare or not seen before.

The Standard Malayalam or lack thereof

How do you determine which is the “correct” or “standard” way of writing a word? Malayalam has lot of orthographic variants for words which were introduced to language as genuine mistakes that later became common words(രാപ്പകൽ/രാപകൽ, ചിലവ്/ചെലവ്), phonetic simplification(അദ്ധ്യാപകൻ/അധ്യാപകൻ, സ്വർണ്ണം/സ്വർണം), or old spelling(കർത്താവ്/കൎത്താവു്) and so on. A debate about the correctness of these words will hardly reach conclusion. For our case, this is more of an issue of selecting words in the lexicon. Which one to include, which one to exclude? It is easy to consider these debates as blocker for the progress of the project and give up: “well, these things are not decided by academics so far, so we cannot do anything about it till they make up their mind”.

I did not want to end up in that deadlock. I decided to be liberal about the lexicon. If people are using some words commonly, they are valid words the project need to recognize as much as possible. That is the very liberal definition I have. I leave the standardization discussion to linguists who care about it.

The news report from Mathrubhumi daily in 2007 about my old spelling checker

Back in 2007, when I developed the old Malayalam spellchecker, these debates came up.  Dr. P Somanathan, who helps me a lot now a days with this project, wrote about the issue of Malayalam spelling inconsistencies: “ചരിത്രത്തെ വീണ്ടെടുക്കുക:” and “വേണം നമുക്ക് ഏകീകൃതമായ ഒരെഴുത്തുരീതി

References

  1. A Data-Driven Approach to Checking and Correcting Spelling Errors in Sinhala. Asanka Wasala, Ruvan Weerasinghe, Randil Pushpananda,
    Chamila Liyanage and Eranga Jayalatharachchi [pdf] This paper discuss the phonetic similarity based strategies to create a wordlist, instead of edit distance approach.
  2. Finite-State Spell-Checking with Weighted Language and Error Models—Building and Evaluating Spell-Checkers with Wikipedia as Corpus Tommi A Pirinen, Krister Lindén [pdf] This paper outlines the usage of Finite state transducer technique to address the issue of infinite dictionary of morphologically rich languages. They use Finnish as the example language
  3. The Malayalam morphology analyser project by myself https://gitlab.com/smc/mlmorph is the foundation for the spellchecker.
  4. The common Malayalam spelling mistakes and confusables were presented in great depth by Renowned linguist and author Panmana Ramachandran Nair in his books  ‘തെറ്റില്ലാത്ത മലയാളം’, ‘തെറ്റും ശരിയും’, ‘ശുദ്ധ മലയാളം’ and ‘നല്ല മലയാളം’.
  5.  Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams Tommi A Pirinen and Miikka Silfverberg and Krister Lindén [pdf] – This paper discuss the context sensitive spellchecker approach.

Where can I try the spellchecker?

If you curious about the implementation of this approach, please refer https://gitlab.com/smc/mlmorph and https://gitlab.com/smc/mlmorph/wikis/Spellchecker-Plan. Since the implementation is not complete, I will write a new article about it later. Thanks for reading!

A screenshot of Malayalam spellchecker in action. Along with incorrect words, some correct words are marked as misspelled too. This is because of the incomplete morphology analyser. As it improves, more words will be covered.

Malayalam morphology analyser – status update

For the last several months, I am actively working on the Malayalam morphology analyser project. In case you are not familiar with the project, my introduction blog post is a good start. I was always skeptical about the approach and the whole project as such looked very ambitious. But, now  I am almost confident that the approach is viable. I am making good progress in the project, so this is some updates on that.

Analyser coverage statistics

Recently I added a large corpora to frequently monitor the percentage of words the analyser can parse.  The corpora was selected from two large chapters of ഐതിഹ്യമാല, some news reports, an art related essay, my own technical blog posts to have some diversity in the vocabulary.

Total words
15808
Analysed words10532
Coverage66.62%
Time taken
0.443 seconds

This is a very encouraging. Achieving a 66% for such a morphologically rich language Malayalam is no small task. From my reading, Turkish and Finnish, languages with same complexity of morphology achieved about 90% coverage. It may be more difficult to increase the coverage for me compared to achieving this much so far. So I am planning some frequency analysis on words that are not parsed by analyser, and find some patterns to improve.

The performance aspect is also notable. Once the automata is loaded to memory, the analysis or generation is super fast. You can see that ~16000 words were analyzed under half of a second.

Tests

From the very beginning the project was test driven. I now has 740 test cases for various word forms

The transducer

The compiled transducer now is 6.2 MB.  The transducer is written in SFST-PL and compile using SFST. It used to be compiled using hfst, but hfst is now severely broken for SFST-PL compilation, so I switched to SFST. But the compiled transducer is read using hfst python binding.

Fst type
SFST
arc typeSFST
Number of states
200562
Number or arcs
732268
Number of final states
130

The Lexicon

The POS tagged lexicon I prepared is from various sources like wiktionary, wikipedia(based on categories), CLDR. While developing I had to improve the lexicon several times since none of the above sources are accurate. The wiktionary also introduced a large amount of archaic or sanskrit terms to the lexicon. As of today, following table illustrates the lexicon status

Nouns
64763
Person names
505
Place names
2031
Postpositions
85
Pronouns
33
Quantifiers
57
Abbreviations
27
Adjectives
18
Adverbs
14
Affirmatives
6
Conjunctions
75
Demonstratives
9
English borrowed nouns
657
Interjections
36
Language names(nouns)
639
Affirmations and negations
8
Verbs
3844

As you can see, the lexicon is not that big. Especially it is very limited for proper nouns like names, places. I think the verb lexicon is much better. I need to find a way to expand this further.

POS Tagging

There is no agreement or standard on the POS tagging schema to be used for Malayalam. But I refused to set this is as a blocker for the project. I defined my own POS tagging schema and worked on the analyser. The general disagreement is about naming, which is very trivial to fix using a tag name mapper. The other issue is classification of features, which I found that there no elaborate schema that can cover Malayalam.

I started referring http://universaldependencies.org/ and provided links to the pages in it from the web interface.  But UD is also missing several tags that Malayalam require. So far I have defined 85 tags

Challenges

The main challenge I am facing is not technical, it is linguistic. I am often challenged by my limited understanding of Malayalam grammar. Especially about the grammatical classifications, I find it very difficult to come up with an agreement after reading several grammar books. These books were written in a span of 100 years and I miss a common thread in the approach for Malayalam grammar analysis. Sometimes a logical classification is not the purpose of the author too. Thankfully, I am getting some help from Malayalam professors whenever I am stuck.

The other challenge is I hardly got any contributor to the project except some bug reporting. There is a big entry barrier to this kind of projects. The SFST-PL is not something everybody familiar with. I need to write some simple examples for others to practice and join.

I found that some practical applications on top of the morphology analyser is attracting more people. For example, the number spellout application I wrote caught the attention of many people. I am excited to present the upcoming spellchecker that I was working recently. I will write about the theory of that soon.

How to customize Malayalam fonts in Linux

Now a days GNU/Linux distributions like Ubuntu, Debian, Fedora etc comes with pre-configured fonts for Malayalam. For Sans-serif family, it is Meera and  for serif, it is Rachana. If you like to change these fonts, there is no easy way to do with configuration tools in Gnome or KDE. They provide a general font selector for the whole desktop, but not for a given language.

The advantage of setting these preference at system level is, you don’t need to choose this fonts at application level then. For example, you don’t need to set them for firefox, chrome etc. All will follow the system preferences. We will use fontconfig for this

First, create a file named ~/.config/fontconfig/conf.d/50-my-malayalam.conf. If the folders for this file does not exist, just create them. To this file, add the following content.

<?xml version="1.0"?>
<!DOCTYPE fontconfig SYSTEM "fonts.dtd">
<fontconfig>
<!-- Malayalam (ml) -->
<match target="font">
        <test name="lang" compare="contains">
                <string>ml</string>
        </test>
        <alias>
                <family>sans-serif</family>
                <prefer>
                        <family>Manjari</family>
                </prefer>
        </alias>
</match>

<match target="font">
        <test name="lang" compare="contains">
                <string>ml</string>
        </test>
        <alias>
                <family>serif</family>
                <prefer>
                        <family>Rachana</family>
                </prefer>
        </alias>
</match>

<!-- Malayalam (ml) ends -->

</fontconfig>

Save the file and you are done. You can check if the default font for Malayalam changed or not using the following command

$ LANG=ml_IN fc-match

It should list Manjari. The above code we added to the file is not complicated. You can see that we are setting the sans-serif font preference for ml(Malayalam) language as Manjari. Also serif font preference as Rachana. You are free to change the fonts to whatever you prefer.

Note that you may want to close and open your applications to get this preference applied.

You may choose one of the fonts available at smc.org.in/fonts, download and install and use the above configuration with it.

യുവാക്കളുടെ തൊഴിലഭിമാനവും തൊഴിൽ സൊസൈറ്റികളും

നമ്മുടെ നാട്ടിലെ യുവാക്കൾ നേരിടുന്ന ഒരു പ്രതിസന്ധിയെപ്പറ്റിയും അതിന് പരിഹാരമായേക്കാവുന്ന  ഒരാശയത്തെപ്പറ്റിയും എഴുതിയ ഒരു കുറിപ്പാണിതു്.

നമ്മുടെ നാട്ടിൽ സവിശേഷ നൈപുണികൾ ആവശ്യമുള്ള പലതരത്തിലുള്ള കൂലിപ്പണികൾ,  ഡ്രൈവിങ്ങ്, കൃഷിപ്പണികൾ, പെയിന്റിങ്ങ്, കെട്ടിടനിർമാണം, മെക്കാനിക് തുടങ്ങിയ ജോലികളിൽ ഏർപ്പെടുന്ന യുവാക്കൾ ധാരാളമുണ്ട്. ഇവരെല്ലാം മിക്കപ്പൊഴും അസംഘടിത മേഖലയിലാണുതാനും. സർക്കാർ, സ്വകാര്യ ജോലി നേടാത്തതോ നേടാനാവശ്യമായ വിദ്യാഭ്യാസമില്ലാത്തവരോ ആയ യുവാക്കളായ പുരുഷന്മാരാണ് ഇവയിലധികവും. പക്ഷേ യുവതികൾ വിദ്യാഭ്യാസം പരമാവധി വിവാഹം വരെ തുടർന്ന് പിന്നീട് കുടുംബജീവിതത്തിൽ എത്തിച്ചേരുകയാണ്. ഇരുപതിനും മുപ്പത്തഞ്ചിനും ഇടക്ക് പ്രായമുള്ള ഇവർ പുതിയൊരു വെല്ലുവിളി നേരിടുന്നുണ്ട്. അതിനെപ്പറ്റി വിശദമായ ഒരു പഠനറിപ്പോർട്ട് ഈയിടെ സമകാലിക മലയാളം വാരിക പ്രസിദ്ധീകരിച്ചിരുന്നു(നിത്യഹരിത വരൻമാർ-രേഖാചന്ദ്ര, സമകാലിക മലയാളം ജൂലൈ 16). മലബാർ മേഖലയിൽ വ്യാപകമായി ഈ തരത്തിലുള്ള യുവാക്കൾ അവിവാഹിതരായിത്തുടരുന്നു എന്നതാണ് പഠനം.

ഇതിന്റെ കാരണം, സാംസ്കാരികമായി മേൽപ്പറഞ്ഞ ജോലിക്കാരോടുള്ള യുവതികളുടെ കുടുംബങ്ങളുടെ താത്പര്യക്കുറവാണ്. സർക്കാർ, സ്വകാര്യകമ്പനി ജോലിയില്ലാത്തവർക്ക് യുവതികളെ വിവാഹം കഴിച്ചുകൊടുക്കാൻ ആരും തയ്യാറാവുന്നില്ല. കുടക് കല്യാണം തുടങ്ങിയ പുതിയ പ്രതിഭാസങ്ങളുടെ വിവരങ്ങൾ ആ ലേഖനത്തിലുണ്ട്. ജാതി, ജാതകം തുടങ്ങിയവ പണ്ടത്തേക്കാളേറെ വഴിമുടക്കിയായി നിൽക്കുന്നുമുണ്ട്. പ്രണയവിവാഹങ്ങൾക്ക് ഗ്രാമപ്രദേശങ്ങളിൽ മിക്കവാറും സദാചാരപ്പോലീസുകാർ ഇടകൊടുക്കാറുമില്ല. യുവാക്കൾ ഇത്തരം പണികൾക്ക് പോയി സ്വന്തം വീട്ടിലെ യുവതികൾക്ക് കുറേകൂടി വിദ്യാഭ്യാസം കൊടുക്കാൻ ശ്രമിക്കാറുണ്ടെങ്കിലും ആ യുവതികൾ പിന്നീട് മെച്ചപ്പെട്ട ജോലിയുള്ള യുവാക്കളെ മാത്രം ശ്രമിക്കുന്നതുകൊണ്ട്, അവർ വീണ്ടും പ്രതിസന്ധിയിലാവുന്നു.<

കായികാദ്ധ്വാനത്തോടുള്ള വിമുഖത വളർന്നുവരാൻ മേൽപ്പറഞ്ഞ പ്രശ്നം കാരണമാകുന്നുണ്ട്. സോഷ്യൽ സ്റ്റാറ്റസ് എന്ന ഈഗോ പതിയെ മേൽപ്പറഞ്ഞ സുപ്രധാന ജോലികളിലേക്ക് ആളെകിട്ടാനില്ലാത്ത പ്രശ്നത്തിലേക്കും എത്തിക്കുന്നുണ്ട്. സമൂഹത്തിലെ  പൊതുവെയുള്ള വിദ്യാഭ്യാസനിലവാരം കൂടിവരുന്തോറും ഈ ഈഗോ വല്ലാതെ വർദ്ധിക്കുകയും ചെയ്യും. പതിയെപ്പതിയെ അനാരോഗ്യകരമായ ഒരു സാമൂഹികവ്യവസ്ഥ ഇതിൽനിന്നും ഉടലെടുക്കുമെന്ന് ഞാൻ ഭയക്കുന്നു. യുവതികൾ പ്രത്യേകിച്ചും കുടുംബങ്ങളിൽ നിന്നുള്ള സമ്മർദ്ദം കാരണം ജോലിസാധ്യതകളുടെ വളരെ ഇടുങ്ങിയ ഒരു സെലക്ഷൻ സ്പേസിലേക്ക് പോകുന്നുണ്ട്. അവർക്ക് മേൽപ്പറഞ്ഞ ജോലികളിലേക്ക് പോകാൻ നമ്മുടെ സാമൂഹികാവസ്ഥ സമ്മതിക്കാത്ത സ്ഥിതിയാണ് വരുന്നത്. ഇവിടെയാണ് അതിഥിത്തൊഴിലാളികൾ അവസരങ്ങൾ കണ്ടെത്തിയത്.

സാമൂഹികരംഗത്ത് മതേതര പൊതുവേദികൾ കുറഞ്ഞ നമ്മുടെ സമൂഹത്തിൽ ഈ യുവശക്തി രാഷ്ട്രീയപരമായി പ്രബുദ്ധരായിരിക്കുക എന്ന വെല്ലുവിളി കൂടുതലാവുന്നുമുണ്ട്. അരാഷ്ട്രീയത ഒരു ഡിഫോൾട്ട് ചോയ്സ് ആയി യുവാക്കൾക്കിടയിൽ വരാനുള്ള സാധ്യത എന്തുകൊണ്ടും പ്രതിരോധിച്ചേ മതിയാകൂ.

ഇതുവരെ ചുരുക്കിപ്പറഞ്ഞ പ്രശ്നങ്ങൾക്ക് മേൽപ്പറഞ്ഞ യുവാക്കൾക്കിടയിലേക്ക് ഒരു സാമൂഹികമുന്നേറ്റത്തിന്റെ ആവശ്യകതയുണ്ട്. ഉദ്ദേശങ്ങളിതാണ്:

  • കായികാദ്ധ്വാനമുള്ളതോ അല്ലാത്തതോ ആയ എല്ലാത്തരം അസംഘടിത ജോലികൾക്കും സാമൂഹികാംഗീകാരം വളർത്തിയെടുക്കുക. യുവാക്കളുടെ മാനവവിഭവശേഷി മിഥ്യാധാരണകളിലൂടെയും സാമൂഹികമായ കെട്ടുപാടുകളിലും തളയ്ക്കാതിരിക്കുക.
  • ഇത്തരം ജോലിക്കാരെ സംഘടിതമേഖലയിലേക്ക് കൊണ്ടുവന്ന് രാഷ്ട്രീയമായി പ്രബുദ്ധരാക്കുക. മതേതര ഇടങ്ങൾ സംഘടിപ്പിക്കുക.
  • തൊഴിൽ പരിശീലനങ്ങളും, ഉള്ള തൊഴിലുകളിൽ ആരോഗ്യകരമായ പരിഷ്കാരങ്ങൾക്ക് പ്രേരണയും പരിഷ്കാരങ്ങളും നൽകുക. തൊഴിലുകൾ ആകർഷണീയമാക്കുക.
  • കുടുംബശ്രീ കൊണ്ടുവന്ന സാമൂഹികചാലകശക്തി യുവാക്കളിലേക്ക് കൂടുതൽ വ്യാപിപ്പിക്കുക.

ഇതിലേക്ക് എനിക്ക് നിർദ്ദേശിക്കാനുള്ള ഒരു ആശയം “തൊഴിൽ സൊസൈറ്റികൾ” ആണ്. അതിനെപ്പറ്റിയുള്ള ഏകദേശധാരണ ഇങ്ങനെയാണ്.

  • തൊഴിലാളികളെ ആവശ്യമുള്ളവരും തൊഴിലാളികളും തമ്മിലുള്ള ഒരു മീറ്റിങ്ങ് പോയിങ്ങ് ആയി ഈ സൊസൈറ്റികൾ പ്രവർത്തിക്കുന്നു.
  • യുവാക്കൾ അവിടെ രജിസ്റ്റർ ചെയ്യുന്നു, അവരുടെ കഴിവുകളും.
  • ഇത്തരം സൊസൈറ്റികളിൽ രജിസ്റ്റർ ചെയ്തവർ യൂണിഫോമുള്ളവരും നെയിംടാഗും തൊഴിൽ സുരക്ഷാവസ്ത്രങ്ങൾ/ഉപകരണങ്ങളോടുകൂടിയവരാണ്(to overcome social stigma, this is
    important)
  • ആർക്കും ഈ സൈസൈറ്റികളിൽ ജോലിക്കാരെ തേടാം. നേരിട്ട് പോയി അന്വേഷിക്കണമെന്നില്ല. അല്പസ്വല്പം ടെക്നോളജിയുടെ സഹായത്തോടെ ഈ കണക്ഷനുകൾ പെട്ടെന്നുണ്ടാക്കാം. മൊത്തത്തിൽ അപ്പോയിന്റ്മെന്റ് സിസ്റ്റം ഒക്കെ വെച്ച് പഴയ ഫ്യൂഡൽ കാലഘട്ടത്തിലെ മുതലാളി-പണിക്കാർ റിലേഷനെ പൊളിച്ചെഴുതലാണ് ഉദ്ദേശം. അതുവഴി ഏത് ജോലിയുടെയും ഉയർച്ച താഴ്ചകളെ പൊളിക്കലും.
  • സൊസൈറ്റികൾക്ക് കൂലിനിരക്കുകൾ നിശ്ചയിക്കാം. തൊഴിൽ അവകാശങ്ങളെപ്പറ്റി ബോധമുള്ളവരായിരിക്കും.

ഈ ആശയം പാശ്ചാത്യനാടുകളിൽ മുതലാളിത്തവ്യവസ്ഥിതി നടപ്പിലാക്കിത്തുടങ്ങിയിട്ടുണ്ട്.Amazon Services ഉദാഹരണം.  Uber, Airbnb ഒക്കെപ്പോലെ അത്തരം “ഓൺലൈൻ ആപ്പുകൾ” ഉടൻ
നമ്മുടെ നാട്ടിലുമെത്തും. പക്ഷേ, തൊഴിൽദാതാവ്-തൊഴിലാളി ബന്ധത്തിൽനിന്നുള്ള ചൂഷണത്തിനപ്പുറം അവക്ക് ലക്ഷ്യങ്ങളുണ്ടാവില്ല. ആ സ്പേസിലേക്ക് സാമൂഹികരാഷ്ട്രീയ ലക്ഷ്യങ്ങളോടെ നേരത്തെത്തന്നെ കേരളജനത പ്രവേശിക്കണമെന്നാണാഗ്രഹം.

The many forms of ചിരി ☺️

This is an attempt to list down all forms of Malayalam word ചിരി(meaning: ☺️, smile, laugh). For those who are unfamiliar with Malayalam, the language is a highly inflectional Dravidian language. I am actively working on a morphology analyser(mlmorph) for the language as outlined in one of my previous blogpost.

I prepared this list as a test case for mlmorph project to evaluate the grammar rule coverage. So I thought of listing it here as well with brief comments.
1. ചിരി
ചിരി is a noun. So it can have all nominal inflections.

2. ചിരിയുടെ
3. ചിരിക്ക്
4. ചിരിയ്ക്ക്
5. ചിരിയെ
6. ചിരിയിലേയ്ക്ക്
7. ചിരികൊണ്ട്
8. ചിരിയെക്കൊണ്ട്
9. ചിരിയിൽ
10. ചിരിയോട്
11. ചിരിയേ

There is a plural form
12. ചിരികൾ

A number of agglutinations can happen at the end of the word using Affirmatives, negations, interrogatives etc. For example, ചിരിയുണ്ട്, ചിരിയില്ല, ചിരിയോ. But now I am ignoring all agglutinations and listing only the inflections.

ചിരിക്കുക is the verb form of ചിരി.
13.  ചിരിക്കുക

It can have the following tense forms
14. ചിരിച്ചു
15. ചിരിക്കുക
16. ചിരിക്കും

A concessive form for the word
17. ചിരിച്ചാലും

This verb has the following aspects
18. ചിരിക്കാറ്
19. ചിരിച്ചിരുന്നു
20. ചിരിച്ചിരിയ്ക്കുന്നു
21. ചിരിച്ചിരിക്കുന്നു
22. ചിരിച്ചിരിക്കും
23. ചിരിച്ചിട്ട്
24. ചിരിച്ചുകൊണ്ടിരുന്നു
25. ചിരിച്ചുകൊണ്ടേയിയിരുന്നു
26. ചിരിച്ചുകൊണ്ടേയിരിക്കുന്നു
27. ചിരിച്ചുകൊണ്ടിരിക്കുന്നു
28. ചിരിച്ചുകൊണ്ടിരിക്കും
29. ചിരിച്ചുകൊണ്ടേയിരിക്കും

There are number of mood forms for the verb ചിരിക്കുക
30. ചിരിക്കാവുന്നതേ
31. ചിരിച്ചേ
32. ചിരിക്കാതെ
33. ചിരിച്ചാൽ
34. ചിരിക്കണം
35. ചിരിക്കവേണം
36. ചിരിക്കേണം
37. ചിരിക്കേണ്ടതാണ്
38. ചിരിക്ക്
39. ചിരിക്കുവിൻ
40. ചിരിക്കൂ
41. ചിരിക്ക
42. ചിരിച്ചെനെ
43. ചിരിക്കുമേ
44. ചിരിക്കട്ടെ
45. ചിരിക്കട്ടേ
46. ചിരിക്കാം
47. ചിരിച്ചോ
48. ചിരിച്ചോളൂ
49. ചിരിച്ചാട്ടെ
50. ചിരിക്കാവുന്നതാണ്
51. ചിരിക്കണേ
52. ചിരിക്കേണമേ
53. ചിരിച്ചേക്കാം
54. ചിരിച്ചോളാം
55. ചിരിക്കാൻ
56. ചിരിച്ചല്ലോ
57. ചിരിച്ചുവല്ലോ

There are a few inflections with adverbial participles
58. ചിരിക്കാൻ
59. ചിരിച്ച്
60. ചിരിക്ക
61. ചിരിക്കിൽ
62. ചിരിക്കുകിൽ
63. ചിരിക്കയാൽ
64. ചിരിക്കുകയാൽ

The verb can act as an adverb clause. Examples
65. ചിരിച്ച
66. ചിരിക്കുന്ന
67. ചിരിച്ചത്
68. ചിരിച്ചതു്
69. ചിരിക്കുന്നത്

The above two forms act as nominal forms. Hence they have all nominal inflections too
70. ചിരിച്ചതിൽ
71. ചിരിക്കുന്നതിൽ
72. ചിരിക്കുന്നതിന്
73. ചിരിച്ചതിന്
74. ചിരിച്ചതിന്റെ
75. ചിരിക്കുന്നതിന്റെ
76. ചിരിച്ചതുകൊണ്ട്
77. ചിരിക്കുന്നതുകൊണ്ട്
78. ചിരിച്ചതിനോട്
79. ചിരിക്കുന്നതിനോട്
80. ചിരിക്കുന്നതിലേയ്ക്ക്

Now, a few voice forms for the verb ചിരിക്കുക
81. ചിരിക്കപ്പെടുക
82. ചിരിപ്പിക്കുക

These voice forms are again just verbs. So it can go through all the above inflections the verb ചിരിക്കുക has. Not writing it here, since it mostly a repeat of what is listed here. ചിരിക്കപ്പെടുക has all inflections of the verb പെടുക. You can see them listed in my test case file though

A noun can be derived from the verb ചിരിക്കുക too. That is
83. ചിരിക്കൽ

Since it is a noun, all nominal inflections apply.
84. ചിരിക്കലേ
85. ചിരിക്കലിനോട്
86. ചിരിക്കലിൽ
87. ചിരിക്കലിന്റെ
88. ചിരിക്കലിനെക്കൊണ്ട്
89. ചിരിക്കലിലേയ്ക്ക്
90. ചിരിക്കലിന്

My test file has 164 entries including the ones I skipped here. As per today, the morphology analyser can parse 74% of the items. You can check the test results here: https://paste.kde.org/pn5z0oh7g

A native Malayalam speaker may point out that the variation fo this word ചിരിയ്ക്കുക -with യ് before ക്കുക. My intention is to support that variation as well. Obviously that word also will have the above listed inflected forms.

Now that I wrote this list here, I think having a rough English translation of each items would be cool, but it is too tedious to me.

How to type Malayalam using Keyman 10 and Mozhi

This is a quick tutorial on installing Mozhi input method in Windows 10.

Mozhi is a transliteration based keyboard  for Malayalam. You can type malayaalam to get മലയാളം for example. We will use Keyman tool as the input tool. Keyman input tool is an opensource input mechanism now developed by SIL. It supports lot of languages and Mozhi malayalam is one of that.

Step 1: Download Keyman desktop with Mozhi Malayalam keyboard

Go to https://keyman.com/keyboards/mozhi_malayalam. There you will see the following options to download. Select the first one as shown below. Download the installer to your computer. It is a file about 20MB.

Keyman 10 Desktop download page.

Step 2: Installation

Double click the downloaded file to start installation. The installer will be like this:

Keyman 10 Desktop installer

Click on the Install Keyman Desktop button. You will see the below screen.

Keyman 10 Desktop welcome page.

 

Press the “Start keyman” button. The installation will start and keyboard will start.

Step 3: Choose Mozhi input method

You will see a small icon at the bottom of your screen, near time is displayed.

Click on that to choose Mozhi.

Keyboard selection

Once you chose Mozhi, you can type in Manglish anywhere and you will see malayalam. To learn typing click on the “Keyboard Usage” as shown above.

Step 4: Start typing in Malayalam

You can directly type Malayalam in any application without copy paste. Just like English, start typing. Make sure to use a good Malayalam font. You can get them from https://smc.org.in/fonts/

Using Mozhi in LibreOffice. Notice the font used is Manjari.What I typed is “ippOL enikk malayaalam ezhuthaanaRiyaam”

 

Kindle supports custom fonts

I am pleasantly surprised to see that Amazon Kindle now supports installing custom fonts. A big step towards supporting non-latin content in their devices. I can now read Malayalam ebooks in my kindle with my favorite fonts.

Content rendered in Manjari font. Note that I installed Bold, Regular, Thin variants so that Kindle can pick up the right one

This feature is introduced in Kindle 5.9.6.1 version released in June 2018. Once updated to that version, all you need is to connect the device using the USB cable to your computer. Copy your fonts to the fonts folder there. Remove the usb cable. You will see the fonts listed in font selector.

Kindle had added Malayalam rendering support back in 2016, but the default font provided was one of the worst Malayalam fonts. It had wrong glyphs for certain conjuncts and font had minimal glyphs.

I tried some of the SMC Malayalam fonts in the new version of Kindle. Screenshots given below

Custom fonts selection screen. These fonts were copied to the device
Select a font other than the default one
Content in Rachana.
Make sure to check the version. 5.9.6.1 is the latest version and it supports custom fonts

Manjari 1.5 version released

A new version of Manjari typeface is available now. Version 1.5 is mainly a bug fix release.

In version 1.3, the build tooling of the project was changed from fontforge to fontmake. Two weeks back a few people reported that the font no longer works in MS Word and Wordpad. Font selector lists the font, but when selected, the content remains same. It works in all other applications without any issues. Because of that the bug went unnoticed.

Debugging the issue was not easy since font works everywhere else. I did a line by line diff of the ttx format(XML font format) of old and new version fonts.  Found that the OS/2 ulUnicodeRange, ulCodePageRange values were set to 0 in version 1.3.  Apparently these values are really checked by MS Word and Wordpad. If these are missing Wordpad and Word just rejects the font.  Correct values for these fields are set now.

New version 1.5 is available now. You can download the latest fonts from https://smc.org.in/fonts/#manjari

Stylistic Alternates for ച്ച, ള്ള in Manjari and Chilanka fonts

The ligatures for the Malayalam conjuncts ച്ച, ള്ള have less popular variants as shown below

The second form is not seen in print but often in handwritten Malayalam. I have seen it a lot in bus boards especially at Thiruvananthapuram. There are no digital typefaces with the second style, except the Chilanka font I designed. It uses the second variant of ച്ച. I got lot of appreciation for that style variant, but also recieved request for the first form of ച്ച. I had a private copy of Chilanka with that variant and had given to whoever requested. I also recieved some requests for the second style of ള്ള. For the Manjari font too, I recieved requests for second variant.

Today I am announcing the new version of Manjary and Chilanka font, with these two forms as optional variants without the need for a different copy of a font. In a single font, you will get both these variants using the Opentype stylistic alternatives feature.

The default styles of ച്ച and ള്ള are not changed in new version. The fonts comes with an option to chose a different form.

Choosing the style for webfonts using CSS

Use the font-feature-settings CSS style to choose a style. For the element or class in the html, use it as follows:

For style 1:

font-feature-settings: "salt" 1;

For style 2:

font-feature-settings: "salt" 2;

Choosing the style variant in LibreOffice

In the place of the font name in font selector, append :salt=1 for first style, :salt=2 for second style. So you need to give Manjari Regular:salt=2 as the font name for example to get second style.

Choosing the style variant in XeLaTeX

fontspec allows to choose alterate style variants. Use Alternate=N syntax. Note that N starts from 0. So for style1, use Alternate=0 and for style2 use Alternate=2. Refer section 2.8.3 of fontspec documentation.

\documentclass[11pt]{article}
\usepackage{polyglossia}
\newfontfamily{\manjari}[Script=Malayalam]{Manjari}
\begin{document}

\manjari{\addfontfeature{Alternate=1}കാച്ചാണി, വെള്ളയമ്പലം}

\end{document}

This will produce the following rendering:

Choosing the style variant in Inkscape

Inkscape font selection dialog has a feature to chose font style variants. It uses the property values of CSS font-feature-settng.

In Adobe, Indesign, selecting the ligature will give stylistic alternative(s) if any to choose.

Updated fonts

Updated fonts are available in SMC’s font download microsite https://smc.org.in/fonts