New version of Stuttgart Finite State Transducer

The Malayalam morphology analyser(mlmorph) I am actively developing since 2016 is based on Stuttgart Finite State Transducer(SFST) formalism. SFST was developed by Dr Helmut Schmid. Since SFST had minimal developer apis and no functional python APIs, I was using Helsinki Finite State Transducer(HFST) toolkit. HFST has python binding and better tooling and development support. HFST has SFST backend so there was no issue in using SFST formalism. But HFST development has slowed down in past years. [Read More]
mlmorph  fst 

New version of Manjari Typeface released

A new version of Manjari Malayalam typeface is available now. Version 2.000 comes with a few bugfixes and glyph additions. Most of the changes are based on a review by Google fonts team. New version is available at SMC website for preview and download. Changes Design correction for Malayalam number 10 to better match the attestations (0d6b0fc2) Avoid glyphs that use components that are components themself (6fe9b35d) Add ligatures glyphs for ൻ + ് + റ version of nta (2c39ae08) Review and update diacritics alignment (c498245d) Rename License. [Read More]

Tesseract OCR web interface

I prepared a web frontend for Tesseract OCR to do optical character recognition for Malayalam - https://ocr.smc.org.in

Ya, Ra signs producing same rendering irrespective of order in data.

This application uses Tesseract.js, Javascript port of Tesseract.

You can use images with English or Malayalam content. Use the editor and the spellchecker for proofreading the text recognized.

Your image does not leave your browser since the recognition is done in browser and does not use any remote servers.

Source code: https://gitlab.com/smc/tesseract-ocr-web

Fixing a bug in Malayalam ya, ra, va sign rendering

In Malayalam, the Ya, Va and Ra consonant signs when appeared together has an interesting problem. The Ra sign(്ര also known as reph) is prebase sign, meaning, it goes to left side of the consonant or conjunct to which it applies. The Ya sign(്യ) and Va sign(്വ) are post base, meaning it goes to the right side of consonant or conjunct to which it applies. So, after a consonant or conjunct, if Ra sign and Ya sign is present, Ra sign goes to left and Ya sign remain to the right. [Read More]

English Malayalam Translation using OpusMT

SMC had started a machine translation service at translate.smc.org.in for English-Malayalam. This system uses huggingface transformers with OpusMT language models for translation. OPUS MT provides pre-trained neural translation models trained on OPUS data. These models can seamlessly run with the OPUS-MT transation servers that can be installed from our OPUS-MT github repository. The translation service is powered by Marian Neural MT engine The quality of the machine translation depends on the availability of parallel corpus. [Read More]

Web application for learning Malayalam writing

In my previous blog post, I wrote about an experiment of using SVG path animation to help learn malayalam letter writing. That prototype application was well received by many people and that encouraged me to host it as a proper application. Now, the Malayalam learning application is available at https://learn.smc.org.in Source code: https://gitlab.com/smc/mlmash I added all letters of Malayalam there. Added a few common ligatures too. Kavya helped to record and add pronunication of these letters with couple of examples. [Read More]

Animated SVGs for learning Malayalam writing

I wanted to make an educational typeface with writing directions in each glyphs. Something like this: But considering the effort it takes I was bit confused whether it is really necessary to have a typeface or just images like this will suffice. Recently, I read about SVG path animations and I thought animating the path inside each letters will be more helpful than static image with drawing directions. The Chilanka and Manjari typefaces I designed have SVG images with strokes as master designs and in most of the cases, the stroke path directions are the writing directions. [Read More]

A spellchecker webservice supporting 90 languages

https://spell.toolforge.org/ is a webservice providing spellcheck web API for 90 languages. I wrote this service hoping that it can be potentially integrated to wikipedia editor to help contributors. The spellchecker backend is hunspell for majority of the languages. It can also proxy similar webservices to provide a single interface. For Malayalam language it uses such an external web API. This Express based nodejs service interfaces to hunspell using nodehun. API GET spellcheck/:language/:word: Check the word in the given language for spelling mistakes. [Read More]

Digital garden

docs.thottingal.in is my personal knowledge space I try to document and share everything I know about this world in form of an online GitBook hosted on GitHub. What is a digital garden? In basic terms, it is a different format for written content on the web. It’s about moving away from blog posts ordered by dates and categories, into more of an interlinked web of notes. Blog posts are articles at a point in time, while here it is living document. [Read More]

Foreign word detection in mlmorph

The test corpus for Malayalam Morphological analysis has many foreign words. They are either written in a non-Malayalam script or written in Malayalam. For example, “ഇലക്ട്രിസിറ്റി”, “ഡോക്സ്”, “ഇന്റർമീഡിയറ്റ്”, “അബ്സ്ട്രാക്റ്റ്”, “ഇല്ലസ്ടേഷൻ”, “ഇല്ലിറ്ററേറ്റ്”, “റെക്കോർഡ്”, “procrastination”, “唐宸禹” - These are all foreign words and it is useless to analyse them using mlmorph. Since mlmorph works based on a root word lexicon, it is practically impossible to have them in lexicon. So there should be a way to identify the words easily and tag them as FW - Foreign word Part of speech. [Read More]