Blogs -

ബഹുവചനമില്ലാത്ത ദേശാഭിമാനി

Posted on February 21, 2022 | Santhosh Thottingal

ഈയിടെയായി ദേശാഭിമാനി പത്രത്തിൽ ബഹുവചനങ്ങൾ, പ്രത്യേകിച്ചും തലക്കെട്ടുകളിൽ ഒഴിവാക്കുന്നത് ശ്രദ്ധയിൽ പെട്ടു. ഇത് ഒരു എഡിറ്റോറിയൽ തീരുമാനമാണോയെന്നറിയില്ല. തലക്കെട്ടിൽ ബഹുവചനരൂപമില്ലെങ്കിലും വാർത്തയിൽ അവയുണ്ടുതാനും. ഇത് എല്ലായിടത്തും ഒരുപോലെ കാണുന്നുമില്ല. വെറുമൊരു കൗതുകത്തിനു കുറച്ചു ഉദാഹരണങ്ങൾ കൊടുക്കുന്നു.

deshabhimani

Hyphenation of Indian languages

Posted on February 18, 2022 | Santhosh Thottingal

The latest version of Firefox - Firefox 97 - supports hyphenation of Indian languages. I had filed a bug report to include the hyphenation patterns I prepared in Firefox. That 6 year old bug report is now resolved. Hyphenation is the process inserting hyphens in between the syllables of a word so that when the text is justified, maximum space is utilized. Following languages are supported: Assamese Bengali Gujarati Hindi Kannada Malayalam Marati Odia Panjabi Tamil Telugu I had written several articles about how to do hyphenation for Indian languages in various applications. [Read More]

hyphenation

Using Manjari as new orthography Malayalam font

Posted on October 29, 2021 | Santhosh Thottingal

Manjari is a traditional orthography font for Malayalam. It has large set of ligatures, vowel signs like /u/ get attached to its corresponding consonants to form ligatures. But, sometimes there are requirements to illustrate new orthography Malayalam content in Manjari. Recently, Manjari was used to typeset an academic book related to Malayalam script and it was required to show some content in new orthography with detached vowel signs and detached reph signs. [Read More]

Manjari opentype malayalam

One million Wikipedia articles by translation

Posted on October 22, 2021 | Santhosh Thottingal

I am happy to share a news from my work at Wikimedia Foundation. The Wikipedia article translation system, known as Content Translation reached a milestone of creating one million articles. Since 2015, this is my major project at WMF and I am lead engineer for the project. The Content Translation system helps Wikipedia editors to quickly translate and publish articles from one language wiki to another. This way, the knowledge gap between different languages are reduced. [Read More]

Wikipedia

New version of Malayalam morphology analyser

Posted on March 20, 2021 | Santhosh Thottingal

In the previous blog post I explained my efforts to modernize SFST. Since I published SFST python binding and modernized to make it compile in all operating systems, next step was to drop HFST dependency of mlmorph and use the new version of SFST 1.5.0. mlmorph 1.3.0 has no dependency on HFST and all installation problems in different operating systems and python versions are solved now. Latest version is also available in pypi. [Read More]

mlmorph fst

New version of Stuttgart Finite State Transducer

Posted on March 20, 2021 | Santhosh Thottingal

The Malayalam morphology analyser(mlmorph) I am actively developing since 2016 is based on Stuttgart Finite State Transducer(SFST) formalism. SFST was developed by Dr Helmut Schmid. Since SFST had minimal developer apis and no functional python APIs, I was using Helsinki Finite State Transducer(HFST) toolkit. HFST has python binding and better tooling and development support. HFST has SFST backend so there was no issue in using SFST formalism. But HFST development has slowed down in past years. [Read More]

mlmorph fst

New version of Manjari Typeface released

Posted on March 19, 2021 | Santhosh Thottingal

A new version of Manjari Malayalam typeface is available now. Version 2.000 comes with a few bugfixes and glyph additions. Most of the changes are based on a review by Google fonts team. New version is available at SMC website for preview and download. Changes Design correction for Malayalam number 10 to better match the attestations (0d6b0fc2) Avoid glyphs that use components that are components themself (6fe9b35d) Add ligatures glyphs for ൻ + ് + റ version of nta (2c39ae08) Review and update diacritics alignment (c498245d) Rename License. [Read More]

manjari fonts releases

Tesseract OCR web interface

Posted on November 14, 2020 | Santhosh Thottingal

I prepared a web frontend for Tesseract OCR to do optical character recognition for Malayalam - https://ocr.smc.org.in

<a href="https://ocr.smc.org.in">Ya, Ra signs producing same rendering irrespective of order in data.</a>

This application uses Tesseract.js, Javascript port of Tesseract.

You can use images with English or Malayalam content. Use the editor and the spellchecker for proofreading the text recognized.

Your image does not leave your browser since the recognition is done in browser and does not use any remote servers.

Source code: https://gitlab.com/smc/tesseract-ocr-web

malayalam ocr

Fixing a bug in Malayalam ya, ra, va sign rendering

Posted on November 13, 2020 | Santhosh Thottingal

In Malayalam, the Ya, Va and Ra consonant signs when appeared together has an interesting problem. The Ra sign(്ര also known as reph) is prebase sign, meaning, it goes to left side of the consonant or conjunct to which it applies. The Ya sign(്യ) and Va sign(്വ) are post base, meaning it goes to the right side of consonant or conjunct to which it applies. So, after a consonant or conjunct, if Ra sign and Ya sign is present, Ra sign goes to left and Ya sign remain to the right. [Read More]

malayalam typography opentype

English Malayalam Translation using OpusMT

Posted on November 1, 2020 | Santhosh Thottingal

SMC had started a machine translation service at translate.smc.org.in for English-Malayalam. This system uses huggingface transformers with OpusMT language models for translation. OPUS MT provides pre-trained neural translation models trained on OPUS data. These models can seamlessly run with the OPUS-MT transation servers that can be installed from our OPUS-MT github repository. The translation service is powered by Marian Neural MT engine The quality of the machine translation depends on the availability of parallel corpus. [Read More]

malayalam machine-translation