Tesseract OCR web interface

I prepared a web frontend for Tesseract OCR to do optical character recognition for Malayalam - https://ocr.smc.org.in

This application uses Tesseract.js, Javascript port of Tesseract.

You can use images with English or Malayalam content. Use the editor and the spellchecker for proofreading the text recognized.

Your image does not leave your browser since the recognition is done in browser and does not use any remote servers.

Source code: https://gitlab.com/smc/tesseract-ocr-web

Fixing a bug in Malayalam ya, ra, va sign rendering

In Malayalam, the Ya, Va and Ra consonant signs when appeared together has an interesting problem. The Ra sign(്ര also known as reph) is prebase sign, meaning, it goes to left side of the consonant or conjunct to which it applies. The Ya sign(്യ) and Va sign(്വ) are post base, meaning it goes to the right side of consonant or conjunct to which it applies. So, after a consonant or conjunct, if Ra sign and Ya sign is present, Ra sign goes to left and Ya sign remain to the right. [Read More]

English Malayalam Translation using OpusMT

SMC had started a machine translation service at translate.smc.org.in for English-Malayalam. This system uses huggingface transformers with OpusMT language models for translation. OPUS MT provides pre-trained neural translation models trained on OPUS data. These models can seamlessly run with the OPUS-MT transation servers that can be installed from our OPUS-MT github repository. The translation service is powered by Marian Neural MT engine The quality of the machine translation depends on the availability of parallel corpus. [Read More]

Web application for learning Malayalam writing

In my previous blog post, I wrote about an experiment of using SVG path animation to help learn malayalam letter writing. That prototype application was well received by many people and that encouraged me to host it as a proper application. Now, the Malayalam learning application is available at https://learn.smc.org.in Source code: https://gitlab.com/smc/mlmash I added all letters of Malayalam there. Added a few common ligatures too. Kavya helped to record and add pronunication of these letters with couple of examples. [Read More]

Animated SVGs for learning Malayalam writing

I wanted to make an educational typeface with writing directions in each glyphs. Something like this: But considering the effort it takes I was bit confused whether it is really necessary to have a typeface or just images like this will suffice. Recently, I read about SVG path animations and I thought animating the path inside each letters will be more helpful than static image with drawing directions. [Read More]

Gayathri – New Malayalam typeface

Swathanthra Malayalam Computing is proud to announce Gayathri – a new typeface for Malayalam. Gayathri is designed by Binoy Dominic, opentype engineering by Kavya Manohar and project coordination by Santhosh Thottingal. This typeface was financially supported by Kerala Bhasha Institute, a Kerala government agency under cultural department. This is the first time SMC work with Kerala Government to produce a new Malayalam typeface. Gayathri is a display typeface, available in Regular, Bold, Thin style variants. [Read More]

Kindle supports custom fonts

I am pleasantly surprised to see that Amazon Kindle now supports installing custom fonts. A big step towards supporting non-latin content in their devices. I can now read Malayalam ebooks in my kindle with my favorite fonts. [][1]Content rendered in Manjari font. Note that I installed Bold, Regular, Thin variants so that Kindle can pick up the right one This feature is introduced in Kindle version released in June 2018. [Read More]

Talk on ‘Malayalam orthographic reforms’ at Grafematik 2018

Santhosh and I presented a paper on ‘Malayalam orthographic reforms: impact on language and popular culture’ at Graphematik conference held at IMT Atlantique, Brest, France. Our session was chaired by Dr. Christa Dürscheid. The paper we presented is available here. The video of our presentation is available in youtube. Grafematik is a conference, first of its kind, bringing together disciplines concerned with writing systems and their representation in written communication. [Read More]

u and uː vowel signs of Malayalam

The reformed or simplified orthographic script style of Malayalam was introduced in 1971 by this government order. This is what is taught in schools. The text book content is also in reformed style. The prevailing academic situation does not facilitate the students to learn the exhaustive and rich orthographic set of Malayalam script. At the same time they observe a lot of wall writings, graffiti, bill-boards and handwriting sticking to the exhaustive orthographic set. [Read More]

മലയാളത്തിലെ ‘ഉ’കാര ചിഹ്നങ്ങൾ

പരിഷ്കരിച്ച മലയാള ലിപിയാണല്ലോ ഇന്നു പാഠപുസ്തകത്തിലുള്ളതും വിദ്യാലയങ്ങളിൽ പഠിപ്പിക്കുന്നതും. അതുകൊണ്ടു തന്നെ ഔപചാരിക വിദ്യാഭ്യാസത്തിൽ മലയാളത്തിന്റെ തനതുലിപിയുടെ ശൈലീഭേദങ്ങൾ പരിചയിക്കുവനുള്ള അവസരം നമുക്കു കിട്ടാറില്ല. പക്ഷേ ചുമരെഴുത്തുകളിലും, ബസ്സിലെ ബോർഡുകളിലും, തനതുമലയാളം എഴുതിശീലിച്ച മുതിർന്നവരുടെ കയ്യെഴുത്തിലുമൊക്കെയായി ഈ ലിപിരൂപങ്ങൾ നമ്മുടെ മുന്നിലുണ്ടു താനും. ലിപിപരിഷ്കരണത്തിന്റെ ഭാഗമായി വേർപെട്ട കൂട്ടക്ഷരങ്ങൾ മിക്കതും തെറ്റുകളൊന്നുമില്ലാതെ നമ്മുടെ കയ്യെഴുത്തുകളിൽ അറിഞ്ഞോ അറിയാതെയോ കൂടിച്ചേരാറുണ്ട്. പക്ഷേ വേർപെട്ട ചിഹ്നങ്ങൾ, പ്രത്യേകിച്ച് ു, ൂ ചിഹ്നങ്ങൾ വ്യഞ്ജനത്തോടു ചേർത്തെഴുതുമ്പോൾ ശൈലികൾ കൂടിക്കുഴഞ്ഞ് പോവുകയും ചെയ്യുന്നു. ചുവടെയുള്ള ചിത്രം നോക്കുക. ഉ-ചിഹ്നങ്ങളുടെ ഉപയോഗം ചുമരെഴുത്തിൽ. പച്ചയടയാളത്തിനുള്ളിൽ പരിഷ്കരിച്ച ലിപി, നീലയിൽ തനതു ലിപി എന്നിവ കാണാം. ചുവന്ന അടയാളമിട്ടു സൂചിപ്പിച്ചിരിക്കുന്നത് മലയാളത്തിൽ പതിവില്ലാത്ത ശൈലിയാണ്. [Read More]