It is your language and your pen

Google recently added voice typing support to more languages. Among the languages Malayalam is also included. The speech recognition is good quality and I see lot of positive comments in my social media stream. Many people started using it as primary input mechanism. This is a big step for Malayalam users without any doubt. Technical difficulties related to writing in Malayalam in mobile devices is getting reduced a lot. This will lead to more content generated and that is one of the stated goals of Google’s Next billion users project. The cloud api for speech recognition will help android developers to build new innovative apps around the speech recognition feature.

Google had added handwriting based input method for many of these languages in 2015. It was also well recieved by Malayalam user community and many chose it as primary input method mechanism for mobile devices.

Google’s machine learning based language tools, including the machine translation is well engineered projects and takes the language technology forward. For a language like Malayalam with relatively less language processing technology, this is a big boost. There is not even a competing product in the above mentioned areas.

All of these above technologies are closed source software, completely controlled by Google. Google’s opensource strategy is a complicated one. Google supports and uses opensource to gain maximum out of it – a pragmatic corporate exploitation. Machine learning based technologies are complex to be defined in the traditional open source definition. Here, for a ML based service provider, the training toolkit might be opensource, tensorflow for example. At the same time, the training data, models might be closed and secret. So, basically the system can be only reproduced by the owners of the data and those who has enough processing capacity. These emerging trends in language technology is also hard for individual opensource developers to catch up because of resourcing issues(data, processing capacity).

Is this model good for language?

Think about this. With no competition, the android operating system with Google’s technology platform is becoming default presence in mobile devices of Malayalam speakers with no doubt. The new language technologies are being quickly accepted as the one and only way to convey a persons expressions to digital world. No, it is not an exaggeration. The availability and quality of these tools is clearly winning its mass user crowd. There is no formal education for Malayalam typing. People discover and try anything that is available. For a new person to the digital world, handwriting was the easiest method to input Malayalam. Now it is speech recognition. And that will be the one and only one way these users know to enter Malayalam content. And these tools are fully owned and controlled by Google with no alternatives.

The open soure alternatives for input methods are still at the traditional typing keyboards. With its peers, they indeed won large user base and it even came to the users before Google entered. For example, the Indic keyboard has 1.4 million installations and actively improved by contributor for 23 languages. But I don’t see any opensource project that is in parallel with handwriting and speech recognition based input methods. As a developer working in Indic language technology based on free software, this is indeed a failure of opensource community.

I contacted a few academic researchers working on speech recognition and handwring recognition and asked what they think about these products by Google. For them, it is more difficult to convince the value of their research. ‘Well, we have products from Google that does this and thousands are using it. Why you want to work again on it?’ This question can’t be answered easily.

But to me, all of these products and its above mentioned nature strongly emphasis the need for free software alternatives. The mediation by closed sourced systems on one of the fundamental language computing task- inputting – with no alternatives puts the whole language and hence its users in heavy risk. Input method technologies, speech recognition, handwriting recognition.. all these are core to the language technology. These technolgies and science behind them should be owned by its speakers. People should be able to study, innovate on top of this technology and should be able to build mechanisms that are free from any corporate control to express their language.

I don’t want to imply or spread fear, uncertainity that Google will one day just start charging for these services or shutdown the tools. That is not my concern. All these language tools I mentioned are not to be built for facing that situation. It is to be developed as fundamental communication tools for the people for the digital age – build, own, learn, use, maintain by the people.

Introductory Workshop on Version Control Systems

The IEEE student branch of College of Engineering, Chengannur is doing a commendable initiative conducting a week long student quality improvement programme. I was invited to give an introductory workshop on version control systems as part of ISQIP 2016.

It was a good experience to be with a group of enthusiastic youngsters. The need of version control systems and demonstration of version controlling using git was done during the workshop. The slides of presentation is here.

Feedback on KTU Syllabus of Electronics and Communication Engineering

Kerala Technological University (KTU) published a draft syllabus for the third and fourth semesters of Electronics and Communication Engineering for the coming academic year. It raised widespread concerns regarding:

  • The depth and vastness of contents
  • The obsoleteness of contents
  • Sequence of introducing concepts and the pedagogy involved
  • FOSS friendliness

To discuss the matter and collect feedback from a wider academic community, KTU called for a syllabi discussion meeting at its office on 13th May, 2016. More than a hundred faculties from various engineering colleges in Kerala came over and expressed their genuine concerns, comments and suggestions. I got opportunity to attend the same.

The syllabi committee agreed to wait till 20th May, to receive more comments before they publish the revised draft by 25th of May. Theare collaboratively created document on the changes to be incorporated to the content of various courses can be found here.

Concerns on Electronic Design Automation Lab

As per the draft syllabi published, KTU plans to introduce a new course ‘ELECTRONICS DESIGN AUTOMATION LAB’ to its third Semester Electronics and Communication Engineering syllabus. Its well and good to familiarize the software tools needed to automate many tasks like design and simulation of electronic circuits, introduction to numerical computations, PCB Designing, Hardware Description using HDL etc.

Many pedagogical concerns were raised in the meeting about the introduction of all these diverse EDA tools as a single course. The need of the tools should be obvious to the students while they learn it. It was proposed that SPICE simulations should go along with the Network Theory and Electronic Circuits. Also Logic Circuit Design should be taught with the aid of HDL.

What is more of  a concern is that the syllabus is not FOSS friendly. It clearly specifies some proprietary tools like MATLAB for numerical computation, analysis and plotting. It specifies PSICE for electronic circuit simulation and  VHDL for logic design. This proposal would be like forcing every technology institute to buy a licensed version of these software.

The Govt. of India has an Open Source Software adoption policy. Kerala State Govt. too has a  policy to adopt Free and Open Source software. As per this policy  use of proprietary tools are allowed only when there are no open source alternatives. There are open source software like Scilab and xcos, Octave, Scipy and Numpy etc. that can be used for the numerical computation experiments specified in the syllabus.

Why FOSS adoption is important?

Teaching and learning should not be tool/product specific. Syllabus should be neutral and should not endorse a brand. The students should not be locked on to a specific vendor. The learners who wish to install the software and experiment further shouldn’t  be restricted by any licensing terms and high cost. It would otherwise encourage unethical practices like usage of pirated copies of software.

Development of open source software is through open collaboration.  The algorithmic implementations are not black boxes as in proprietary tools. They are openly licensed for learning and modifications, for the enthusiasts. Learning an EDA tools should not end with the lab course. Students should acquire the skill necessary to solve any engineering problem that comes on their way using these tools.

There are MHRD initiatives like FOSSEE (Free and Open Software in Education) project to promote the use of FOSS tools to improve the quality of education in our country. The aim is to reduce dependency on proprietary software in educational institutions. FOSSEE encourages the use of FOSS tools through various activities to ensure commercial software is replaced by equivalent FOSS tools. They even develop new FOSS tools and upgrade existing tools to meet requirements in academia and research. FOSSEE supports academic institutions for FOSS adoption through lab migration and textbook companion projects.

Why not KTU collaborate with FOSSEE?

FOSS adoption might not be a very easy task. There might be a need for technical support to institutions and faculty members. KTU can collaborate with FOSSEE in this regard. They have created a repository of spoken tutorials for various FOSS tools for numerical computations, analog and digital circuit simulation etc.

Free software will have cost for training, maintenance just like proprietary software. But the learning curve can be smoothed through joint efforts.

If the tools and software used are open source, KTU can plan to create an open repository of solved simulation experiments, which can be continuously enriched by contributions from faculties and students. Hope KTU takes some steps in this direction as per the suggestions submitted.

Making of Keraleeyam font: From ASCII to Unicode

Keraleeyam is a new unicode malayalam font designed for titles.  It was originally designed in 2005 for ‘Keraleeyam’, a magazine supporting environmental movements in Kerala, with ASCII encoding and was distributed along  with Rachana editor software.

Unicode font feature tables for malayalam are complex, which include diverse rules for ligature formation and glyph positioning. Keraleeyam which was originally ASCII encoded, contained no such rules. It would have been a herculian task to manually add the rules for each glyph. Keraleeyam has 792 glyphs in it. Also rules needed to be duplicated to support both the latest and old open type specifications. It ensures that the font is rendered correctly by all applications in new and reasonably old operating systems.

Happy to say that font featuring was done without much difficulty as one would expect. Thanks to the existing unicode font Rachana with little known bugs and extensive glyph set of 1083 glyphs. And thanks to Hussain K. H. who designed and named every glyph with the same name as the corresponding glyph in Rachana. Rajeesh K. V. imported the feature tables of Rachana and applied it over Keraleeyam, in a semi- automated manner.

Then remained the optimization tasks of kerning and positioning. I contributed to such fine tuning stuff. The beta version of the Keraleeyam font was released as a part of 13th anniversary celebrations of Swathanthra Malayalam Computing by Murali Thummarukudi at Vylopilli Samskrithi Bhavan on 16th December 2014.

The project is hosted here. Seeking comments and feedbacks for the release of stable version soon.

 

Video of our presentation from 7th Multilingual Workshop by W3C

Video of our presentation from 7th Multilingual Workshop by W3C, Madrid, Spain, May 7-8


Best Practices on the Design of Translation- Pau Giner, David Chan and Santhosh Thottingal.

Abstract: Wikipedia is one of the most multilingual projects on the web today. In order to provide access to knowledge to everyone, Wikipedia is available in more than 280 languages. However, the coverage of topics and detail varies from language to language. The Language Engineering team from the Wikimedia Foundation is building open source tools to facilitate the translation of content when creating new articles to facilitate the diffusion of quality content across languages. The translation process in Wikipedia presents many different challenges. Translation tools are aimed at making the translation processes more fluent by integrating different tools such as translation services, dictionaries, and information from semantic databases as Wikidata.org. In addition to the technical challenges, ensuring content quality is one of the most important aspects considered during the design of the tool since any translation that does not read natural is not acceptable for a community focused on content quality. This talk will cover the design (from both technical and user experience perspectives) of the translation tools, and their expected impact on Wikipedia and the Web as a whole.

GSOC 2014 – Mentoring for SMC

I am a mentor for Google Summer of Code 2014 for SMC. I will be helping Praveen Sridhar to port input methods from jquery.ime to the Firefox OS.

We started the project and Praveen already has a proof of concept ready.

Tim Chien and Rudy Lu from Mozilla is co-mentoring the same project

Collaboratively edited documentation for Indic font developers

One of the integral building blocks for providing multilingual support for digital content are fonts. In current times, OpenType fonts are the choice. With the increasing need for supporting languages beyond the Latin script, the TrueType font specification was extended to include elements for the more elaborate writing systems that exist. This effort was jointly undertaken in the 1990s by Microsoft and Adobe. The outcome of this effort was the OpenType Specification – a successor to the TrueType font specification.

JanaSanskritSans_ddhrya
The Devanagari ddhrya-ligature, as displayed in the
JanaSanskritSans font.

Fonts for Indic languages had traditionally been created for the printing industry. The TrueType specification provided the baseline for the digital fonts that were largely used in desktop publishing. These fonts however suffered from inconsistencies arising from technical shortcomings like non-uniform character codes. These shortcomings made the fonts highly unreliable for digital content and their use across platforms. The problems with character codes were largely alleviated with the gradual standardization through modification and adoption of Unicode character codes. The OpenType Specification additionally extended the styling and behavior for the typography.

The availability of the specification eased the process of creating Indic language fonts with consistent typographic behaviour as per the script’s requirement. However, disconnects between the styling and technical implementation hampered the font creation process. Several well-stylized fonts were upgraded to the new specification through complicated adjustments, which at times compromised on their aesthetic quality. On the other hand, the technical adoption of the specification details was a comparatively new know-how for the font designers. To strike a balance, an initiative was undertaken by the a group of font developers and designers to document the knowledge acquired from the hands own experience for the benefit of upcoming developers and designers in this field.

glyph-fontforge-meera
Glyphs inside Meera font

The outcome of the project will be an elaborate, illustrated guideline for font designers. A chapter will be dedicated to each of the Indic scripts – Bengali, Devanagari, Gujarati, Kannada, Malayalam, Odia, Punjabi, Tamil and Telugu. The guidelines will outline the technical representation of the canonical aspects of these complex scripts. This is especially important when designing for complex scripts where the shape or positioning of a character depends on its relation to other characters.

This project is open for participation and contributors can commit directly on the project repository.

New version of Malayalam fonts released

Swathanthra Malayalam Computing project announced the release of new version of Malayalam unicode fonts this week. In this version, there are many improvements for popular Malayalam fonts Rachana and Meera. Dyuthi font has some bug fixes. I am listing the changes below.

  1. Meera font was small compared to other fonts. This was not really a problem in Gnome environment since fontconfig allows you to define a scaling factor to match other font size. But it was an issue in Libreoffice, KDE and mainly in Windows where this kind of scaling feature does not work. Thanks to P Suresh for a rework on glyphs and fixing this issue.
  2. Rachana, Meera and Dyuthi had wrong glyphs used as placeholder glyphs. Bugs like these are fixed.
  3. Virama 0D4D had a wrong LSB that cause the cursor positioning and glyph boundary go wrong. Fixed that bug
  4. Atomic Chilu code points introduced in Unicode 5.1 was missing in all the fonts that SMC maintained because of the controversial decision by Unicode and SMC’s stand against that. Issues still exist, but content with code point is present, to avoid any difficulties to users, added those characters to Meera and Rachana fonts.
  5. Rupee Symbols added to Meera and Rachana. Thanks to Hiran for designing Sans and Serif glyphs for Rupee.
  6. Dot Reph(0D4E) – The glyphs for this was already present in Meera but unmapped to any unicode point. GSUB Lookup tables added to the glyphs according to unicode specification.

For a more detailed change description see this mail thread. There are some minor changes as well.

Thanks to Hussain K H (designer of both Meera and Rachana) , P Suresh, Hiran for their valuable contribution. And thanks to SMC community and font users for using the fonts and reporting bugs. We hope that we can bring this new version in your favorite GNU/Linux distros soon. Wikimedia’s WebFonts extension uses Meera font and the font will be updated there soon. Next release of GNU Freefont is expected to update Malayalam glyphs using Meera and Rachana for freefont-sans and freefont-serif font respectively. We plan to update other fonts we maintain also with these changes in next versions. There are still some glyphs missing in these fonts with respect to the latest unicode version.

 

Malayalam Wikisource Offline version

Malayalam Wikisource community today released the first offline version of Malayalam wikisource during the 4th annual wiki meetup of Malayalam wikimedians. To the  best of our knowledge, this is the first time a wikisource project release its offline version. Malayalam wiki community had released the first version of Malayalam wikipedia one year back.

Releasing the offline version of a wikisource is a challenging project. The technical aspects of the project was designed and implemented by myself. So let me share the details of the project.

As you know a Wikisource contains lot of books, and each book varies in its size, it is divided to chapters or sections. There is no common pattern for books. Each having its own structure. A novel presentation is different from a collection of poems from a Poet. Wikisource also has religious books like Bible, Quran, Bhagavat Geeta, Ramayana etc.  Since books are for continuous reading for a long time, the readabilty and how we present the lengthy chapters in screen also matters. Offline wikipedia tools for example, Kiwix does not do any layout modification of the content and present as it is shown in wikipedia/wikisource. The tool we wrote last year for Malayalam wikipedia offline version also present scrollable vertical content in the screen. Both are not configurable to give different presentation styles depending on the nature of the book.

What we wanted is a book reader kind of application interface.  Readers should be able to easily navigate to books, chapters. The chapter content will be very lengthy. For a long time reading of this content,  a lengthy vertically scrolled text is not a good idea. We also need to take care of the width of the lines.  If each line spans 80-90% of the screen, especially for a wide screen monitor, it is a strain for neck and eyes.

 

Screenshot of Offline version. Click to enlarge


The selection of books for the offline version was done by the active wikimedians at Wiksource. Some of the selected books was proof read by many volunteers within the last  2 weeks.

The tools used for extracting htmls were adhoc and adapted to meet the good presentation of each book. So there is nothing much to reuse here. Extracting the html and then taking the content part alone using pyquery and removing some unwanted sections from html- basically this is what our scripts did. The content is added to predefined HTML templates with proper CSS for the UI. CSS3 multicolumn feature was used for book like interface. Since IE did not implement this standard even in IE9, for that browser the book like interface was not provided. Chrome browser with version less than 12 could not support, because of these bugs: http://code.google.com/p/chromium/issues/detail?id=45840 and http://code.google.com/p/chromium/issues/detail?id=78155. For easy navigation, mouse wheel support and page navigation buttons provided. For solving non-availability of required fonts, webfonts were integrated with a selection box  to select favorite font. Reader can also select the font size to make the reading comfortable.

Why static html? The variety of platforms and other versions we need to support, necessity to have webfonts, complex script rendering, effort to develop and customize UI, relatively small size of the data, avoiding any installation of software in users system etc made us to choose static html+ jquery + css as the technology choice. The downside is we could not provide full text search.

Apart from the wikisource, we also included a collection of copyleft of images from wikimedia commons. Thanks to Nishan Naseer, for preparing a gallery application using jquery. We selected 4 categories from Commons which are related to Kerala. We hope everybody will like the pictures and it will give  a small introduction to Wikimedia Commons.

Even though the python scripts are not ready to reuse in any projects, if anybody want to have a look at it, please mail me. I am not putting it in public since the script does not make sense outside the context of each book and its existing presentation in Malayalam wikisource.

The CD image is available for download here and one can also browse the CD content here.

Thanks to Shiju Alex for coordinating this project. And thanks to all Malayalam wikisource volunteers for making this happen.  We have included poems, folk songs, devotional songs, novel, grammar book, tales, books on Hinduism, Islam-ism, Christianity, Communism, Philosophy. With this release, it becomes the biggest offline digital archive of Malayalam books.