Video of our presentation from 7th Multilingual Workshop by W3C

Video of our presentation from 7th Multilingual Workshop by W3C, Madrid, Spain, May 7-8


Best Practices on the Design of Translation- Pau Giner, David Chan and Santhosh Thottingal.

Abstract: Wikipedia is one of the most multilingual projects on the web today. In order to provide access to knowledge to everyone, Wikipedia is available in more than 280 languages. However, the coverage of topics and detail varies from language to language. The Language Engineering team from the Wikimedia Foundation is building open source tools to facilitate the translation of content when creating new articles to facilitate the diffusion of quality content across languages. The translation process in Wikipedia presents many different challenges. Translation tools are aimed at making the translation processes more fluent by integrating different tools such as translation services, dictionaries, and information from semantic databases as Wikidata.org. In addition to the technical challenges, ensuring content quality is one of the most important aspects considered during the design of the tool since any translation that does not read natural is not acceptable for a community focused on content quality. This talk will cover the design (from both technical and user experience perspectives) of the translation tools, and their expected impact on Wikipedia and the Web as a whole.

Typesetting Malayalam using XeTeX

XeTeX is an extension of TeX with built-in support for Unicode and OpenType. In this tutorial, we are going to learn how to typeset Malayalam using XeTeX. With some learning effort, we can produce high quality typesetting using XeTeX. 

Installing XeTeX

XeTeX is packaged for all famous GNU/Linux distros. The installation method depends your distro. For ease of installation and configuration, we suggest to use a TeXLive version 2012 or above – either standalone TeXLive distribution or install from your distribution’s package manager. Windows and OSX versions are also available.

Following packages are required to install to get a working xetex environment in your computer. Note that these packages are relatively large in size and will take time and bandwidth.

  1. texlive-xetex
  2. texlive-latex-extra
  3. texlive-lang-indic

You also need reasonably good unicode compatible Malayalam fonts. These fonts also comes with GNU/Linux distros. Search for malayalam fonts in your package manager and install if not already installed. Eg fonts: Meera, Rachana etc.

Creating documents using XeTeX

A simple document to learn usage of xetex is given below.

Using a text editor like gedit or kate, create a new file with .tex as file extension. Eg: example.tex. Copy the following content as the content for that file and save.

\documentclass[11pt]{article}
\usepackage{fontspec}
\usepackage{polyglossia}
\setdefaultlanguage{malayalam}
\setmainfont[Script=Malayalam, HyphenChar="0000]{Rachana}
% In the above line we customized Hyphenation characters since
% visbile hyphen, aka Soft Hyphen is not used for Malayalam
\lefthyphenmin=3
\righthyphenmin=4
\title{\textbf{സ്വർണം}}
\author{മലയാളം വിക്കിപീഡിയ}
\date{}
\begin{document}

\maketitle

\section{സ്വർണം}

മൃദുവും തിളക്കമുള്ളതുമായ മഞ്ഞലോഹമാണ് സ്വർണം. വിലയേറിയ ലോഹമായ സ്വർണം, നാണയമായും, ആഭരണങ്ങളുടെ രൂപത്തിലും നൂറ്റാണ്ടുകളായി മനുഷ്യൻ ഉപയോഗിച്ചു പോരുന്നു. 
ചെറിയ കഷണങ്ങളും തരികളുമായി സ്വതന്ത്രാവസ്ഥയിൽത്തന്നെ പ്രകൃതിയിൽ ഈ ലോഹം കണ്ടുവരുന്നു. ലോഹങ്ങളിൽ വച്ച് ഏറ്റവും നന്നായി രൂപഭേദം വരുത്താവുന്ന ലോഹമാണിത്.
\footnote{http://www.webelements.com/webelements/elements/text/Au/key.html "Key properties of gold" (in ഇംഗ്ലീഷ്). ശേഖരിച്ചത് 2007-06-18.}

\section{ഗുണങ്ങൾ}
സ്വർണത്തിന്റെ അണുസംഖ്യ 79-ഉം പ്രതീകം Au എന്നുമാണ്. ഔറം എന്ന ലത്തീൻ വാക്കിൽ നിന്നാണ് Au എന്ന പ്രതീകം ഉണ്ടായത്.
ഏറ്റവും നന്നായി രൂപഭേദം വരുത്താൻ സാധിക്കുന്ന ലോഹമാണ് സ്വർണ്ണം. ഒരു ഗ്രാം സ്വർണ്ണം അടിച്ചു പരത്തി ഒരു ചതുരശ്രമീറ്റർ വിസ്തീർണ്ണമുള്ള ഒരു തകിടാക്കി മാറ്റാൻ സാധിക്കും. 
അതായത് 0.000013 സെന്റീമീറ്റർ വരെ ഇതിന്റെ കനം കുറക്കാൻ കഴിയും. അതു പോലെ വെറും 29 ഗ്രാം സ്വർണ്ണം ഉപയോഗിച്ച് 100 കിലോ മീറ്റർ നീളമുള്ള കമ്പിയുണ്ടാക്കാനും സാധിക്കും. 

\section{ചരിത്രം}
ചരിത്രാതീത കാലം മുതൽക്കേ അറിയപ്പെട്ടിരുന്ന അമൂല്യലോഹമാണ്‌ സ്വർണ്ണം. ഒരുപക്ഷേ മനുഷ്യൻ ആദ്യമായി ഉപയോഗിച്ച ലോഹവും ഇതുതന്നെയായിരിക്കണം.
ബി.സി.ഇ. 2600 ലെ ഈജിപ്ഷ്യൻ ഹീറോഗ്ലിഫിക്സ് ലിഖിതങ്ങളിൽ ഈജിപ്തിൽ സ്വർണ്ണം സുലഭമായിരുന്നെന്ന് പരാമർശിക്കുന്നുണ്ട്.
ചരിത്രം പരിശോധിച്ചാൽ ഈജിപ്തും നുബിയയുമാണ്‌ ലോകത്തിൽ ഏറ്റവുമധികം സ്വർണ്ണം ഉല്പ്പാദിപ്പിച്ചിരുന്ന മേഖലകൾ. ബൈബിളിലെ പഴയ നിയമത്തിൽ സ്വർണ്ണത്തെപ്പറ്റി പലവട്ടം പരാമർശിക്കുന്നുണ്ട്.

\end{document}

Now you need to compile this document to generate PDF.

xelatex example.tex

Output of the above content can be seen here.

The above tutorial is a very basic tutorial on using XeTeX with Malayalam. For detailed tutorial, please refer any tutorial available freely in internet. Example: https://en.wikibooks.org/wiki/LaTeX

GSOC 2014 – Mentoring for SMC

I am a mentor for Google Summer of Code 2014 for SMC. I will be helping Praveen Sridhar to port input methods from jquery.ime to the Firefox OS.

We started the project and Praveen already has a proof of concept ready.

Tim Chien and Rudy Lu from Mozilla is co-mentoring the same project

Parsing CLDR plural rules in javascript

English and many other languages have only 2 plural forms. Singular if the count is one and anything else is plural including zero.

But for some other languages, the plural forms are more than 2. Arabic, for example has 6 plural forms, sometimes referred as ‘zero’, ‘one’, ‘two’, ‘few’, ‘many’, ‘other’ forms. Integers 11-26, 111, 1011 are of ‘many’ form, while 3,4,..10 are ‘few’ form.

While preparing the interface messages for application user interfaces, grammatically correct sentences are must. “Found 1 results” or “Found 1 result(s)” are bad interface messages. For a developer, if the language in the context is English or languages having similar plural forms, it may be a matter of an if condition to conditionally choose one of the messages.

But that approach is not scalable if we want to deal with lot of languages. Some applications come with their own plural handling mechanism, probably by a module that tells you the plural form, given a number, and language. The plural forms per language and the rules to determine it is defined in CLDR. CLDR defines the plural rules in a markup language named LDML and releases the collections frequently.

If you look at the CLDR plural rules table you can easily understand this. The rules are defined in a particular syntax. For example, the Russian plural rules are given below.

One need to pass the value of the number to the variable in the above expressions and evaluate. If the expression evaluates to a boolean true, then the corresponding plural form should be used.

So, an expression like  n = 0 or n != 1 and n mod 100 = 1..19 mapped to ‘many’ holds true if the value of n=0,119, 219, 319. So we say that they are of ‘few’ plural form.

But in the Russian example given above, we don’t see n, but we see variables v, i etc. The meaning of these variables are defined in the standard as:

Symbol Value
n absolute value of the source number (integer and decimals).
i integer digits of n.
v number of visible fraction digits in n, with trailing zeros.
w number of visible fraction digits in n, without trailing zeros.
f visible fractional digits in n, with trailing zeros.
t visible fractional digits in n, without trailing zeros.

Keeping these definitions in mind, the expression v = 0 and i % 10 = 1 and i % 100 != 11 evaluates true for 1,21,31, 41 etc and false for 11. In other words, number 1,21,31 are of plural form “one” in Russian.

A module to support the plural forms for any language can manually(or semi automatically) convert this expressions to programming language one time and use it. Twitter-cldr a CLDR abstraction library by twitter follows this method. It converted the above given plural rules to the following javascript expression using a compiler.

This works. But CLDR updates the plural rules in every releases. Most of the time, it contains additional language support. Sometimes the rules are changed or improved too. The maintainer of the module need to recompile them to javascript expression in such cases.

If we can write a compiler to generate javascript from this expressions, can’t we write a parser-evaluator for the expressions? So that we just need to pass the rule and the number to that evaluator  and it returns the plural form?

CLDRPluralRuleParser

 CLDR Plural Rule Evaluator
CLDR Plural Rule Evaluator

CLDRPluralRuleParser is that evaluator. I wrote this parser when we at Wikimedia foundation wanted a data driven plural rule evaluation for the 300+ languages we support. It started as a free time project in June 2012. Later it became part of MediaWiki core to support front-end internationalization. We wanted a PHP version also to support interface messages constructed at server side. Tim Starling wrote a PHP CLDR plural rule evaluator.

It is javascript library that takes the standardized plural rule and an integer and returning true or false depending on whether the rule pass for the given integer. It is written with UMD/common.js pattern and available as a node module too.

The node module comes with command line interface, just to experiment with rules.

$ cldrpluralruleparser 'n is 1' 0

false

The module does not self contain the plural rules collection or data. Developers need to have that collection as an xml or json inside the application and need to pass to the module. In that sense, one cannot offload the whole i18n message processing task to this module. For a more handy internationalization with javascript, that takes care of plural, gender, grammar etc, you may consider jquery.i18n which contains CLDRPluralRuleParser.

An example showing how to use the CLDR supplied plural rule data and this library is included in the repository. You can play with that application here.

License: Initially the license of the module was GPL, but as per some of the collaboration discussion between Wikimedia, cldrjs, jQuery.globalize, moment.js, it was decided to change the license to MIT.

Browser language preferences: navigator.languages is coming

Browsers provide an option to choose the preferred language a website to be shown, often named as “Accept language“.

FirefoxLanguageSelection
Firefox accept language preference

These preference values allows websites to deliver a suitable language version to the user.

For the developers, to read this value, the existing options is to check the Accept-Language http header value.  It works and many websites use it already. But this value was never exposed at client side. Javascript cannot access the value of these preferences. There are many use cases where this is preferred. Handling i18n at client side is one of this.

Chrome accept language preference
Chrome accept language preference

navigator.language does exist, but that does not give the correct values. For chrome, it gives browsers UI language and it differs from what is meant by accept-languages. Firefox 5 onwards this property’s value is based on the value of the Accept-Language header value. It returns a string value, but accept-language is usually a list of language values in the order of preference.

The good news is, a patch just landed in Firefox to support navigator.languages

It returns an array of language tags representing the user’s preferred languages, with the most preferred language first.

The most preferred language is the one returned by navigator.language

See specification.

Now that it is landed in Firefox, Blink developers are also considering the implementation.

This will definitely improve the web experience to users and help a lot for internationalization developers.

W3C Workshop at Madrid

I will be speaking at the upcoming W3C workshop at Madrid. The workshop is on 7-8 May 2014 and the theme is “New Horizons for the Multilingual Web”.

I will be co-presenting with Pau Giner, David Chan from Wikimedia Foundation Language engineering team on best practices of translation at wikipedia. It will cover the design (from both technical and user experience perspectives) of the translation tools, and their expected impact on Wikipedia and the Web as a whole.

Meera Tamil font in Ubuntu Trusty Tahr

Ubuntu Trusty Tahr is going to be released on April 17th 2014.Meera Tamil

Meera Tamil font, a free licensed unicode font for Tamil will be available in this release.

The font is already available in Debian. In both Ubuntu and Debian you can install the font by

sudo apt-get install fonts-meera-taml

Thanks Vasudev for packaging it for Debian.

Configurable node logger with winston

For an advanced logging system for nodejs applications, winston is very helpful. Winston is a multi-transport async logging library for node.js. Similar to famous logging systems like log4j, we can configure the log levels and winston allows to define multiple logging targets like file, console, database etc.

I wanted to configure logging as per usual nodejs production vs development environment. Of course with development mode, I am more interested in debug level logging and at production environment I am more interested in higher level logs.

I am sharing my singleton logger instance setup code.