So I did some cleanup and rewrite, added documentation, example and here it is: http://thottingal.in/projects/swanalekha/swanalekha-ml.html
Video of our presentation from 7th Multilingual Workshop by W3C, Madrid, Spain, May 7-8
Best Practices on the Design of Translation- Pau Giner, David Chan and Santhosh Thottingal.
Abstract: Wikipedia is one of the most multilingual projects on the web today. In order to provide access to knowledge to everyone, Wikipedia is available in more than 280 languages. However, the coverage of topics and detail varies from language to language. The Language Engineering team from the Wikimedia Foundation is building open source tools to facilitate the translation of content when creating new articles to facilitate the diffusion of quality content across languages. The translation process in Wikipedia presents many different challenges. Translation tools are aimed at making the translation processes more fluent by integrating different tools such as translation services, dictionaries, and information from semantic databases as Wikidata.org. In addition to the technical challenges, ensuring content quality is one of the most important aspects considered during the design of the tool since any translation that does not read natural is not acceptable for a community focused on content quality. This talk will cover the design (from both technical and user experience perspectives) of the translation tools, and their expected impact on Wikipedia and the Web as a whole.
XeTeX is an extension of TeX with built-in support for Unicode and OpenType. In this tutorial, we are going to learn how to typeset Malayalam using XeTeX. With some learning effort, we can produce high quality typesetting using XeTeX.
XeTeX is packaged for all famous GNU/Linux distros. The installation method depends your distro. For ease of installation and configuration, we suggest to use a TeXLive version 2012 or above – either standalone TeXLive distribution or install from your distribution’s package manager. Windows and OSX versions are also available.
Following packages are required to install to get a working xetex environment in your computer. Note that these packages are relatively large in size and will take time and bandwidth.
You also need reasonably good unicode compatible Malayalam fonts. These fonts also comes with GNU/Linux distros. Search for malayalam fonts in your package manager and install if not already installed. Eg fonts: Meera, Rachana etc.
Creating documents using XeTeX
A simple document to learn usage of xetex is given below.
Using a text editor like gedit or kate, create a new file with .tex as file extension. Eg: example.tex. Copy the following content as the content for that file and save.
Now you need to compile this document to generate PDF.
Output of the above content can be seen here.
The above tutorial is a very basic tutorial on using XeTeX with Malayalam. For detailed tutorial, please refer any tutorial available freely in internet. Example: https://en.wikibooks.org/wiki/LaTeX
How to configure Ubuntu 14.04 for typing in Malayalam
English and many other languages have only 2 plural forms. Singular if the count is one and anything else is plural including zero.
But for some other languages, the plural forms are more than 2. Arabic, for example has 6 plural forms, sometimes referred as ‘zero’, ‘one’, ‘two’, ‘few’, ‘many’, ‘other’ forms. Integers 11-26, 111, 1011 are of ‘many’ form, while 3,4,..10 are ‘few’ form.
While preparing the interface messages for application user interfaces, grammatically correct sentences are must. “Found 1 results” or “Found 1 result(s)” are bad interface messages. For a developer, if the language in the context is English or languages having similar plural forms, it may be a matter of an if condition to conditionally choose one of the messages.
But that approach is not scalable if we want to deal with lot of languages. Some applications come with their own plural handling mechanism, probably by a module that tells you the plural form, given a number, and language. The plural forms per language and the rules to determine it is defined in CLDR. CLDR defines the plural rules in a markup language named LDML and releases the collections frequently.
If you look at the CLDR plural rules table you can easily understand this. The rules are defined in a particular syntax. For example, the Russian plural rules are given below.
One need to pass the value of the number to the variable in the above expressions and evaluate. If the expression evaluates to a boolean true, then the corresponding plural form should be used.
So, an expression like n = 0 or n != 1 and n mod 100 = 1..19 mapped to ‘many’ holds true if the value of n=0,119, 219, 319. So we say that they are of ‘few’ plural form.
But in the Russian example given above, we don’t see n, but we see variables v, i etc. The meaning of these variables are defined in the standard as:
|n||absolute value of the source number (integer and decimals).|
|i||integer digits of n.|
|v||number of visible fraction digits in n, with trailing zeros.|
|w||number of visible fraction digits in n, without trailing zeros.|
|f||visible fractional digits in n, with trailing zeros.|
|t||visible fractional digits in n, without trailing zeros.|
Keeping these definitions in mind, the expression v = 0 and i % 10 = 1 and i % 100 != 11 evaluates true for 1,21,31, 41 etc and false for 11. In other words, number 1,21,31 are of plural form “one” in Russian.
CLDRPluralRuleParser is that evaluator. I wrote this parser when we at Wikimedia foundation wanted a data driven plural rule evaluation for the 300+ languages we support. It started as a free time project in June 2012. Later it became part of MediaWiki core to support front-end internationalization. We wanted a PHP version also to support interface messages constructed at server side. Tim Starling wrote a PHP CLDR plural rule evaluator.
The node module comes with command line interface, just to experiment with rules.
$ cldrpluralruleparser 'n is 1' 0
License: Initially the license of the module was GPL, but as per some of the collaboration discussion between Wikimedia, cldrjs, jQuery.globalize, moment.js, it was decided to change the license to MIT.
Browsers provide an option to choose the preferred language a website to be shown, often named as “Accept language“.
These preference values allows websites to deliver a suitable language version to the user.
navigator.language does exist, but that does not give the correct values. For chrome, it gives browsers UI language and it differs from what is meant by accept-languages. Firefox 5 onwards this property’s value is based on the value of the Accept-Language header value. It returns a string value, but accept-language is usually a list of language values in the order of preference.
The good news is, a patch just landed in Firefox to support
It returns an array of language tags representing the user’s preferred languages, with the most preferred language first.
The most preferred language is the one returned by
Now that it is landed in Firefox, Blink developers are also considering the implementation.
This will definitely improve the web experience to users and help a lot for internationalization developers.
I will be speaking at the upcoming W3C workshop at Madrid. The workshop is on 7-8 May 2014 and the theme is “New Horizons for the Multilingual Web”.
I will be co-presenting with Pau Giner, David Chan from Wikimedia Foundation Language engineering team on best practices of translation at wikipedia. It will cover the design (from both technical and user experience perspectives) of the translation tools, and their expected impact on Wikipedia and the Web as a whole.