Best Practices on the Design of Translation- Pau Giner, David Chan and Santhosh Thottingal.
Abstract: Wikipedia is one of the most multilingual projects on the web today. In order to provide access to knowledge to everyone, Wikipedia is available in more than 280 languages. However, the coverage of topics and detail varies from language to language. The Language Engineering team from the Wikimedia Foundation is building open source tools to facilitate the translation of content when creating new articles to facilitate the diffusion of quality content across languages. The translation process in Wikipedia presents many different challenges. Translation tools are aimed at making the translation processes more fluent by integrating different tools such as translation services, dictionaries, and information from semantic databases as Wikidata.org. In addition to the technical challenges, ensuring content quality is one of the most important aspects considered during the design of the tool since any translation that does not read natural is not acceptable for a community focused on content quality. This talk will cover the design (from both technical and user experience perspectives) of the translation tools, and their expected impact on Wikipedia and the Web as a whole.
XeTeX is an extension of TeX with built-in support for Unicode and OpenType. In this tutorial, we are going to learn how to typeset Malayalam using XeTeX. With some learning effort, we can produce high quality typesetting using XeTeX.
XeTeX is packaged for all famous GNU/Linux distros. The installation method depends your distro. For ease of installation and configuration, we suggest to use a TeXLive version 2012 or above – either standalone TeXLive distribution or install from your distribution’s package manager. Windows and OSX versions are also available.
Following packages are required to install to get a working xetex environment in your computer. Note that these packages are relatively large in size and will take time and bandwidth.
You also need reasonably good unicode compatible Malayalam fonts. These fonts also comes with GNU/Linux distros. Search for malayalam fonts in your package manager and install if not already installed. Eg fonts: Meera, Rachana etc.
Creating documents using XeTeX
A simple document to learn usage of xetex is given below.
Using a text editor like gedit or kate, create a new file with .tex as file extension. Eg: example.tex. Copy the following content as the content for that file and save.
Now you need to compile this document to generate PDF.
English and many other languages have only 2 plural forms. Singular if the count is one and anything else is plural including zero.
But for some other languages, the plural forms are more than 2. Arabic, for example has 6 plural forms, sometimes referred as ‘zero’, ‘one’, ‘two’, ‘few’, ‘many’, ‘other’ forms. Integers 11-26, 111, 1011 are of ‘many’ form, while 3,4,..10 are ‘few’ form.
While preparing the interface messages for application user interfaces, grammatically correct sentences are must. “Found 1 results” or “Found 1 result(s)” are bad interface messages. For a developer, if the language in the context is English or languages having similar plural forms, it may be a matter of an if condition to conditionally choose one of the messages.
But that approach is not scalable if we want to deal with lot of languages. Some applications come with their own plural handling mechanism, probably by a module that tells you the plural form, given a number, and language. The plural forms per language and the rules to determine it is defined in CLDR. CLDR defines the plural rules in a markup language named LDML and releases the collections frequently.
If you look at the CLDR plural rules table you can easily understand this. The rules are defined in a particular syntax. For example, the Russian plural rules are given below.
One need to pass the value of the number to the variable in the above expressions and evaluate. If the expression evaluates to a boolean true, then the corresponding plural form should be used.
So, an expression like n = 0 or n != 1 and n mod 100 = 1..19 mapped to ‘many’ holds true if the value of n=0,119, 219, 319. So we say that they are of ‘few’ plural form.
But in the Russian example given above, we don’t see n, but we see variables v, i etc. The meaning of these variables are defined in the standard as:
absolute value of the source number (integer and decimals).
integer digits of n.
number of visible fraction digits in n, with trailing zeros.
number of visible fraction digits in n, without trailing zeros.
visible fractional digits in n, with trailing zeros.
visible fractional digits in n, without trailing zeros.
Keeping these definitions in mind, the expression v = 0 and i % 10 = 1 and i % 100 != 11 evaluates true for 1,21,31, 41 etc and false for 11. In other words, number 1,21,31 are of plural form “one” in Russian.
CLDRPluralRuleParser is that evaluator. I wrote this parser when we at Wikimedia foundation wanted a data driven plural rule evaluation for the 300+ languages we support. It started as a free time project in June 2012. Later it became part of MediaWiki core to support front-end internationalization. We wanted a PHP version also to support interface messages constructed at server side. Tim Starling wrote a PHP CLDR plural rule evaluator.
The node module comes with command line interface, just to experiment with rules.
$ cldrpluralruleparser 'n is 1' 0
An example showing how to use the CLDR supplied plural rule data and this library is included in the repository. You can play with that application here.
Browsers provide an option to choose the preferred language a website to be shown, often named as “Accept language“.
These preference values allows websites to deliver a suitable language version to the user.
navigator.language does exist, but that does not give the correct values. For chrome, it gives browsers UI language and it differs from what is meant by accept-languages. Firefox 5 onwards this property’s value is based on the value of the Accept-Language header value. It returns a string value, but accept-language is usually a list of language values in the order of preference.
I will be co-presenting with Pau Giner, David Chan from Wikimedia Foundation Language engineering team on best practices of translation at wikipedia. It will cover the design (from both technical and user experience perspectives) of the translation tools, and their expected impact on Wikipedia and the Web as a whole.
For an advanced logging system for nodejs applications, winston is very helpful. Winston is a multi-transport async logging library for node.js. Similar to famous logging systems like log4j, we can configure the log levels and winston allows to define multiple logging targets like file, console, database etc.
I wanted to configure logging as per usual nodejs production vs development environment. Of course with development mode, I am more interested in debug level logging and at production environment I am more interested in higher level logs.
I am sharing my singleton logger instance setup code.