Parsing CLDR plural rules in javascript

English and many other languages have only 2 plural forms. Singular if the count is one and anything else is plural including zero.

But for some other languages, the plural forms are more than 2. Arabic, for example has 6 plural forms, sometimes referred as ‘zero’, ‘one’, ‘two’, ‘few’, ‘many’, ‘other’ forms. Integers 11-26, 111, 1011 are of ‘many’ form, while 3,4,..10 are ‘few’ form.

While preparing the interface messages for application user interfaces, grammatically correct sentences are must. “Found 1 results” or “Found 1 result(s)” are bad interface messages. For a developer, if the language in the context is English or languages having similar plural forms, it may be a matter of an if condition to conditionally choose one of the messages.

But that approach is not scalable if we want to deal with lot of languages. Some applications come with their own plural handling mechanism, probably by a module that tells you the plural form, given a number, and language. The plural forms per language and the rules to determine it is defined in CLDR. CLDR defines the plural rules in a markup language named LDML and releases the collections frequently.

If you look at the CLDR plural rules table you can easily understand this. The rules are defined in a particular syntax. For example, the Russian plural rules are given below.

One need to pass the value of the number to the variable in the above expressions and evaluate. If the expression evaluates to a boolean true, then the corresponding plural form should be used.

So, an expression like  n = 0 or n != 1 and n mod 100 = 1..19 mapped to ‘many’ holds true if the value of n=0,119, 219, 319. So we say that they are of ‘few’ plural form.

But in the Russian example given above, we don’t see n, but we see variables v, i etc. The meaning of these variables are defined in the standard as:

Symbol Value
n absolute value of the source number (integer and decimals).
i integer digits of n.
v number of visible fraction digits in n, with trailing zeros.
w number of visible fraction digits in n, without trailing zeros.
f visible fractional digits in n, with trailing zeros.
t visible fractional digits in n, without trailing zeros.

Keeping these definitions in mind, the expression v = 0 and i % 10 = 1 and i % 100 != 11 evaluates true for 1,21,31, 41 etc and false for 11. In other words, number 1,21,31 are of plural form “one” in Russian.

A module to support the plural forms for any language can manually(or semi automatically) convert this expressions to programming language one time and use it. Twitter-cldr a CLDR abstraction library by twitter follows this method. It converted the above given plural rules to the following javascript expression using a compiler.

This works. But CLDR updates the plural rules in every releases. Most of the time, it contains additional language support. Sometimes the rules are changed or improved too. The maintainer of the module need to recompile them to javascript expression in such cases.

If we can write a compiler to generate javascript from this expressions, can’t we write a parser-evaluator for the expressions? So that we just need to pass the rule and the number to that evaluator  and it returns the plural form?

CLDRPluralRuleParser

 CLDR Plural Rule Evaluator
CLDR Plural Rule Evaluator

CLDRPluralRuleParser is that evaluator. I wrote this parser when we at Wikimedia foundation wanted a data driven plural rule evaluation for the 300+ languages we support. It started as a free time project in June 2012. Later it became part of MediaWiki core to support front-end internationalization. We wanted a PHP version also to support interface messages constructed at server side. Tim Starling wrote a PHP CLDR plural rule evaluator.

It is javascript library that takes the standardized plural rule and an integer and returning true or false depending on whether the rule pass for the given integer. It is written with UMD/common.js pattern and available as a node module too.

The node module comes with command line interface, just to experiment with rules.

$ cldrpluralruleparser 'n is 1' 0

false

The module does not self contain the plural rules collection or data. Developers need to have that collection as an xml or json inside the application and need to pass to the module. In that sense, one cannot offload the whole i18n message processing task to this module. For a more handy internationalization with javascript, that takes care of plural, gender, grammar etc, you may consider jquery.i18n which contains CLDRPluralRuleParser.

An example showing how to use the CLDR supplied plural rule data and this library is included in the repository. You can play with that application here.

License: Initially the license of the module was GPL, but as per some of the collaboration discussion between Wikimedia, cldrjs, jQuery.globalize, moment.js, it was decided to change the license to MIT.

Configurable node logger with winston

For an advanced logging system for nodejs applications, winston is very helpful. Winston is a multi-transport async logging library for node.js. Similar to famous logging systems like log4j, we can configure the log levels and winston allows to define multiple logging targets like file, console, database etc.

I wanted to configure logging as per usual nodejs production vs development environment. Of course with development mode, I am more interested in debug level logging and at production environment I am more interested in higher level logs.

I am sharing my singleton logger instance setup code.

 

Brackets, my favorite javascript IDE

I use Brackets for web development. I had tried several other IDEs but Brackets is my current favorite IDE. A few things I liked is listed below

Some extensions I use with Brackets are:

  1. Markdown Preview for easy editing of markdown
  2. Brackets Git for git integration
  3. Themes for Brackets For Monokai Darksoda theme I use
  4. Brackets Linux UI
  5. Interactive Linter realtime JSHint/JSLint/CoffeeLint reports into brackets as you work on your code
  6. WD Minimap for SublimeText like code overview
  7. Beautify for automatic code formatting as you save using jsbeautify

Beautify extension helps me a lot because most of the MediaWiki related code I write needs be as MediaWiki javascript coding convention. I never get it right if I format manually. The convention is a bit different from usual js code formatting. In general you need to use a lot of whitespaces. This extension was using a default jsbeautify formatting configuration and I wanted it to be customizable per project so that I can write my own .jsbeautifyrc file to get my code formatted as per conventions.

There was an enhancement bug for this. I wrote a patch for handling project specific jsbeautifyrc and Martin Zagora merged it to the repo. Here is my .jsbeautifyrc for MediaWiki https://gist.github.com/santhoshtr/9867861

Brackets is in active development and I look forward for more features. The most important bug I would like to get fixed, that all code editors I tried suffer including brackets is support of pain free complex script editing and rendering. Brackers uses CodeMirror for the code editor and I had reported this issue . It is not trivial to fix and root cause is related to the core design. Along with js,css,html, php etc I have to work with files containing all kind of natural language text and this feature is important to me.

Hyphenation of Indian Languages in Webpages

In my last blogpost I explained hyphenation of Indian language text in openoffice. In this blogpost I will explain how hyphenation can be done in webpages.

As I explained importance of hyphenation come into picture when we justify the text. The length of the lines are controlled by the parent tags…. Unicode had defined a special character called soft hyphen for hyphenation denoted by ­ . In HTML, the plain hy­phen is rep­re­sent­ed by the “-” char­ac­ter (- or-). The soft hy­phen is rep­re­sent­ed by the char­ac­ter en­ti­ty ref­er­ence ­ (­ or ­)

User agents-browsers can break the line whenever a soft hyphen is found. So if we have a javascript based implemenation, which insert the softhyphen in between the words based on language specific rules, we can achieve hyphenation in webpages too.

Hyphenator is a project which does exactly the same. “Hyphenator.js brings client-side hyphenation of HTML-Documents on to every browser by inserting soft hyphens using hyphenation patterns and Frank M. Liangs hyphenation algorithm commonly known from LaTeX and Openoffice. “

Hyphenator was not tested for any non-latin languages so far. I tried to add support for Indian languages and the result was satisfactory. I used the
same rules I defined for openoffice. Unlike latin languages, the number of hyphenation patterns for Indian languages is very less and the performance is good because of that.

I have added Malayalam, Tamil, Hindi, Oriya, Kannda, Telugu, Bengali, Gujarati and Panjabi support to it. You can see a working example here. (I wanted to embed one example here. But livejournal doesnot allow javascript inside blog body ). The column layout is done by CSS. Try resizing the browser windows and try a print preview too..

Don’t forget to read the source code of that page. It is very simple. If you want hyphenation in your webpage, all you need is to include the javascript as done in the example. We need to provide the lang attributes for nodes so that the required patterns for that language can be loaded. I placed the new language patterns temporarily in download area of SMC. I will ask the author of Hyphenator to include it in upstream itself. Code is available here


Update(18-Dec-2008):Thanks to Mathias Nater, author of hyphenator, the patterns were added to upstream.