A short story of one lakh Wikipedia articles

At Wikimedia Foundation, I am working on a project to help people translate articles from one language to another. The project started in 2014 and went to production in 2015.

Over the last one year, a total of 100,000 new artcles were created across many languages. A new article get translated in every five minutes, 2000+ articles translated per week.

The 100000th Wikipedia page created with Content Translation is in Spanish, for the song ‘Crying, Waiting, Hoping

I designed the technical architecture and continue to be the main developer. I am so proud to be part of a project that contributed this much for free knowledge.

Related blog posts in Wikimedia blog:

Content translation tool hits milestone with one hundred thousand articles

Wikipedia’s coverage of essential vaccines is expanding

Content Translation tool has now been used for 50,000 articles

Semi-automated content translation is coming to Scandinavian Wikipedias

Naradanews Malayalam published a note on this

ഓരോ അഞ്ച് മിനിറ്റിലും പുതിയ ലേഖനം; വിക്കിപീഡിയ ബഹുഭാഷാ സാന്നിധ്യം വർദ്ധിപ്പിക്കുന്നു

Translating HTML content using a plain text supporting machine translation engine

At Wikimedia, I am currently working on ContentTranslation tool, a machine aided translation system to help translating articles from one language to another. The tool is deployed in several wikipedias now and people are creating new articles sucessfully.

The ContentTranslation tool provides machine translation as one of the translation tool, so that editors can use it as an initial version to improve up on. We used Apertium as machine translation backend and planning to support more machine translation services soon.

A big difference in editing using ContentTranslation, is it does not involve Wiki Markup. Instead, editors can edit rich text. Basically it is contenteditable HTML elements. This also means, what you translate is HTML sections of articles.

The HTML contains all possible markups that a typical Wikipedia article has. This means, the machine translation is on HTML content. But, not all MT engines support HTML content.

Some MT engines, such as Moses, output subsentence alignment information directly, showing which source words correspond to which target words.

$ echo 'das ist ein kleines haus' | moses -f phrase-model/moses.ini -t
this is |0-1| a |2-2| small |3-3| house |4-4|

The Apertium MT engine does not translate formatted text faithfully. Markup such as HTML tags is treated as a form of blank space. This can lead to semantic changes (if words are reordered), or syntactic errors (if mappings are not one-to-one).

$ echo 'legal <b>persons</b>' | apertium en-es -f html
Personas <b>legales</b>
$ echo 'I <b>am</b> David' | apertium en-es -f html
Soy</b> David 

Other MT engines exhibit similar problems. This makes it challenging to provide machine translations of formatted text. This blog post explains how this challenge is tackled in ContentTranslation.

As we saw in the examples above, a machine translation engine can cause the following errors in the translated HTML. The errors are listed in descending order of severity.

  1. Corrupt markup – If the machine translation engine is unaware of HTML structure, they can potentially move the HTML tags randomly, causing corrupted markup in the MT result
  2. Wrongly placed annotations – The two examples given above illustrate this. It is more severe if content includes links and link targets were swapped or randomly given in the MT output.
  3. Missing annotations – Sometimes the MT engine may eat up some tags in the translation process.
  4. Split annotations -During translation a single word can be translated to more than one word. If the source word has a mark up, say <a> tag. Will the MT engine apply the <a> tag wrapping both words or apply to each word?

All of the above issues can cause bad experience to translators.

Apart from potential issues with markup transfer, there is another aspect about sending HTML content to MT engines. Compared to plain text version of a paragraph, HTML version is bigger in terms of size(bytes). Most of these extra addition is tags and attributes which should be unaffected by the translation. This is unnecessary bandwidth usage. If the MT engine is a metered engine(non-free, API access is measured and limited), we are not being economic.

An outline of the algorithm we used to transfer markups from source content to translated content is given below.

  1. The input HTML content is translated into a LinearDoc, with inline markup (such as bold and links) stored as attributes on a linear array of text chunks. This linearized format is convenient for important text manipulation operations, such as reordering and slicing, which are challenging to perform on an HTML string or a DOM tree.
  2. Plain text sentences (with all inline markup stripped away) are sent to the MT engine for translation.
  3. The MT engine returns a plain text translation, together with subsentence alignment information (saying which parts of the source text correspond to which parts of the translated text).
  4. The alignment information is used to reapply markup to the translated text.

This make sure that MT engines are translating only plain text and mark up is applied as a post-MT processing.

Essentially the algorithm does a fuzzy match to find the target locations in translated text to apply annotations. Here also content given to MT engines is plain text only.

The steps are given below.

  1. For the text to translate, find the text of inline annotations like bold, italics, links etc. We call it subsequences.
  2. Pass the full text and subsequences to the plain text machine translation engine. Use some delimiter so that we can do the array mapping between source items(full text and subsequences) and translated items.
  3. The translated full text will have the subsequences somewhere in the text. To locate the subsequence translation in full text translation, use an approximate search algorithm
  4. The approximate search algorithm will return the start position of match and length of match. To that range we map the annotation from the source html.
  5. The approximate match involves calculating the edit distance between words in translated full text and translated subsequence. It is not strings being searched, but ngrams with n=number of words in subsequence. Each word in ngram will be matched independently.

To understand this, let us try the algorithm in some example sentences.

  1. Translating the Spanish sentence <p>Es <s>además</s> de Valencia.</p> to Catalan: The plain text version is Es además de Valencia.. And the subsequence with annotation is  además . We give both the full text and subsequence to MT. The full text translation is A més de València.. and the word  además  is translated as a més. We do a search for a més in the full text translation. The search will be successfull and the <s> tag will be applied, resulting <p>És <s>a més</s> de València.</p>.The seach performed in this example is plain text exact search. But the following example illustrate why it cannot be an exact search.
  2. Translating an English sentence <p>A <b>Japanese</b> <i>BBC</i> article</p> to Spanish. The full text translation of this is Un artículo de BBC japonés  One of the subsequenceJapanese will get translated as Japonés. The case of J differs and search should be smart enough to identify japonés as a match for Japonés. The word order in source text and translation is already handled by the algorithm. The following example will illustrate that is not just case change that happens.
  3. Translating <p>A <b>modern</b> Britain.</p> to Spanish. The plain text version get translated as Una Gran Bretaña moderna.  and the word with annotation modern get translated as  Moderno. We need a match for moderna and Moderno. We get <p>Una Gran Bretaña <b>moderna</b>.</p>. This is a case of word inflection. A single letter at the end of the word changes.
  4. Now let us see an example where the subsequence is more than one word and the case of nested subsequences. Translating English sentence <p>The <b>big <i>red</i></b> dog</p> to Spanish. Here, the subsequnce Big red is in bold, and inside that, the red is in italics. In this case we need to translate the full text, sub sequence big red and red. So we have,   El perro rojo grande as full translation, Rojo grande and Rojo as translations of sub sequences. Rojo grande need to be first located and bold tag should be applied. Then search for Rojo and apply Italic. Then we get <p>El perro <b><i>rojo</i> grande</b></p>.
  5. How does it work with heavily inflected languages like Malayalam? Suppose we translate <p>I am from <a href=”x”>Kerala<a></p> to Malayalam. The plain text translation is ഞാന്‍ കേരളത്തില്‍ നിന്നാണു്. And the sub sequence Kerala get translated to കേരളം. So we need to match കേരളം and കേരളത്തില്‍. They differ by an edit distance of 7 and changes are at the end of the word. This shows that we will require language specific tailoring to satisfy a reasonable output.

The algorithm to do an approximate string match can be a simple levenshtein distance , but what would be the acceptable edit distance? That must be configurable per language modules. And the following example illustrate that just doing an edit distance based matching wont work.

Translating <p>Los Budistas no <b>comer</b> carne</p> to English. Plain text translation is The Buddhists not eating meat. Comer translates as eat. With an edit distance approach, eat will match more with meat than eating. To address this kind of cases, we mix a second criteria that the words should start with same letter. So this also illustrate that the algorithm should have language specific modules.

Still there are cases that cannot be solved by the algorithm we mentioned above. Consider the following example

Translating <p>Bees <b>cannot</b> swim</p>. Plain text translation to Spanish is Las Abejas no pueden nadar and the phrase cannot translates as Puede no. Here we need to match Puede no andno pueden which of course wont match with the approach we explained so far.

To address this case, we do not consider sub sequence as a string, but an n-gram where n= number of words in the sequence. The fuzzy matching should be per word in the n-gram and should not be for the entire string. ie. Puede to be fuzzy matched with no and pueden, and no to be fuzzy matched wth no and pueden– left to right, till a match is found. This will take care of word order changes as welll as inflections

Revisiting the 4 type of errors that happen in annotation transfer, with the algorithm explained so far, we see that in worst case, we will miss annotations. There is no case of corrupted markup.

As and when ContentTranslation add more language support, language specific customization of above approach will be required.

You can see the algorithm in action by watching the video linked above. And here is a ascreenshot:

Translation of a paragraph from Palak Paneer article of Spanish Wikipedia to Catalan. Note the links, bold etc applied in correct position in translation at right side

If anybody interested in the code, See https://github.com/wikimedia/mediawiki-services-cxserver/tree/master/mt – It is a javascript module in a nodejs server which powers ContentTranslation.

Credits: David Chan, my colleague at Wikimedia,  for extensive help on providing lot of example sentences with varying complexity to fine tune the algorithm. The LinearDoc model that make the whole algorithm work is written by him. David also wrote an algorithm to handle the HTML translation using an upper casing algorithm, you can read it from here. The approximation based algorithm explained above replaced it.

Parsing CLDR plural rules in javascript

English and many other languages have only 2 plural forms. Singular if the count is one and anything else is plural including zero.

But for some other languages, the plural forms are more than 2. Arabic, for example has 6 plural forms, sometimes referred as ‘zero’, ‘one’, ‘two’, ‘few’, ‘many’, ‘other’ forms. Integers 11-26, 111, 1011 are of ‘many’ form, while 3,4,..10 are ‘few’ form.

While preparing the interface messages for application user interfaces, grammatically correct sentences are must. “Found 1 results” or “Found 1 result(s)” are bad interface messages. For a developer, if the language in the context is English or languages having similar plural forms, it may be a matter of an if condition to conditionally choose one of the messages.

But that approach is not scalable if we want to deal with lot of languages. Some applications come with their own plural handling mechanism, probably by a module that tells you the plural form, given a number, and language. The plural forms per language and the rules to determine it is defined in CLDR. CLDR defines the plural rules in a markup language named LDML and releases the collections frequently.

If you look at the CLDR plural rules table you can easily understand this. The rules are defined in a particular syntax. For example, the Russian plural rules are given below.

One need to pass the value of the number to the variable in the above expressions and evaluate. If the expression evaluates to a boolean true, then the corresponding plural form should be used.

So, an expression like  n = 0 or n != 1 and n mod 100 = 1..19 mapped to ‘many’ holds true if the value of n=0,119, 219, 319. So we say that they are of ‘few’ plural form.

But in the Russian example given above, we don’t see n, but we see variables v, i etc. The meaning of these variables are defined in the standard as:

Symbol Value
n absolute value of the source number (integer and decimals).
i integer digits of n.
v number of visible fraction digits in n, with trailing zeros.
w number of visible fraction digits in n, without trailing zeros.
f visible fractional digits in n, with trailing zeros.
t visible fractional digits in n, without trailing zeros.

Keeping these definitions in mind, the expression v = 0 and i % 10 = 1 and i % 100 != 11 evaluates true for 1,21,31, 41 etc and false for 11. In other words, number 1,21,31 are of plural form “one” in Russian.

A module to support the plural forms for any language can manually(or semi automatically) convert this expressions to programming language one time and use it. Twitter-cldr a CLDR abstraction library by twitter follows this method. It converted the above given plural rules to the following javascript expression using a compiler.

This works. But CLDR updates the plural rules in every releases. Most of the time, it contains additional language support. Sometimes the rules are changed or improved too. The maintainer of the module need to recompile them to javascript expression in such cases.

If we can write a compiler to generate javascript from this expressions, can’t we write a parser-evaluator for the expressions? So that we just need to pass the rule and the number to that evaluator  and it returns the plural form?

CLDRPluralRuleParser

 CLDR Plural Rule Evaluator
CLDR Plural Rule Evaluator

CLDRPluralRuleParser is that evaluator. I wrote this parser when we at Wikimedia foundation wanted a data driven plural rule evaluation for the 300+ languages we support. It started as a free time project in June 2012. Later it became part of MediaWiki core to support front-end internationalization. We wanted a PHP version also to support interface messages constructed at server side. Tim Starling wrote a PHP CLDR plural rule evaluator.

It is javascript library that takes the standardized plural rule and an integer and returning true or false depending on whether the rule pass for the given integer. It is written with UMD/common.js pattern and available as a node module too.

The node module comes with command line interface, just to experiment with rules.

$ cldrpluralruleparser 'n is 1' 0

false

The module does not self contain the plural rules collection or data. Developers need to have that collection as an xml or json inside the application and need to pass to the module. In that sense, one cannot offload the whole i18n message processing task to this module. For a more handy internationalization with javascript, that takes care of plural, gender, grammar etc, you may consider jquery.i18n which contains CLDRPluralRuleParser.

An example showing how to use the CLDR supplied plural rule data and this library is included in the repository. You can play with that application here.

License: Initially the license of the module was GPL, but as per some of the collaboration discussion between Wikimedia, cldrjs, jQuery.globalize, moment.js, it was decided to change the license to MIT.

W3C Workshop at Madrid

I will be speaking at the upcoming W3C workshop at Madrid. The workshop is on 7-8 May 2014 and the theme is “New Horizons for the Multilingual Web”.

I will be co-presenting with Pau Giner, David Chan from Wikimedia Foundation Language engineering team on best practices of translation at wikipedia. It will cover the design (from both technical and user experience perspectives) of the translation tools, and their expected impact on Wikipedia and the Web as a whole.

Collaboratively edited documentation for Indic font developers

One of the integral building blocks for providing multilingual support for digital content are fonts. In current times, OpenType fonts are the choice. With the increasing need for supporting languages beyond the Latin script, the TrueType font specification was extended to include elements for the more elaborate writing systems that exist. This effort was jointly undertaken in the 1990s by Microsoft and Adobe. The outcome of this effort was the OpenType Specification – a successor to the TrueType font specification.

JanaSanskritSans_ddhrya
The Devanagari ddhrya-ligature, as displayed in the
JanaSanskritSans font.

Fonts for Indic languages had traditionally been created for the printing industry. The TrueType specification provided the baseline for the digital fonts that were largely used in desktop publishing. These fonts however suffered from inconsistencies arising from technical shortcomings like non-uniform character codes. These shortcomings made the fonts highly unreliable for digital content and their use across platforms. The problems with character codes were largely alleviated with the gradual standardization through modification and adoption of Unicode character codes. The OpenType Specification additionally extended the styling and behavior for the typography.

The availability of the specification eased the process of creating Indic language fonts with consistent typographic behaviour as per the script’s requirement. However, disconnects between the styling and technical implementation hampered the font creation process. Several well-stylized fonts were upgraded to the new specification through complicated adjustments, which at times compromised on their aesthetic quality. On the other hand, the technical adoption of the specification details was a comparatively new know-how for the font designers. To strike a balance, an initiative was undertaken by the a group of font developers and designers to document the knowledge acquired from the hands own experience for the benefit of upcoming developers and designers in this field.

glyph-fontforge-meera
Glyphs inside Meera font

The outcome of the project will be an elaborate, illustrated guideline for font designers. A chapter will be dedicated to each of the Indic scripts – Bengali, Devanagari, Gujarati, Kannada, Malayalam, Odia, Punjabi, Tamil and Telugu. The guidelines will outline the technical representation of the canonical aspects of these complex scripts. This is especially important when designing for complex scripts where the shape or positioning of a character depends on its relation to other characters.

This project is open for participation and contributors can commit directly on the project repository.

New version of Malayalam fonts released

Swathanthra Malayalam Computing project announced the release of new version of Malayalam unicode fonts this week. In this version, there are many improvements for popular Malayalam fonts Rachana and Meera. Dyuthi font has some bug fixes. I am listing the changes below.

  1. Meera font was small compared to other fonts. This was not really a problem in Gnome environment since fontconfig allows you to define a scaling factor to match other font size. But it was an issue in Libreoffice, KDE and mainly in Windows where this kind of scaling feature does not work. Thanks to P Suresh for a rework on glyphs and fixing this issue.
  2. Rachana, Meera and Dyuthi had wrong glyphs used as placeholder glyphs. Bugs like these are fixed.
  3. Virama 0D4D had a wrong LSB that cause the cursor positioning and glyph boundary go wrong. Fixed that bug
  4. Atomic Chilu code points introduced in Unicode 5.1 was missing in all the fonts that SMC maintained because of the controversial decision by Unicode and SMC’s stand against that. Issues still exist, but content with code point is present, to avoid any difficulties to users, added those characters to Meera and Rachana fonts.
  5. Rupee Symbols added to Meera and Rachana. Thanks to Hiran for designing Sans and Serif glyphs for Rupee.
  6. Dot Reph(0D4E) – The glyphs for this was already present in Meera but unmapped to any unicode point. GSUB Lookup tables added to the glyphs according to unicode specification.

For a more detailed change description see this mail thread. There are some minor changes as well.

Thanks to Hussain K H (designer of both Meera and Rachana) , P Suresh, Hiran for their valuable contribution. And thanks to SMC community and font users for using the fonts and reporting bugs. We hope that we can bring this new version in your favorite GNU/Linux distros soon. Wikimedia’s WebFonts extension uses Meera font and the font will be updated there soon. Next release of GNU Freefont is expected to update Malayalam glyphs using Meera and Rachana for freefont-sans and freefont-serif font respectively. We plan to update other fonts we maintain also with these changes in next versions. There are still some glyphs missing in these fonts with respect to the latest unicode version.

 

SVG Fonts

This post is some notes on the current state of SVG Fonts.

SVG is not a webfont format. The purpose of SVG fonts is to be embedded inside of SVG documents  (or linked to them), similar to the way you would embed standard  TrueType or OpenType fonts in a PDF.  SVG fonts are text files that contain the glyph outlines represented  as standard SVG elements and attributes, as if they were single vector  objects in the SVG image. Unlike EOT, WOFF, TTF formats , SVG is plain text uncompressed file.

Even though it is not webfont format, some browsers will accept svg in the @fontface css3 declaration.

Firefox and IE does not support SVG Fonts. Here is the bug on Mozilla bugzilla about this with a lengthy discussion –https://bugzilla.mozilla.org/show_bug.cgi?id=119490  This is one of the reason why Firefox does not score ACID3 test – http://www.itcode.org/why-firefox-is-not-scoring-100-in-the-acid3-test . Developers argue that WOFF is sufficient and SVG Fonts does not give any advantage.  Support for SVG Fonts in the web development and font communities has been declining for some time. There’s already been discussion without objection of dropping SVG fonts from the Acid3 test. The community has put forth a proposal in the SVG Working Group to give SVG Fonts optional status.

Browser Support Matrix for SVG Fonts http://caniuse.com/#feat=svg-fonts  IE and FF do not support it. Webkit based browsers support it – Chrome, Epiphany. For browsers not supporting svg features natively there is a flash based javascript library named svgweb http://code.google.com/p/svgweb/

Limitations of svg fonts:

  • Not all of the opentype features are available in SVG specification
  • For example Indic fonts require many opentype features for correct rendering – see opentype spec of Malayalam http://www.microsoft.com/typography/otfntdev/malayot/shaping.aspx
  • Even though SVG Fonts are support is available in some browsers, practically they cannot render SVG fonts for complex scripts such as Indic – Here is a sample svg file with Meera font defined in it – http://thottingal.in/tests/svg/Meera-fontembedding.svg. As you can see, rendering is wrong.
  • The main drawback to SVG fonts is there is no provision for font-hinting. The SVG standard states: “SVG  fonts contain unhinted font outlines. Because of this, on many  implementations there will be limitations regarding the quality and  legibility of text in small font sizes. For increased quality and  legibility in small font sizes, content creators may want to use an  alternate font technology, such as fonts that ship with operating  systems or an alternate WebFont format. – http://www.w3.org/TR/SVG/fonts.html”

There is an alternate proposal to use opentype features of the font and use svg just for the glyphs https://wiki.mozilla.org/SVGOpenTypeFonts

Fontforge can be used for creating SVG Fonts. But the created SVG font works only for simple scripts like Latin. Fails to export GPOS/GSUB tables to the SVG- bug report – http://sourceforge.net/mailarchive/message.php?msg_id=27964229 GSUB issue can be solved either by handcoding the unicode sequences for glyphs or by writing an external script. But , more important opentype features- Vowel sign(matra) reordering issues persists.

Eventhough svg fonts by defining font data inside svg itself does not seem to have much interest from developers, using webfonts inside for svg has some importance. Just like web pages, webfonts can be used to render the text inside the svg. The webfont format depends on the browser. Example: http://thottingal.in/tests/svg/Dyuthi-Webfont.svg (Have a look at the source code of the file)

Malayalam Wikisource Offline version

Malayalam Wikisource community today released the first offline version of Malayalam wikisource during the 4th annual wiki meetup of Malayalam wikimedians. To the  best of our knowledge, this is the first time a wikisource project release its offline version. Malayalam wiki community had released the first version of Malayalam wikipedia one year back.

Releasing the offline version of a wikisource is a challenging project. The technical aspects of the project was designed and implemented by myself. So let me share the details of the project.

As you know a Wikisource contains lot of books, and each book varies in its size, it is divided to chapters or sections. There is no common pattern for books. Each having its own structure. A novel presentation is different from a collection of poems from a Poet. Wikisource also has religious books like Bible, Quran, Bhagavat Geeta, Ramayana etc.  Since books are for continuous reading for a long time, the readabilty and how we present the lengthy chapters in screen also matters. Offline wikipedia tools for example, Kiwix does not do any layout modification of the content and present as it is shown in wikipedia/wikisource. The tool we wrote last year for Malayalam wikipedia offline version also present scrollable vertical content in the screen. Both are not configurable to give different presentation styles depending on the nature of the book.

What we wanted is a book reader kind of application interface.  Readers should be able to easily navigate to books, chapters. The chapter content will be very lengthy. For a long time reading of this content,  a lengthy vertically scrolled text is not a good idea. We also need to take care of the width of the lines.  If each line spans 80-90% of the screen, especially for a wide screen monitor, it is a strain for neck and eyes.

 

Screenshot of Offline version. Click to enlarge


The selection of books for the offline version was done by the active wikimedians at Wiksource. Some of the selected books was proof read by many volunteers within the last  2 weeks.

The tools used for extracting htmls were adhoc and adapted to meet the good presentation of each book. So there is nothing much to reuse here. Extracting the html and then taking the content part alone using pyquery and removing some unwanted sections from html- basically this is what our scripts did. The content is added to predefined HTML templates with proper CSS for the UI. CSS3 multicolumn feature was used for book like interface. Since IE did not implement this standard even in IE9, for that browser the book like interface was not provided. Chrome browser with version less than 12 could not support, because of these bugs: http://code.google.com/p/chromium/issues/detail?id=45840 and http://code.google.com/p/chromium/issues/detail?id=78155. For easy navigation, mouse wheel support and page navigation buttons provided. For solving non-availability of required fonts, webfonts were integrated with a selection box  to select favorite font. Reader can also select the font size to make the reading comfortable.

Why static html? The variety of platforms and other versions we need to support, necessity to have webfonts, complex script rendering, effort to develop and customize UI, relatively small size of the data, avoiding any installation of software in users system etc made us to choose static html+ jquery + css as the technology choice. The downside is we could not provide full text search.

Apart from the wikisource, we also included a collection of copyleft of images from wikimedia commons. Thanks to Nishan Naseer, for preparing a gallery application using jquery. We selected 4 categories from Commons which are related to Kerala. We hope everybody will like the pictures and it will give  a small introduction to Wikimedia Commons.

Even though the python scripts are not ready to reuse in any projects, if anybody want to have a look at it, please mail me. I am not putting it in public since the script does not make sense outside the context of each book and its existing presentation in Malayalam wikisource.

The CD image is available for download here and one can also browse the CD content here.

Thanks to Shiju Alex for coordinating this project. And thanks to all Malayalam wikisource volunteers for making this happen.  We have included poems, folk songs, devotional songs, novel, grammar book, tales, books on Hinduism, Islam-ism, Christianity, Communism, Philosophy. With this release, it becomes the biggest offline digital archive of Malayalam books.

Mediawiki Berlin hackathon

I am just back from Mediawiki Berlin Hackathon. On May 13 to 15, Mediawiki developers attended the hackathon and squashed many bugs and discussed many features. Members of language committee had its first real-life meeting in parallel with hackathon. It was a nice event, learned a lot, talked to many awesome hackers and linguists.