At Wikimedia, I am currently working on ContentTranslation tool, a machine aided translation system to help translating articles from one language to another. The tool is deployed in several wikipedias now and people are creating new articles sucessfully.
The ContentTranslation tool provides machine translation as one of the translation tool, so that editors can use it as an initial version to improve up on. We used Apertium as machine translation backend and planning to support more machine translation services soon.
A big difference in editing using ContentTranslation, is it does not involve Wiki Markup. Instead, editors can edit rich text. Basically it is contenteditable HTML elements. This also means, what you translate is HTML sections of articles.
The HTML contains all possible markups that a typical Wikipedia article has. This means, the machine translation is on HTML content. But, not all MT engines support HTML content.
Some MT engines, such as Moses, output subsentence alignment information directly, showing which source words correspond to which target words.
$ echo 'das ist ein kleines haus' | moses -f phrase-model/moses.ini -t this is |0-1| a |2-2| small |3-3| house |4-4|
The Apertium MT engine does not translate formatted text faithfully. Markup such as HTML tags is treated as a form of blank space. This can lead to semantic changes (if words are reordered), or syntactic errors (if mappings are not one-to-one).
$ echo 'legal <b>persons</b>' | apertium en-es -f html Personas <b>legales</b>
$ echo 'I <b>am</b> David' | apertium en-es -f html Soy</b> David
Other MT engines exhibit similar problems. This makes it challenging to provide machine translations of formatted text. This blog post explains how this challenge is tackled in ContentTranslation.
As we saw in the examples above, a machine translation engine can cause the following errors in the translated HTML. The errors are listed in descending order of severity.
- Corrupt markup – If the machine translation engine is unaware of HTML structure, they can potentially move the HTML tags randomly, causing corrupted markup in the MT result
- Wrongly placed annotations – The two examples given above illustrate this. It is more severe if content includes links and link targets were swapped or randomly given in the MT output.
- Missing annotations – Sometimes the MT engine may eat up some tags in the translation process.
- Split annotations -During translation a single word can be translated to more than one word. If the source word has a mark up, say <a> tag. Will the MT engine apply the <a> tag wrapping both words or apply to each word?
All of the above issues can cause bad experience to translators.
Apart from potential issues with markup transfer, there is another aspect about sending HTML content to MT engines. Compared to plain text version of a paragraph, HTML version is bigger in terms of size(bytes). Most of these extra addition is tags and attributes which should be unaffected by the translation. This is unnecessary bandwidth usage. If the MT engine is a metered engine(non-free, API access is measured and limited), we are not being economic.
An outline of the algorithm we used to transfer markups from source content to translated content is given below.
- The input HTML content is translated into a LinearDoc, with inline markup (such as bold and links) stored as attributes on a linear array of text chunks. This linearized format is convenient for important text manipulation operations, such as reordering and slicing, which are challenging to perform on an HTML string or a DOM tree.
- Plain text sentences (with all inline markup stripped away) are sent to the MT engine for translation.
- The MT engine returns a plain text translation, together with subsentence alignment information (saying which parts of the source text correspond to which parts of the translated text).
- The alignment information is used to reapply markup to the translated text.
This make sure that MT engines are translating only plain text and mark up is applied as a post-MT processing.
Essentially the algorithm does a fuzzy match to find the target locations in translated text to apply annotations. Here also content given to MT engines is plain text only.
The steps are given below.
- For the text to translate, find the text of inline annotations like bold, italics, links etc. We call it subsequences.
- Pass the full text and subsequences to the plain text machine translation engine. Use some delimiter so that we can do the array mapping between source items(full text and subsequences) and translated items.
- The translated full text will have the subsequences somewhere in the text. To locate the subsequence translation in full text translation, use an approximate search algorithm
- The approximate search algorithm will return the start position of match and length of match. To that range we map the annotation from the source html.
- The approximate match involves calculating the edit distance between words in translated full text and translated subsequence. It is not strings being searched, but ngrams with n=number of words in subsequence. Each word in ngram will be matched independently.
To understand this, let us try the algorithm in some example sentences.
- Translating the Spanish sentence
<p>Es <s>además</s> de Valencia.</p>to Catalan: The plain text version is
Es además de Valencia.. And the subsequence with annotation is
además. We give both the full text and subsequence to MT. The full text translation is
A més de València.. and the word
ademásis translated as
a més. We do a search for
a mésin the full text translation. The search will be successfull and the <s> tag will be applied, resulting
<p>És <s>a més</s> de València.</p>.The seach performed in this example is plain text exact search. But the following example illustrate why it cannot be an exact search.
- Translating an English sentence
<p>A <b>Japanese</b> <i>BBC</i> article</p>to Spanish. The full text translation of this is
Un artículo de BBC japonésOne of the subsequence
Japanesewill get translated as
Japonés. The case of
Jdiffers and search should be smart enough to identify
japonésas a match for
Japonés.The word order in source text and translation is already handled by the algorithm. The following example will illustrate that is not just case change that happens.
<p>A <b>modern</b> Britain.</p>to Spanish. The plain text version get translated as
Una Gran Bretaña moderna. and the word with annotation modern get translated as
Moderno. We need a match for
Moderno. We get
<p>Una Gran Bretaña <b>moderna</b>.</p>. This is a case of word inflection. A single letter at the end of the word changes.
- Now let us see an example where the subsequence is more than one word and the case of nested subsequences. Translating English sentence
<p>The <b>big <i>red</i></b> dog</p>to Spanish. Here, the subsequnce
Big redis in bold, and inside that, the red is in italics. In this case we need to translate the full text, sub sequence
red. So we have, El perro rojo grande as full translation, Rojo grande and Rojo as translations of sub sequences.
Rojo grandeneed to be first located and bold tag should be applied. Then search for
Rojoand apply Italic. Then we get
<p>El perro <b><i>rojo</i> grande</b></p>.
- How does it work with heavily inflected languages like Malayalam? Suppose we translate <p>I am from <a href=”x”>Kerala<a></p> to Malayalam. The plain text translation is ഞാന് കേരളത്തില് നിന്നാണു്. And the sub sequence Kerala get translated to കേരളം. So we need to match കേരളം and കേരളത്തില്. They differ by an edit distance of 7 and changes are at the end of the word. This shows that we will require language specific tailoring to satisfy a reasonable output.
The algorithm to do an approximate string match can be a simple levenshtein distance , but what would be the acceptable edit distance? That must be configurable per language modules. And the following example illustrate that just doing an edit distance based matching wont work.
<p>Los Budistas no <b>comer</b> carne</p> to English. Plain text translation is
The Buddhists not eating meat.
Comer translates as
eat. With an edit distance approach,
eat will match more with
eating. To address this kind of cases, we mix a second criteria that the words should start with same letter. So this also illustrate that the algorithm should have language specific modules.
Still there are cases that cannot be solved by the algorithm we mentioned above. Consider the following example
<p>Bees <b>cannot</b> swim</p>. Plain text translation to Spanish is
Las Abejas no pueden nadar and the phrase
cannot translates as
Puede no. Here we need to match
Puede no and
no pueden which of course wont match with the approach we explained so far.
To address this case, we do not consider sub sequence as a string, but an n-gram where n= number of words in the sequence. The fuzzy matching should be per word in the n-gram and should not be for the entire string. ie.
Puede to be fuzzy matched with
no to be fuzzy matched wth
pueden– left to right, till a match is found. This will take care of word order changes as welll as inflections
Revisiting the 4 type of errors that happen in annotation transfer, with the algorithm explained so far, we see that in worst case, we will miss annotations. There is no case of corrupted markup.
As and when ContentTranslation add more language support, language specific customization of above approach will be required.
You can see the algorithm in action by watching the video linked above. And here is a ascreenshot:
Credits: David Chan, my colleague at Wikimedia, for extensive help on providing lot of example sentences with varying complexity to fine tune the algorithm. The LinearDoc model that make the whole algorithm work is written by him. David also wrote an algorithm to handle the HTML translation using an upper casing algorithm, you can read it from here. The approximation based algorithm explained above replaced it.
A new handwriting style font for Malayalam is in development. The font is named as “Chilanka”(ചിലങ്ക).
You may try the font using this edtiable page http://smc.org.in/downloads/fonts/chilanka/tests/ -It has the font embedded
Download the latest version: http://smc.org.in/downloads/fonts/chilanka/Chilanka.ttf
- Font license: Free licensed font, OFL.
- Source code: https://github.com/smc/Chilanka
- Tools used for drawing: Inkscape and fontforge
Chilanka/ചിലങ്ക is a musical anklet
A brief note on the workflow I used for font development is as follows
- Prepared a template svg in Inkscape that has all guidelines and grid setup.
- Draw the glyphs. This is the hardest part. For this font, I used bezier tool of inkscape. SVG with stroke alone is saved. Did not prepare outline in Inkscape, this helped me to rework on the drawing several times easily. To visualize how the stroke will look like in outlined version, I set stroke width as 130, with rounded end points. All SVGs are version tracked. SVGs are saved as inkscape svgs so that I can retain my guidelines and grids.
- In fontforge, import this svgs and create the outline using expand stroke, with stroke width 130, stroke height 130, pen angle 45 degree, line cap and line join as round.
- Simplify the glyph automatically and manually to reduce the impact of conversion of Cubic bezier to quadratic bezier.
- Metrics tuning. Set both left and right bearings as 100 units(In general, there are glyph specfic tuning)
- The opentype tables are the complex part. But for this font, it did not take much time since I used SMC’s already existing well maintained feature tables. I could just focus on design part.
- Test using test scripts
Some more details:
- Design: Santhosh Thottingal
- Technology: Santhosh Thottingal and Kavya Manohar
- Total number of glyphs: 676. Includes basic latin glyphs.
- Project started on September 15, 2014
- Number of svgs prepared: 271
- Em size: 2048. Ascend: 1434. Descend: 614
- 242 commits so far.
- Latest version: 1.0.0-alpha.20141027
- All drawings are in inkscape. No paper involved, no tracing.
Thanks for all my friends who are helping me testing and for their encouragement.
Stay tuned for first version announcement
(Cross posted from http://blog.smc.org.in/new-handwriting-style-font-for-malayalam-chilanka/ )
So I did some cleanup and rewrite, added documentation, example and here it is: http://thottingal.in/projects/swanalekha/swanalekha-ml.html
English and many other languages have only 2 plural forms. Singular if the count is one and anything else is plural including zero.
But for some other languages, the plural forms are more than 2. Arabic, for example has 6 plural forms, sometimes referred as ‘zero’, ‘one’, ‘two’, ‘few’, ‘many’, ‘other’ forms. Integers 11-26, 111, 1011 are of ‘many’ form, while 3,4,..10 are ‘few’ form.
While preparing the interface messages for application user interfaces, grammatically correct sentences are must. “Found 1 results” or “Found 1 result(s)” are bad interface messages. For a developer, if the language in the context is English or languages having similar plural forms, it may be a matter of an if condition to conditionally choose one of the messages.
But that approach is not scalable if we want to deal with lot of languages. Some applications come with their own plural handling mechanism, probably by a module that tells you the plural form, given a number, and language. The plural forms per language and the rules to determine it is defined in CLDR. CLDR defines the plural rules in a markup language named LDML and releases the collections frequently.
If you look at the CLDR plural rules table you can easily understand this. The rules are defined in a particular syntax. For example, the Russian plural rules are given below.
One need to pass the value of the number to the variable in the above expressions and evaluate. If the expression evaluates to a boolean true, then the corresponding plural form should be used.
So, an expression like n = 0 or n != 1 and n mod 100 = 1..19 mapped to ‘many’ holds true if the value of n=0,119, 219, 319. So we say that they are of ‘few’ plural form.
But in the Russian example given above, we don’t see n, but we see variables v, i etc. The meaning of these variables are defined in the standard as:
|n||absolute value of the source number (integer and decimals).|
|i||integer digits of n.|
|v||number of visible fraction digits in n, with trailing zeros.|
|w||number of visible fraction digits in n, without trailing zeros.|
|f||visible fractional digits in n, with trailing zeros.|
|t||visible fractional digits in n, without trailing zeros.|
Keeping these definitions in mind, the expression v = 0 and i % 10 = 1 and i % 100 != 11 evaluates true for 1,21,31, 41 etc and false for 11. In other words, number 1,21,31 are of plural form “one” in Russian.
CLDRPluralRuleParser is that evaluator. I wrote this parser when we at Wikimedia foundation wanted a data driven plural rule evaluation for the 300+ languages we support. It started as a free time project in June 2012. Later it became part of MediaWiki core to support front-end internationalization. We wanted a PHP version also to support interface messages constructed at server side. Tim Starling wrote a PHP CLDR plural rule evaluator.
The node module comes with command line interface, just to experiment with rules.
$ cldrpluralruleparser 'n is 1' 0
License: Initially the license of the module was GPL, but as per some of the collaboration discussion between Wikimedia, cldrjs, jQuery.globalize, moment.js, it was decided to change the license to MIT.
For an advanced logging system for nodejs applications, winston is very helpful. Winston is a multi-transport async logging library for node.js. Similar to famous logging systems like log4j, we can configure the log levels and winston allows to define multiple logging targets like file, console, database etc.
I wanted to configure logging as per usual nodejs production vs development environment. Of course with development mode, I am more interested in debug level logging and at production environment I am more interested in higher level logs.
I am sharing my singleton logger instance setup code.
I use Brackets for web development. I had tried several other IDEs but Brackets is my current favorite IDE. A few things I liked is listed below
- It is free software licensed under the MIT License
- Availability of large number of extensions
Some extensions I use with Brackets are:
- Markdown Preview for easy editing of markdown
- Brackets Git for git integration
- Themes for Brackets For Monokai Darksoda theme I use
- Brackets Linux UI
- Interactive Linter realtime JSHint/JSLint/CoffeeLint reports into brackets as you work on your code
- WD Minimap for SublimeText like code overview
- Beautify for automatic code formatting as you save using jsbeautify
There was an enhancement bug for this. I wrote a patch for handling project specific jsbeautifyrc and Martin Zagora merged it to the repo. Here is my .jsbeautifyrc for MediaWiki https://gist.github.com/santhoshtr/9867861
Brackets is in active development and I look forward for more features. The most important bug I would like to get fixed, that all code editors I tried suffer including brackets is support of pain free complex script editing and rendering. Brackers uses CodeMirror for the code editor and I had reported this issue . It is not trivial to fix and root cause is related to the core design. Along with js,css,html, php etc I have to work with files containing all kind of natural language text and this feature is important to me.
One of the integral building blocks for providing multilingual support for digital content are fonts. In current times, OpenType fonts are the choice. With the increasing need for supporting languages beyond the Latin script, the TrueType font specification was extended to include elements for the more elaborate writing systems that exist. This effort was jointly undertaken in the 1990s by Microsoft and Adobe. The outcome of this effort was the OpenType Specification – a successor to the TrueType font specification.
Fonts for Indic languages had traditionally been created for the printing industry. The TrueType specification provided the baseline for the digital fonts that were largely used in desktop publishing. These fonts however suffered from inconsistencies arising from technical shortcomings like non-uniform character codes. These shortcomings made the fonts highly unreliable for digital content and their use across platforms. The problems with character codes were largely alleviated with the gradual standardization through modification and adoption of Unicode character codes. The OpenType Specification additionally extended the styling and behavior for the typography.
The availability of the specification eased the process of creating Indic language fonts with consistent typographic behaviour as per the script’s requirement. However, disconnects between the styling and technical implementation hampered the font creation process. Several well-stylized fonts were upgraded to the new specification through complicated adjustments, which at times compromised on their aesthetic quality. On the other hand, the technical adoption of the specification details was a comparatively new know-how for the font designers. To strike a balance, an initiative was undertaken by the a group of font developers and designers to document the knowledge acquired from the hands own experience for the benefit of upcoming developers and designers in this field.
The outcome of the project will be an elaborate, illustrated guideline for font designers. A chapter will be dedicated to each of the Indic scripts – Bengali, Devanagari, Gujarati, Kannada, Malayalam, Odia, Punjabi, Tamil and Telugu. The guidelines will outline the technical representation of the canonical aspects of these complex scripts. This is especially important when designing for complex scripts where the shape or positioning of a character depends on its relation to other characters.
This project is open for participation and contributors can commit directly on the project repository.
- Project: https://github.com/IndicFontbook/Fontbook
- Mailing list: https://groups.google.com/forum/#!forum/indicfontbook
- Latest version of the documentation: http://thottingal.in/documents/Fontbook.pdf