When Breath Becomes Air – Paul Kalanithi

I read this book after my friends recommended it and wrote the following note at Goodreads

I am not rating this book. Not because of the book is bad. It is well written, I read it in a single day. But just that it gave me lot of pain and I would not recommend any of my friends to go through that pain. The reading can be felt very personal, like watching a friend – the author of the book – suffering. I had friends who went through very difficult illness and no longer with us. I have close relatives who is going through similar difficult time.

That is about rating and recommendation.

But, I am thankful to my friends who introduced this book to me. Just with some 100 pages, a friend appears in my life and leaves after a very short visit. Presenting lot of perspectives about the life and its struggles. Earns respect very quickly, and earns your tears towards the last pages. I literally skipped last few pages just because I did not want to.

Gujarat Files – Rana Ayyub

Fear – That would be the single word I can use to describe my feeling after reading the book. My belief in Indian judicial and paraliamentary system is not a firm one. This book shakes it badly. I am nobody to judge the facts revealed in the book, but what make me afraid is there is very less chance that these facts are verified by the current Indian political and legal system.

Lot of respect for the author and her courage for this work.

Feedback on KTU Syllabus of Electronics and Communication Engineering

Kerala Technological University (KTU) published a draft syllabus for the third and fourth semesters of Electronics and Communication Engineering for the coming academic year. It raised widespread concerns regarding:

  • The depth and vastness of contents
  • The obsoleteness of contents
  • Sequence of introducing concepts and the pedagogy involved
  • FOSS friendliness

To discuss the matter and collect feedback from a wider academic community, KTU called for a syllabi discussion meeting at its office on 13th May, 2016. More than a hundred faculties from various engineering colleges in Kerala came over and expressed their genuine concerns, comments and suggestions. I got opportunity to attend the same.

The syllabi committee agreed to wait till 20th May, to receive more comments before they publish the revised draft by 25th of May. Theare collaboratively created document on the changes to be incorporated to the content of various courses can be found here.

Concerns on Electronic Design Automation Lab

As per the draft syllabi published, KTU plans to introduce a new course ‘ELECTRONICS DESIGN AUTOMATION LAB’ to its third Semester Electronics and Communication Engineering syllabus. Its well and good to familiarize the software tools needed to automate many tasks like design and simulation of electronic circuits, introduction to numerical computations, PCB Designing, Hardware Description using HDL etc.

Many pedagogical concerns were raised in the meeting about the introduction of all these diverse EDA tools as a single course. The need of the tools should be obvious to the students while they learn it. It was proposed that SPICE simulations should go along with the Network Theory and Electronic Circuits. Also Logic Circuit Design should be taught with the aid of HDL.

What is more of  a concern is that the syllabus is not FOSS friendly. It clearly specifies some proprietary tools like MATLAB for numerical computation, analysis and plotting. It specifies PSICE for electronic circuit simulation and  VHDL for logic design. This proposal would be like forcing every technology institute to buy a licensed version of these software.

The Govt. of India has an Open Source Software adoption policy. Kerala State Govt. too has a  policy to adopt Free and Open Source software. As per this policy  use of proprietary tools are allowed only when there are no open source alternatives. There are open source software like Scilab and xcos, Octave, Scipy and Numpy etc. that can be used for the numerical computation experiments specified in the syllabus.

Why FOSS adoption is important?

Teaching and learning should not be tool/product specific. Syllabus should be neutral and should not endorse a brand. The students should not be locked on to a specific vendor. The learners who wish to install the software and experiment further shouldn’t  be restricted by any licensing terms and high cost. It would otherwise encourage unethical practices like usage of pirated copies of software.

Development of open source software is through open collaboration.  The algorithmic implementations are not black boxes as in proprietary tools. They are openly licensed for learning and modifications, for the enthusiasts. Learning an EDA tools should not end with the lab course. Students should acquire the skill necessary to solve any engineering problem that comes on their way using these tools.

There are MHRD initiatives like FOSSEE (Free and Open Software in Education) project to promote the use of FOSS tools to improve the quality of education in our country. The aim is to reduce dependency on proprietary software in educational institutions. FOSSEE encourages the use of FOSS tools through various activities to ensure commercial software is replaced by equivalent FOSS tools. They even develop new FOSS tools and upgrade existing tools to meet requirements in academia and research. FOSSEE supports academic institutions for FOSS adoption through lab migration and textbook companion projects.

Why not KTU collaborate with FOSSEE?

FOSS adoption might not be a very easy task. There might be a need for technical support to institutions and faculty members. KTU can collaborate with FOSSEE in this regard. They have created a repository of spoken tutorials for various FOSS tools for numerical computations, analog and digital circuit simulation etc.

Free software will have cost for training, maintenance just like proprietary software. But the learning curve can be smoothed through joint efforts.

If the tools and software used are open source, KTU can plan to create an open repository of solved simulation experiments, which can be continuously enriched by contributions from faculties and students. Hope KTU takes some steps in this direction as per the suggestions submitted.

Internationalized Top Level Domain Names in Indian Languages

Medianama recently published a news report- “ICANN approves Kannada, Malayalam, Assamese & Oriya domain names“, which says:

ICANN (Internet Corporation for Assigned Names and Numbers) has approved four additional proposed Indic TLDs (top level domain names), in Malayalam, Kannada, Assamese and Oriya languages. The TLDs are yet to be delegated to NIXI (National Internet exchange of India). While Malayalam, Kannada and Oriya will use their own scripts, Assamese TLDs will use the Bengali script.

The news title says “domain names” and the report talks about TLDs. For many people domain name is simply something like “google.com” or “amazon.in” etc. So people may misinterpret the news report as approval for domain names like “കേരളസർവ്വകലാശാല.ഭാരതം”. Many people asked me if that is the case.  We are going to have such domain names in future, but not yet.

I will try to explain the concept of TLD and IDN and the current status in this post.

The Internet Corporation for Assigned Names and Numbers (ICANN) is a non-profit organization which takes care of the whole internet domain name system and registration process. It achieves this with the help of lot of domain process and policies and domain registrars. In India NIXI owns the .in registration process.

A domain name is a string, used to identify member of a network based on a well defined Domain Name System(DNS). So, “google.com”, “thottingal.in” etc are domain names. There are dots in the domain name. They indicate the hierarchy from right to left. In the domain name “thottingal.in”, “.in” indicates a top level or root in naming and under that there is “thottingal”. If there is “blog.thottingal.in”, “blog” is a subdomain under “thottingal.in” and so on.

The top level domains are familiar to us. “.org”, “.com”, “.in”, “.uk”, “.gov” are all examples. Out of these “.com”, “.org” and “.gov” are generic top level domains. “.in” and “.uk” are country code top level domains, often abbreviated as ccTLD.  “.in” is obviously for India.

In November 2009, ICANN decided to allow these domain name strings in the script used in countries. So “.in” should be able to represent in Indian languages too. They are called Internationalized country code Top Level Domain names, abbreviated as IDN ccTLD.

ICANN also defined a fast track process to do the definition of these domains and delegation to registrars so that website owners can register such domain names. The actual policy document on this is available at ICANN website[pdf], but in short, the steps are (1) preparation, (2) string validation and approval, (3) delegation to registrars.

So far the following languages finished all 3 steps in 2014.

  1. Hindi:  .भारत
  2. Urdu: بھارت
  3. Telugu: .భారత్
  4. Gujarati: .ભારત
  5. Punjabi: .ਭਾਰਤ
  6. Bengali: .ভারত
  7. Tamil: .இந்தியா

What this means is, NIXI owns this TLDs and can assign domains to website owners. But as far as I know, NIXI is yet to start that.

And the following languages, just got approval for second step – string validation. ICANN announced this on April 13, 2016. String validation means,  Requests are evaluated in accordance with the technical and linguistic requirements for the IDN ccTLD string(s) criteria.  IDN ccTLD requesters must fulfill a number of requirements:

  • The script used to represent the IDN ccTLDs must be non-Latin;
  • The languages used to express the IDN ccTLDs must be official in the corresponding country or territory; and
  • A specific set of technical requirements must be met.

The languages passed the second stage now are:

  1. Kannada: .ಭಾರತ
  2. Malayalam: .ഭാരതം
  3. Assamese: .ভাৰত
  4. Oriya: .ଭାରତ

As a next step, these languages need delegation- NIXI as registrar. So in short, nothing ready yet for people want to register domain names with the above TLDs.

We were talking about TLDs- top level domain names. Why there is a delay in allowing people to register domains once we have TLD? It is not easy. The domain names are unique identifiers and there should be well defined rules to validate and allow registering a domain. The domain should be a valid string based on linguistic characteristics of the language. There should be a de-duplication process- nobody should be allowed to take a domain that is already registered. You may think that it is trivial, string comparison, but nope, it is very complex. There are visually similar characters in these scripts, there are rules about how a consonant-vowel combination can appear, there are canonically equivalent letters. There are security issues[pdf] to consider.

Before allowing domain names, the IDN policy for each script need to be defined and approved. You can see a sample here: Draft IDN Policy for Tamil[PDF]. The definition of these rules were initially attempted by CDAC and was controversial and did not proceed much. I had reviewed the Malayalam policy in 2010 and participated in the discussion meetings based on a critique we prepared.

ICANN has created Generation Panels to Develop Root Zone Label Generation Rules with specific reference to Neo-Brahmi scripts. I am a member of this panel as volunteer. Once the rules are defined, registration will start, but I don’t know exactly when it will happen.  The Khmer Generation Panel has completed their proposal for the Root Zone LGR. The proposal has been released for public comments.

Redesigned font download page of SMC

The font preview and download page of SMC has a fresh look now.


The intention is to provide a font preview, typography showcase, download site in single page. Every font has multiple illustrations of usage and the text used is editable if you want to try your own text there.

The old page which was also designed by myself was not mobile friendly. It provides a single page view to compare the fonts, each represented as cards. But it did not had enough flexibility to showcase some fine usages of typography.


The new design is mobile friendly.

On technical side, I used flexbox, LESS. For carousal style transitions, I used cycle2  More importantly, I did not use bootstrap :).  See code.

Fontconfig language matching

I had to spend a few hours to debug a problem about fontconfig not identifiying a font for a language. Following the tradition of sharing the knowledge you acquired in hard way, let me note it down here for search engines.

The font that I am designing now has 3 style variants, thin, regular and bold. All has same family name. So if you set this family for whatever purpose, depending on context, thin, regular or bold versions will be picked up. Regular is expected by default. Also when you pick the font from font selectors, you would expect, regular being selected by default.

The problem I was facing is, instead of Regular, Bold was getting selected as default. In font selectors, Bold was listed first.

In GNU/Linux systems, this font matching and selection is done by fontconfig. I started with fc-match

$ fc-match MyFont
MyFontBold.otf: "MyFont" "Bold"

So that confirms the problem. After fiddling with os/2 properties , asking in fontconfig mailing list, and reading fontconfig documentation, I found that the lang property fontconfig calculates from Regular variant of font does not include ‘en’

$ fc-list MyFont : family : style : lang 

I tried to find how fontconfig calculates the languages supported by a font. The minimum set of code points to be included in a font so that fontconfig declare that it supports a given language is defined in the fontconfig library. You can find them in source code. For example, mandatory code points(glyphs that match to it) to be present for English is defined in en.orth file. I cross checked each code points and one was indeed missing from my regular font variant, but bold version had everything. When I added it, all started working normally.

Later fontconfig developer Akira TAGOH told me that I can also use fc-validate to check the language coverage

$ fc-validate --lang=en MyFont.otf
MyFont.otf:0 Missing 1 glyph(s) to satisfy the coverage for en language

And after adding the missing glyph

$ fc-validate --lang=en MyFont.otf
MyFont.otf:0 Satisfy the coverage for en language

And now fc-match list Regular as default style

$ fc-match MyFont
MyFont.otf: "MyFont" "Regular"

Experimenting eSim- A tool for Electronic Circuit Simulation

I did not have much exposure to open source Electronic Design Automation tools during my graduation course in Electronics and Communication Engineering. My institute had proprietary EDA tools in the lab and all my experiences were limited to them.  I must confess I never tried to explore the FOSS world for alternatives until I was in a need to offer a lab course on basic circuit simulation.

Web searches took me to the  design suite eSim . It  is an open source EDA tool for circuit design, simulation, analysis and PCB design. It is an integrated tool built using open source software such as KiCad and Ngspice. eSim is released under GPL. It’s GUI guides the user through the steps of schematic creation, netlist generation, PCB design and simulation. eSim source code iis hosted at: https://github.com/FOSSEE/eSim .

eSim is developed by FOSSEE (Free and Open Source Software for Education) – an initiative of MHRD, Govt. of India. FOSSEE promotes the migration of labs in educational institutions from proprietary tools to FOSS only ones through lab migration projects. The source code of lab experiments are crowd sourced from faculties and students under lab migration project. These are made available by FOSSEE under  Creative Commons Attribution-ShareAlike 4.0 International Licence.

My proposal for migrating the basic electronics lab to eSim is under review. There was good technical support from the eSim team during solving various experimental issues. The user’s guide for carrying out the experiments proposed under this project is published here.   It is under  Creative Commons Attribution-ShareAlike 4.0 India Licence.This guide provides solutions to specific simulation problems using eSim. Experimental procedures are explained with screen shots.

Have a look and propose suggestions. If you have ideas on improving the contents, feel free to contribute. Git repository of user guide: https://github.com/kavyamanohar/e-design-simulation-guide


അധിക നിമിഷം (Leap second)

ഈ വരുന്ന ജൂണ്‍ 30 നു് ഒരു പ്രത്യേകതയുണ്ടു്. ആ ദിവസത്തിന്റെ ദൈര്‍ഘ്യം 24 മണിക്കൂറും ഒരു സെക്കന്റും ആണു്. അധികം വരുന്ന ഈ ഒരു സെക്കന്റിനെ ലീപ് സെക്കന്റ് അല്ലെങ്കില്‍ അധിക നിമിഷം എന്നാണു് വിളിക്കുന്നതു്. നമ്മള്‍ സാധാരണ ഉപയോഗിക്കുന്ന കൈയില്‍ കെട്ടുന്ന വാച്ചുകളിലോ ചുമര്‍ ക്ലോക്കുകളിലോ ഒന്നും ഇതു കണ്ടെന്നു വരില്ല. അല്ലെങ്കിലും ഒരു സെക്കന്റിനൊക്കെ നമുക്കെന്തു വില അല്ലേ? പക്ഷേ അങ്ങനെ തള്ളിക്കളയാനാവില്ല ഈ അധിക സെക്കന്റിനെ. സെക്കന്റ് അളവില്‍ കൃത്യത ആവശ്യമായ കമ്പ്യൂട്ടറുകളിലും ഉപകരണങ്ങളിലും ഇതു പ്രശ്നമുണ്ടാക്കാനുള്ള സാധ്യത വളരെ കൂടുതലായതുകൊണ്ടു് ജൂണ്‍ 30, 11 മണി, 60 സെക്കന്റ് എന്ന സമയത്തെ, എന്നാല്‍ ജൂലൈ 1 ആവാത്ത ആ നിമിഷത്തെ, നേരിടാന്‍ ലോകമെമ്പാടുമുള്ള സാങ്കേതിക വിദഗ്ദ്ധര്‍ കരുതിയിരിക്കുന്നു.

ഈ അധിക നിമിഷം എവിടെനിന്നു വന്നു? വളരെ ചുരുക്കിപ്പറഞ്ഞാല്‍ ഭൂമിയുടെ കറക്കത്തിന്റെ വേഗത എല്ലാ കാലത്തും ഒരുപോലെയല്ലാത്തതുകൊണ്ടാണു് ഈ അഡ്ജസ്റ്റ് മെന്റ് വേണ്ടിവരുന്നതു്. ഭൂമിയുടെ കറക്കത്തിന്റെ വേഗത കുറയാന്‍ ഭൌമപാളികളുടെ ചലനങ്ങള്‍ അല്ലെങ്കില്‍ ഭൂചലനങ്ങള്‍ പ്രധാനകാരണമാണു് . ഭൂമിയുടെ കറക്കത്തെ അടിസ്ഥാനമാക്കി ഒരു ദിവസത്തെ 24 മണിക്കൂറുകളായും ഒരു മണിക്കൂറിനെ 60 മിനിറ്റായും ഓരോ മിനിറ്റിനെയും 60 സെക്കന്റായും വിഭജിച്ചാണല്ലോ നമ്മുടെ സമയം. ഇതിനെ ആസ്ട്രോണമിക്കല്‍ സമയം എന്നും വിളിക്കാം. പക്ഷേ കൃത്യതയാര്‍ന്ന സെക്കന്റിന്റെ നിര്‍വചനം ഈ വിഭജനങ്ങളെ ആസ്പദമാക്കിയല്ല ചെയ്തിരിക്കുന്നതു്. ഒരു സീസിയം-133 ആറ്റം, സ്ഥിരാവസ്ഥയിലിരിക്കുമ്പോൾ (Ground State) അതിന്റെ രണ്ട് അതിസൂക്ഷ്മസ്തരങ്ങൾ (Hyper Levels) തമ്മിലുള്ള മാറ്റത്തിനനുസരിച്ചുള്ള വികിരണത്തിന്റെ സമയദൈർഘ്യത്തിന്റെ, 9,192,631,770 മടങ്ങ് എന്നാണു് സെക്കന്റിന്റെ ശാസ്ത്രീയവും ഔദ്യോഗികവുമായ നിര്‍വചനം.

ലോകത്തിലെ ക്ലോക്കുകളെല്ലാം കൃത്യസമയം പാലിക്കുന്നതു് കോര്‍ഡിനേറ്റഡ് യൂണിവേഴ്സല്‍ ടൈം (UTC) സ്റ്റാന്‍ഡേഡ് അനുസരിച്ചാണു്. ഇതിനെ ആസ്പദമാക്കിയാണു് സമയമേഖലകളില്‍( Timezones) സമയം കണക്കാക്കുന്നതും കമ്പ്യൂട്ടറുകളിലെ സമയക്രമീകരണവും. ഗ്രീനിച്ച് മാനക സമയമടിസ്ഥാനമാക്കി ശാസ്ത്രലോകം അംഗീകരിച്ച സമയഗണനസമ്പ്രദായമാണു് UTC. ഇന്ത്യയിലെ സമയമേഖല UTC+5.30 എന്നാണു് കുറിക്കാറുള്ളതു്. ഗ്രീനിച്ച് സമയത്തില്‍ നിന്നും അഞ്ചരമണിക്കൂര്‍ കൂടുതല്‍ എന്ന അര്‍ത്ഥത്തില്‍. 1972 മുതല്‍ UTC, ഇന്റര്‍നാഷണല്‍ അറ്റോമിക് ടൈമിനെ പിന്തുടരുന്നു. ഇന്റര്‍നാഷണല്‍ അറ്റോമിക് ടൈം സീസിയം ആറ്റത്തിന്റെ വികിരണത്തെ അടിസ്ഥാനമാക്കിയാണു്.

ഞാന്‍ തുടക്കത്തില്‍ പറഞ്ഞ ജൂണ്‍ 30 നു് അധിക സെക്കന്റ് എന്നതു് UTC സമയമാണെന്നു വ്യക്തമാക്കട്ടെ. ശരിക്കും ഇന്ത്യയിലപ്പോള്‍ ജൂലൈ 1 രാവിലെ 5.30 ആയിരിക്കും.

നിത്യജീവിതത്തിലെ സമയം എന്ന ആശയം രാത്രി-പകല്‍ മാറ്റങ്ങളെ അടിസ്ഥാനമാക്കിയാണല്ലോ. UTC യും നിത്യജീവിതത്തിലെ ആവശ്യങ്ങള്‍ക്കുള്ളതായതുകൊണ്ടു് ഒരേ സമയം അറ്റോമിക് ടൈമിന്റെ കൃത്യത പാലിക്കാനും അതേ സമയം ഭൂമിയുടെ കറക്കത്തിനൊപ്പമാവാനും വേണ്ടിയാണു് ഇടക്ക് ഇങ്ങനെ സെക്കന്റുകള്‍ ചേര്‍ക്കുന്നതു്. ഇങ്ങനത്തെ 26-ാമത്തെ അഡ്ജസ്റ്റ്മെന്റ് ആണു് 2015 ജൂണ്‍ 30നു നടക്കാന്‍ പോകുന്നതു്. 2012 ജൂണ്‍ 30നായിരുന്നു അവസാനമായി ലീപ് സെക്കന്റ് വന്നതു്.

കൃത്യമായി പറഞ്ഞാല്‍ ജൂണ്‍ 30നു് പതിനൊന്നുമണി 59 മിനിറ്റ്, 59 സെക്കന്റ് കഴിഞ്ഞാല്‍ ജൂലൈ 1, 00:00:00 സമയം ആവേണ്ടതിനു പകരം ജൂണ്‍ 30, 11 മണി, 59 മിനിറ്റ്, 60 സെക്കന്റ് എന്ന സമയം നില്‍നില്‍ക്കും. അതിനു ശേഷമേ ജൂലൈ ആവൂ.

ലീപ് സെക്കന്റ് കുഴപ്പക്കാരനാവുന്നതു് പല രീതികളിലാണു്. കമ്പ്യൂട്ടറുകളില്‍ ഏതുതരത്തിലുള്ള ഓപ്പറേഷനുകളുടെ രേഖീയ ക്രമം(linear sequencing) ടൈം സ്റ്റാമ്പുകളെ അടിസ്ഥാനമാക്കിയാണു്. ഓപ്പറേറ്റിങ്ങ് സിസ്റ്റമാണു് ഈ മിടിപ്പുകള്‍(ticks) ഉണ്ടാക്കിക്കൊണ്ടു് അതിനുമുകളിലെ അപ്ലിക്കേഷനുകളെ സഹായിക്കുന്നതു്. മിടിപ്പുകളുടെ എണ്ണം മിനിറ്റ്, മണിക്കൂര്‍, ദിവസം ഒക്കെ കണക്കാക്കാന്‍ ഉപയോഗിക്കുമെന്നു പ്രത്യേകം പറയേണ്ടതില്ലല്ലോ. 12:59:60 നു ജൂലൈ ഒന്നാണോ ജൂണ്‍ 30 ആണോ തുടങ്ങിയ കണ്‍ഫ്യൂഷന്‍ മുതല്‍ എന്തൊക്കെ തരത്തിലുള്ള പ്രശ്നമാണു് ഇവ ഉണ്ടാക്കുന്നതെന്നു പറയാന്‍ കഴിയില്ല. ലിനക്സ് കെര്‍ണലില്‍ ഇതു കൈകാര്യം ചെയ്യാനുള്ള സംവിധാനം ഉണ്ടായിരുന്നെങ്കിലും 2012ലെ ലീപ് സെക്കന്റ് സമയത്തു് അതു് നേരാവണ്ണം പ്രവര്‍ത്തിച്ചില്ല. ജൂണ്‍ 30നു ന്യൂയോര്‍ക്ക് സ്റ്റോക് എക്ചേഞ്ച് ഒരു മണിക്കൂറോളം പ്രവര്‍ത്തനം നിര്‍ത്തുമെന്നു് അറിയിച്ചു കഴിഞ്ഞു.

വലിയ വെബ്സൈറ്റുകള്‍ ലീപ് സെക്കന്റിനെ നേരിടാന്‍ തയ്യാറെടുത്തുകഴിഞ്ഞു. വിക്കിപീഡീയ അതിന്റെ സെര്‍വറുകളില്‍ UTC ടൈമുമായുള്ള ഏകോപനം താത്കാലികമായി നിര്‍ത്തിവെച്ചു് ഹാര്‍ഡ്‌വെയര്‍ ക്ലോക്കില്‍ സെര്‍വറുകള്‍ ഓടിക്കും. ലീപ് സെക്കന്റ് ഒക്കെ കഴിഞ്ഞ ശേഷം സെര്‍വറുകളെ പല ഘട്ടങ്ങളിലായി വീണ്ടും UTC യുമായി ഏകോപിപ്പിക്കും. ഗൂഗിള്‍ വേറൊരു രീതിയാണു് ഉപയോഗിക്കുന്നതു്. അവര്‍ ലീപ് സെക്കന്റിനോടടുത്തു് വരുന്ന സെക്കന്റുകളെ കുറേശ്ശേ വലിച്ചു നീട്ടും, ചില്ലറ മില്ലി സെക്കന്റുകള്‍ അധികമുള്ള സെക്കന്റുകള്‍ എല്ലാം കൂടി കൂട്ടിവെച്ചാല്‍ ഒരു സെക്കന്റിന്റെ ഗുണം ചെയ്യും, അതേ സമയം പുതിയൊരു സെക്കന്റിന്റെ വരവ് ഇല്ലാതാക്കുകയും ചെയ്യും.

ഈ തലവേദന എങ്ങനെയെങ്കിലും ഒഴിവാക്കാനുള്ള ചര്‍ച്ചകളും ആരംഭിച്ചിട്ടുണ്ടു്. ഭൂമിയില്‍ നമ്മള്‍ ലീപ് സെക്കന്റ് കണക്കാക്കിയാലും നമ്മുടെ ബഹിരാകാശ നിരീക്ഷണങ്ങള്‍ക്കു് ആസ്ട്രോണമിക്കല്‍ ക്ലോക്ക് തന്നെ വേണമല്ലോ. ലീപ് സെക്കന്റ് എന്നു വേണം എന്നു് ഏകദേശം ആറുമാസം മുമ്പേ തീരുമാനിക്കാനും പറ്റു. International Earth Rotation and Reference Systems Service (IERS) ആണു് ലീപ് സെക്കന്റ് എപ്പോള്‍ വേണമെന്നു തീരുമാനിക്കുന്നതു്.

കൂടുതല്‍ വായനയ്ക്ക്:  https://en.wikipedia.org/wiki/Leap_second

Translating HTML content using a plain text supporting machine translation engine

At Wikimedia, I am currently working on ContentTranslation tool, a machine aided translation system to help translating articles from one language to another. The tool is deployed in several wikipedias now and people are creating new articles sucessfully.

The ContentTranslation tool provides machine translation as one of the translation tool, so that editors can use it as an initial version to improve up on. We used Apertium as machine translation backend and planning to support more machine translation services soon.

A big difference in editing using ContentTranslation, is it does not involve Wiki Markup. Instead, editors can edit rich text. Basically it is contenteditable HTML elements. This also means, what you translate is HTML sections of articles.

The HTML contains all possible markups that a typical Wikipedia article has. This means, the machine translation is on HTML content. But, not all MT engines support HTML content.

Some MT engines, such as Moses, output subsentence alignment information directly, showing which source words correspond to which target words.

$ echo 'das ist ein kleines haus' | moses -f phrase-model/moses.ini -t
this is |0-1| a |2-2| small |3-3| house |4-4|

The Apertium MT engine does not translate formatted text faithfully. Markup such as HTML tags is treated as a form of blank space. This can lead to semantic changes (if words are reordered), or syntactic errors (if mappings are not one-to-one).

$ echo 'legal <b>persons</b>' | apertium en-es -f html
Personas <b>legales</b>
$ echo 'I <b>am</b> David' | apertium en-es -f html
Soy</b> David 

Other MT engines exhibit similar problems. This makes it challenging to provide machine translations of formatted text. This blog post explains how this challenge is tackled in ContentTranslation.

As we saw in the examples above, a machine translation engine can cause the following errors in the translated HTML. The errors are listed in descending order of severity.

  1. Corrupt markup – If the machine translation engine is unaware of HTML structure, they can potentially move the HTML tags randomly, causing corrupted markup in the MT result
  2. Wrongly placed annotations – The two examples given above illustrate this. It is more severe if content includes links and link targets were swapped or randomly given in the MT output.
  3. Missing annotations – Sometimes the MT engine may eat up some tags in the translation process.
  4. Split annotations -During translation a single word can be translated to more than one word. If the source word has a mark up, say <a> tag. Will the MT engine apply the <a> tag wrapping both words or apply to each word?

All of the above issues can cause bad experience to translators.

Apart from potential issues with markup transfer, there is another aspect about sending HTML content to MT engines. Compared to plain text version of a paragraph, HTML version is bigger in terms of size(bytes). Most of these extra addition is tags and attributes which should be unaffected by the translation. This is unnecessary bandwidth usage. If the MT engine is a metered engine(non-free, API access is measured and limited), we are not being economic.

An outline of the algorithm we used to transfer markups from source content to translated content is given below.

  1. The input HTML content is translated into a LinearDoc, with inline markup (such as bold and links) stored as attributes on a linear array of text chunks. This linearized format is convenient for important text manipulation operations, such as reordering and slicing, which are challenging to perform on an HTML string or a DOM tree.
  2. Plain text sentences (with all inline markup stripped away) are sent to the MT engine for translation.
  3. The MT engine returns a plain text translation, together with subsentence alignment information (saying which parts of the source text correspond to which parts of the translated text).
  4. The alignment information is used to reapply markup to the translated text.

This make sure that MT engines are translating only plain text and mark up is applied as a post-MT processing.

Essentially the algorithm does a fuzzy match to find the target locations in translated text to apply annotations. Here also content given to MT engines is plain text only.

The steps are given below.

  1. For the text to translate, find the text of inline annotations like bold, italics, links etc. We call it subsequences.
  2. Pass the full text and subsequences to the plain text machine translation engine. Use some delimiter so that we can do the array mapping between source items(full text and subsequences) and translated items.
  3. The translated full text will have the subsequences somewhere in the text. To locate the subsequence translation in full text translation, use an approximate search algorithm
  4. The approximate search algorithm will return the start position of match and length of match. To that range we map the annotation from the source html.
  5. The approximate match involves calculating the edit distance between words in translated full text and translated subsequence. It is not strings being searched, but ngrams with n=number of words in subsequence. Each word in ngram will be matched independently.

To understand this, let us try the algorithm in some example sentences.

  1. Translating the Spanish sentence <p>Es <s>además</s> de Valencia.</p> to Catalan: The plain text version is Es además de Valencia.. And the subsequence with annotation is  además . We give both the full text and subsequence to MT. The full text translation is A més de València.. and the word  además  is translated as a més. We do a search for a més in the full text translation. The search will be successfull and the <s> tag will be applied, resulting <p>És <s>a més</s> de València.</p>.The seach performed in this example is plain text exact search. But the following example illustrate why it cannot be an exact search.
  2. Translating an English sentence <p>A <b>Japanese</b> <i>BBC</i> article</p> to Spanish. The full text translation of this is Un artículo de BBC japonés  One of the subsequenceJapanese will get translated as Japonés. The case of J differs and search should be smart enough to identify japonés as a match for Japonés. The word order in source text and translation is already handled by the algorithm. The following example will illustrate that is not just case change that happens.
  3. Translating <p>A <b>modern</b> Britain.</p> to Spanish. The plain text version get translated as Una Gran Bretaña moderna.  and the word with annotation modern get translated as  Moderno. We need a match for moderna and Moderno. We get <p>Una Gran Bretaña <b>moderna</b>.</p>. This is a case of word inflection. A single letter at the end of the word changes.
  4. Now let us see an example where the subsequence is more than one word and the case of nested subsequences. Translating English sentence <p>The <b>big <i>red</i></b> dog</p> to Spanish. Here, the subsequnce Big red is in bold, and inside that, the red is in italics. In this case we need to translate the full text, sub sequence big red and red. So we have,   El perro rojo grande as full translation, Rojo grande and Rojo as translations of sub sequences. Rojo grande need to be first located and bold tag should be applied. Then search for Rojo and apply Italic. Then we get <p>El perro <b><i>rojo</i> grande</b></p>.
  5. How does it work with heavily inflected languages like Malayalam? Suppose we translate <p>I am from <a href=”x”>Kerala<a></p> to Malayalam. The plain text translation is ഞാന്‍ കേരളത്തില്‍ നിന്നാണു്. And the sub sequence Kerala get translated to കേരളം. So we need to match കേരളം and കേരളത്തില്‍. They differ by an edit distance of 7 and changes are at the end of the word. This shows that we will require language specific tailoring to satisfy a reasonable output.

The algorithm to do an approximate string match can be a simple levenshtein distance , but what would be the acceptable edit distance? That must be configurable per language modules. And the following example illustrate that just doing an edit distance based matching wont work.

Translating <p>Los Budistas no <b>comer</b> carne</p> to English. Plain text translation is The Buddhists not eating meat. Comer translates as eat. With an edit distance approach, eat will match more with meat than eating. To address this kind of cases, we mix a second criteria that the words should start with same letter. So this also illustrate that the algorithm should have language specific modules.

Still there are cases that cannot be solved by the algorithm we mentioned above. Consider the following example

Translating <p>Bees <b>cannot</b> swim</p>. Plain text translation to Spanish is Las Abejas no pueden nadar and the phrase cannot translates as Puede no. Here we need to match Puede no andno pueden which of course wont match with the approach we explained so far.

To address this case, we do not consider sub sequence as a string, but an n-gram where n= number of words in the sequence. The fuzzy matching should be per word in the n-gram and should not be for the entire string. ie. Puede to be fuzzy matched with no and pueden, and no to be fuzzy matched wth no and pueden– left to right, till a match is found. This will take care of word order changes as welll as inflections

Revisiting the 4 type of errors that happen in annotation transfer, with the algorithm explained so far, we see that in worst case, we will miss annotations. There is no case of corrupted markup.

As and when ContentTranslation add more language support, language specific customization of above approach will be required.

You can see the algorithm in action by watching the video linked above. And here is a ascreenshot:

Translation of a paragraph from Palak Paneer article of Spanish Wikipedia to Catalan. Note the links, bold etc applied in correct position in translation at right side

If anybody interested in the code, See https://github.com/wikimedia/mediawiki-services-cxserver/tree/master/mt – It is a javascript module in a nodejs server which powers ContentTranslation.

Credits: David Chan, my colleague at Wikimedia,  for extensive help on providing lot of example sentences with varying complexity to fine tune the algorithm. The LinearDoc model that make the whole algorithm work is written by him. David also wrote an algorithm to handle the HTML translation using an upper casing algorithm, you can read it from here. The approximation based algorithm explained above replaced it.