നമ്മളെല്ലാം നമ്മളല്ലാതാവുന്ന കാലം

എന്റെ പാൻകാർഡിലെ പേരല്ല പാസ്‌പോർട്ടിലുള്ളതു്. വോട്ടേഴ്സ് ഐഡിയിലെ വീട്ടുപേരല്ല പാൻകാർഡിൽ. വീട്ടുപേരാകട്ടെ ഓരോന്നിലും ഓരോന്നാണു്. ചിലതിൽ ഇനിഷ്യൽ മാത്രം. ചിലതിൽ ഇനിഷ്യൽ ചുരുക്കാതെ എഴുതിയതു്. ചിലതു് മലയാളത്തിൽ. ചിലതു് ഇംഗ്ലീഷിൽ. ചിലവയിൽ അക്ഷരത്തെറ്റ്. കുത്തുള്ള ഇനിഷ്യൽ. കുത്തില്ലാത്ത ഇനിഷ്യൽ. തോട്ടിങ്ങൽ, തോട്ടുങ്ങൽ, തോട്ടിങ്ങല്…

ആധാറിലെ എന്റെ പേരു് ഉത്തരമില്ലാത്ത ഒരു ചോദ്യമാണു്. “സന്തോഷ് ടീ ആര്”. ആരാണെന്നറിയാൻ കാർഡിലെ ഫോട്ടോ ഒട്ടും സഹായകരമല്ല.

വലിയ പ്രശ്നമൊന്നും ഇതുവരെ നേരിട്ടിട്ടില്ല. പക്ഷേ കുറച്ചു ദിവസം മുമ്പു് എന്റെ പ്രൊവിഡന്റ് ഫണ്ട് KYC ഡോക്യുമെന്റുകളിൽ പാൻ കാർഡ് ചേർക്കാൻ പറ്റിയില്ല. എന്റെ എംപ്ലോയറുടെ റെക്കോർഡിലുള്ള പേരും പാൻ കാർഡിലെ പേരും ഒന്നല്ലാത്തതുകൊണ്ടാണത്രെ. കാര്യം ശരിയാണു്, പാൻ കാർഡിലെ പേരിൽ ഇനിഷ്യലുകളാണ്, എംപ്ലോയറുടെ കയ്യിൽ ഇനിഷ്യൻ നീട്ടിയെഴുതിയതും. ഇതിപ്പൊ പേരെന്തിനാ അങ്ങനെ മാച്ച് ചെയ്യുന്നതു്, അതൊക്കെ ഒഴിവാക്കാനല്ലേ ഈ ഡോക്യുമെന്റുകൾക്ക് നമ്പറുകൾ?

നമ്മൾ നമ്മളാണെന്നു തെളിയിക്കാൻ പെടാപാടുപെടുന്ന വിചിത്രരാജ്യമാണു് ഇന്ത്യ. ഐഡിപ്രൂഫുകളില്ലാതെ ഒരു ഓഫീസിലും പോകേണ്ടകാര്യമില്ല. നമ്മൾ അങ്ങോട്ട് പണം കൊടുക്കാനാണെങ്കിലും പോലും ഐഡിയില്ലാതെ വാങ്ങില്ല. ഈ ഐഡി പ്രൂഫ് എന്ന പേപ്പർ മാത്രമേ ആവശ്യമുള്ളൂ, അതിലെ പേരോ, ഫോട്ടോയോ ആരും പരിശോധിച്ചുറപ്പിക്കുന്നതു ഞാൻ കണ്ടിട്ടില്ല. ബൈ ഡിഫോൾട്ട്, നമ്മൾ നമ്മളല്ല എന്നതാണ് പൊതുവിലുള്ള നയം. അല്ലെങ്കിൽ തെളിയിക്കണം.

ആധാർ “യുണീക്(!) ഐഡി” ഇതിനൊക്കെ പരിഹാരമല്ലേന്നു ചിന്തിക്കുന്നവരുണ്ടാകം. എങ്കിൽ തെറ്റി.  നമ്പറുകൾക്കപ്പുറം  സോഫ്റ്റ്‌വെയറുകൾ ഇന്ത്യൻ പേരുകൾ കമ്പെയർ ചെയ്യുന്ന ഒരു ദുരന്തനാടകത്തിനെല്ലാരും തയ്യാറാവുക.

ആധാറും പാൻകാർഡും ലിങ്ക് ചെയ്തില്ലെങ്കിൽ പാൻകാർഡ് കാൻസലാവും എന്നൊരു സന്തോഷവാർത്ത കേട്ടിരുന്നല്ലോ.  ഈ ലിങ്ക് ചെയ്യൽ അത്ര എളുപ്പമാവില്ല പോലും.

The income tax department has recently started accepting initials that can be put on the PAN card. However, you will be required to put your full name while applying for a verification of data. For instance, if your official name is M Ramamurthy where the full name is Madurai Ramamurthy where Ramamurthy is the first name, I-T will allow initials only for the first name. Hence, you can either print your name on the PAN card as either Ramamurthy or R Madurai.

Once you try to link PAN to your Aadhaar card which accepts initials, there is bound to be a name mismatch. This will lead to a rejection of request for linking Aadhaar card. In several parts of the country, especially the south of India, the names of villages are often suffixed or prefixed to the name of an individual. Here, any name except the first name of the person is considered as surname and will have to be mentioned in full on the PAN card.

ഈ പേരുകൾ ഒത്തുനോക്കൽ സോഫ്റ്റ്‌വെയർ ഏറ്റെടുക്കുകയാണു്. വെറും string comparison ആണെങ്കിൽ മിക്കവരുടെയും പേരുകൾ തള്ളും. തള്ളിപ്പോയവർ പേരുകൾ മാറ്റാൻ, ശരിയാക്കാൻ മെനക്കെടേണ്ടിവരും.  ഇനിയിപ്പൊ ഇന്ത്യൻ പേരുകൾ മാച്ച് ചെയ്തു നോക്കാൻ ഒരു അൽഗോരിതം തയ്യാറാക്കാൻ ശ്രമിച്ചെന്നിരിക്കട്ടെ. എന്തൊക്കെയായിരിക്കണം അതിൽ ശ്രദ്ധിക്കേണ്ടതു്? ഇംഗ്ലീഷിലെഴുതിയ പേരാണെന്നെരിക്കട്ടെ. ഇന്ത്യൻ ഭാഷകളെക്കൂടിപരിഗണിച്ചാൽ ഇതു് വളരെ സങ്കീർണ്ണമാകും.

  • അപ്പർ കേസ്, ലോവർ കേസ് മാറ്റങ്ങൾ കണക്കിലെടുക്കരുതു്. Anand Chandran, ANAND Chandran, Anand CHANDRAN, anand chandran, ANAND CHANDRAN ഒക്കെ ഒന്നാണല്ലോ.
  • കുത്ത് കോമ ഒക്കെ വിടണം. Rama C, Rama C.
  • സ്പേസ് വെറുതെ വിടണം. Rama Chandran, Ramachandran ഇവ ഒന്നല്ലേ
  • സ്പെല്ലിങ്ങ് വ്യത്യാസങ്ങൾ. Pradeep, Pradip, Pradeeb, Pradeeb, Prathib, Pratheeb, Prathib, Prathib ഇതൊക്കെ വേറേ വേറെ ആളുകളാവാം, ഒരാളാവാം
  • First name, Last name, Middle name പേരുകൾ നമുക്ക് വളരെ ആശയക്കുഴപ്പമുള്ളവയാണു്. പ്രേതേകിച്ചും ദക്ഷിണേന്ത്യൻ പേരുകളിൽ. M Sudhakaran, Sudhakaran M, Sudhakaran Manoharan ഒക്കെ ഒരാളാവാമല്ലോ.
  • വേറുതേ ഒരു സ്പേസ് അധികമായതുകൊണ്ട് പേരു മാച്ച് ആവാതെ ആളുകളെ ഓഫീസുകൾ കേറിയിറക്കാനും ഈ അൽഗോരിതങ്ങളെക്കൊണ്ടു സാധിക്കും എന്നും ഓർക്കാം.

ഇംഗ്ലീഷിൽ ഇത്തരം അൽഗോരിതങ്ങളൊക്കെ ഉണ്ടു്. Soundex, MetaphoneNew York State Identification and Intelligence System ഒക്കെ ഉദാഹരണങ്ങൾ. നമ്മുടെ പേരുകളുടെ പ്രത്യേകതകൾ കൂടി പരിഗണിച്ചു് അതുപോലൊരു അൽഗോരിതം നമുക്കും വേണ്ടേ?

ഇംഗ്ലീഷിൽ മാത്രം പോരാ, ഇന്ത്യൻ ഭാഷകളിലും വേണ്ടേ? നേരത്തെപ്പറഞ്ഞ Soundex അൽഗോരിതത്തിനെപ്പോലെയൊന്നു് ഇന്ത്യൻ ഭാഷകൾക്കു വേണ്ടി ഇൻഡിക് സൌണ്ടെക്സ് എന്ന പേരിൽ തയ്യാറാക്കിയിരുന്നു. വിവിധ ഇന്ത്യൻ ഭാഷകളിലെഴുതിയാലും പരസ്പരം ഒത്തുനോക്കാൻ അതിനു കഴിയും. ഉദാഹരണത്തിനു് സന്തോഷ്, सन्तोष എന്നിവ ഒരേ പേരാണെന്നു പറയുന്ന വിധത്തിൽ.

Internationalized Top Level Domain Names in Indian Languages

Medianama recently published a news report- “ICANN approves Kannada, Malayalam, Assamese & Oriya domain names“, which says:

ICANN (Internet Corporation for Assigned Names and Numbers) has approved four additional proposed Indic TLDs (top level domain names), in Malayalam, Kannada, Assamese and Oriya languages. The TLDs are yet to be delegated to NIXI (National Internet exchange of India). While Malayalam, Kannada and Oriya will use their own scripts, Assamese TLDs will use the Bengali script.

The news title says “domain names” and the report talks about TLDs. For many people domain name is simply something like “google.com” or “amazon.in” etc. So people may misinterpret the news report as approval for domain names like “കേരളസർവ്വകലാശാല.ഭാരതം”. Many people asked me if that is the case.  We are going to have such domain names in future, but not yet.

I will try to explain the concept of TLD and IDN and the current status in this post.

The Internet Corporation for Assigned Names and Numbers (ICANN) is a non-profit organization which takes care of the whole internet domain name system and registration process. It achieves this with the help of lot of domain process and policies and domain registrars. In India NIXI owns the .in registration process.

A domain name is a string, used to identify member of a network based on a well defined Domain Name System(DNS). So, “google.com”, “thottingal.in” etc are domain names. There are dots in the domain name. They indicate the hierarchy from right to left. In the domain name “thottingal.in”, “.in” indicates a top level or root in naming and under that there is “thottingal”. If there is “blog.thottingal.in”, “blog” is a subdomain under “thottingal.in” and so on.

The top level domains are familiar to us. “.org”, “.com”, “.in”, “.uk”, “.gov” are all examples. Out of these “.com”, “.org” and “.gov” are generic top level domains. “.in” and “.uk” are country code top level domains, often abbreviated as ccTLD.  “.in” is obviously for India.

In November 2009, ICANN decided to allow these domain name strings in the script used in countries. So “.in” should be able to represent in Indian languages too. They are called Internationalized country code Top Level Domain names, abbreviated as IDN ccTLD.

ICANN also defined a fast track process to do the definition of these domains and delegation to registrars so that website owners can register such domain names. The actual policy document on this is available at ICANN website[pdf], but in short, the steps are (1) preparation, (2) string validation and approval, (3) delegation to registrars.

So far the following languages finished all 3 steps in 2014.

  1. Hindi:  .भारत
  2. Urdu: بھارت
  3. Telugu: .భారత్
  4. Gujarati: .ભારત
  5. Punjabi: .ਭਾਰਤ
  6. Bengali: .ভারত
  7. Tamil: .இந்தியா

What this means is, NIXI owns this TLDs and can assign domains to website owners. But as far as I know, NIXI is yet to start that.

And the following languages, just got approval for second step – string validation. ICANN announced this on April 13, 2016. String validation means,  Requests are evaluated in accordance with the technical and linguistic requirements for the IDN ccTLD string(s) criteria.  IDN ccTLD requesters must fulfill a number of requirements:

  • The script used to represent the IDN ccTLDs must be non-Latin;
  • The languages used to express the IDN ccTLDs must be official in the corresponding country or territory; and
  • A specific set of technical requirements must be met.

The languages passed the second stage now are:

  1. Kannada: .ಭಾರತ
  2. Malayalam: .ഭാരതം
  3. Assamese: .ভাৰত
  4. Oriya: .ଭାରତ

As a next step, these languages need delegation- NIXI as registrar. So in short, nothing ready yet for people want to register domain names with the above TLDs.

We were talking about TLDs- top level domain names. Why there is a delay in allowing people to register domains once we have TLD? It is not easy. The domain names are unique identifiers and there should be well defined rules to validate and allow registering a domain. The domain should be a valid string based on linguistic characteristics of the language. There should be a de-duplication process- nobody should be allowed to take a domain that is already registered. You may think that it is trivial, string comparison, but nope, it is very complex. There are visually similar characters in these scripts, there are rules about how a consonant-vowel combination can appear, there are canonically equivalent letters. There are security issues[pdf] to consider.

Before allowing domain names, the IDN policy for each script need to be defined and approved. You can see a sample here: Draft IDN Policy for Tamil[PDF]. The definition of these rules were initially attempted by CDAC and was controversial and did not proceed much. I had reviewed the Malayalam policy in 2010 and participated in the discussion meetings based on a critique we prepared.

ICANN has created Generation Panels to Develop Root Zone Label Generation Rules with specific reference to Neo-Brahmi scripts. I am a member of this panel as volunteer. Once the rules are defined, registration will start, but I don’t know exactly when it will happen.  The Khmer Generation Panel has completed their proposal for the Root Zone LGR. The proposal has been released for public comments.

New handwriting style font for Malayalam: Chilanka

A new handwriting style font for Malayalam is in development. The font is named as “Chilanka”(ചിലങ്ക).

This is a alpha version release. Following is a sample rendering.

More samples here.

You may try the font using this edtiable page http://smc.org.in/downloads/fonts/chilanka/tests/ -It has the font embedded

Download the latest version: http://smc.org.in/downloads/fonts/chilanka/Chilanka.ttf

Chilanka/ചിലങ്ക is a musical anklet

A brief note on the workflow I used for font development is as follows

  1. Prepared a template svg in Inkscape that has all guidelines and grid setup.
  2. Draw the glyphs. This is the hardest part. For this font, I used bezier tool of inkscape. SVG with stroke alone is saved. Did not prepare outline in Inkscape, this helped me to rework on the drawing several times easily. To visualize how the stroke will look like in outlined version, I set stroke width as 130, with rounded end points. All SVGs are version tracked. SVGs are saved as inkscape svgs so that I can retain my guidelines and grids.
  3. In fontforge, import this svgs and create the outline using expand stroke, with stroke width 130, stroke height 130,  pen angle 45 degree, line cap and line join as round.
  4. Simplify the glyph automatically and manually to reduce the impact of conversion of Cubic bezier to quadratic bezier.
  5. Metrics tuning. Set both left and right bearings as 100 units(In general, there are glyph specfic tuning)
  6. The opentype tables are the complex part. But for this font, it did not take much time since I used SMC’s already existing well maintained feature tables. I could just focus on design part.
  7. Test using test scripts

Some more details:

  • Design: Santhosh Thottingal
  • Technology: Santhosh Thottingal and Kavya Manohar
  • Total number of glyphs: 676. Includes basic latin glyphs.
  • Project started on September 15, 2014
  • Number of svgs prepared: 271
  • Em size: 2048. Ascend: 1434. Descend: 614
  • 242 commits so far.
  • Latest version: 1.0.0-alpha.20141027
  • All drawings are in inkscape. No paper involved, no tracing.

Thanks for all my friends who are helping me testing and for their encouragement.
Stay tuned for first version announcement 🙂

(Cross posted from http://blog.smc.org.in/new-handwriting-style-font-for-malayalam-chilanka/ )

GSOC 2014 – Mentoring for SMC

I am a mentor for Google Summer of Code 2014 for SMC. I will be helping Praveen Sridhar to port input methods from jquery.ime to the Firefox OS.

We started the project and Praveen already has a proof of concept ready.

Tim Chien and Rudy Lu from Mozilla is co-mentoring the same project

Meera Tamil font in Ubuntu Trusty Tahr

Ubuntu Trusty Tahr is going to be released on April 17th 2014.Meera Tamil

Meera Tamil font, a free licensed unicode font for Tamil will be available in this release.

The font is already available in Debian. In both Ubuntu and Debian you can install the font by

sudo apt-get install fonts-meera-taml

Thanks Vasudev for packaging it for Debian.

Collaboratively edited documentation for Indic font developers

One of the integral building blocks for providing multilingual support for digital content are fonts. In current times, OpenType fonts are the choice. With the increasing need for supporting languages beyond the Latin script, the TrueType font specification was extended to include elements for the more elaborate writing systems that exist. This effort was jointly undertaken in the 1990s by Microsoft and Adobe. The outcome of this effort was the OpenType Specification – a successor to the TrueType font specification.

JanaSanskritSans_ddhrya
The Devanagari ddhrya-ligature, as displayed in the
JanaSanskritSans font.

Fonts for Indic languages had traditionally been created for the printing industry. The TrueType specification provided the baseline for the digital fonts that were largely used in desktop publishing. These fonts however suffered from inconsistencies arising from technical shortcomings like non-uniform character codes. These shortcomings made the fonts highly unreliable for digital content and their use across platforms. The problems with character codes were largely alleviated with the gradual standardization through modification and adoption of Unicode character codes. The OpenType Specification additionally extended the styling and behavior for the typography.

The availability of the specification eased the process of creating Indic language fonts with consistent typographic behaviour as per the script’s requirement. However, disconnects between the styling and technical implementation hampered the font creation process. Several well-stylized fonts were upgraded to the new specification through complicated adjustments, which at times compromised on their aesthetic quality. On the other hand, the technical adoption of the specification details was a comparatively new know-how for the font designers. To strike a balance, an initiative was undertaken by the a group of font developers and designers to document the knowledge acquired from the hands own experience for the benefit of upcoming developers and designers in this field.

glyph-fontforge-meera
Glyphs inside Meera font

The outcome of the project will be an elaborate, illustrated guideline for font designers. A chapter will be dedicated to each of the Indic scripts – Bengali, Devanagari, Gujarati, Kannada, Malayalam, Odia, Punjabi, Tamil and Telugu. The guidelines will outline the technical representation of the canonical aspects of these complex scripts. This is especially important when designing for complex scripts where the shape or positioning of a character depends on its relation to other characters.

This project is open for participation and contributors can commit directly on the project repository.

Hyphenation in web

This is a follow up of a 4 year old blog post about hyphenation. Hyphenation allows the controlled splitting of words to improve the layout of paragraphs, typically splitting words at syllabic or morphemic boundaries and visually indicating the split (usually with a hyphen).

I wrote about how a webpage can use Hyphenator javascript library to achieve hyphenation for a text with ‘justify‘ style. Along with the hyphenation rules I wrote for many Indian languages, this solution works and some websites already use it. The Hyphenator library helps to insert Soft hyphens in appropriate positions inside the text.

Example showing the difference between Malayalam text hyphenated and not hyphenated. You can see lot of line space wasted with white space in non-hyphenated text
Example showing the difference between Malayalam text hyphenated and not hyphenated. You can see lot of line space wasted with white space in non-hyphenated text

 

More recently browsers such as Firefox, Safari and Chrome have begun to support the CSS3 hyphens property, with hyphenation dictionaries for a range of languages, to support automatic hyphenation.

For hyphenation to work correctly, the text must be marked up with language information, using the language tags described earlier. This is because hyphenation rules vary by language, not by script. The description of the hyphens property in CSS says “Correct automatic hyphenation requires a hyphenation resource appropriate to the language of the text being broken. The user agents is therefore only required to automatically hyphenate text for which the author has declared a language (e.g. via HTML lang or XML xml:lang) and for which it has an appropriate hyphenation resource.”

CSS Example

-webkit-hyphens: auto;
-moz-hyphens: auto;
-ms-hyphens: auto;
-o-hyphens: auto;
hyphens: auto;

Browser Compatibility

  • Chrome 13+ with -webkit prefix
  • Firefox 6.0+ with -moz prefix
  • IE 10+ with -ms prefix.

Hyphenation rules

CSS Text Level 3 does not define the exact rules for hyphenation, however user agents are strongly encouraged to optimize their line-breaking implementation to choose good break points and appropriate hyphenation points.

Firefox has hyphenation rules for about 40 languages. A complete list of languages supported in FF and IE is available at Mozilla wiki

You can see that none of the Indian languages are listed there. Hyphenation rules can be reused from the TeX hyphenation rules.  Jonathan Kew was importing the hyphenation rules from TeX and I had requested importing the hyphenation rules for Indian languages too.  But that was more than a year back, not much progress in that. Apparently there was a licensing issue with derived work but looks like it is resolved already.

CSS4 Text

While this is all well and good, it doesn’t provide the fine grain control you may require to get professional results. For this CSS4 Text introduce more features.

  • Limiting the number of hyphens in a row using hyphenate-limit-lines. This property is currently supported by IE10 and Safari, using the -ms- and -webkit- prefix respectively.
  • Limiting the word length, and number of characters before and after the hyphen using hyphenate-limit-chars
  • Setting the hyphenation character using hyphenate-character. Helps to override the default soft hyphen character

More reading

PS: Sometimes hyphenation can be very challenging. For example hyphenating the 746 letter long name of Wolfe+585, Senior.