Internationalized Top Level Domain Names in Indian Languages

Medianama recently published a news report- “ICANN approves Kannada, Malayalam, Assamese & Oriya domain names“, which says:

ICANN (Internet Corporation for Assigned Names and Numbers) has approved four additional proposed Indic TLDs (top level domain names), in Malayalam, Kannada, Assamese and Oriya languages. The TLDs are yet to be delegated to NIXI (National Internet exchange of India). While Malayalam, Kannada and Oriya will use their own scripts, Assamese TLDs will use the Bengali script.

The news title says “domain names” and the report talks about TLDs. For many people domain name is simply something like “google.com” or “amazon.in” etc. So people may misinterpret the news report as approval for domain names like “കേരളസർവ്വകലാശാല.ഭാരതം”. Many people asked me if that is the case.  We are going to have such domain names in future, but not yet.

I will try to explain the concept of TLD and IDN and the current status in this post.

The Internet Corporation for Assigned Names and Numbers (ICANN) is a non-profit organization which takes care of the whole internet domain name system and registration process. It achieves this with the help of lot of domain process and policies and domain registrars. In India NIXI owns the .in registration process.

A domain name is a string, used to identify member of a network based on a well defined Domain Name System(DNS). So, “google.com”, “thottingal.in” etc are domain names. There are dots in the domain name. They indicate the hierarchy from right to left. In the domain name “thottingal.in”, “.in” indicates a top level or root in naming and under that there is “thottingal”. If there is “blog.thottingal.in”, “blog” is a subdomain under “thottingal.in” and so on.

The top level domains are familiar to us. “.org”, “.com”, “.in”, “.uk”, “.gov” are all examples. Out of these “.com”, “.org” and “.gov” are generic top level domains. “.in” and “.uk” are country code top level domains, often abbreviated as ccTLD.  “.in” is obviously for India.

In November 2009, ICANN decided to allow these domain name strings in the script used in countries. So “.in” should be able to represent in Indian languages too. They are called Internationalized country code Top Level Domain names, abbreviated as IDN ccTLD.

ICANN also defined a fast track process to do the definition of these domains and delegation to registrars so that website owners can register such domain names. The actual policy document on this is available at ICANN website[pdf], but in short, the steps are (1) preparation, (2) string validation and approval, (3) delegation to registrars.

So far the following languages finished all 3 steps in 2014.

  1. Hindi:  .भारत
  2. Urdu: بھارت
  3. Telugu: .భారత్
  4. Gujarati: .ભારત
  5. Punjabi: .ਭਾਰਤ
  6. Bengali: .ভারত
  7. Tamil: .இந்தியா

What this means is, NIXI owns this TLDs and can assign domains to website owners. But as far as I know, NIXI is yet to start that.

And the following languages, just got approval for second step – string validation. ICANN announced this on April 13, 2016. String validation means,  Requests are evaluated in accordance with the technical and linguistic requirements for the IDN ccTLD string(s) criteria.  IDN ccTLD requesters must fulfill a number of requirements:

  • The script used to represent the IDN ccTLDs must be non-Latin;
  • The languages used to express the IDN ccTLDs must be official in the corresponding country or territory; and
  • A specific set of technical requirements must be met.

The languages passed the second stage now are:

  1. Kannada: .ಭಾರತ
  2. Malayalam: .ഭാരതം
  3. Assamese: .ভাৰত
  4. Oriya: .ଭାରତ

As a next step, these languages need delegation- NIXI as registrar. So in short, nothing ready yet for people want to register domain names with the above TLDs.

We were talking about TLDs- top level domain names. Why there is a delay in allowing people to register domains once we have TLD? It is not easy. The domain names are unique identifiers and there should be well defined rules to validate and allow registering a domain. The domain should be a valid string based on linguistic characteristics of the language. There should be a de-duplication process- nobody should be allowed to take a domain that is already registered. You may think that it is trivial, string comparison, but nope, it is very complex. There are visually similar characters in these scripts, there are rules about how a consonant-vowel combination can appear, there are canonically equivalent letters. There are security issues[pdf] to consider.

Before allowing domain names, the IDN policy for each script need to be defined and approved. You can see a sample here: Draft IDN Policy for Tamil[PDF]. The definition of these rules were initially attempted by CDAC and was controversial and did not proceed much. I had reviewed the Malayalam policy in 2010 and participated in the discussion meetings based on a critique we prepared.

ICANN has created Generation Panels to Develop Root Zone Label Generation Rules with specific reference to Neo-Brahmi scripts. I am a member of this panel as volunteer. Once the rules are defined, registration will start, but I don’t know exactly when it will happen.  The Khmer Generation Panel has completed their proposal for the Root Zone LGR. The proposal has been released for public comments.

Identifiers In Indic Languages

Recently, while preparing a critique for  IDN Policy for Malayalam language prepared by CDAC,  I noticed that ICANN does not allow control characters in the domain names.  Sometime back I noticed Python 3 identifiers also does not allow control characters in the Identifiers. This blog post attempts to analyze the issue by looking at the Unicode and ICANN specifications about these special characters.

Apart from the existing characters in Indic languages,  Zero width Joiner and Zero width non joiners are widely used in Indic languages to control how the ligatures are formed. For some samples on how they are used, refer the wikipedia links. Being control characters and invisible characters, they are often removed while doing normalization , particularly before doing a string comparison, or collation (sort).

Identifiers, the strings that uniquely represent some data often has a policy on what kind of characters it can contain. For example, email address is an identifier, which unambiguously defines somebody’s email address, does not allow ‘space’ characters in between. Some examples for this kind of identifiers are: email ids, web domain address, variables in programming languages etc.

Gone are the days where identifiers can be represented only using English characters. Python 3.0+ allows  you to define a variable in program using any words that can be represented in Unicode. For more details on this Python feature read PEP 3131 – Supporting Non Ascii Identifiers . Some samples : Program written in Malayalam. In tamil , and In Hindi

Same is the case of Web addresses. With the advent of Internationalized Domain Names(IDN) that allows you register web addresses in your own languages, the English only web address scene is changing.

But this change brings some issues in the definition of ‘Identifiers’ – just like English, what are the characters allowed in using a domain name or programming language identifier that can be used? Standards and specifications are being drafted on this for each language. For Internationalized domain names in Indian languages, CDAC is drafting the policy. For python, the PEP 3131 has specification.

As a general rule, Unicode standard and the standards based on Unicode does not allow you use Unicode control characters such as zwj and zwnj in identifiers. Based on that The Internet Corporation for Assigned Names and Numbers (ICANN) , in RFC 3454 , it prohibits a list of control characters. RFC 3454 is used as a specification for converting a Unicode encoded domain name to its Punicode version for doing the validation.  For example,Thottingal, in Malayalam- തോട്ടിങ്ങല്‍ (0D24 0D4B 0D1F 0D4D 0D1F 0D3F 0D19 0D4D 0D19 0D32 0D4D 200D), when converted to punicode becomes xn--fwcaqax2g2d7dtadc . This conversion excludes the zwj at the end of the word. If I do a reverse conversion from xn--fwcaqax2g2d7dtadc to unicode what I get is തോട്ടിങ്ങല് (0D24 0D4B 0D1F 0D4D 0D1F 0D3F 0D19 0D4D 0D19 0D32 0D4D). Note that codepoint 200D – ZWJ is removed. That means I cannot register my domain thottingal.in in Malayalam properly. You can verify this using ICU online converter.  Now another example, Tamilnadu – in Malayalam തമിഴ്‌നാട് (0D24 0D2E 0D3F 0D34 0D4D 200C 0D28 0D3E 0D1F 0D4D) becomes xn--lwcjmx4a2de7id. When I do a reverse conversion, I getതമിഴ്നാട് (0D24 0D2E 0D3F 0D34 0D4D 0D28 0D3E 0D1F 0D4D) . Now ZWNJ(200C) is missed. Try yourself using the converter . This means one cannot register a website with Tamilnadu written in Malayalam properly. The IDN policies for Indic languages are based on this exclusion rules for zwj, zwnj.

For python 3.0+ ,  you cannot have an identifier in programming language with zwj, zwnj  or any control character in it. See this bug report for more details: Issue 5358

All of the above issues are because of the assumption that zwj,zwnj is prohibited from Identifiers for all cases. But that is not true. Look at the Unicode Standard Annex 31 – “Unicode Identifier and Pattern Syntax”(TR31). TR31 is based on Public Review 96 – “Allowing Special Characters in Identifiers”

This annex describes specifications for recommended defaults for the use of Unicode in the definitions of identifiers and in pattern-based syntax. It also supplies guidelines for use of normalization with identifiers. […]

default-ignorable characters are normally excluded from Unicode identifiers. However, visible distinctions created by certain format characters (particularly the Join_Control characters) are necessary in certain languages. A blanket exclusion of these characters makes it impossible to create identifiers with the correct visual appearance for common words or phrases in those languages. Identifier systems that attempt to provide more natural representations of terms in modern, customary use should allow these characters in input and display, but limit them to contexts in which they are necessary. […]

But since the characters are invisible, to meet the security considerations,  It should be clearly defined where and all we can use them. What if a domain is registered with 5 zwnj  continuously in it? It will look same to a string with 4 zwnjs. So TR31 defines 3 valid cases where zwnj and zwj can be used in an Identifier.

  • Allow ZWNJ in breaking a cursive connection
  • Allow ZWNJ in a conjunct context (example:  തമിഴ്‌നാട് , ദൃക്‌സാക്ഷി)
  • Allow ZWJ in a conjunct context (examples:  ന + ് + zwj -> ന്‍ ,  क+  ् +  zwj -> क्‍ )

These 3 cases covers all zwj,zwnj usage patterns in our languages.

So now it is clear that Unicode standard allows them in Identifiers. In that case, there should not be a conflict between Unicode Identifier policy and ICANN policy or any other identifier policy such as PEP 3131. Blanket exclusion of these characters are not allowed. So RFC 3454 should be compatible with TR31. The IDN policy of Indic languages should be based on that new specification and not based on the existing RFC 3454. Since CDAC is responsible of Indic Domain policy, they should take responsibility for bringing this change.

For making a change in PEP 3131, myself and Baiju M started a wiki page explaining what change need to be done. Read it from here.

Having said that, is it desirable to have  two domains,  one with a valid zwj/zwnj usage and another without them? Of course, they will be visually different, avoiding any possibilities for spoofing. Now the question is whether those  two words represent two words in the language?

As far as Malayalam is concerned there are three cases here:

  1. Missing ZWJ is considered as a spelling mistake – തമിഴ്‌നാട് (correct), തമിഴ്നാട് (incorrect) pair is an example for that.  Should we allow both domains ? I don’t know any case where a missing ZWNJ form another valid word with different meaning.
  2. Missing ZWJ means , the word is a different word with different meaning. This is very rare – വന്‍യവനിക , വന്യവനിക pair is often cited an example for this. But many people argues this is not a valid case.
  3. Missing ZWJ never means a spelling mistake, but just a writing style. There are many examples for this. നന്‍മ-നന്മ is one obvious one.

So the question is whether a domain differing by a valid zwj/zwnj use  to an existing registered domain to be allowed or not? I would suggest to use existing policy for domain comparison for this. ie, If the collation weights of existing domain and to-be registered domains are same ,  don’t register the new one. ZWJ, ZWNJ are characters with zero collation weight and in collation or string comparison they are ignored.

http://www.python.org/dev/peps/pep-3131/PEP