Tamil Collation in GLIBC

A  few months back, we started fixing the collation rules of Indian languages in GNU C library. Pravin Satpute prepared patches for many languages and I prepared patches for Malayalam and Tamil. Later Pravin enhanced the Tamil patch.

You can read the rules used for Malayalam collation here[PDF document]. Tamil patch was applied to upstream, but the bug is still open since there is some confusion on the results.

Before reading the below discussion, please read the discussion happened in the bug report : [ta_IN] Tamil collation rules are not working in other locales

Since many Tamil friends can give valuable comments on this, I am giving an explanation for my patch here. K Sethu gave some interestin his comments on the patch and I would like to hear from others also. Since collation is a very important component on Tamil support, I feel that an open discussion and consensus  should happen among language speakers outside bug trackers.

This is the logic used currently in Tamil and Malayalam Collation rules also follow the same logic.

  1. Consider each consonant as pure consonant + implicit a vowel. ie க= க் + அ   and த= த்+ அ
  2. Similarly கா = க்+ ஆ, தி = த்+ இ
  3. From #1 and #2, க் < க, த்< த , We get this output for example:அ
    அக்
    அகம்
    அகால
    அக்கம
    அக்கு
    But K Sethu questions this order in his comment here.According to him
    ( consonant1+ virma+ consonant2 ) < ( consonant1+ vowel + [consonant2] )
    or The correct sequence should be அ, அக், அக்கம், அக்கு, அகம், அகால
    But as per my patch
    ( consonant1+ virma+ consonant2 ) > ( consonant1+ vowel + [consontant2] )
    ie, all conjuncts for consonant1 happens after all consonant1+vowel + * sequences.
    So let me try to explain this behaviour.
  4. let us take க்த and கத:க்த = க்+ த்+ அ
    கத = க்+ அ+ த்+ அ
    considering the weight comparison logic(decreasing weight from left to right)
    this comparison becomes between
    க்+ த்+ அ and க்+ அ+ த்+ அ
    since க் is common in first weight, removing it. so it becomes
    த்+ அ and அ+ த்+ அ
    Since த் > அ
    த்+ அ > அ+ த்+ அ
    and there by
    க்த > கத
    So conjuncts comes after the cosonant+vowel pairs. hence the result given in #3

Apart from these, equal weights are assigned for ோ (0BCB), ௌ (0BCC), and their canonical equivalent forms.

If anybody interested in testing the patch, get ta_IN and iso14651_t1_common files from here, back up those file in /usr/share/i18n/locales, and place these two files there. reconfigure your locale using “sudo dpkg-reconfigure locales”. Sort some random file using “LANG=ta_IN sort yourfile”. If your distro is not debian based, follow the instructions from here

There is an easy way to test this. Silpa project provides an online application for Indic language collation. You can try it from here. It is a Unicode Collation algorithm implementation. The Unicode collation definition has many mistakes but we have a patched version. You can compare the results between original and patched version.

Feel free to inform this discussion to anybody interested on Tamil Computing. I would be happy to help in the implementation if we  reach  a consensus.


5 thoughts on “Tamil Collation in GLIBC”

  1. I second K Sethu. It is only natural to expect அக்கம் immediately after அக் rather than putting it after அகம், அகால etc.

  2. Dear Santosh

    Thank you for your continuing efforts in getting the Tamil collation in GLIBC correctly made.

    I will write later (within a few days) on the apparent contradictions that emerge when phonetic or phonemic ( whichever is the correct word used in lingusitics) decomposition of consonant+vowel syllables (into their pure consonant + vowel atomic ) are directly used for collation algorithm (which is the basis of your effort), and my opinion on how they may be resolved.

    Now let me point out relevant standards or documents having collation requiremnts shown.

    Firstly as Mani Manivannan pointed out in first comment, the standard of TN government is at http://www.tn.gov.in/gosdb/gorders/it/it_e_29_2010.pdf . Appendix D (pages 42 to 44 of the PDF file) has the collation orders. In it, grouped are the following sets (1 to 7) in the collation order:

    1. Language independent symbols: ! ” $ …etc

    2. Special Tamil Symbols: Day sign, Month sign etc
    Note in this set, 4 signs for the symbols Full Moon, New Moon, Karthigai and Raja are encoded only in TN Government’s TACE encoding which the TN government has prescribed as alternative to Unicode Tamil in applications which do not support or only partially support Unicode Tamil. The rest symbols are common to both Unicode and TACE and since your collation is for ta_IN (i.e., Unicode Tamil) those are relevant and not the 4 nos TACE only symbols.

    3. Tamil numerals and fractions. In this set also digits 0 to 9, numbers 10, 100 and 1000 are common to both Unicode Tamil and TACE and so are relevant for your collation works but the rest, all fractions, are for TACE only and not relevant.

    4. 12 Vowels ( அ,ஆ,இ,ஈ…..ஒ,ஓ,ஔ) plus Aytham (visarga) ஃ
    The ascending collation order is left to right and so the place of ஃ is at the end of vowels.

    5. The consonant – vowel consonant – It is the matrix of 18 Tamil pure consonants and their 18×12 consonant+vowel syllables. The sorting order is row-wise left to right in each row.

    That is க்<க<கா<கி<கீ<……………..கொ<கோ<கௌ<ங்<ங<ஙா<ஙி<ஙீ…………. and so on

    6. The Grantha groups of Tamil letters
    This set is similar to above with pure grantha consonant followed by its consonant+vowel syllables left to right for ஜ்,ஶ்,ஷ்,ஸ்,ஹ் and followed by analogous for the ligature க்ஷ் .

    Please note in case of the ligature க்ஷ் this is the conjunct form of க்+ஷ் U+0B95 U+0BCD U+0BB7 U+0BCD

    However majority use among Tamils is non conjunct form க்‌ஷ which is typed in by having Non Zero Width Joiner (U+200C) between க் and ஷ . i.e. U+0B95 U+0BCD U+200C U+0BB7 U+0BCD . You can see under Appendix C (page 37 of file – page 9 of 13 in Appendix C) the recommendation of providing ZWNJ in Tamil99 key-mappings for this purpose. (Sri Lanka's standard for Tamil Unicode usage also has same provisioning in Modified Renganthan key-map).

    So there is a question to be addressed where the non conjunct form க்‌ஷ has to be placed in collation order.

    I believe non-conjunct form க்‌ஷ் should come after க்ஶௌ like this :
    ……. க்ஶ்,க்ஶ,க்ஶி…….க்ஶொ, க்ஶோ, க்ஶௌ,க்‌ஷ்,க்‌ஷ,க்‌ஷா, …….க்‌ஷொ,க்‌ஷோ, க்‌ஷௌ,க்ஸ்,க்ஸ,…..and so on, and not be equivalent to க்ஷ் for collation purpose. Conjunct form க்ஷ் and its syllables come at the end as they in the last row of the table.

    7. Ligature ஶ்ரீ (SHrii)

    Finally the singleton ligature ஶ்ரீ that is conjunct of ஶ்+ரீ : U+0BB6 U+0BCD U+0BB0 U+0BC0

    It should be noted that before Unicode 4.1 this ligature was defined as the conjunct of ஸ்+ரீ : U+0BB8 U+0BCD U+0BB0 U+0BC0 . But from Unicode 4.1 it was changed to the current ஶ்+ரீ. The old form has to be deprecated but still it lingers on in fonts. For example the Lohit Tamil font which is comprehensive in coverage of all Tamil Unicode characters also has the older definition in addition to the current.

    I do not think that both definitions should be considered as equivalent for collation purpose because the older definition will be deprecated eventually. But opinions of others are needed. The issue is rather sticky in view of the already existing data bases and documents with older definition. Need to get opinion of others on this.

    I will continue in next posting with some more on minor differences in Sri Lanka's standard [http://www.icta.lk/attachments/651_651_SLS%201326-Part%201.pdf] (which would be relevant for ta_LK locale) and also some relevant issues raised by R. Padmakumar in a document written in the year 2004-dec [http://www.angelfire.com/empire/thamizh/2/]

    K. Sethu

  3. Santosh

    Pardon me for the long delay. Wish to get back on track for this issue and continue on what all I wanted to add but before that couple of questions to you:

    കൊച് കൊച്ചി കൊചി

    If you sort the above what order they will appear?. The order I wrote above is as per linguistic expectations for Tamil. i.e, koch<kochchi<kochi ( the first and 3rd perhaps are not genuine words in Malayalam? – doesn't matter).

    According to the comparison logic in the algorithm you would get the ascending order as: കൊച് കൊചി കൊച്ചി – Am I right?

    Is that accepted by Malayalam linguists? Are ordering in Malayalam language dictionaries / glossaries or indexes in books in same manner?

    K. Sethu

    1. @Sethu,
      The dictionaries are not consistent in the collation order as far as I know. The approach we followed for Malayalam is documented in

      a)http://smc.org.in/doc/malayalam-collation.pdf and
      b) http://smc.org.in/doc/rachana-malayalam-collation.pdf

      The basic idea is, half consonant(or consonant without inherent vowel) appear before its full consonant.
      കൊച്ച്
      കൊച്ച
      കൊച്ചു്
      കൊച്ചു
      is a resulting collated words.
      You may try the glibc tamil or any language collation in the current GNU C library using this simple online tool
      http://silpa.org.in/santhosh/collation/

      Thanks for your interest in this. I am unable to comment on any of the facts you mentioned in previous comment since I am not a Tamil speaker. But I hope other Tamil scholars will read and document it clearly so that we can take it up for implementation.

Leave a Reply

Your email address will not be published. Required fields are marked *