Skip to content

Creating a new Language ecosystem- Sourashtra as example

Sourashtra is a language spoken by Sourashtra  people living in South Tamilnadu and Gujarat of India. Originated from Brahmi and then Grandha, this language is mother tongue for half a million people. But most of them are not familiar with the script of this language. Very few people knows reading and writing on Sourashtra script. Sourashtra has a ISO 639-3 language code saz and  Unicode range  U+A880 – U+A8DF

Recently Sourashtra wikipedia project was started in the wikimedia incubator : http://incubator.wikimedia.org/wiki/Wp/saz and Mediawiki localization started in translatewiki Since the language did not had any proper fonts or input tools, this was not going well.

When we add a  new language support in Mediawiki or start a new language wikipedia,  we need to develop the language technology ecosystem to support its growth. This ecosystem comprises of Unicode code points for the script, proper fonts, rendering support,  input tools, availability of these fonts and input tools in operating systems or alternate ways to get it working in operating system etc.

Sourashtra language had a unicode font developed by Prabu M Rengachari, named ‘Sourashtra’ itself. The font had problems with browsers/operating systems. We fixed to make it work properly. The font was not licensed properly. Prabu agreed to release it in GNU GPLV3 license with font exception. He also agreed to rename the font to another name other than the script name itself.

The font was renamed to Pagul, meaning ‘Footstep’ in Sourashtra and hosted in sourceforge

Once we have a font with proper license, we wanted it to be available in operating systems. I filed a packaging request in Debian. Vasudev Kamath of Debian India Team packaged it and now it is available in debian unstable(sid).  Parag Nemade of Fedora India packaged the font for Fedora and will be avialable in upcoming Fedora 15.

To add a new language support in operating system, we need a locale definition. In GNU Linux this is GLibc locale definition. With the help of Prabu, I prepared the saz_IN locale file for glibc, and filed as bug report to add to glibc. I hope, soon it will be part of Glibc.

Well, all of these was possible since it was GNU/Linux or Free software. Things are a bit difficult on the other side, proprietary operating system world. There is nothing we can do with those operating systems. Since there is no ‘market’ for these minority language, it won’t come to the priority of those companies to add support for these languages. Users will see squares or question marks when they visit sourashtra wikipedia.

We are working on a solution for this, not only for sourashtra, but a common solution for all languages. We are developing a webfonts extension for Mediawiki to provide font embedding in wiki pages to avoid the necessity of having fonts installed in user’s computers. The extension is in development and one can preview it in my test wiki. For Sourashtra, we added webfonts support(preview) .

Input tools needs to be developed and packaged. For mediaiwki, with the help of Narayam extension, we can easily add this support.

With the silpa project, I added a server side, PDF/PNG/SVG rendering support for Sourashtra as well.

 

{ 8 } Comments

  1. Siebrand Mazeland | May 7, 2011 at 2:59 PM | Permalink

    Hey Santosh. Superb development for yet another Indic language. Would you consider submitting the glyphs for inclusion in an exiting open source font like DejaVu? http://dejavu-fonts.org/wiki/Main_Page

    As this font is for example used in a PHP PDF class like tcpdf (http://www.tcpdf.org/) this would provide for another world of opportunities.

  2. Siebrand Mazeland | May 7, 2011 at 3:00 PM | Permalink

    Oh, one addition: DejaVu is a public domain font. It would require a license change on the glyph collection.

  3. Santhosh | May 7, 2011 at 4:59 PM | Permalink

    Hi Siebrand,
    I doubt whether TCPDF can support this font. So far it is incapable of rendering Indic scripts. But PyPDFLib PDF library can(and its online verion here: http://silpa.org.in/Render ).
    We had included some of our GPL Font glyphs in Dejavu before. We just need a dual licensing. They need Bitstream Vera style license. So we can relicense the font in GPLv3#FontException+ BVL.

  4. Siebrand Mazeland | May 7, 2011 at 7:55 PM | Permalink

    So does not being able to render Indic scripts by tcpdf have to do with the actual fonts for it lacking, or with technical limitations? My hunch is the former, not the latter.

    If I assume my assumption to be true, that points out another weakness in open source we are facing: the fonts may be there, but they are scattered all over the place, and getting them would be the next challenge – webfonts are a solution only for that, the web, but once the web content is put in documents, the issues are there again. A “cover all” font like DejaVu is one possibility. Are you aware of any other initiatives to tackle this issue?

  5. Santhosh | May 7, 2011 at 8:09 PM | Permalink

    It is technical limitation. For TCPDF, the true type font need to be converted to afm format first, then for each script, the diacritics or ligature rules are implemented in tcpdf itself. That is what done for adding Arabic/Persian support. For complex scripts this is not a correct approach. Indic shaping engines like Pango has evolved by taking about 10 years. The shaping logic is very complex and duplicating it inside a PDF library is wrong approach. Instead the PDF library should depend on Pango or the upcoming Harfbuzz rendering engines. The PDF export library in Mediawiki uses reportlab pdf library. That also attempts to the rendering by itself. And ended up in having no support for any Indic languages and many bugs for Arabic scripts(Note that this extension is disabled in many Indian wikiprojects). Fonts are not enough for rendering, a shaping engine is also required for complex script to interpret the glyph formation rules. This is what PyPDFLib is trying to solve by using Pango for script rendering and Cairo for graphics.

  6. Santhosh | May 7, 2011 at 8:29 PM | Permalink

    ‘Cover all’ fonts is a problem. Its size will be very big. This is better handled with the font family concept like Sans family. If we have a document with multiple scripts, glyphs will be selected from multiple fonts. As long as the glyph is present in some font in the system, you will never see a square or question mark. Now suppose we create a PDF file with multiple scripts, the glyphs will be taken from multiple fonts and will be embedded. Please open this PDF generated using pypdflib http://thottingal.in/tmp/multiscript.pdf and see the properties of the PDF -> Fonts. You can see a list of fonts used for these scripts. I am not mapping a script to a font in my code. Instead, I am asking Pango to use Sans family for the document. It takes the required fonts from the system and does the job. (I hope I understood your question correctly.)

  7. Anivar Aravind | May 7, 2011 at 8:52 PM | Permalink

    Pagul glyphs can be included in GNU Freefont . http://www.gnu.org/software/freefont/coverage.html GPLv3 +FE may not be suitable with DejaVu License

  8. Dr. M.S. Sureshkumar Ph. D., | January 16, 2012 at 8:46 PM | Permalink

    Happy to note the technical development of Sourashtra language, kindly bring the progress

    With happiness,
    Dr. Sureshkumar

Post a Comment

Your email is never published nor shared. Required fields are marked *

Powered by WP Hashcash