NotoSansMalayalam and nta

NotoSansMalayalam has the following ligature rules for ന്റ (nta)- All uses Akhand Opentype featurenotosansml-nta1

  1. uni0D7B(ൻ) + uni0D4D(്) + uni0D31(റ) => ൻ + ് + റ
  2. uni0D28(ന) + uni0D4D(്) + uni200D(ZWNJ) +uni0D31(റ) => ന്‍ + റ
  3. uni0D28(ന) + uni0D4D(്) + uni0D31(റ) => ന് +റ


The first one is what is defined in Unicode chapter 09 section 9.9[pdf]. The second is what Microsoft Kartika used to use for /nta/ as a bug. The last one is what all other fonts follows. If this is what standards can achieve, what can I say?

Hyphenation in web

This is a follow up of a 4 year old blog post about hyphenation. Hyphenation allows the controlled splitting of words to improve the layout of paragraphs, typically splitting words at syllabic or morphemic boundaries and visually indicating the split (usually with a hyphen).

I wrote about how a webpage can use Hyphenator javascript library to achieve hyphenation for a text with ‘justify‘ style. Along with the hyphenation rules I wrote for many Indian languages, this solution works and some websites already use it. The Hyphenator library helps to insert Soft hyphens in appropriate positions inside the text.

Example showing the difference between Malayalam text hyphenated and not hyphenated. You can see lot of line space wasted with white space in non-hyphenated text
Example showing the difference between Malayalam text hyphenated and not hyphenated. You can see lot of line space wasted with white space in non-hyphenated text


More recently browsers such as Firefox, Safari and Chrome have begun to support the CSS3 hyphens property, with hyphenation dictionaries for a range of languages, to support automatic hyphenation.

For hyphenation to work correctly, the text must be marked up with language information, using the language tags described earlier. This is because hyphenation rules vary by language, not by script. The description of the hyphens property in CSS says “Correct automatic hyphenation requires a hyphenation resource appropriate to the language of the text being broken. The user agents is therefore only required to automatically hyphenate text for which the author has declared a language (e.g. via HTML lang or XML xml:lang) and for which it has an appropriate hyphenation resource.”

CSS Example

-webkit-hyphens: auto;
-moz-hyphens: auto;
-ms-hyphens: auto;
-o-hyphens: auto;
hyphens: auto;

Browser Compatibility

  • Chrome 13+ with -webkit prefix
  • Firefox 6.0+ with -moz prefix
  • IE 10+ with -ms prefix.

Hyphenation rules

CSS Text Level 3 does not define the exact rules for hyphenation, however user agents are strongly encouraged to optimize their line-breaking implementation to choose good break points and appropriate hyphenation points.

Firefox has hyphenation rules for about 40 languages. A complete list of languages supported in FF and IE is available at Mozilla wiki

You can see that none of the Indian languages are listed there. Hyphenation rules can be reused from the TeX hyphenation rules.  Jonathan Kew was importing the hyphenation rules from TeX and I had requested importing the hyphenation rules for Indian languages too.  But that was more than a year back, not much progress in that. Apparently there was a licensing issue with derived work but looks like it is resolved already.

CSS4 Text

While this is all well and good, it doesn’t provide the fine grain control you may require to get professional results. For this CSS4 Text introduce more features.

  • Limiting the number of hyphens in a row using hyphenate-limit-lines. This property is currently supported by IE10 and Safari, using the -ms- and -webkit- prefix respectively.
  • Limiting the word length, and number of characters before and after the hyphen using hyphenate-limit-chars
  • Setting the hyphenation character using hyphenate-character. Helps to override the default soft hyphen character

More reading

PS: Sometimes hyphenation can be very challenging. For example hyphenating the 746 letter long name of Wolfe+585, Senior.

Malayalam Wikisource Offline version

Malayalam Wikisource community today released the first offline version of Malayalam wikisource during the 4th annual wiki meetup of Malayalam wikimedians. To the  best of our knowledge, this is the first time a wikisource project release its offline version. Malayalam wiki community had released the first version of Malayalam wikipedia one year back.

Releasing the offline version of a wikisource is a challenging project. The technical aspects of the project was designed and implemented by myself. So let me share the details of the project.

As you know a Wikisource contains lot of books, and each book varies in its size, it is divided to chapters or sections. There is no common pattern for books. Each having its own structure. A novel presentation is different from a collection of poems from a Poet. Wikisource also has religious books like Bible, Quran, Bhagavat Geeta, Ramayana etc.  Since books are for continuous reading for a long time, the readabilty and how we present the lengthy chapters in screen also matters. Offline wikipedia tools for example, Kiwix does not do any layout modification of the content and present as it is shown in wikipedia/wikisource. The tool we wrote last year for Malayalam wikipedia offline version also present scrollable vertical content in the screen. Both are not configurable to give different presentation styles depending on the nature of the book.

What we wanted is a book reader kind of application interface.  Readers should be able to easily navigate to books, chapters. The chapter content will be very lengthy. For a long time reading of this content,  a lengthy vertically scrolled text is not a good idea. We also need to take care of the width of the lines.  If each line spans 80-90% of the screen, especially for a wide screen monitor, it is a strain for neck and eyes.


Screenshot of Offline version. Click to enlarge

The selection of books for the offline version was done by the active wikimedians at Wiksource. Some of the selected books was proof read by many volunteers within the last  2 weeks.

The tools used for extracting htmls were adhoc and adapted to meet the good presentation of each book. So there is nothing much to reuse here. Extracting the html and then taking the content part alone using pyquery and removing some unwanted sections from html- basically this is what our scripts did. The content is added to predefined HTML templates with proper CSS for the UI. CSS3 multicolumn feature was used for book like interface. Since IE did not implement this standard even in IE9, for that browser the book like interface was not provided. Chrome browser with version less than 12 could not support, because of these bugs: and For easy navigation, mouse wheel support and page navigation buttons provided. For solving non-availability of required fonts, webfonts were integrated with a selection box  to select favorite font. Reader can also select the font size to make the reading comfortable.

Why static html? The variety of platforms and other versions we need to support, necessity to have webfonts, complex script rendering, effort to develop and customize UI, relatively small size of the data, avoiding any installation of software in users system etc made us to choose static html+ jquery + css as the technology choice. The downside is we could not provide full text search.

Apart from the wikisource, we also included a collection of copyleft of images from wikimedia commons. Thanks to Nishan Naseer, for preparing a gallery application using jquery. We selected 4 categories from Commons which are related to Kerala. We hope everybody will like the pictures and it will give  a small introduction to Wikimedia Commons.

Even though the python scripts are not ready to reuse in any projects, if anybody want to have a look at it, please mail me. I am not putting it in public since the script does not make sense outside the context of each book and its existing presentation in Malayalam wikisource.

The CD image is available for download here and one can also browse the CD content here.

Thanks to Shiju Alex for coordinating this project. And thanks to all Malayalam wikisource volunteers for making this happen.  We have included poems, folk songs, devotional songs, novel, grammar book, tales, books on Hinduism, Islam-ism, Christianity, Communism, Philosophy. With this release, it becomes the biggest offline digital archive of Malayalam books.

On Machine Translation and God

I was reading an article named “Why Can’t a Computer Translate More Like a Person?” by Alan K. Melby. The article is about the challenges that machine translation technology face to reach a acceptable quality of translation. He explains the importance of culture sensitivity required for machine translation programs. Article lists a number of examples where MT can go wrong if context , culture etc are not taken into consideration.  There are very interesting arguments about how reductionalism becomes a wrong choice while designing MT. If you are interested in  natural language processing or machine translation and wondering if there is any limit for computer programs to reach human’s language capabilities, please read it.

The article is written long time back, and Machine Translation technologies improved a lot. There are commercial as well as free translation products for many languages. There are research going on in intra-indic as well as english-indic translations.  I am not sure how far these technologies solved the challenges mentioned in the above mentioned article, but I believe that the questions are still valid.

The question is whether the programs can understand our culture, language usage , emotions etc. For translating limited domain or dry content, the machine translation may be effective, but in a general purpose use, I don’t know how effective they are.

Melby argues :

That key factor which is missing from current theories is agency. By agency, I mean the capacity to make real choices by exercising our will, ethical choices for which we are responsible. […]. Any ‘choice’ that is a rigid and unavoidable consequence of the circumstances is not a real choice that could have gone either way and is thus not an example of agency. A computer has no real choice in what it will do next. Its next action is an unavoidable consequence of the machine language it is executing and the values of data presented to it. I am proposing that any approach to meaning that discounts agency will amount to no more than the mechanical manipulation of symbols such as words, that is, moving words around and linking them together in various ways instead of understanding them. Computers can already manipulate symbols. In fact, that is what they mostly do. But manipulating symbols does not give them agency and it will not let them handle language like humans. Symbol manipulation works only within a specific domain, and any attempt to move beyond a domain through symbol manipulation is doomed, for manipulation of symbols involves no true surprises, only the strict application of rules. General vocabulary, as we have seen, involves true surprises that could not have been predicted.

With all these advanced technologies, can we develop a universal , any-to-any language translation program? We have seen many examples where human beings are failing miserably in sensible translation. If you are looking for  english->hindi translation effectiveness, try this using google Translation

आप हिन्दी समझते है ? ==> You understand English?

So do you think that if there is any such universal translation tool,  it is nearly impossible and “only god can create such a tool” ?! . Heard about Babel fish (of The Hitchhiker’s Guide to the Galaxy)? .  The babel fish is small, yellow, leech-like, and is a universal translator which simultaneously translates from one spoken language to another. When inserted into the ear, its nutrition processes convert sound waves into brain waves, neatly crossing the language divide between any species you should happen to meet whilst travelling in space. According to the Hitchhiker’s Guide, the Babel fish was put forth as an example for the non-existence of God: .

“I refuse to prove that I exist,” says God, “for proof denies faith, and without faith I am nothing.”

“But,” says Man, “the Babel fish is a dead giveaway isn’t it? It could not have evolved by chance. It proves that you exist, and so therefore, by your own arguments, you don’t. Q.E.D.

“Oh dear,” says God, “I hadn’t thought of that,” and promptly vanishes in a puff of logic

Alan K Melby argues that Douglas Adams was saying that there can’t be any such fish.

The silliness of the above argument is intended, I believe, to show the futility of trying to prove the existence of God, through physics or any other route. Belief in God is a starting point, not a conclusion. If it were a conclusion, then that conclusion would have to be based on something else that is firmer than our belief in God. If that something else forces everyone to believe in God, then faith is denied. If that something else does not force us to believe in God, then it may not be a sufficiently solid foundation for our belief.

Adams may also be saying something about translation and the nature of language. I can speculate on what Adams had in mind to say about translation when he dreamed up the Babel fish. My own bias would have him saying indirectly that there could be no such fish since there is no universal set of thought patterns underlying all languages. Even with direct brain to brain communication, we would still need shared concepts in order to communicate. Words do not really fail us. If two people share a concept, they can eventually agree on a word to express it. Ineffable experiences are those that are not shared by others.

I have some friends studying on machine translation with Indian Languages. They are evaluating shallow transfer method(Statistical methods to the words surrounding the ambiguous word.) for this using tools like apertium. Let us hope that they will succeed in their efforts.

Let me give one example translation between Tamil and Malayalam where context matters.

In Malayalam, for ‘wait, wait’, we usually say, “നില്ക്കു് നില്ക്ക്”(Literal meaning:  ‘stand, stand’ ) . For the same purpose , I have noticed that my Tamil speaking friends  use “இரு இரு” (Literal meaning: ‘sit, sit’  ). Now if the translation is done without knowing this usage, it is going to be funny. Shallow transfer methods use multiple intermediate  languages for translation. For eg: If there is a translation  tool available for a->b and b->c and then a->c is possible through a->b->c . I feel that this is going to be a big challenge.. to keep the word meaning, context, common usage…etc.. Let us wait/sit/stand and see 😀

Since we saw “a nonexistence of God proof”, let me give another one, that I read sometime back.

  1. God is so powerful, he can do any thing,
  2. God can create anything , if #1 is true
  3. If #2 is is true, he can create a big stone that he cannot lift!
  4. If he cannot lift a stone, then #1 is wrong, hence #2 also wrong. So God does not exist!

Looks very silly, right? or “Logical” ? 🙂

PDFBox : Extract Text from PDF

Recently I had to extract text from PDF files for indexing the content using Apache Lucene. Apache PDFBox was the obvious choice for the java library to be used.

Apache PDFBox is an opensource java library for working with PDF files. The PDFBox library allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. PDFBox also includes several command line utilities.

There is no latest build available for PDFBox. Sourceforge has very old binaries. But  the old version fails to work with PDF 1.5 specification. So one need to compile the latest code from SVN. 

I am sharing the latest jar file built from svn here

The following example explains how to extract the text from a pdf file using PDFBox.

import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;

public class PDFTextParser {

	// Extract text from PDF Document
	static String pdftoText(String fileName) {
		PDFParser parser;
		String parsedText = null;;
		PDFTextStripper pdfStripper = null;
		PDDocument pdDoc = null;
		COSDocument cosDoc = null;
		File file = new File(fileName);
		if (!file.isFile()) {
			System.err.println("File " + fileName + " does not exist.");
			return null;
		try {
			parser = new PDFParser(new FileInputStream(file));
		} catch (IOException e) {
			System.err.println("Unable to open PDF Parser. " + e.getMessage());
			return null;
		try {
			cosDoc = parser.getDocument();
			pdfStripper = new PDFTextStripper();
			pdDoc = new PDDocument(cosDoc);
			parsedText = pdfStripper.getText(pdDoc);
		} catch (Exception e) {
					.println("An exception occured in parsing the PDF Document."
							+ e.getMessage());
		} finally {
			try {
				if (cosDoc != null)
				if (pdDoc != null)
			} catch (Exception e) {
		return parsedText;
	public static void main(String args[]){



More details on the APIs can be read from here

Youtube to MPEG or Ogg video conversion

Here is the two line method to convert a youtube video to oggvorbis video.
Locate clive and ffmpeg2theora in your package and install
$clive (replace this with the youtube address you want)
It will create a flv file.
Convert to mpeg video file
$ffmpeg -i AmericaAmerica.flv AmericaAmerica.mpg
Convert to ogg video file
$ffmpeg2theora AmericaAmerica.mpg (replace it with the name of the flv file the previous command created)
Done. You can see the .ogg file in the directory from where you executed the above commands

ധ്വനി-കെ.ഡി.ഇ സംയോജനം

KDE ഡെസ്ക്ടോപ്പില്‍ ധ്വനി ടെക്സ്റ്റ് ടു സ്പീച്ച് സിസ്റ്റം ചേര്‍ത്തു് kedit, kate, kwrite, konqueror എന്നിവയിലുള്ള മലയാളം(ധ്വനി പിന്തുണയ്ക്കുന്ന മറ്റു ഭാഷകളും) വായിക്കാം. കോണ്‍ക്വറര്‍ വെബ് ബ്രൌസറിലും മലയാളം വെബ് പേജുകള്‍ വായിക്കാന്‍ ധ്വനി ഉപയോഗിക്കാം. ഇതിനായി ഞാന്‍ പ്രത്യേകം കോഡൊന്നും എഴുതിയിട്ടില്ല. :). ktts(KDE യുടെ TTS system) കമാന്റ് പ്ലഗിന്‍ എന്ന ഒരു സൌകര്യം ഉപയോഗിച്ചാണു് ഇതു ചെയ്യാന്‍ കഴിയുന്നതു്.
Kontrol center ല്‍ പോയി Regional and Accessibility എന്ന വിഭാഗത്തിലെ Text-to-speech എടുക്കുക. അവിടെ Talkers tab ല്‍ Add എന്ന ബട്ടണ്‍ ക്ലിക്ക് ചെയ്യുക. Synthesizer എന്നതിന്റെ Show All തിരഞ്ഞെടുത്ത് Command എന്നെടുക്കുക. Language എന്നതു് Other എന്നും. ഇപ്പോള്‍ നിങ്ങള്‍ക്ക് സിന്തസൈസറിന്റെ കമാന്റ് ചേര്‍ക്കാനുള്ള ഒരു ജാലകം കിട്ടും. അവിടെ
dhvani %f

എന്നു ചേര്‍ക്കുക. ഈ Talker നെ ഡിഫോള്‍ട്ട് ആക്കുക. തീര്‍ന്നു. മുമ്പ് പറഞ്ഞ അപ്ലിക്കേനുകളിലെല്ലാം വായിക്കേണ്ട ഭാഗം സെലക്ട് ചെയ്തു് ടൂള്‍സ് മെനുവില്‍ നിന്നു് Speak text എടുക്കുക.
For for information about dhvani, how to install etc see the documentation