On Machine Translation and God

I was reading an article named “Why Can’t a Computer Translate More Like a Person?” by Alan K. Melby. The article is about the challenges that machine translation technology face to reach a acceptable quality of translation. He explains the importance of culture sensitivity required for machine translation programs. Article lists a number of examples where MT can go wrong if context , culture etc are not taken into consideration.  There are very interesting arguments about how reductionalism becomes a wrong choice while designing MT. If you are interested in  natural language processing or machine translation and wondering if there is any limit for computer programs to reach human’s language capabilities, please read it.

The article is written long time back, and Machine Translation technologies improved a lot. There are commercial as well as free translation products for many languages. There are research going on in intra-indic as well as english-indic translations.  I am not sure how far these technologies solved the challenges mentioned in the above mentioned article, but I believe that the questions are still valid.

The question is whether the programs can understand our culture, language usage , emotions etc. For translating limited domain or dry content, the machine translation may be effective, but in a general purpose use, I don’t know how effective they are.

Melby argues :

That key factor which is missing from current theories is agency. By agency, I mean the capacity to make real choices by exercising our will, ethical choices for which we are responsible. […]. Any ‘choice’ that is a rigid and unavoidable consequence of the circumstances is not a real choice that could have gone either way and is thus not an example of agency. A computer has no real choice in what it will do next. Its next action is an unavoidable consequence of the machine language it is executing and the values of data presented to it. I am proposing that any approach to meaning that discounts agency will amount to no more than the mechanical manipulation of symbols such as words, that is, moving words around and linking them together in various ways instead of understanding them. Computers can already manipulate symbols. In fact, that is what they mostly do. But manipulating symbols does not give them agency and it will not let them handle language like humans. Symbol manipulation works only within a specific domain, and any attempt to move beyond a domain through symbol manipulation is doomed, for manipulation of symbols involves no true surprises, only the strict application of rules. General vocabulary, as we have seen, involves true surprises that could not have been predicted.

With all these advanced technologies, can we develop a universal , any-to-any language translation program? We have seen many examples where human beings are failing miserably in sensible translation. If you are looking for  english->hindi translation effectiveness, try this using google Translation

आप हिन्दी समझते है ? ==> You understand English?

So do you think that if there is any such universal translation tool,  it is nearly impossible and “only god can create such a tool” ?! . Heard about Babel fish (of The Hitchhiker’s Guide to the Galaxy)? .  The babel fish is small, yellow, leech-like, and is a universal translator which simultaneously translates from one spoken language to another. When inserted into the ear, its nutrition processes convert sound waves into brain waves, neatly crossing the language divide between any species you should happen to meet whilst travelling in space. According to the Hitchhiker’s Guide, the Babel fish was put forth as an example for the non-existence of God: .

“I refuse to prove that I exist,” says God, “for proof denies faith, and without faith I am nothing.”

“But,” says Man, “the Babel fish is a dead giveaway isn’t it? It could not have evolved by chance. It proves that you exist, and so therefore, by your own arguments, you don’t. Q.E.D.

“Oh dear,” says God, “I hadn’t thought of that,” and promptly vanishes in a puff of logic

Alan K Melby argues that Douglas Adams was saying that there can’t be any such fish.

The silliness of the above argument is intended, I believe, to show the futility of trying to prove the existence of God, through physics or any other route. Belief in God is a starting point, not a conclusion. If it were a conclusion, then that conclusion would have to be based on something else that is firmer than our belief in God. If that something else forces everyone to believe in God, then faith is denied. If that something else does not force us to believe in God, then it may not be a sufficiently solid foundation for our belief.

Adams may also be saying something about translation and the nature of language. I can speculate on what Adams had in mind to say about translation when he dreamed up the Babel fish. My own bias would have him saying indirectly that there could be no such fish since there is no universal set of thought patterns underlying all languages. Even with direct brain to brain communication, we would still need shared concepts in order to communicate. Words do not really fail us. If two people share a concept, they can eventually agree on a word to express it. Ineffable experiences are those that are not shared by others.

I have some friends studying on machine translation with Indian Languages. They are evaluating shallow transfer method(Statistical methods to the words surrounding the ambiguous word.) for this using tools like apertium. Let us hope that they will succeed in their efforts.

Let me give one example translation between Tamil and Malayalam where context matters.

In Malayalam, for ‘wait, wait’, we usually say, “നില്ക്കു് നില്ക്ക്”(Literal meaning:  ‘stand, stand’ ) . For the same purpose , I have noticed that my Tamil speaking friends  use “இரு இரு” (Literal meaning: ‘sit, sit’  ). Now if the translation is done without knowing this usage, it is going to be funny. Shallow transfer methods use multiple intermediate  languages for translation. For eg: If there is a translation  tool available for a->b and b->c and then a->c is possible through a->b->c . I feel that this is going to be a big challenge.. to keep the word meaning, context, common usage…etc.. Let us wait/sit/stand and see 😀

Since we saw “a nonexistence of God proof”, let me give another one, that I read sometime back.

  1. God is so powerful, he can do any thing,
  2. God can create anything , if #1 is true
  3. If #2 is is true, he can create a big stone that he cannot lift!
  4. If he cannot lift a stone, then #1 is wrong, hence #2 also wrong. So God does not exist!

Looks very silly, right? or “Logical” ? :)

PDFBox : Extract Text from PDF

Recently I had to extract text from PDF files for indexing the content using Apache Lucene. Apache PDFBox was the obvious choice for the java library to be used.

Apache PDFBox is an opensource java library for working with PDF files. The PDFBox library allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. PDFBox also includes several command line utilities.

There is no latest build available for PDFBox. Sourceforge has very old binaries. But  the old version fails to work with PDF 1.5 specification. So one need to compile the latest code from SVN. 

I am sharing the latest jar file built from svn here

The following example explains how to extract the text from a pdf file using PDFBox.

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;

public class PDFTextParser {

	// Extract text from PDF Document
	static String pdftoText(String fileName) {
		PDFParser parser;
		String parsedText = null;;
		PDFTextStripper pdfStripper = null;
		PDDocument pdDoc = null;
		COSDocument cosDoc = null;
		File file = new File(fileName);
		if (!file.isFile()) {
			System.err.println("File " + fileName + " does not exist.");
			return null;
		}
		try {
			parser = new PDFParser(new FileInputStream(file));
		} catch (IOException e) {
			System.err.println("Unable to open PDF Parser. " + e.getMessage());
			return null;
		}
		try {
			parser.parse();
			cosDoc = parser.getDocument();
			pdfStripper = new PDFTextStripper();
			pdDoc = new PDDocument(cosDoc);
			pdfStripper.setStartPage(1);
			pdfStripper.setEndPage(5);
			parsedText = pdfStripper.getText(pdDoc);
		} catch (Exception e) {
			System.err
					.println("An exception occured in parsing the PDF Document."
							+ e.getMessage());
		} finally {
			try {
				if (cosDoc != null)
					cosDoc.close();
				if (pdDoc != null)
					pdDoc.close();
			} catch (Exception e) {
				e.printStackTrace();
			}
		}
		return parsedText;
	}
	public static void main(String args[]){
		System.out.println(pdftoText("/home/santhosh/pdfbox/test.pdf"));
	}

}

 

More details on the APIs can be read from here

Announcing Project Silpa

Many of my friends already know about a project I am working on,  this is a public announcement of that.

The project is named as Silpa, may be an acronym of Swathanthra(Mukth, Free as in Freedom) Indian Language Processing Applications. It is a web framework and a set of applications for processing Indian Languages in many ways. Or in other words, it is a platform for porting existing and upcoming language processing applications to the web.

Before going to the details, you can have a quick preview of the application here : http://smc.org.in/silpa

The project is designed for adding applications/utilities as plugins. The framework is written from scratch using python language. As you can see in the development version, there are number of modules already written.  Most of the modules requires some more work to make it _complete_. The application is free software and there is a link to the source code at the bottom of the application.

As it is meant for covering all languages of India, all modules should be capable of handling all scripts from India(Sometimes English too). At the same time , the language of input data is transparent , meaning, user need not mention that _this_ is the language in which she is entering the data. Unlike desktop applications which asks to specify the language along with the input data(for eg: Spell checker) , the modules should try to detect the language them self. And if possible, modules try to process the data even if the input data is in multiple Indic scripts.

The modules may be General purpose(eg: Dictionary, Spellcheck,Sort. Transliteration, Font conversion..) or Technology/Algorithm  Demonstration purpose (eg: Hyphenation, Stemmer, Search algorithms)

Some of the modules are usable  as of now, while some of them are in development. You may just try out them. User’s data will not be logged  except when a crash occurs(at that time user data and exception trace will be logged for later debugging).

And, this is also a call for contributors. You may propose new ideas for modules, feature suggestion etc.. A few  students showed interest in the project. Unfortunately python is not a language in their  college syllabus. So if you are good in python and have interest in contributing to the project, drop me a mail :). There is no separate version for development and the one which is present at http://smc.org.in/silpa . All development happens there itself and any change in the code is immediately available for use!(or immediately starts crashing for user data)

I will write on some interesting algorithms I used for some modules later. If you are curious to know them, read the code!