Malayalam Wikipedia releases selected articles on CD

As part of Malayalam Wikipedia Meetup 2010 , today  Malayalam wikipedia releases 500 selected articles on a CD ROM. This is the first time in India, a Wikipedia on local language releasing its articles for offline usage. I handled the technology part  of the project.

The idea was to get the selected articles in static form to the CD. But this is not easy as we imagine. It is not like saving each  page from browser to the local machine. Following were the challenges:

  • Automate the process of getting the page and the images in it. Wikipedia articles changes frequently. So we need the program to fetch the latest article from wiki whenever it is executed.
  • Fix all the links, css, javascript, image references so that all resolves within CD itself
  • Provide an categorized index of the articles for easily locating the article.
  • Provide a search in the article titles.
  • ISO 9660 filesystem of CD/DVD has lots of limitations. There are restrictions on unicode names of the files, length of the file names, directory depth, special characters in filenames etc. Wikipedia has its article and image names with unicode, special characters and most of the time they exceeds the filename length. To avoid all these, we should rename most of the files and then fix the cross references in all files.
  • It should work on all Operating systems. All the content should be presented with HTML, Javascript and CSS. Being the content in Malayalam, even if the user does not have required fonts in her/his machine, there should not be any problem for reading the content(font embedding required).

Manually solving all these challenges is not the way to go. So I wrote a program, which just takes the article titles and does all the above tasks and finally creates a repository ready for burning to CD ROM.

Wget disappointed me in fetching the content from wiki. There is an open bug in wget which make the download of non-latin URLs impossible.

Have a look at the CD content we created : Malayalam Wikipedia Selected 500 Articles . Hiran helped me with the artworks.

The CD cover image designed by Hiran

Since entire process is automated, the program can be used for any other language.  I am releasing the program for the benefit of everybody. You can get the program from here. It is written on Python. Jquery was used for the UI.  For details on the usage, customization etc read the wiki page of the project.

For those who can’t read Malayalam, here is a sample wiki created  by the wiki2cd program from English wikipedia by selecting 10 articles.

Malayalam Wikipedia Community  hope that this is a big step to reach the majority of the people who does not have internet access. If printed, this 500 articles will be at least 5000 pages. CDROM also includes information about commonly used free software based tools for Malayalam computing. Some writing tools and fonts are distributed in the same CD ROM.

Thanks to Malayalam Wikipedia for giving this great opportunity to wok on this project.

The ISO image of the CD is available here for download.

21 thoughts on “Malayalam Wikipedia releases selected articles on CD”

  1. Santhu,

    Your hard work and commitment to build up a strong base for the e-Malayalam will never go unnoticed. Initiatives like this would require tremendous amount of hard work, technical knowledge and a good vision.
    You are wonderful when you dont stay for the credits but when you move on to the next projects once things like these are done
    All the best buddy!

  2. hai! wiki cd is excelent. how to run the programme(win2cd)? through which software i can run it?python?

  3. thanks for your great work….you bring technology & knowledge closer to and empower our people..thanks once again….

  4. Hello,

    I’d like to express my respect for your work. Sadly, my Malayalam is ~46 years old and I can’t read or speak it anymore. Yet I still hope to have the opportunity to learn it again.

    Greetings from Germany,


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.