Soundex is a phonetic indexing algorithm. It is used to search/retrieve words having similar pronunciation but slightly different spelling. Soundex was developed by Robert C. Russell and Margaret K. Odell. A variation called American Soundex was used in the 1930s for a retrospective analysis of the US censuses from 1890 through 1920. It is also described in Donald Knuth’s The Art of Computer Programming. The National Archives and Records Administration (NARA) maintains the current rule set for the official implementation of Soundex used by the U.S. Government.
The soundex code for a word is an english alphabet followed by a number of digits. The algorithm is explained with examples here
By this algorithm, if my name is written as Santhosh , Santosh , Santhos or Santos , the soundex code remains same and it is S532
Soundex had many limitations and sometimes creates false positive or false negative errors. There are variants for soundex and one important variant is Metaphone alogirthm by Lawrence Philip . Metaphone have another improved version called Double Metaphone.
Well, it works fine, but only for English. Just like our languages also have varying spelling. But more than spelling, in India , we have another issue to be addressed: Words(often nouns) getting transliterated among Indian Languages. Let me give some examples: In railway reservation chart, your name will be written in English as well as in Hindi. You are from Kerala(or some other state) and your name is transliterated to some other language. The only thing remain same is its pronunciation. It will be great if we can search on this data based on pronunciation, right?
We see a lot of discussions happening around e-governance and other computerization projects such as national UID etc now a days. Projects that handle Indic text heavily will require efficient search and string processing algorithms.
You got a list of names in Bengali and you dont know Bengali but you know Malayalam. Obviously you can’t search on Bengali text using Malayalam. But can’t we develop such algorithm?
So our requirement is this:
- Language Independent Search on Indic text.
- Comparison should be based on pronunciation
- Should be tolerant to spelling variation
The above discussed soundex algorithm just fits to the solution. Only issue is that algorithm is defined only for English. So time to design soundex for our languages!.
Original Soundex algorithm is not multilingual. And not designed for language independent indexing. And the digits to represent phonetic categories for Indian languages will not fit into 6. We have more phonetical families. So our algorithm will not be exact “localization” of English soundex, but we will use the concept. Let us call it as IndicSoundex 🙂 . Infact Metaphone algorithm uses 16 families for English.
One of the characteristcs of Indian languages is that all languages share same phonetic features. We have vowels a, aa, i, ii, u…. then consonant families ka, cha, ta, tha, pa.. etc. What we need to do is mapping all these sets to a common representation which is independent of language. And we call that representation as soundex for an Indic word.
While grouping and mapping Indic letters to phonetic codes, the following are facts are taken into consideration.
- Group short and long vowel to a single code. e and ee is considered as equal
- Consider half consonants as full consonants. For this ignore halants.
- Group consonant families. ka, kha,ga,gha, nga becomes a single family. Same is the case of cha, ta, tha,pa
- Group ra, Ra,
- Group la,La, zha
- Group sa,Sa,sha
When I grouped like this I got 20 groups. I have prepared a table for all Indic letters and corresponding code here http://thottingal.in/soundex/soundex.html
- For each letter in the word except first letter, get the corresponding soundex digit from the character map, which is nothing but a table like this.
- If the letter is not found in character map, the soundex digit for that letter is 0
- Duplicate consecutive soundex codes are skipped. ie, effectively क्क will be considered as क.
- Replace first digit with first alpha character.
- remove all 0s from the soundex code.
- Return soundex code padded to the required length (ie, if required length of code is 5 and soundex is സBCD, then soundex returned will be സBCD0.
From the above algorithm , the soundex code for സന്തോഷ് is സLKES000. and for सन्तौष , it is सLKES000 . So if I need to compare सन्तौष and സന്തോഷ് and need a positive result, we should do something in comparison logic and there by making the comparison language independent
An example :
കാര്തിക്, കാര്ത്തിക്, കാര്തിഗ് = കAPKBF00
- Compare the two string without calculating the soundex codes. If there are same return 0
- Calculate the sounedx codes for both strings, if the match return 1. ie, both strings are from same language and sounds alike
- If the soundex of strings has different fist letter and rest of the part is same, check whether the first letters are with same soundex digit. If so, both words from different languages, but sounds alike. Return code for this will be 2
- If none of the above conditions match, return -1, indicating both strings are completely different.
Use http://libindic.org/Soundex to experiment with this algorithm. Following screenshot shows how കാര്ത്തിക് and કાર્તિક્ are found to be phonetically similar or “sounds alike”
Ok, we got the soundex code. The soundex code itself is not useful. We need to implement a search utility based on this. As I explained above, an intra-indic search program. We will see it later.
Feel free to comment on algorithm and please suggest any improvement we can do.