New version of Stuttgart Finite State Transducer

The Malayalam morphology analyser(mlmorph) I am actively developing since 2016 is based on Stuttgart Finite State Transducer(SFST) formalism. SFST was developed by Dr Helmut Schmid. Since SFST had minimal developer apis and no functional python APIs, I was using Helsinki Finite State Transducer(HFST) toolkit. HFST has python binding and better tooling and development support. HFST has SFST backend so there was no issue in using SFST formalism.

But HFST development has slowed down in past years. In 2017, I had to switch back to SFST for compiling my automata because HFST’s compile support for SFST was broken. That complicated my build system, but still I was able to use the python binding of HFST. But the development and release cycles of HFST further slowed down and python bindings were not released for python 3.8 or later versions. So mlmorph was stuck with python 3.8 and this was a big pain point since developers trying to use mlmorph found it not working in their python versions. It became more serious when Ubuntu and such popular distros moved to later python versions as default ones. I tried to contact the upstream developers. They were trying to help, but there were no active developers. Helping HFST to release python bindings was not easy because of very complicated build system they have.

Software is good only if you can run it in a system. So to save mlmorph from this dependency issue, I started looking at SFST source code and see if I can improve it. My intention was to write a python binding for SFST itself. SFST is written in C++. I have not written in serious C++ ever since I left my engineering college. I tried Pybind11 for writing the binding. Pybind11 documentation is very detailed, but it prefers CMake build system. SFST was using Automake. So I first migrated SFST source code from GNU Make to CMake. Once everything was compiling, pybind11 worked smoothly and I got the python binding working.

But it was not over. I integrated github workflows to build python wheels for major operating system and architectures and the build started failing because the VC++ is not able to compile SFST in windows. This was because of deprecated usage of hash_map. To use with C++11 or C++17, the correct way is to use unordered_map. So I migrated the codebase to unordered_map. This was not easy because the primary data structure for the transducer is this hashmap. To get everything working, I had to change all char * usages to std::string.

The SFST progamming interfaces were expecting File points to write results, instead of returning the actual transducer result. This made me to use a temporary file in python binding and then read and return its content to python client code. That was not efficient. I had to write a transducer path traversal logic to return a string representation of its result for reuse in client side.

And at last, python bindings were ready, python wheels were bulding all possible python versions and operating systems supported by cibuildwheel.

I wrote to Dr Schmid to upstream these changes. To my surprise he just updated the SFST website with link to my SFST version as latest version of SFST. Thanks Dr Schmid.

So I released a new version of SFST(version 1.5.0), and its python bindings at https://pypi.org/project/sfst/.

I was not expecting this much troubles to get it working, but the effort was worth it. I am happy that mlmorph is no longer stuck with HFST development and just depends on SFST that I maintain. And I brushed up my C++ skills a bit.

mlmorph  fst 
comments powered by Disqus