Stemming with Haskell

Last week we worked on building a small search engine with Haskell. As you might know, when searching you’ll need some index you’ll search and possibly stemming to allow people to search for variants of a word and still come up with accurate results.

Fortunately for us, there are already good libraries and tools out there to help us. So instead of trying to write everything from scratch, we made a small library based on Snowball’s libstemmer_c and a very (very!) rough start of a Sphinx client (more about that in a later post).

We’ve released the library on Hackage so check out stemmer 0.1

A small code example to give you a taste…

module Main where
 
import qualified NLP.Stemmer as Stemming
import Control.Monad (unless)
import System.IO (hSetBuffering, stdout, BufferMode(NoBuffering))
 
main :: IO ()
main = do
    stemmer <- Stemming.new Stemming.English
    putStrLn "Enter a sentence to stem, an empty line to stop."
    hSetBuffering stdout NoBuffering -- to print a prompt
    stemUserInput stemmer
    Stemming.delete stemmer
 
stemUserInput :: Stemming.Stemmer -> IO ()
stemUserInput stemmer = do
    putStr "> "
    string <- getLine
    unless (string == "") $ do 
        string' <- mapM (Stemming.stem stemmer) $ words string
        putStrLn $ "< " ++ unwords string'
        stemUserInput stemmer

Save this to Main.hs and then do something like

$ ghc --make Main.hs -o stemmer
[1 of 1] Compiling Main ( Main.hs, Main.o )
Linking stemmer …
$ ./stemmer
Enter a sentence to stem, an empty line to stop.
> The fishes worked forever with their fins
< The fish work forev with their fin
> Stemming with Haskell
< Stem with Haskel

It was pretty easy to implement this library and also a nice exercise in using Haskell’s Foreign Function Interface.


About this entry