Stemming with Haskell reloaded

Thanks to the nice discussion with Reinier Lamers of the previous post, I’ve updated and released the stemmer library with a more Haskell-like interface. As a point of reference, here’s a new version of the example of the previous post.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import NLP.Stemmer
import Control.Monad (unless)
import System.IO (hSetBuffering, stdout, BufferMode(NoBuffering))
 
main :: IO ()
main = do
    putStrLn "Enter a sentence to stem, an empty line to stop."
    hSetBuffering stdout NoBuffering -- to print a prompt
    stemUserInput
 
stemUserInput :: IO ()
stemUserInput = do
    putStr "> "
    string <- getLine
    unless (string == "") $ do 
        putStrLn $ (++) "< " $ unwords $ 
                               stemWords English $ 
                               words string
        stemUserInput

You see? Much nicer, no more C-like pointer adminstration cruft that you wouldn’t expect in Haskell, just a simple (pure) function ’stemWords’ (line 17) which, given an algorithm, stems a list of strings. (I suppose a next version of the library should have an implementation for bytestrings as well.)

As the attentive reader might have noticed, the example above is not semantically equal to the previous one, since there is no sign the stemmer is constructed only once and deleted at the very end. To explain the implementation of the library, I’d like to show how you can still use the more C-like interface, minus the tedious pointer administration.

Let me introduce withStemmer, inspired by withHMatrix.

withStemmer :: Algorithm -> (Stemmer -> IO a) -> IO a
withStemmer algorithm action = do
    stemmer <- new algorithm
    result  <- action stemmer
    delete stemmer
    return result

Using withStemmer I can now repeat the example above, but with the semantics of the example of the previous post. (For a quick scan: only line 1, 9, 16 and 17 changed.)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import NLP.Stemmer.C
import Control.Monad (unless)
import System.IO (hSetBuffering, stdout, BufferMode(NoBuffering))
 
main :: IO ()
main = do
    putStrLn "Enter a sentence to stem, an empty line to stop."
    hSetBuffering stdout NoBuffering -- to print a prompt
    withStemmer English stemUserInput
 
stemUserInput :: Stemmer -> IO ()
stemUserInput stemmer = do
    putStr "> "
    string <- getLine
    unless (string == "") $ do
        string' <- mapM (stem stemmer) $ words string
        putStrLn $ (++) "< " $ unwords string'
        stemUserInput stemmer

Notice that I’m using NLP.Stemmer.C (line 1) and that the stemming now must be done inside the IO monad (line 16). In practice this is probably a mere inconvenience, but a pure Haskell interface is of course much nicer… Introducing to the stage unsafePerformIO (organ sounds: dum, dum, duuum).

I’ve defined a nice ‘unsafe’ version of withStemmer as a helper:

{-# NOINLINE withStemmer #-}
withStemmer :: Algorithm -> (C.Stemmer -> IO a) -> a
withStemmer algorithm action = unsafePerformIO $ 
    C.withStemmer algorithm action

And now I can easily define two very nice functions

stem :: Algorithm -> String -> String
stem algorithm input = 
    withStemmer algorithm (\stemmer -> C.stem stemmer input)
 
stemWords :: Algorithm -> [String] -> [String]
stemWords algorithm input = 
    withStemmer algorithm (\stemmer -> mapM (C.stem stemmer) input)

stemWords is there for efficiency reasons, since I suppose stemming a list of words is a common action. It’s much nicer (and a bit faster) to ‘keep the stemmer alive’ and use a mapM internally, than letting the user do map stem ....

So, unsafePerformIO, eh? It took me a couple of tries to get to this interface, but I think it turned out pretty okay. Word of advice, make sure that you read the tips and docs carefully. Also, don’t mix multiple unsafePerformIO calls, that was at least a sure recipe for segfaults in my case ;)


About this entry