Stemming with Haskell reloaded
Thanks to the nice discussion with Reinier Lamers of the previous post, I’ve updated and released the stemmer library with a more Haskell-like interface. As a point of reference, here’s a new version of the example of the previous post.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | import NLP.Stemmer import Control.Monad (unless) import System.IO (hSetBuffering, stdout, BufferMode(NoBuffering)) main :: IO () main = do putStrLn "Enter a sentence to stem, an empty line to stop." hSetBuffering stdout NoBuffering -- to print a prompt stemUserInput stemUserInput :: IO () stemUserInput = do putStr "> " string <- getLine unless (string == "") $ do putStrLn $ (++) "< " $ unwords $ stemWords English $ words string stemUserInput |
You see? Much nicer, no more C-like pointer adminstration cruft that you wouldn’t expect in Haskell, just a simple (pure) function ’stemWords’ (line 17) which, given an algorithm, stems a list of strings. (I suppose a next version of the library should have an implementation for bytestrings as well.)
As the attentive reader might have noticed, the example above is not semantically equal to the previous one, since there is no sign the stemmer is constructed only once and deleted at the very end. To explain the implementation of the library, I’d like to show how you can still use the more C-like interface, minus the tedious pointer administration.
Let me introduce withStemmer, inspired by withHMatrix.
withStemmer :: Algorithm -> (Stemmer -> IO a) -> IO a withStemmer algorithm action = do stemmer <- new algorithm result <- action stemmer delete stemmer return result
Using withStemmer I can now repeat the example above, but with the semantics of the example of the previous post. (For a quick scan: only line 1, 9, 16 and 17 changed.)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | import NLP.Stemmer.C import Control.Monad (unless) import System.IO (hSetBuffering, stdout, BufferMode(NoBuffering)) main :: IO () main = do putStrLn "Enter a sentence to stem, an empty line to stop." hSetBuffering stdout NoBuffering -- to print a prompt withStemmer English stemUserInput stemUserInput :: Stemmer -> IO () stemUserInput stemmer = do putStr "> " string <- getLine unless (string == "") $ do string' <- mapM (stem stemmer) $ words string putStrLn $ (++) "< " $ unwords string' stemUserInput stemmer |
Notice that I’m using NLP.Stemmer.C (line 1) and that the stemming now must be done inside the IO monad (line 16). In practice this is probably a mere inconvenience, but a pure Haskell interface is of course much nicer… Introducing to the stage unsafePerformIO (organ sounds: dum, dum, duuum).
I’ve defined a nice ‘unsafe’ version of withStemmer as a helper:
{-# NOINLINE withStemmer #-} withStemmer :: Algorithm -> (C.Stemmer -> IO a) -> a withStemmer algorithm action = unsafePerformIO $ C.withStemmer algorithm action
And now I can easily define two very nice functions
stem :: Algorithm -> String -> String stem algorithm input = withStemmer algorithm (\stemmer -> C.stem stemmer input) stemWords :: Algorithm -> [String] -> [String] stemWords algorithm input = withStemmer algorithm (\stemmer -> mapM (C.stem stemmer) input)
stemWords is there for efficiency reasons, since I suppose stemming a list of words is a common action. It’s much nicer (and a bit faster) to ‘keep the stemmer alive’ and use a mapM internally, than letting the user do map stem ....
So, unsafePerformIO, eh? It took me a couple of tries to get to this interface, but I think it turned out pretty okay. Word of advice, make sure that you read the tips and docs carefully. Also, don’t mix multiple unsafePerformIO calls, that was at least a sure recipe for segfaults in my case ;)
5 Comments
Jump to comment form | comments rss | trackback uri