<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Tupil Code Blog &#187; stemmer</title>
	<atom:link href="http://blog.tupil.com/tag/stemmer/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.tupil.com</link>
	<description>(Get up early, code often)</description>
	<lastBuildDate>Wed, 02 Sep 2009 19:38:33 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Stemming with Haskell reloaded</title>
		<link>http://blog.tupil.com/stemming-with-haskell-reloaded/</link>
		<comments>http://blog.tupil.com/stemming-with-haskell-reloaded/#comments</comments>
		<pubDate>Sat, 19 Jul 2008 14:46:48 +0000</pubDate>
		<dc:creator>Eelco Lempsink</dc:creator>
				<category><![CDATA[Code]]></category>
		<category><![CDATA[Haskell]]></category>
		<category><![CDATA[stemmer]]></category>

		<guid isPermaLink="false">http://blog.tupil.com/?p=23</guid>
		<description><![CDATA[Thanks to the nice discussion with Reinier Lamers of the previous post, I&#8217;ve updated and released the stemmer library with a more Haskell-like interface.  As a point of reference, here&#8217;s a new version of the example of the previous post.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import NLP.Stemmer
import Control.Monad &#40;unless&#41;
import System.IO &#40;hSetBuffering, stdout, BufferMode&#40;NoBuffering&#41;&#41;
&#160;
main :: IO &#40;&#41;
main = do
   [...]]]></description>
			<content:encoded><![CDATA[<p>Thanks to the nice discussion with Reinier Lamers of the <a href="http://blog.tupil.com/?p=22">previous post</a>, I&#8217;ve updated and <a href='http://hackage.haskell.org/cgi-bin/hackage-scripts/package/stemmer'>released</a> the stemmer library with a more Haskell-like interface. <span id="more-23"></span> As a point of reference, here&#8217;s a new version of the example of the previous post.</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
</pre></td><td class="code"><pre class="haskell"><span style="color: #06c; font-weight: bold;">import</span> NLP<span style="color: #339933; font-weight: bold;">.</span>Stemmer
<span style="color: #06c; font-weight: bold;">import</span> Control<span style="color: #339933; font-weight: bold;">.</span><span style="color: #cccc00; font-weight: bold;">Monad</span> <span style="color: green;">&#40;</span>unless<span style="color: green;">&#41;</span>
<span style="color: #06c; font-weight: bold;">import</span> System<span style="color: #339933; font-weight: bold;">.</span><span style="color: #cccc00; font-weight: bold;">IO</span> <span style="color: green;">&#40;</span>hSetBuffering<span style="color: #339933; font-weight: bold;">,</span> stdout<span style="color: #339933; font-weight: bold;">,</span> BufferMode<span style="color: green;">&#40;</span>NoBuffering<span style="color: green;">&#41;</span><span style="color: green;">&#41;</span>
&nbsp;
main <span style="color: #339933; font-weight: bold;">::</span> <span style="color: #cccc00; font-weight: bold;">IO</span> <span style="color: green;">&#40;</span><span style="color: green;">&#41;</span>
main <span style="color: #339933; font-weight: bold;">=</span> <span style="color: #06c; font-weight: bold;">do</span>
    <span style="font-weight: bold;">putStrLn</span> <span style="">&quot;Enter a sentence to stem, an empty line to stop.&quot;</span>
    hSetBuffering stdout NoBuffering <span style="color: #5d478b; font-style: italic;">-- to print a prompt</span>
    stemUserInput
&nbsp;
stemUserInput <span style="color: #339933; font-weight: bold;">::</span> <span style="color: #cccc00; font-weight: bold;">IO</span> <span style="color: green;">&#40;</span><span style="color: green;">&#41;</span>
stemUserInput <span style="color: #339933; font-weight: bold;">=</span> <span style="color: #06c; font-weight: bold;">do</span>
    <span style="font-weight: bold;">putStr</span> <span style="">&quot;&gt; &quot;</span>
    string <span style="color: #339933; font-weight: bold;">&lt;-</span> <span style="font-weight: bold;">getLine</span>
    unless <span style="color: green;">&#40;</span>string <span style="color: #339933; font-weight: bold;">==</span> <span style="">&quot;&quot;</span><span style="color: green;">&#41;</span> <span style="color: #339933; font-weight: bold;">$</span> <span style="color: #06c; font-weight: bold;">do</span> 
        <span style="font-weight: bold;">putStrLn</span> <span style="color: #339933; font-weight: bold;">$</span> <span style="color: green;">&#40;</span><span style="color: #339933; font-weight: bold;">++</span><span style="color: green;">&#41;</span> <span style="">&quot;&lt; &quot;</span> <span style="color: #339933; font-weight: bold;">$</span> <span style="font-weight: bold;">unwords</span> <span style="color: #339933; font-weight: bold;">$</span> 
                               stemWords English <span style="color: #339933; font-weight: bold;">$</span> 
                               <span style="font-weight: bold;">words</span> string
        stemUserInput</pre></td></tr></table></div>

<p>You see?  Much nicer, no more C-like pointer adminstration cruft that you wouldn&#8217;t expect in Haskell, just a simple (pure) function &#8217;stemWords&#8217; (line 17) which, given an algorithm, stems a list of strings.  (I suppose a next version of the library should have an implementation for bytestrings as well.)</p>
<p>As the attentive reader might have noticed, the example above is not semantically equal to the previous one, since there is no sign the stemmer is constructed only once and deleted at the very end. To explain the implementation of the library, I&#8217;d like to show how you can still use the more C-like interface, minus the tedious pointer administration.</p>
<p>Let me introduce <code>withStemmer</code>, inspired by <a href='http://www.haskell.org/pipermail/glasgow-haskell-users/2002-July/003681.html'>withHMatrix</a>.</p>

<div class="wp_syntax"><div class="code"><pre class="haskell">withStemmer <span style="color: #339933; font-weight: bold;">::</span> Algorithm <span style="color: #339933; font-weight: bold;">-&gt;</span> <span style="color: green;">&#40;</span>Stemmer <span style="color: #339933; font-weight: bold;">-&gt;</span> <span style="color: #cccc00; font-weight: bold;">IO</span> a<span style="color: green;">&#41;</span> <span style="color: #339933; font-weight: bold;">-&gt;</span> <span style="color: #cccc00; font-weight: bold;">IO</span> a
withStemmer algorithm action <span style="color: #339933; font-weight: bold;">=</span> <span style="color: #06c; font-weight: bold;">do</span>
    stemmer <span style="color: #339933; font-weight: bold;">&lt;-</span> new algorithm
    result  <span style="color: #339933; font-weight: bold;">&lt;-</span> action stemmer
    delete stemmer
    <span style="font-weight: bold;">return</span> result</pre></div></div>

<p>Using <code>withStemmer</code> I can now repeat the example above, but with the semantics of the example of the <a href="http://blog.tupil.com/?p=22">previous post</a>.  (For a quick scan: only line 1, 9, 16 and 17 changed.)</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
</pre></td><td class="code"><pre class="haskell"><span style="color: #06c; font-weight: bold;">import</span> NLP<span style="color: #339933; font-weight: bold;">.</span>Stemmer<span style="color: #339933; font-weight: bold;">.</span>C
<span style="color: #06c; font-weight: bold;">import</span> Control<span style="color: #339933; font-weight: bold;">.</span><span style="color: #cccc00; font-weight: bold;">Monad</span> <span style="color: green;">&#40;</span>unless<span style="color: green;">&#41;</span>
<span style="color: #06c; font-weight: bold;">import</span> System<span style="color: #339933; font-weight: bold;">.</span><span style="color: #cccc00; font-weight: bold;">IO</span> <span style="color: green;">&#40;</span>hSetBuffering<span style="color: #339933; font-weight: bold;">,</span> stdout<span style="color: #339933; font-weight: bold;">,</span> BufferMode<span style="color: green;">&#40;</span>NoBuffering<span style="color: green;">&#41;</span><span style="color: green;">&#41;</span>
&nbsp;
main <span style="color: #339933; font-weight: bold;">::</span> <span style="color: #cccc00; font-weight: bold;">IO</span> <span style="color: green;">&#40;</span><span style="color: green;">&#41;</span>
main <span style="color: #339933; font-weight: bold;">=</span> <span style="color: #06c; font-weight: bold;">do</span>
    <span style="font-weight: bold;">putStrLn</span> <span style="">&quot;Enter a sentence to stem, an empty line to stop.&quot;</span>
    hSetBuffering stdout NoBuffering <span style="color: #5d478b; font-style: italic;">-- to print a prompt</span>
    withStemmer English stemUserInput
&nbsp;
stemUserInput <span style="color: #339933; font-weight: bold;">::</span> Stemmer <span style="color: #339933; font-weight: bold;">-&gt;</span> <span style="color: #cccc00; font-weight: bold;">IO</span> <span style="color: green;">&#40;</span><span style="color: green;">&#41;</span>
stemUserInput stemmer <span style="color: #339933; font-weight: bold;">=</span> <span style="color: #06c; font-weight: bold;">do</span>
    <span style="font-weight: bold;">putStr</span> <span style="">&quot;&gt; &quot;</span>
    string <span style="color: #339933; font-weight: bold;">&lt;-</span> <span style="font-weight: bold;">getLine</span>
    unless <span style="color: green;">&#40;</span>string <span style="color: #339933; font-weight: bold;">==</span> <span style="">&quot;&quot;</span><span style="color: green;">&#41;</span> <span style="color: #339933; font-weight: bold;">$</span> <span style="color: #06c; font-weight: bold;">do</span>
        string' <span style="color: #339933; font-weight: bold;">&lt;-</span> <span style="font-weight: bold;">mapM</span> <span style="color: green;">&#40;</span>stem stemmer<span style="color: green;">&#41;</span> <span style="color: #339933; font-weight: bold;">$</span> <span style="font-weight: bold;">words</span> string
        <span style="font-weight: bold;">putStrLn</span> <span style="color: #339933; font-weight: bold;">$</span> <span style="color: green;">&#40;</span><span style="color: #339933; font-weight: bold;">++</span><span style="color: green;">&#41;</span> <span style="">&quot;&lt; &quot;</span> <span style="color: #339933; font-weight: bold;">$</span> <span style="font-weight: bold;">unwords</span> string'
        stemUserInput stemmer</pre></td></tr></table></div>

<p>Notice that I&#8217;m using <code>NLP.Stemmer.C</code> (line 1) and that the stemming now must be done inside the IO monad (line 16).  In practice this is probably a mere inconvenience, but a pure Haskell interface is of course much nicer&#8230; Introducing to the stage <code>unsafePerformIO</code> (organ sounds: dum, dum, duuum).</p>
<p>I&#8217;ve defined a nice &#8216;unsafe&#8217; version of withStemmer as a helper:</p>

<div class="wp_syntax"><div class="code"><pre class="haskell"><span style="color: #5d478b; font-style: italic;">{-# NOINLINE withStemmer #-}</span>
withStemmer <span style="color: #339933; font-weight: bold;">::</span> Algorithm <span style="color: #339933; font-weight: bold;">-&gt;</span> <span style="color: green;">&#40;</span>C<span style="color: #339933; font-weight: bold;">.</span>Stemmer <span style="color: #339933; font-weight: bold;">-&gt;</span> <span style="color: #cccc00; font-weight: bold;">IO</span> a<span style="color: green;">&#41;</span> <span style="color: #339933; font-weight: bold;">-&gt;</span> a
withStemmer algorithm action <span style="color: #339933; font-weight: bold;">=</span> unsafePerformIO <span style="color: #339933; font-weight: bold;">$</span> 
    C<span style="color: #339933; font-weight: bold;">.</span>withStemmer algorithm action</pre></div></div>

<p>And now I can easily define two very nice functions</p>

<div class="wp_syntax"><div class="code"><pre class="haskell">stem <span style="color: #339933; font-weight: bold;">::</span> Algorithm <span style="color: #339933; font-weight: bold;">-&gt;</span> <span style="color: #cccc00; font-weight: bold;">String</span> <span style="color: #339933; font-weight: bold;">-&gt;</span> <span style="color: #cccc00; font-weight: bold;">String</span>
stem algorithm input <span style="color: #339933; font-weight: bold;">=</span> 
    withStemmer algorithm <span style="color: green;">&#40;</span>\stemmer <span style="color: #339933; font-weight: bold;">-&gt;</span> C<span style="color: #339933; font-weight: bold;">.</span>stem stemmer input<span style="color: green;">&#41;</span>
&nbsp;
stemWords <span style="color: #339933; font-weight: bold;">::</span> Algorithm <span style="color: #339933; font-weight: bold;">-&gt;</span> <span style="color: green;">&#91;</span><span style="color: #cccc00; font-weight: bold;">String</span><span style="color: green;">&#93;</span> <span style="color: #339933; font-weight: bold;">-&gt;</span> <span style="color: green;">&#91;</span><span style="color: #cccc00; font-weight: bold;">String</span><span style="color: green;">&#93;</span>
stemWords algorithm input <span style="color: #339933; font-weight: bold;">=</span> 
    withStemmer algorithm <span style="color: green;">&#40;</span>\stemmer <span style="color: #339933; font-weight: bold;">-&gt;</span> <span style="font-weight: bold;">mapM</span> <span style="color: green;">&#40;</span>C<span style="color: #339933; font-weight: bold;">.</span>stem stemmer<span style="color: green;">&#41;</span> input<span style="color: green;">&#41;</span></pre></div></div>

<p><code>stemWords</code> is there for efficiency reasons, since I suppose stemming a list of words is a common action. It&#8217;s much nicer (and a bit faster) to &#8216;keep the stemmer alive&#8217; and use a <code>mapM</code> internally, than letting the user do <code>map stem ...</code>.</p>
<p>So, <code>unsafePerformIO</code>, eh? It took me a couple of tries to get to this interface, but I think it turned out pretty okay.  Word of advice, make sure that you read the <a href='http://www.haskell.org/haskellwiki/GHC:FAQ#When_is_it_safe_to_use_unsafe_functions_such_as_unsafePerformIO.3F'>tips</a> and <a href='http://haskell.org/ghc/docs/latest/html/libraries/base/System-IO-Unsafe.html#v%3AunsafePerformIO'>docs</a> carefully.  Also, don&#8217;t mix multiple unsafePerformIO calls, that was at least a sure recipe for segfaults in my case ;)</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.tupil.com/stemming-with-haskell-reloaded/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Stemming with Haskell</title>
		<link>http://blog.tupil.com/stemming-with-haskell/</link>
		<comments>http://blog.tupil.com/stemming-with-haskell/#comments</comments>
		<pubDate>Mon, 14 Jul 2008 14:55:41 +0000</pubDate>
		<dc:creator>Eelco Lempsink</dc:creator>
				<category><![CDATA[Code]]></category>
		<category><![CDATA[Haskell]]></category>
		<category><![CDATA[library]]></category>
		<category><![CDATA[Snowball]]></category>
		<category><![CDATA[stemmer]]></category>

		<guid isPermaLink="false">http://blog.tupil.com/?p=22</guid>
		<description><![CDATA[Last week we worked on building a small search engine with Haskell. As you might know, when searching you&#8217;ll need some index you&#8217;ll search and possibly stemming to allow people to search for variants of a word and still come up with accurate results.
Fortunately for us, there are already good libraries and tools out there [...]]]></description>
			<content:encoded><![CDATA[<p>Last week we worked on building a small search engine with Haskell. As you might know, when searching you&#8217;ll need some <em>index</em> you&#8217;ll search and possibly <a href='http://en.wikipedia.org/wiki/Stemming'>stemming</a> to allow people to search for variants of a word and still come up with accurate results.</p>
<p>Fortunately for us, there are already good libraries and tools out there to help us. So instead of trying to write everything from scratch, we made a small library based on <a href='http://snowball.tartarus.org/'>Snowball&#8217;s libstemmer_c</a> and a very (very!) rough start of a <a href='http://www.sphinxsearch.com/'>Sphinx</a> client (more about that in a later post).</p>
<p>We&#8217;ve released the library on <a href='http://hackage.haskell.org/'>Hackage</a> so check out <a href='http://hackage.haskell.org/cgi-bin/hackage-scripts/package/stemmer'>stemmer 0.1</a></p>
<p>A small code example to give you a taste&#8230;</p>

<div class="wp_syntax"><div class="code"><pre class="haskell"><span style="color: #06c; font-weight: bold;">module</span> Main <span style="color: #06c; font-weight: bold;">where</span>
&nbsp;
<span style="color: #06c; font-weight: bold;">import</span> <span style="color: #06c; font-weight: bold;">qualified</span> NLP<span style="color: #339933; font-weight: bold;">.</span>Stemmer <span style="color: #06c; font-weight: bold;">as</span> Stemming
<span style="color: #06c; font-weight: bold;">import</span> Control<span style="color: #339933; font-weight: bold;">.</span><span style="color: #cccc00; font-weight: bold;">Monad</span> <span style="color: green;">&#40;</span>unless<span style="color: green;">&#41;</span>
<span style="color: #06c; font-weight: bold;">import</span> System<span style="color: #339933; font-weight: bold;">.</span><span style="color: #cccc00; font-weight: bold;">IO</span> <span style="color: green;">&#40;</span>hSetBuffering<span style="color: #339933; font-weight: bold;">,</span> stdout<span style="color: #339933; font-weight: bold;">,</span> BufferMode<span style="color: green;">&#40;</span>NoBuffering<span style="color: green;">&#41;</span><span style="color: green;">&#41;</span>
&nbsp;
main <span style="color: #339933; font-weight: bold;">::</span> <span style="color: #cccc00; font-weight: bold;">IO</span> <span style="color: green;">&#40;</span><span style="color: green;">&#41;</span>
main <span style="color: #339933; font-weight: bold;">=</span> <span style="color: #06c; font-weight: bold;">do</span>
    stemmer <span style="color: #339933; font-weight: bold;">&lt;-</span> Stemming<span style="color: #339933; font-weight: bold;">.</span>new Stemming<span style="color: #339933; font-weight: bold;">.</span>English
    <span style="font-weight: bold;">putStrLn</span> <span style="">&quot;Enter a sentence to stem, an empty line to stop.&quot;</span>
    hSetBuffering stdout NoBuffering <span style="color: #5d478b; font-style: italic;">-- to print a prompt</span>
    stemUserInput stemmer
    Stemming<span style="color: #339933; font-weight: bold;">.</span>delete stemmer
&nbsp;
stemUserInput <span style="color: #339933; font-weight: bold;">::</span> Stemming<span style="color: #339933; font-weight: bold;">.</span>Stemmer <span style="color: #339933; font-weight: bold;">-&gt;</span> <span style="color: #cccc00; font-weight: bold;">IO</span> <span style="color: green;">&#40;</span><span style="color: green;">&#41;</span>
stemUserInput stemmer <span style="color: #339933; font-weight: bold;">=</span> <span style="color: #06c; font-weight: bold;">do</span>
    <span style="font-weight: bold;">putStr</span> <span style="">&quot;&gt; &quot;</span>
    string <span style="color: #339933; font-weight: bold;">&lt;-</span> <span style="font-weight: bold;">getLine</span>
    unless <span style="color: green;">&#40;</span>string <span style="color: #339933; font-weight: bold;">==</span> <span style="">&quot;&quot;</span><span style="color: green;">&#41;</span> <span style="color: #339933; font-weight: bold;">$</span> <span style="color: #06c; font-weight: bold;">do</span> 
        string' <span style="color: #339933; font-weight: bold;">&lt;-</span> <span style="font-weight: bold;">mapM</span> <span style="color: green;">&#40;</span>Stemming<span style="color: #339933; font-weight: bold;">.</span>stem stemmer<span style="color: green;">&#41;</span> <span style="color: #339933; font-weight: bold;">$</span> <span style="font-weight: bold;">words</span> string
        <span style="font-weight: bold;">putStrLn</span> <span style="color: #339933; font-weight: bold;">$</span> <span style="">&quot;&lt; &quot;</span> <span style="color: #339933; font-weight: bold;">++</span> <span style="font-weight: bold;">unwords</span> string'
        stemUserInput stemmer</pre></div></div>

<p>Save this to Main.hs and then do something like<br />
<code><br />
$ ghc --make Main.hs -o stemmer<br />
[1 of 1] Compiling Main             ( Main.hs, Main.o )<br />
Linking stemmer ...<br />
$ ./stemmer<br />
Enter a sentence to stem, an empty line to stop.<br />
> The fishes worked forever with their fins<br />
< The fish work forev with their fin<br />
> Stemming with Haskell<br />
< Stem with Haskel<br />
</code></p>
<p>It was pretty easy to implement this library and also a nice exercise in using <a href='http://www.cse.unsw.edu.au/~chak/haskell/ffi/'>Haskell's Foreign Function Interface</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.tupil.com/stemming-with-haskell/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
	</channel>
</rss>
