Information Theory for spell checking

July 29, 2011

On my flights to and from Portland I read part of “Introduction to Information Theory : Signals and Noise” by James Pierce.

In the the section on symbol frequency, Claude Shannon’s original work with bigrams and trigrams is described, and several English looking words are generated.

I’ve finished code that generates bigrams, it’s below.

I’ve not used Haskell’s System.Random in years, still trying to figure out how to thread a Random value around to generate random likely looking words.

module NGram where
import qualified Data.Map as M
import Data.List
-- count the frequency of letters inside words

type Unimap = M.Map Char Integer

builder c m = M.insertWith (+) c 1 m

unimapped = foldr builder M.empty singleinput
    where singleinput = foldr1 (++) inputs

unimap        :: String -> Unimap
unimap string = foldr builder M.empty string

main = do bigstring <- readFile "/usr/share/dict/words"
          print $ show (unimap bigstring)

In writing this, I realize that ngrams for any n can be done with the same code, but I haven’t quite figured out that part.

Got any suggestions for improvement?