Information Theory for spell checking
July 29, 2011
On my flights to and from Portland I read part of “Introduction to Information Theory : Signals and Noise” by James Pierce.
In the the section on symbol frequency, Claude Shannon’s original work with bigrams and trigrams is described, and several English looking words are generated.
I’ve finished code that generates bigrams, it’s below.
I’ve not used Haskell’s System.Random in years, still trying to figure out how to thread a Random value around to generate random likely looking words.
module NGram where
import qualified Data.Map as M
import Data.List
-- count the frequency of letters inside words
type Unimap = M.Map Char Integer
builder c m = M.insertWith (+) c 1 m
unimapped = foldr builder M.empty singleinput
where singleinput = foldr1 (++) inputs
unimap :: String -> Unimap
unimap string = foldr builder M.empty string
main = do bigstring <- readFile "/usr/share/dict/words"
print $ show (unimap bigstring)
In writing this, I realize that ngrams for any n can be done with the same code, but I haven’t quite figured out that part.
Got any suggestions for improvement?