Simple Spelling Corrector project

One of the most important NLP projects is the spelling corrector, in this post you can find a simple version of a Python spelling corrector based of Peter Norvig tutorial 

 
The idea    
    Given a word(w), find find it”s correction(c) even if it was the word itself 
 

How it works

 

           Some Probability Theory 

            We will say that we are trying to find the correction c out of all corrections, that maximizes the probability of thee original word w:          

            argmaxc P(c|w) = argmaxc P(w|c) P(c)

           This equation has 3 parts  

  • P(c) : how likely the word c appear in English … this  is called the language model 
  • P(w|c) : how likely is it that the writer will type the word w when the word c is intended … called the error model
  • argmaxc : this is the control mechanism, to enumerate all feasible values of c and then give the highest probability   

The Algorithm 

  1. Train the model by reading all the in the training data (text file contains the training set)  and count how many times each word occur … create the uni-gram language model and use the add one smoothing with new words which has not been seen before in the training set 

  2. calculate the edit distance between the given word and the possible corrections the edits can be deletion (remove one letter), transposition (swap adjacent letters), alteration (change one letter to another) or insertion (add a letter) 

  3. choose the smallest edit distance from the original word then choose the most probable word in the training set according to the highest P(c)

             The Code 

     This code from Peter Norvig tutorial in Python 
 
import re, collections

def words(text): return re.findall('[a-z]+', text.lower())

def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model

NWORDS = train(words(file('big.txt').read()))

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):
   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in splits if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts    = [a + c + b     for a, b in splits for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words): return set(w for w in words if w in NWORDS)

def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key=NWORDS.get)

 
To run the code 

   you need to download the sample training set big.txt and place it in the same working folder of your project and to test it type 

>>> correct('speling')
'spelling'
>>> correct('korrecter')
'corrector'
 

Leave a comment