One of the most important NLP projects is the spelling corrector, in this post you can find a simple version of a Python spelling corrector based of Peter Norvig tutorial
The idea
Given a word(w), find find it”s correction(c) even if it was the word itself
How it works
Some Probability Theory
We will say that we are trying to find the correction c out of all corrections, that maximizes the probability of thee original word w:
argmaxc P(c|w) = argmaxc P(w|c) P(c)
This equation has 3 parts
- P(c) : how likely the word c appear in English … this is called the language model
- P(w|c) : how likely is it that the writer will type the word w when the word c is intended … called the error model
- argmaxc : this is the control mechanism, to enumerate all feasible values of c and then give the highest probability
The Algorithm
- Train the model by reading all the in the training data (text file contains the training set) and count how many times each word occur … create the uni-gram language model and use the add one smoothing with new words which has not been seen before in the training set
- calculate the edit distance between the given word and the possible corrections the edits can be deletion (remove one letter), transposition (swap adjacent letters), alteration (change one letter to another) or insertion (add a letter)
- choose the smallest edit distance from the original word then choose the most probable word in the training set according to the highest P(c)
The Code
This code from Peter Norvig tutorial in Python
import re, collections
def words(text): return re.findall('[a-z]+', text.lower())
def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model
NWORDS = train(words(file('big.txt').read()))
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def edits1(word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
inserts = [a + c + b for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)
def known_edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
def known(words): return set(w for w in words if w in NWORDS)
def correct(word):
candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
return max(candidates, key=NWORDS.get)
To run the code
you need to download the sample training set big.txt and place it in the same working folder of your project and to test it type
>>> correct('speling')
'spelling'
>>> correct('korrecter')
'corrector'