Simple Spelling Corrector project

One of the most important NLP projects is the spelling corrector, in this post you can find a simple version of a Python spelling corrector based of Peter Norvig tutorial

The idea

Given a word(w), find find it”s correction(c) even if it was the word itself

How it works

Some Probability Theory

We will say that we are trying to find the correction c out of all corrections, that maximizes the probability of thee original word w:

argmax_c P(c|w) = argmax_c P(w|c) P(c)

This equation has 3 parts

P(c) : how likely the word c appear in English … this is called the language model

P(w|c) : how likely is it that the writer will type the word w when the word c is intended … called the error model

argmax_{c :}this is the control mechanism, to enumerate all feasible values of c and then give the highest probability

The Algorithm

Train the model by reading all the in the training data (text file contains the training set) and count how many times each word occur … create the uni-gram language model and use the add one smoothing with new words which has not been seen before in the training set
calculate the edit distance between the given word and the possible corrections the edits can be deletion (remove one letter), transposition (swap adjacent letters), alteration (change one letter to another) or insertion (add a letter)
choose the smallest edit distance from the original word then choose the most probable word in the training set according to the highest P(c)

The Code

This code from Peter Norvig tutorial in Python

import re, collections

def words(text): return re.findall('[a-z]+', text.lower()) 

def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model

NWORDS = train(words(file('big.txt').read()))

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):
   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in splits if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts    = [a + c + b     for a, b in splits for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words): return set(w for w in words if w in NWORDS)

def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key=NWORDS.get)

To run the code

you need to download the sample training set big.txt and place it in the same working folder of your project and to test it type

>>> correct('speling')
'spelling'
>>> correct('korrecter')
'corrector'

Rania Nabil

Robotics and IT geek

Simple Spelling Corrector project

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply