The series has been put on hold for now. I’m not happy with how the writing is going so for. There’s probably a better way to present the material that I need to think about.
The following list of posts in this series will be updated as I add more posts.
In the previous post, we looked at Levenshtein distances and proved a simple lemma about delete edits. Today we will look at Levenshtein automata for fixed words for a fixed maximal edit distance . The automaton will recognize all words that are at most distance away from .
Some Terminology
We will use to denote the alphabet that characters of our strings come from. does not include the sentinel character .
For a character , we will use the notation to denote a transition from the state to on input character . will denote the set of transitions .
Levenshtein Automaton for a Fixed Word
Definition: The non-deterministic Levenshtein automaton for a string with maximal edit distance is defined as follows.
- The states of the automaton are . Note: is not exponentiation. It’s just syntax.
- is the only start state.
- The final states are for all .
- The transitions are:
- (for matches).
- (for inserts).
- (for substitutions).
- (for deletes).
Example: The figure below shows the automaton .
It is easy to see that any path from to into a stream of matches and character edits and that the atomaton allows for at most character edits. Rigorous proofs of this can be found in the literature.
Removing Epsilon Transitions
Using the lemma about delete edits from the previous post, we can easily remove the epsilon transitions. We simply replace the epsilon transitions with the following and add some additional final states.
- (for deletes).
We add the states from which we could traverse to some via epsilon transitions to the set of final states.
- The final states are for all where .
Example:
The figure below shows the automaton for abac
with without epsilon transitions.
Single Column of Final States
For removing epsilon transitions, we had to add additional final states in case the input text was too short. We can go back to final states being in a single column if we use the sentinel character to signal the end of the text and the end of the input.
For simplicity, we extend the alphabet to . We will add some additional states for .
Given an input string of length , we will feed extra characters to the automaton till we have fed exactly characters.
Example:
The figure below shows the automaton for abac
with without epsilon
transitions which allows for the sentinel character as an input.