My recent discovery of (Non-deterministic) Universal Levenshtein Automata (NULA) have been really engaging creatively. In this series of posts, I am going to collect together some definitions, theorems, and proofs about Levenshtein Distances and NULAs, and then provide a second set of definitions so that we can use bitwise operations for the implementation of NULAs.
The series has been put on hold for now. I’m not happy with how the writing is going so for. There’s probably a better way to present the material that I need to think about.
The following list of posts in this series will be updated as I add more posts.
Levenshtein Distance
Definition: The Levenshtein Distance between two strings is the minimal number of character edits required to change one string into the other. A character edit is inserting a character, deleting a character, or substituting a character for another.
Example:
The Levenshtein distance between "abc"
and "abd"
is 1.
The character edit substitutes 'c'
with 'd'
.
Example:
The Levenshtein distance between "abc"
and "abdc"
is 1.
The character edit inserts 'd'
before 'c'
.
Example:
The Levenshtein distance between "abc"
and "ac"
is 1.
The character edit delete the 'b'
.
Example:
The Levenshtein distance between "abc"
and "acd"
is 2.
There are a few ways of doing two edits to "abc"
to get "acd"
.
One way is to substitute the 'b'
for 'c'
and the 'c'
for 'd'
.
Another way is to delete the 'b'
and insert the 'd'
.
This shows that the character edits involved need not be unique.
In some areas of applications, it is useful to consider more kinds of edits. For example, in spell checking inputs from a keyboard, you might consider swapping two adjacent characters (transpositions) as an edit, and in optical character recognition you might consider splitting one character into two and merging two characters into one as edits.
Some Terminology
Let be a string of length . We will represents the -th character of as . So .
We will represent character edits as follows.
- Substituting the -th character will be represented as .
- Deleting the -th character will be represented as .
- An insert edit will be represented as , where is the -th insert edit.
We will not use or for strings, and it will be clear from context whether we are using as an index or as an insert edit.
We will use as a sentinel character when needed, often to represent the end of strings. Replacting will not be allowed for substitution edits.
Example:
Let be the string "abc"
.
Then = 'a'
, = 'b'
, and = 'c'
.
One way of minimally editing to get "acd"
can be represented as , where = 'c'
and = 'd'
.
Another way of minimally editing to get "acd"
can be represented as , where = 'd'
.
A Lemma About Delete Edits
Lemma: Let and be two strings. Given a set of minimal character edits to transform to , we can rearrange the edits so that every maximal sequence of delete edits are followed by a character match or are at end of the string.
Proof: We will shift maximal sequences of delete edits to the right till they are followed by a character match or they are at the end of the string.
Suppose we have the sequence in , i.e. we are deleting characters starting at character and then inserting a character. This can be rewritten as . This changes the edits so that we are inserting a character first, and then deleting characters starting at .
Suppose we have the sequence in with . Here we are deleting characters starting at character and then substituting the -th character. This can be rewritten as . This changes the edits so that we are substituting the -th character, and then deleting characters starting at . Note that cannot equal since that would mean we found a smaller set of edits than the minimal set of edits we started with.
The above two steps can be repeated for maximal sequences of delete edits in until all maximal sequences of delete edits are followed by a match or are at the end of the string.
Update
We do not need to worry about delete edits followed by a transposition.
A delete followed by a transposition is two edits, but so is a substitution
followed by a match followed by a delete.
If our string is = abc
, and we are matching with cb
, we can write cb
as .
Longer delete sequences followed by a transposition , can be
similarly thought of as , where .