Keywords and Phrases: Spelling Detection and Correction, Approximate String Matching, Word Processing, Logic Programming
Second, it may be the case that normalization will require the removal of formatting symbols. This is particularly true of electronic documents created with modern word processing software with concurrent formatting, for there may be no unformatted copy of the document to work with. Some of the formatting symbols, the so-called 'control characters' are non-printable characters. These formatting symbols are by far the easiest to work with for their encodings are mutually exclusive with those of the text. More problematic is the set of printable formatting characters, for they may serve a variety of functions in the electronic document. To illustrate, hyphens which function as delimiters (as in the case of social security numbers and compound words) are essential, whereas those which signify word-breaks at lines' end are extraneous.
Third, grammatical and punctuation symbols must be handled in some fashion. Again, this operation must be accomplished with finesse, for the symbols may be used for a variety of purposes. For example, apostrophes which serve as 'scare quotes' must be extracted, while those which indicate contractions would normally be preserved. In addition, characters which are used as word boundaries (periods, commas, spaces, etc.) must be stripped from the adjoining character strings before comparison.
Last, we may require some standardization for alternative spellings ('judgment' vs. 'judgement') in order to reduce number of dictionary entries. In such a case, one alternative might be normalized to another. This might be of value in handling both British and American spellings with one lexicon, for example. One might wish to extend the principle of standardized spelling even further by reducing word tokens to root forms and affixes.
These operations illustrate the sorts of activities that are involved in converting an electronic document into a canonical form. Once in this form, presumably all word-candidates are identified and the document is amenable to verification and correction. First, the verification procedure will identify those word tokens whose spelling cannot be verified. Next, a correction procedure will suggest a set of alternative spellings for each unverified token. In the ideal case, the correction would be largely automatic, where each misspelled word would be replaced with the intended, correctly spelled surrogate. However, this goal is unrealistic given that no computer can recover the writer's intentions. The best that we can hope for is an 'educated guess' based upon the syntactic and semantic context.
For example, in the case of verification (the determination that there is in fact a spelling error), there are at least two radically different approaches which have currency in the literature. On the one hand, there are those who favor a deterministic approach. Exemplary techniques include pattern matching and table look-up. In this case, verification amounts to the logical intersection of a set of 'target' words and a lexicon. (From now on, we will use the term 'lexicon' rather than 'dictionary' when referring to word lists without accompanying definitions). The main advantage of this approach is that it is exact (assuming a lexicon of sufficient size). One major disadvantage is that it is difficult to do quickly without a primary-resident dictionary of considerable size. By consuming large amounts of primary storage, the pattern matching technique favors machines with larger internal memories.
The principles behind lexical searching are well understood, and the algorithms are highly refined. For a general discussion, see Knuth [21]. Several sub-linear string search algorithms are available [9][22] which presumably have found their way into current software products. In addition, considerable attention has been given to the organization of the lexicon to optimize searching. This organization has been based on partial [18] and total [7] hashing, tree-structured domains [28][30], and segmented lists [26][40], to name but a few.
The second common approach to spelling verification is probabilistic. This is typified by constituent analysis. In this case, no attempt is made to match a target string with a lexical entry. Rather, algorithms are employed to associate with each target a 'coefficient of peculiarity'. Those targets which seem most peculiar are identified as likely candidates for misspellings. This type of analysis can be much faster than search routines, although they are obviously less precise.
The most common form of constituent analysis is 'n-gram' or 'string segment'
analysis. An n-gram of a string
c1c2...cL
is any segment of length n
Once the target word has been decomposed into its constituent n-grams, each n- gram is assigned a weight which reflects its frequency of occurrence in some corpus (presumably, the lexicon in use). In the simplest form of constituent analysis, these weights are then used directly in the construction of an overall index of peculiarity for the containing word. This is basically the technique used in the TYPO system [27].
Of course, there are many variations on this theme. Zamora, et al [47], extend the analysis by calculating an n-gram error probability which is the ratio of the frequency of occurrence of n-grams in misspellings to its overall frequency. By using error probabilities rather than simple frequency measures, it is possible to avoid flagging many correctly spelled, though orthographically unusual, words.
It should be remembered that n-gram analysis is a subclass of constituent analysis. There are also verification techniques based upon graphemic analysis [46] (a grapheme is taken to be an orthographical approximation of a phonetic representation). By incorporating phonetic information into the technique, there exists a facility for identifying misspellings which are a result of 'spelling by sound' (e.g., 'numonia' vs. 'pneumonia').
In addition, there are constituent analyses based upon information-theoretic principles. An example of this approach appears in Pollock and Zamora [33]. In this case, 'keys' are associated with each target. These keys are transformations of the target with some of the orthographical redundancy eliminated. They are based upon the general principle of data compaction which holds that it is possible to eliminate redundancy in data, without loss of meaning. In the case of Pollock and Zamora, the compaction results from the elimination of duplicate characters, while preserving the original orders of first occurrence. These sorts of compaction techniques have their modern origins in the SOUNDEX system [29]. While this type of constituent analysis is associated with spelling correction (see below), it is easy to imagine an extension to verification as well: one could construct a peculiarity index for the compacted representation.
In addition, it should be mentioned that there are also hybrid techniques which have features common to both look-up and constituent analyses. 'Stripping' is such a technique. In this case, constituent analysis is used to identify and remove legitimate affixes from word tokens. Then a table look-up procedure is employed to determine whether the root of the word is correctly spelled. This allows the verifier to check a large number of variations of the same root-word, with minimal lexicon. Stripping has been used in several spelling checkers, most notably the Stanford spelling checker, SPELL. For further details, see Peterson [31] and Turba [43].
This by no means exhausts the range of spelling verifiers, whether conceived or implemented. However, it does call attention to the general considerations which guide their design.
Spelling correction, unlike verification, must of necessity involve a lexicon, for the notion of 'correctly spelled word' is defined in terms of membership in one or more lexical lists. Usually, whenever primary storage is at a premium (as in current microcomputer environments), several lexical lists are used. Typically, these lists are arranged in a hierarchical fashion. A small lexicon which consists of very common words is kept in primary memory. This is complemented with a much larger, master lexicon on secondary storage. In some cases, tertiary lists are also employed for application-specific (e.g., 'professional dictionaries' used in medical and legal offices) and user- specific terminology.
Lexical lists can be organized in a variety of ways. The simplest form is a sequential list where words are stored alphabetically, just as they would appear in a dictionary. In order to increase efficiency, one would normally index this list so that the partitions of interest can be accessed directly. The two most common elementary indexing techniques are those which index by collating sequence and those which index by length. Many algorithms employ both of these indexing techniques in list organization.
The two other data structures used for indexing word lists are tree structures, either at the word or character level, and hashing, whether partial or complete. For extensive treatment of the search characteristics of these techniques, see Knuth [21]. Turba [43] provides detailed discussion of their use in the organization of lexicons for spelling checking.
We assume, then, that a spelling corrector will have access to one or more lexical lists, organized in such a way that efficient retrieval is possible. The objective of the corrector can be stated quite simply: find the subset of lexical entries which are 'similar' to the misspelled targets. The trick is to operationalize the similarity relation in such a way that the resulting procedure is co-extensive with the relation. These issues are generally subsumed under the topic of approximate string matching to which we now turn.
Spelling correction, on the other hand, involves the association of a target word (presumably, one which is misspelled) with a set of words in the lexicon which satisfy a similarity relation. We will think of the set of words which are in this similarity relation with the target as a similarity set. The essence of spelling correction can now be seen to be a two-fold activity involving the application of appropriate similarity relations to create the desired similarity sets, for all misspelled words in the document. Again, in practice, similarity relations are implemented by means of approximate string matching procedures.
Formally, approximate string matching is a procedure which associates non- identical character strings with one another on the basis of some criterion (or set of criteria) of similarity. That is, if S is a set of strings, we can think of the k-wise clustering of S into similarity sets S1,S2,...,Sk such that (V-Si) (V-s)(V-s') (s,s'{-Si <-> s~s'), where '~' denotes a similarity relation, and s,s' are distinct strings. There may, of course, be many identifiable similarity relations over S, each one of which would correspond to a distinct clustering. Further, a single string may be a member of several similarity sets for any given similarity relation.
Faulk [15] provides a useful, straightforward categorization of similarity relations. In his view, such relations are of three main types. Positional similarity is a relation which refers to the degree to which matching characters in two strings are in the same position. Ordinal similarity refers to the degree to which the matching characters are in the same order. Finally, material similarity refers to the degree to which two strings consist of the same character tokens. Of course, Faulk's characterization deals exclusively with orthographical similarity relations. We will expand upon this topic shortly.
To illustrate this classification, we borrow upon the early work of Damerau [12]. In this classic paper on data errors, Damerau reports that four single errors account for the majority of typing mistakes. These four errors result from: accidental insertion of an unwanted character, accidental omission of a desired character, accidental substitution of one character for another, and accidental transposition of adjacent characters. According to Damerau, these four sources of mistakes, taken together, account for over 80% of all typing errors. This result was confirmed by Morgan [26] and Shaffer and Hardwick [38]. A later study by Pollock and Zamora [33] suggests that over 90% of all typing errors may involve a single mistake, though not necessarily those identified by Damerau.
Brief reflection will indicate that the Damerau conditions fall neatly into the classification scheme proposed by Faulk. Specifically, accidental substitution is an instance of positional similarity; transposition is a case of material similarity; and both accidental insertion and omission constitute examples of ordinal similarity. We now define these errors as specific similarity relations.
Let O,I,S and T be sets of character strings (word tokens) which are defined with respect to the Damerau conditions of omission, insertion, substitution and transposition, respectively. Let s be an arbitrary string and y be a character. We define O,I,S and T as sets of character strings similar to c1c2c3...cn-1cn, such that
One common approach is to explicate the relation by some sort of similarity measures. As an example, the n-gram analysis techniques used for spelling verification are commonly used to create measures of similarity as well.
A number of similarity measures have appeared in the literature. Some of them provide a single type of similarity relation (i.e., in Faulk's sense). For example, Glantz [17] proposes a method which measures only positional similarity. While this method may be of use in restricted areas (e.g., OCR and literary comparison), it is generally felt [2][31] that positional similarity is too narrow to be of much use by itself in spelling correction.
Measures of material similarity have also been developed. Siderov [39] uses correlation coefficients as a measure of character similarity. Unlike positional similarity, which is thought to be too narrow, material similarity is too broad for spelling correction (all anagrams are materially similar!). Thus, measures of material similarity are unlikely to be viable strategies for determining the extension of similarity sets.
Ordinal similarity measures are by far the most common. The early work in SOUNDEX [29] is the earliest example that we know of. Refinements have been discussed frequently in the literature [1][6], and have been used with some success in restricted applications [14][10]. Current research which deals with ordinal similarity measures includes the character-node trie approach of Muth and Tharp [28] and the similarity key method of Pollock and Zamora [33].
The essential weakness of single-relation measures is that they deal with only one aspect of string similarity. It is not surprising, therefore, that current interest has shifted to agglomerative measures which incorporate several similarity measures in their design.
1) d(s,s')>_0 2) d(s,s')=0 <-> s=s' 3) d(s,s')=d(s',s) 4) d(s,s') + d(s',s") >_ d(s,s")
A popular similarity measure which is both agglomerative and metric is the so- called 'Levenshtein Metric', named after the pioneer in coding theory who first suggested its use [23]. This measure has been applied to spelling correction by Wagner, et al [25][45]. We now describe this technique based upon the presentation in Hall and Dowling [19].
Let d(i,j) be an agglomerative measure of similarity with respect to strings s=c1c2...ci and s'=c'1c'2...c'j. We define it recursively as follows:
d(0,0) = 0 d(i,j) = min[ d(i,j-1)+1, d(i-1,j)+1, d(i-1,j-1)+v(ci,c'j), d(i-2,j-2)+v((ci-1,c'j)+v(ci,c'j-1)+1 ]
v(ci,c'j)=0 <-> ci=c'j, and v(ci,c'j)=1 <-> ci=/c'j.
Note here that the terms of the minimization function approximate the Damerau conditions set forth in section 2.1. Assume that the shortest path weight is 1, for two strings whose lengths differ by 1. Further, assume that this path weight resulted from the minimizing term d(i-1,j)+1. We would take this to mean that the shorter string is related to the longer by accidental omission of a character.
While there are other metric similarity measures, for example the probabilistic metric of Fu [16], the Levenshtein metric seems to be the dominant model. However, the greatest attention is now being given to non-metric agglomerative similarity measures based upon n-gram analysis. We illustrate its use by describing the general structure of the procedure developed by Angell [2].
Let S and S' be sequences of n-grams corresponding to strings c1c2...ci and c'1c'2...c'j, respectively. We now determine the number of n-grams, k, which are common to S and S', by pairing them in order, as they appear in S and S', and discarding them after each match. We then stipulate that the measure of similarity is to be some value which is related to both k and the string lengths, say 2k/(i+j). This value may then be used to define the similarity sets for a string by simply requiring that it exceed a pre-established threshold. While this is just one particular application of n-gram analysis, variations on this theme abound [11],[34],[44],[46],[47].
In the above, we have tried to show how agglomerative measures of similarity are used to define similarity relations. In some cases, for example the Levenshtein metric, the connection between the relations and the measures is more obvious than in others (e.g., n-gram analysis). However, in any case, the use of similarity measures suffers from two defects: first, the measures are always approximations; second, they are of no immediate use in identifying the similarity sets required for spelling correction. To elaborate, while they can be of some value in determining whether two strings are similar, it isn't at all clear that they would be useful in extracting the similar strings from a lexical list in the first place. One could, of course, employ the similarity measures as a basis for organizing the list, itself, (cf., [36]), but this strikes us as undesirable for it destroys the alphabetical ordering of the lexicon.
3. THE ACCURACY OF SIMILARITY MEASURES
As we saw in the last section, spelling correction requires that we have some procedure which identifies the members of the similarity sets for our misspelled tokens. What bothers us about the conventional approaches is that they use measures of similarity to define the actual similarity relations. It is our objective to develop a spelling corrector which operationalizes the similarity relations, directly.
Our motivation comes from the observed inaccuracy of conventional approaches. This 'inaccuracy' can be formally stated in terms of the familiar evaluation measures of information retrieval theory. Assume that there are j words in a lexicon of size s which are known to be similar to a target word. Let n be the size of a similarity set actually produced. Further, let k<_n be the number of members of the similarity set which actually do stand in the
similarity relation to the token. We now define the following evaluation parameters after Salton [37]:
recall = k/j precision = k/n fallout = (n-k)/(s-j)
Our objective is to find a method of approximate string matching which has complete recall, 100% precision and zero fallout. That is, we are looking for a procedure, Pi, which will enumerate all and only those strings which are in the ith similarity relation to a target. Assume that we have a priori knowledge of the similarity set, Si, which corresponds to a certain relation with respect to some word token. We then require that (V-x)(Pi=>x <-> x{-Si). In other words, we require a procedure which is coextensive with the actual similarity relation.
One can best appreciate this goal when one refers to the descriptions of the performance of conventional approaches in the literature. For example, Pollock and Zamora [33] report that only 71% of the misspelled word targets were corrected by their method; 11.54% were miscorrected, and 17.45% were uncorrected. Yannakoudakis and Fawthrop [46] reveal that greater than 25% of the target words caused their algorithm to fail completely. Even when the algorithm worked, it was only able to suggest the correct spelling of a word token which was corrupted by a Damerau error 90% of the time. These results seem to be typical of spelling error detectors and correctors based upon measures of similarity [47]. Although Muth and Tharp [28] reported greater effectiveness, the practicality of their approach seems to be in doubt [13].
4. APPROXIMATE STRING MATCHING AS THEOREM PROVING
We propose a new approach to spelling correction which overcomes the inaccuracies of similarity measures. In our method, we use standard logic programming techniques to directly implement the desired similarity relations. We do this by creating two sets of axioms, and then using an inference engine to derive counterexamples to the claim that there is no word which is similar to a misspelled token.
The first axiom set defines the lexicon. Consider a lexicon, L, which consists of predicative expressions of the form W(x), where x is a string variable and the predicate is interpreted as 'x is a word'. We will assume that there be one such lexical axiom for each correctly spelled word of interest.
The second axiom set is a set of logical procedures P={P1,P2,...} which will define the similarity relations in question. Our intention is to directly represent the similarity relations in terms of these procedural axioms, so that a procedure can be found which finds similarity sets without approximation.
We make some general observations concerning these axioms. First, we note that all correctly spelled words in the lexicon are logical consequences of the lexical axioms. Another way of putting this is that if x is a word (i.e., if W(x) is one of the lexical axioms), then '-W(x)' is inconsistent with the axioms.
Second, if the procedural axioms, P, are complete and consistent with respect to the underlying relations, we can say that for all strings, s,s' ((W(s) & ((W(s) U P) |-s')) <-> s~s'). That is, we can show that a string is similar to a target if and only if that string is logically implied by the axioms. Eventually, we will create procedural axioms which are coextensive with the set-theoretical definitions of the Damerau relations which we gave in section 2.1. But first we will describe a general procedure for implementing similarity relations according to Faulk's classification.
4.1 Defining the Axioms in PROLOG
While a detailed discussion of PROLOG is beyond the scope of this paper, a few general remarks are in order. First, PROLOG is a logic-based language which contains a built-in inference mechanism based upon the resolution principle defined by Robinson [35]. For our purposes, the important ingredient of this mechanism is the principle of reductio ad absurdum whose tautological counterpart can be expressed as ((A->(B&-B))->-A. In semantic terms, this says that any formula which implies a contradiction must be false.
The resolution principle is defined for Horn clauses [20]. Horn clauses are elementary disjunctions with at most one unnegated atomic formula. Following convention, we will call Horn clauses with one unnegated formula headed, and those without headless. We note that the truth-functional equivalence Bv-A1v...v-An <-> (B <- (A1&...&An)) means that we can represent any Horn clause as a conditional. In the notation of PROLOG, the '<-' is represented
as ':-', and is referred to as the 'neck' of the clause. Further, the conjunction is denoted by the comma.
Finally, PROLOG makes available an infinite number of functors, one of which is the list functor. For our purposes, we will represent lists with bracket-notation. In this way, we may think of the structure '[e1,...,en]' as an ordered sequence of n elements. We manipulate lists by splitting them into a head and a tail. The head, denoted by the 'X' in the structure '[X|Y]', is the first element in the list, while the tail, 'Y', is the remaining sublist. By selective use of the list structure, we will be able to account for all three categories of similarity relations, as described by Faulk.
4.2. PROLOG Schema for General Similarity Relations
Let A={a1,...,ak} be an alphabet and A* be a set of strings over A. Let u,v,w,x{-A*. We define a string membership relation, <-, in the following way: v<-w <-> w=uvx, for some u,x{-w. Further, we define a positional string membership relation, <-k, as follows: v<-kw <-> w=uvx & |u|=k-1.
Now, we define three string difference coefficients, dm, dp, do, for material, positional and ordinal similarity, respectively. For each a{-A and w{-A*, let na(w) be the number of occurrences of a in w (i.e., the number of instances in which a<-w is satisfied). Let fm:A* X A* -> Nk be a mapping from pairs of words onto k-tuples of integers. We define this function as follows:
fm(v,w) = <|na1(v)-na1(w)|,...,|nak(v)-nak(w)|> .
We then define the material difference coefficient to be the result of summing the coefficients of the vector. That is,
k < dm(v,w) = < (|nai(v) - nai(w)|) . i=1
We observe that dm(v,w) is in effect a count of the number of times that the logical expression -(c<-v <-> c<-w) is satisfied with respect to strings v,w for any character token, c. We make this observation because we feel that it is easier to see the correlation between the PROLOG clause sets and this logical expression, than it is for the clauses and the mathematical description.
Next, given v,w{-A*. For non-null v,w, let v=i(v)v' and w=i(w)w', where i(x) is the initial character token of x. We define the positional difference coefficient, dp:A* X A* -> N, recursively:
dp(v,0/) = |v| = dp(0/,v)
dp(v,w) = dp(v',w') if i(v)=i(w)
1 + dp(v',w') if i(v)=/i(w).
Similarly, we note that dp(v,w) is a count of the number of times that the expression -(c<-kv <-> c<-kw) is satisfied.
Finally, let An be strings of length n over A. If x{-An and w{-A* then let nx(w) denote the number of times that x appears as a segment in w. We define the ordinal difference coefficient by the function, don:A* X A* -> N, such that
<
don(v,w) = < (|nx(v)-nx(w)|)
x{-An
In this case, the logical representation would involve the use of another relation for 'string-segment membership'. We omit this expression in the interest of economy.
We now turn to the PROLOG realization of these functions. It is our objective to illustrate a general approach to the types of relations described by Faulk. Once this general approach is given, we will implement more specific similarity relations which are actually of use in spelling correction.
First, we observe that the string membership relation, <-, can be explicated in terms of the PROLOG structure, '[_|_]'. That is to say, if an element is in a list (in this case, a string) it is either the head of the list or in the tail. By defining string membership recursively, we can locate the element as the head of some sublist. This is accomplished by means of the 'string_member' predicate, below. In order to determine the number of mismatches, one augments the clause set with a 'delete_token' predicate so that the same character token is not counted twice.
If the relation <-k is of interest, we simply ensure that the heads of the lists (i.e., current character tokens of the strings) are always in the same respective position. This is easier to do, for it only involves a comparison of heads of strings (rather than the head of one string with an entire string). Notice that the more general membership predicate is unnecessary in the clause set for positional similarity.
Finally, we provide a general procedure for determining the degree to which two strings have matched characters in the same order. Inasmuch as ordinal similarity has to be determined with regard to string segments (ordinal similarity defined with regard to entire strings is positional similarity), this procedure is essentially an n-gram analysis. In general, n-grams are defined recursively as n consecutive heads of segments. For present purposes, we restrict our attention to trigrams.
We now offer illustrative procedures for the PROLOG realization of the three categories of string similarity defined by Faulk.
{MATERIAL SIMILARITY}
degree_of_m_similarity(STRING1,STRING2,COEFFICIENT):-
((string_member(TOKEN,STRING1),
string_member(TOKEN,STRING2),
not TOKEN=[],
delete_token(TOKEN,STRING1,S1_SHORT),
delete_token(TOKEN,STRING2,S2_SHORT),!,
(degree_of_m_similarity(S1_SHORT,S2_SHORT,N);N=0));
N= -1),
COEFFICIENT is N+1.
string_member(X,[X|_]).
string_member(X,[_|Y]):-string_member(X,Y).
delete_token(X,[X|T],T).
delete_token(X,[_|Y],R):-delete_token(X,Y,R).
{POSITIONAL SIMILARITY}
degree_of_p_similarity([],[],0).
degree_of_p_similarity(STRING1,STRING2,COEFFICIENT):-
STRING1=[TOKEN|TAIL1],STRING2=[TOKEN|TAIL2],
degree_of_p_similarity(TAIL1,TAIL2,N),
COEFFICIENT is N+1.
degree_of_p_similarity(STRING1,STRING2,COEFFICIENT):-
STRING1=[TOKEN1|TAIL1],
STRING2=[TOKEN2|TAIL2],
not TOKEN1=TOKEN2, degree_of_p_similarity(TAIL1,TAIL2,COEFFICIENT).
{ORDINAL SIMILARITY}
test:- findall(X,trigram(STRING1,X),SEGMENT_SET1),
findall(X,trigram(STRING2,X),SEGMENT_SET2),!,
degree_of_o_similarity(SEGMENT_SET1,SEGMENT_SET2,
COEFFICIENT).
degree_of_o_similarity(SEGMENT_SET1,SEGMENT_SET2,COEFFICIENT):-
((segment_set_member(SEGMENT,SEGMENT_SET1),
segment_set_member(SEGMENT,SEGMENT_SET2),
not SEGMENT=[],
delete_segment(SEGMENT,SEGMENT_SET1,SHORT_SET1),
delete_segment(SEGMENT,SEGMENT_SET2,SHORT_SET2),!,
(degree_of_o_similarity(SHORT_SET1,SHORT_SET2,N);N=0));
N= -1),
COEFFICIENT is N+1.
trigram([X],[]).
trigram([X,Y],[]).
trigram([X|[Y|[Z|T]]],[X,Y,Z]).
trigram([H|T],Tt):-trigram(T,Tt),not Tt=[].
segment_set_member(X,[X|_]).
segment_set_member(X,[_|Y]):-segment_set_member(X,Y).
delete_segment(X,[X|T],T).
delete_segment(X,[_|Y],R):-delete_segment(X,Y,R).
4.3. PROLOG Implementation of Damerau Relations
In the previous section, we saw how one might approach the categories of similarity relations described by Faulk. We now apply these techniques to the specific similarity relations suggested by Damerau.
We begin with the set theoretical descriptions presented in section 2.1. We define the extension of the similarity sets O,I,S and T by means of PROLOG clauses which are defined with respect to a list representation of the character string, c1c2...cn. For example, the extension of O is found to be defined by
is_an_element_of_o(X):- (OMISSION} add_character(X,Y), word(Y).
where,
add_character(X,[Y|X]). add_character([U|V],[U|W]):- add_character(V,W).
This clause set is read in the following way: a character string, X, is an element of the set in question (the similarity set for string c1c2...cn which corresponds to the similarity relation of omission) if the result of adding a character to the string is itself a word. 'Adding a character' is then defined by using the PROLOG list notation. In the first clause for the predicate, 'add_character', the uninstantiated variable, Y, represents the character-slot in the first position which will take on the added character. If this clause fails, i.e. if the cause of the spelling error was not a missing character in the first position, the second clause inserts a variable in the 2nd through nth position, one by one, by systemmatically inserting the uninstantiated variable as the head of tails of lists.
As the above example illustrates, the PROLOG code is trivial once one has the set-theoretical definition of the similarity relation and a general understanding of the underlying logic behind tests for ordinal similarity. Since the Damerau condition is a special case of a test for ordinal similarity, where the only segments of interest are those to the left and right of the uninstantiated variable, the code fragment above is much simpler than that for the general case.
We now present PROLOG clauses for the remaining Damerau conditions without discussion.
{INSERTION}
is_an_element_of_i(X):-
delete_character(X,Y),
word(Y).
delete_character([X|Y],Y).
delete_character([X|U],[X|V]):-
delete_character(U,V)
{SUBSTITUTION} {TRANSPOSITION}
is_an_element_of_s(X):- is_an_element_of_t(X):-
replace_character(X,Y), transposes_to(X,Y),
word(Y). word(Y).
replace_character([X|Y],[U|Y]). transposes_to([],[]).
replace_character([X|V],[X|W]:- transposes_to([X|Y],[X|Z]):-
replace_character(V,W). transposes_to(Y,Z).
transposes_to([X|[W|Y]],[W|[X|Y]]).
The actual spelling correction is a byproduct of the resolution/unification mechanism of PROLOG. Remember that the clause sets beginning with 'is_an_element_of' are the procedural axioms. If it can be determined that the spelling of a word, w, has not been verified, we query the database with 'W(w)?' The inference engine of PROLOG then attempts to find a contradiction for the negated fact, '-W(w)'. However, the procedural axioms 'massage' w into a variety of forms (related to w by the Damerau relations) and then test each of these forms against the lexicon. If a contradiction is found, this implies that at least one of the massaged forms is a word. More specifically, the current substitution instances for the characters in the word token will define the correctly spelled word(s). In the purest sense, word tokens corrupted by the sort of mistakes described by Damerau are theorems of the spelling correction system.
4.3. Extending the Method
It should be clear from the foregoing that this approach to spelling correction is extendible to include any similarity relation which can be expressed through Horn clauses (which is, for all practical purposes, the entire set of similarity relations). For example, were we to be interested in correcting target words which were altered by unwanted repetitions of characters (which is common on keyboards with 'repeat' keys), we could easily define the similarity relation by means of a new set. Let R be a set of strings similar to c1c2...cn, except that all triple repetitions are reduced to a single character. That is,
x{-R <-> (x=(c3...cn) <- (c1=c2=c3))
v (x=(c1c4...cn) <- (c2=c3=c4))
v ...
Further, were we to try to incorporate misspellings due to homonymous morphemes, we might define a set, H, such that
x{-H <-> (x=(pc3...cn) <- f(c1c2)=p)
v (x=(c1pc4...cn) <- f(c2c3)=p)
v ...
where f would be a function which mapped one set of character strings (in this case, of length 2) onto another set (in this case, a single character) which is pronounced the same (e.g., 'PH' onto 'F' as in the incorrect spelling, 'Philipinos'). In addition, multiple errors can be easily detected by nesting the similarity relations, rather than treating them independently. If affix-stripping were found to be useful, incorporation would amount to the addition of a trivial list-manipulation operation, and so forth.
As a specific illustration, we consider the enhancement of the program to include tests for two extra characters, two missing characters and two characters transposed around a third (e.g., 'narutal' for 'natural'). These have been identified by Peterson [32] as the next most common typing errors after Damerau-type errors. We illustrate the simplicity of their inclusion by juxtapositioning them with the related Damerau tests for insertion, omission and transposition (minus those portions of the clause sets which are unaffected by the change). To wit,
Insertion Double Insertion
is_an_element_of_i(X):- is_an_element_of_di(X):- delete_character(X,Y), delete character(X,Y),
word(Y). delete character(Y,Z),
word(Z).
Omission Double Omission is_an_element_of_o(X):- is_an_element_of_do(X):-
add_character(X,Y), add_character(X,Y),
word(Y). add_character(Y,Z),
word(Z).
Transposition Transposition Around a Third
is_an_element_of_t(X):- is_an_element_of_tat(X):-
transposes_to(X,Y), tat_transposes_to(X,Y),
word(Y). word(Y).
transposes_to([],[]). tat_transposes_to([],[]).
transposes_to([X|Y],[X|Z}):- tat_transposes_to([X|Y],[X|Z]):-
transposes_to(Y,Z). tat_transposes_to(Y,Z).
transposes_to([X|[W|Y]],[W|[X|Y]]) tat_transposes_to([X|[M|[T1|T2]]],
[T1|[M|[X|T2]]]).
We note that the enhancements represent simple and intuitive modifications of the original tests, just as we would expect.
The variations are endless. The only general restriction is that the similarity relation has to be expressible in first order logic. Further, we note that the PROLOG code which follows trivially from the set-theoretical description is more concise, for its recursive procedures significantly simplify the iterative definitions.
The flexibility of the current approach applies to constraints on the search, as well. At the highest level, search is controlled by simply adding, deleting and permuting the clause sets. Beneath that level, one can build any number of logical constraints into the clauses, themselves. To illustrate, one might wish to encode 'rules of thumb' into the program. We have in mind such things as "i before e, except after c...". For whatever its limitations, this rule of thumb is expressed in a logical form, and is easily encoded into the program. These sorts of 'peremptory rules' might increase the efficiency of the correction procedure by precluding time-consuming searches for more general sorts of errors. That is, in the general case, it may be faster to preprocess the target and make a second attempt at verification than to attempt correction after the first verification failure. We would expect that this type of control structure would be primarily of use in the correction of 'errors of ignorance' (vs. typing errors).
However, the most interesting type of extension is one which includes grammatical information as well. In particular, the logic-based approach to spelling correction is consistent with the definite clause grammar model for natural language parsing. As we have suggested elsewhere [3], a simple extension of the lexical axioms, to accommodate the grammatical features of words, can be used concurrently by both a spelling corrector and transformational parser (see also, [5]).
In short, the logic-based approach seems ideally suited to the types of operations which we might require of a spelling corrector. In fact, it is easy to demonstrate that even the rival agglomerative similarity measures can be expressed within this framework. Notice that the iterative definition of the Levenshtein distance metric can be replaced with the following recursive procedure:
d([],[],0).
d([X|T1],[X|T2],D):- d(T1,T2,D).
d([X|T1],[Y|T2],D):- X=/Y,d(T1,T2,Td),D is Td+1.
d([X|T],Y,D):- d(T,Y,Td),D is Td+1.
d(Y,[X|T],D):- d(Y,T,Td),D is Td+1.
d([H1|[H2|T1]],[H2|[H1|T2]],D):- d(T1,T2,Td),D is Td+1.
Where the first clause is the boundary condition, the second clause is the string-processing procedure, and the remaining clauses define the minimizing terms for substitution, omission, insertion and transposition, respectively. It almost goes without saying that our method can accommodate the n-gram analysis described in section 2.3., for n-gram analysis is the paradigm case of Faulk's ordinal similarity. To wit,
t:-
string_length(C1,L1),
string_length(C2,L2),
findall(T,find_trigram(C1,T),B1),
findall(T,find_trigram(C2,T),B2),
count(B1,B2,K),
similarity_measure is (2*K) / (L1+L2),
find_trigram([X|[Y|[Z|T]]],[X,Y,Z]).
find_trigram([H|T],Tt):-find_trigram(T,Tt).
count(L1,L2,N):-((member(M,L1),
member(M,L2),
not M=[],
delete(M,L1,Ll1),
delete(M,L2,Ll2)),!,
(count(Ll1,Ll2,Nr);Nr=0));Nr=-1),
N is Nr+1.
member(X,[X|_]).
member(X,[_|Y]):-member(X,Y).
delete(X,[X|T],T).
delete(X,[_|Y],R):-delete(X,Y,R).
The capability of easily expressing the underlying logic of agglomerative measures of similarity gives some idea of the expressive power of the logic-based approach to spelling correction. We submit that this sort of versatility is unobtainable within the framework of the conventional approaches.
4.4. Efficiency of the Method
We think that it is obvious from the above that we have achieved our objective of complete recall, 100% precision and zero fallout, with respect to the similarity relations in use. The reason for this is that the procedures are direct translations of the actual similarity relations. No measures or approximations of similarity are needed. Further, the procedures are easily generalized to other sorts of relations regardless of the source of error (e.g., typing, ignorance of spelling conventions, etc). As we have shown above, this is a near trivial undertaking when the similarity relations fall into one of the categories defined by Faulk. We actually employ a prototype spelling corrector, like the one outlined above, in our lab. For a more complete description of an early version of the program, see [4].
While the ultimate efficiency of our method cannot be accurately predicted at this writing, some factors are known. For one, since the tests for similarity are always with respect to the lexicon, the search space is never larger than it would be with a brute force approach. Further, the number of required tests or comparisons will usually be less. For example, a generate-and-test strategy for the Damerau conditions of accidental substitution and omission would involve n(k-1) and k(n+1) tests, respectively, for a k-letter alphabet on strings of length n. These values are reduced to n and n+1, respectively, by our method. The reason for this is that the notion of 'test' is inextricably linked to lexical entry. There is simply no opportunity to test any string which is not already a word due to the way that unification works.
In terms of the actual number of lexical comparisons which need to be made, the logic programming approach can be shown to be quite parsimonious. However, the case is not as clear with regard to searching. Typically, PROLOG clause sets are indexed by predicate name and arity. Given this indexing scheme, the search characteristics should be roughly the same as with conventional algorithms defined over length-segmented lists hashed to segment boundaries and searched sequentially within segments. This is consistent with our empirical data. When we compared PROLOG and PASCAL programs which checked sixteen misspellings for Damerau errors against a 474 word lexicon on an IBM PC/AT, the PASCAL program took 29.2 seconds and the PROLOG program ran in 7.3 seconds. However, the data structure used in the PASCAL program was a length-segmented list! This result only shows that our logic-based approach to spelling correction is competitive with the a high-level approach, so long as the high-level approach is burdened with the same inefficient search strategy.
To provide some estimate of the degree of inefficiency, we processed an electronic document with four Damerau-related spelling errors. Each of the following similarity sets were generated in approximately 3 seconds when running a compiled version of our program on an IBM/PC AT with a dictionary of 4,000 words:
target similarity set
hte => {the,hate,he,ate,hoe,hue} bal => {ball,bald,balk,al,pal,bad,bag,bar,bay} warr => {war,ward,warm,warn,wars} rwd => {red,rid,rod}
On this basis, an electronic document of 5,000 words with 10% errors would require approximately 25 minutes to correct. However, there are some mitigating circumstances.
First, the next wave of PROLOG compilers will include more efficient search strategies. With sophisticated hashing schemes or the use of multiway search trees, PROLOG searches will accelerate (there is no reason, in principle, that PROLOG cannot use the same range of data structures as a high-level language). In fact, at least one current product supports B-trees. We are currently exploring these avenues.
Second, PROLOG is not well-suited for serial, von Neumann architectures. Our host system is an IBM PC, which delivers a maximum of 103 logical inferences per second (LIPS). This is between two and three orders of magnitude slower than the rate which is within current technological limits [42] and seven orders of magnitude slower than the fifth-generation machines planned by ICOT. An increase in performance of a few orders of magnitude yields run times which are competetive with the conventional approaches. Thus, if we are not obsessed with short-term efficiency, the effectiveness of the logical approach justifies continued interest. (As an aside, we mention that it takes approximately 20n logical inferences to apply all four Damerau conditions to a string of length n, so the execution time of a similar program can be quickly approximated if one knows the speed of the host PROLOG system in LIPS).
Third, we observe that most current PROLOG environments are poorly suited for integration with conventional, high-level languages. In our opinion, traditional deterministic document normalization and verification procedures are entirely adequate. While these tasks can be easily handled in PROLOG as well, it is pointless to do so. The best of both worlds can be achieved when one can readily integrate the PROLOG-based corrector with a conventional verifier. In this case, the conventional, procedural approach handles most of the work (over 80% of the word tokens, by some estimates [8][24]) while the PROLOG portion handles the more difficult problems of correction. This lack of integration is more a short-term nuisance than a long-term difficulty.
5. CONCLUSIONS
What we have described, above, is a technique for spelling correction which directly encodes the set-theoretical definitions of a set of similarity relations into a PROLOG program. This program then returns the similarity sets in question to the user in the form of a list of alternative spellings. The actual behavior of our program is much like that of any other interactive spelling checker.
We feel that there are several advantages to the logic-based approach which makes it worthy of further study. First and foremost, the method is exact. This is a consequence of the logical, top-down design. Related to this is the fact that the approach allows the designer to solve the problems of spelling correction (or spelling checking, for that matter) at the conceptual rather than the procedural level. Given a precise description of the problem, either in terms of set theory, iterative or recursive definitions, the code results from a straightforward translation.
Further, a logic-based method avoids the necessity of ad hoc procedures common to the agglomerative measures of similarity. As an illustration, consider that the complexity of converting the single Damerau condition of transposition into a distance metric prompted a separate article [25]. We maintain that this is a direct result of what we see as an unintuitive and awkward representation of the problem. Similarly, the more advanced, non-metric, n-gram analyses are built upon elaborate procedures for calculating frequency analyses, error probabilities and the like for n-grams, none of which seem prima facie related to the topic of spelling errors.
We suspect that the environment in which a spelling checker is used will have an influence upon the particular tests that are found to be useful. We would expect that accomplished spellers could rely heavily on the verification procedure alone, since most of their errors are likely to be typographical. Poor spellers, on the other hand, are likely to expect much more of a spelling checker than knowledge that an error occurred. A significant advantage of the method under discussion is that it easily admits extensions and modifications. In this regard, it is quite unlike agglomerative measures. It is not at all obvious to us how one would extend distance metrics and n-gram analyses to cover a broad enough range of spelling errors that the resulting product would be relevant to a wide variety of applications. Further, the more rigorous the spelling checking procedure, the more natural it becomes to exploit the inherent AND-parallelism in PROLOG.
In addition, we note that certain spelling errors are impossible to correct when isolated from their grammatical context. The paradigm case of such errors are targets which have undergone 'complete corruption'. This occurs when an intended spelling has been corrupted to the point that it is a legitimate spelling of another word (e.g., the noun, 'wand', corrupted to the verb, 'want'; the preposition, 'to', to the adjective, 'too'). These sorts of errors can only be detected, and corrected, if the spelling checking is done in conjunction with at least a partial parse of the containing sentence. We see no way to generalize conventional spelling checkers beyond the orthographical level. Of course, there are even more serious forms of complete corruption which cannot be detected without an understanding of the sentence (i.e., when a word is corrupted into another word with similar grammatical properties, as in 'wonder' -> 'wander'). We shall defer these topics to another forum.
Thus, we have in the present effort an attempt to provide a spelling correction facility which has the characteristics which we feel will be required in the future. What is interesting to us is not the particular details involved in developing a PROLOG-based system, for we would be satisfied with an equipotent system in any language environment, but rather that the solution follows directly from the logic of the problem. Even if the PROLOG approach fails to pass the test of time, the lessons which we can learn from its design should serve us well in providing a logical framework for products to follow.
Acknowledgment: We wish to thank M. Kunze, S. Margolis, R. Morein, G. Nagy, S. Seth, P. Smith, E. Traudt, and the anonymous referees of Information Processing and Management for their invaluable comments and suggestions.
REFERENCES
[1] Alberga, C., "String Similarity and Misspellings", Communications of the ACM, 10, pp. 302-313 (1967).
[2] Angell, R., G. Freund and P. Willett, "Automatic Spelling Correction Using a Trigram Similarity Measure", Information Processing and Management, 19, pp. 255-261 (1983).
[3] Berghel, H., "Extending the Capabilities of Word Processing Software through Horn Clause Lexical Databases", Proceedings of NCC-86, pp. 251-257 (1986).
[4] Berghel, H. and E. Traudt, "Spelling Verification in Prolog", SIGPLAN Notices, 21, pp. 12-27 (1986).
[5] Berghel, H. and J. Weeks, "On Implementing Elementary Movement Transformations with Definite Clause Grammars", Proceedings of the Fifth Phoenix Conference on Computers and Communications, IEEE Computer Society, pp. 366-370, (1986).
[6] Blair, C., "A Program for Correcting Spelling Errors", Information and Control, 3, pp. 60-67 (1960).
[7] Bloom, B., "Space/Time Trade-Offs in Hash Coding with Allowable Errors", Communications of the ACM, 13, pp. 422-426 (1970).
[8] Bourne, C., "Frequency and Impact of Spelling Errors in Bibliographic Data Bases", Information Processing and Management, 13, pp. 1-12 (1977).
[9] Boyer, R. and J. Moore, "A Fast String Searching Algorithm", Communications of the ACM, 20, pp. 762-772 (1977).
[10] Bryant, J. and S. Fenlon, "The Design and Implementation of an On-Line Index", referenced in [15].
[11] Cornew, R., "A Statistical Method of Spelling Correction", Information and Control, 12, pp. 79-93 (1968).
[12] Damerau, F., "A Technique for Computer Detection and Correction of Spelling Errors", Communications of the ACM, 7, pp. 171-176 (1964).
[13] Damerau, F., Review of [28], Computing Reviews, pp. 231 (June, 1978).
[14] Davidson, L., "Retrieval of Misspelled Names in an Airlines Passenger Record System", Communications of the ACM, 5, pp. 169-171 (1962).
[15] Faulk, R., An Inductive Approach to Language Translation", Communications of the ACM, 7, pp. 647-653 (1964).
[16] Fu, K., "Error-Correcting Parsing for Syntactic Pattern Recognition", in Klinger, et al (eds): Data Structures, Computer Graphics and Pattern Recognition, Academic Press, New York (1976).
[17] Glantz, H., "On the Recognition of Information with a Digital Computer:, Journal of the ACM, 4, pp. 178-188 (1957).
[18] Gorin, R., "SPELL: Spelling Check and Correction Program", referenced in [43].
[19] Hall, P. and G. Dowling, "Approximate String Matching", Computing Surveys, 12, pp. 381-402 (1980).
[20] Horn, A., "On Sentences which are True of Direct Unions of Algebras", Journal of Symbolic Logic, 16, pp. 14-21 (1951).
[21] Knuth, D., Sorting and Searching, Addison-Wesley, Reading (1973).
[22] Knuth, D., J. Morris and V. Pratt, "Fast Pattern Matching in Strings", SIAM Journal on Computing, 6, pp. 323-350 (1977).
[23] Levenshtein, V., "Binary Codes Capable of Correcting Deletions, Insertions and Reversals", Sov. Phys, Dokl., 10, pp. 707-710 (1966).
[24] Litecky, C. and G. Davis, "A Study of Errors, Error-Proneness, and Error Diagnosis in COBOL", Communications of the ACM, 19, pp. 33-37 (1976).
[25] Lowrance, R. and R. Wagner, "An Extension of the String-to-String Correction Problem", Journal of the ACM, 22, pp. 177-183 (1975).
[26] Morgan, H., "Spelling Correction in Systems Programs", Communications of the ACM, 13, pp. 90-94 (1970).
[27] Morris, R. and L. Cherry, "Computer Detection of Typographical Errors", IEEE Transactions on Professional Communication, PC-18, 1, pp. 54-64 (1975).
[28] Muth, F. and A. Tharp, "Correcting Human Error in Alphanumeric Terminal Input", Information Processing and Management, 13, pp. 329-337 (1977).
[29] Odell, M. and R. Russell, U.S Patents 1,261,167 (1918) and 1,435,663 (1922).
[30] Partridge, D. and E. James, "Natural Information Processing", International Journal of Man-Machine Studies, 6, pp. 205-235 (1974).
[31] Peterson, J., "Computer Programs for Detecting and Correcting Spelling Errors", Communications of the ACM, 23, pp. 676-687 (1980).
[32] Peterson, J., "A Note on Undetected Typing Errors", Communications of the ACM, 29, pp. 633-637 (1986).
[33] Pollock, J. and A. Zamora, "Automatic Spelling Correction in Scientific and Scholarly Text", Communications of the ACM, 27, pp. 358-368 (1984).
[34] Riseman, E. and A. Hanson, "A Contextual Post-Processing System for Error-Correction Using Binary N-Grams", IEEE Transactions on Computing, C-23, 5, pp. 480-493 (1974).
[35] Robinson, J., "A Machine Oriented Logic Based upon the Resolution Principle", Journal of the ACM, 12, pp. 23-41 (1965).
[36] Salton, G., Automatic Information Organization and Retrieval, McGraw-Hill, New York (1968).
[37] Salton, G. and M. McGill, Introduction to Modern Information Retrieval, McGraw Hill, New York (1983).
[38] Shaffer, L. and J. Hardwick, "Typing Performance as a Function of Text", Quarterly Journal of Experimental Psychology, 20, pp. 360-369 (1968).
[39] Siderov, A., "Analysis of Word Similarity in Spelling Correction Systems", referenced in [2].
[40] Szanser, A., "Error-Correcting Methods in Natural Language Processing", Information Processing 68, IFIPS, pp. 1412-1416 (1969).
[41] Thomas, R. and M. Kassler, "Character Recognition in Context", Information and Control, 10, pp. 43-64 (1967).
[42] Tick, E. and D. Warren, "Towards a Pipelined Prolog Processor", Proceedings of the 1984 International Symposium on Logic Programming, IEEE Computer Society Press, Washington, pp. 29-40 (1984).
[43] Turba, T., "Checking for Spelling and Typographical Errors in Computer-Based Text", Proceedings of the ACM SIGPLAN/SIGOA Symposium on Text Manipulation, ACM, New York, pp. 51-60 (1981).
[44] Ullman, J., "A Binary N-Gram Technique for Automatic Correction of Substitution, Deletion, Insertion and Reversal Errors in Words", The Computer Journal, 20, pp. 141-147 (1977).
[45] Wagner, R. and M. Fischer, "The String-to-String Correction Problem", Journal of the ACM, 21, pp. 168-178 (1974).
[46] Yannakoudakis, E. and D. Fawthrop, "An Intelligent Spelling Error Corrector", Information Processing and Management, 19, pp. 101-108 (1983).
[47] Zamora, E., J. Pollock and A. Zamora, "The Use of Trigram Analysis for Spelling Error Detection", Information Processing and Management, 17, pp. 305-316 (1981).