$29.99
Objectives
➔ Naive Approach for String Matching
➔ Introduction
➔ Boyer-Moore algorithm
➔ Galil Rule Optimization
➔ Complexity Analysis
➔ Boyer-Moore-Horspool algorithm
➔ Real-world applications of Boyer-Moore algorithm
➔ Comparison with other string matching algorithms
Naive approach to String matching
Text: TCACT, Pattern: ACT
T C A C T
A C T
T C A C T
A C T
T C A C T
A C T
Time Complexity: O(m*n) n - length of Text, m - length of pattern
Introduction
● The Boyer-Moore (BM) algorithm is a string matching algorithm that was developed by Robert S. Boyer and J Strother Moore in 1977. It is a powerful and widely used algorithm for finding occurrences of a pattern within a text, and is particularly effective when the pattern is long or has many repeating characters.
● This makes it an attractive choice for many applications that require efficient string matching. It has been widely used in applications such as text editors, search engines, and bioinformatics, where efficient string matching is crucial.
- Bad Character Heuristic
Case: 1 The mismatch becomes a match
T G A C T G A C T C A C T
A C T C
T G A C T G A C T C A C T
A C T C
T
P
T
P
-
Case: 2 Pattern moves past the mismatch character
T G A C T G A C T C A C T
A C T C
T G A C T G A C T C A C T
A C T C
T
P
T
P
-
Construction of shift table (last occurrence of character in the pattern)
Pattern - ACTC
c A C T G
delta1(c) 0 3 2 -1
shift = max(1, delta1(Text(c))) c - mismatch character
-
Shift table construction code
Time complexity: O(n) Space complexity: O(|Σ|)
n - length of the pattern
|Σ| - total number of characters in the text
Case 1: Another occurrence of matched suffix in the pattern
t
A C T A A T T T A T C T A T
A T C T A T
T P
t
A C T A A T T T A T C T A T
A T C T A T
T
P
Case 2: A suffix of t in T, matches with a prefix in P
t
A C T A A T T T A T C T A T
A T C T A T
A C T A A T T T A T C T A T
A T C T A T
T
P
T
P
Case 3: P moves past t (case 1 and case 2 don’t satisfy)
t
A C T A T C T A A T C T A T
C T A A T
A C T A T C T A A T C T A T
C T A A T
T
P
T
P
Galil extension to Boyer-Moore algorithm
Skips comparison of a know failure
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
G T T A T A G C T G A T C G C G G C G T A G C G G C G A A
G C G G C G G C
T
P
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
G T T A T A G C T G A T C G C G G C G T A G C G G C G A A
G C G G C G G C
T
P
Boyer-Moore algorithm - Good suffix Heuristic
k Pattern delta2(k)
1 ATCT̅AT 2
2 A̅ T̅ CTAT 4
3 A̅ T̅ CTAT 4
4 A̅ T̅CTAT 4
5 A̅T̅ CTAT 4
Construction of shift table Pattern - ATCTAT
shift = delta2(k)
k - len of the pattern suffix that matched
Boyer-Moore algorithm - Good suffix Heuristic
Shift table construction code
Time complexity: O(n) Space complexity: O(n) n - length of the pattern
Text - GTTATAGCTGATCGCGGCGTAGCGGCGAA
Pattern - GTAGCGGCG Good suffix shift table
k pattern delta2(k)
1 GTAGCGG̅CG 2
2 GTAGC̅G̅GCG 3
3 GTAG̅C̅G̅GCG 3
4 G̅TAGCGGCG 8
5 G̅TAGCGGCG 8
6 G̅TAGCGGCG 8
7 G̅TAGCGGCG 8
8 G̅TAGCGGCG 8
c A T C G
delta1(c) 2 1 7 8
Bad character shift table
shift =
max(max(1,j-delta1(Text(shift+j))), delta2(k))
Step 1:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
G T T A T A G C T G A T C G C G G C G T A G C G G C G A A
G T A G C G G C G
T
P
delta1(T) = 1, delta2(0) = 0; shift = max(8-delta1(T), 0) = max(8-1, 0) = 7 Step 2:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
G T T A T A G C T G A T C G C G G C G T A G C G G C G A A
G T A G C G G C G
T
P delta1(Text(7+5)) = delta1(C)=7, delta2(3) = 3; shift = max(max(1,5-delta1(C)), 3) = max(max(1, -2), 3) = max(1,3) = 3 Step 3:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
G T T A T A G C T G A T C G C G G C G T A G C G G C G A A
G T A G C G G C G
T
P
delta1(Text(10+2)) = delta1(C) = 7, delta2(6) = 8; shift = max(max(1,2-7), 8) = max(1,8) = 8 Step 4:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
G T T A T A G C T G A T C G C G G C G T A G C G G C G A A
G T A G C G G C G
T
P
Boyer-Moore algorithm
String matching code
Time complexity: O(m+n) Space complexity: O(n + |Σ|)
n - length of the pattern,
|Σ| - total number of alphabets in the text
Boyer-Moore Horspool algorithm
Published in 1980 as an extension to Boyer-Moore algorithm.
● Only applies the bad character heuristic when matching the string
● Makes 2 changes to the bad character heuristic
○ Omits the last character when building the shift table
○ When a mismatch is detected, always use the character in the Text string that is aligned with the last character in the pattern to determine the amount of shift ● It is more efficient than Boyer-Moore for small alphabets sizes.
● It is more easier to implement.
Applications of Boyer-Moore Algorithm
● Used in GNU’s grep
● Used in the Golang strings library
● Also part of the standard C++ library since C++17
● Search in text editors and word processors
● Code editors and IDEs
● File and data compression
Comparative Analysis
demo.ipynb
References
● A Fast String Searching algorithm https://dl.acm.org/doi/pdf/10.1145/359842.359859
● On improving the worst case running time of the Boyer–Moore string matching algorithm https://doi.org/10.1145%2F359146.359148
● https://dearxxj.github.io/post/4/
● Ch10 Information Retrieval: Data Structures & Algorithms http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap10.htm
● Good suffix rule in Boyer Moore algorithm explained simply
● Practical fast searching in strings https://onlinelibrary.wiley.com/doi/10.1002/spe.4380100608
THANK YOU!