November 15, 2021

Boyer Moore Algorithm with Bad character Heuristic

Ubaid AhmedUAbhati

DURATION

9min

Tags

C++

Love to compete?

Join Topcoder Challenges

Pattern matching problems are very popular among string problems. There are some well-known algorithms to solve them, among them the KMP algorithm, Rabin-Karp algorithm, and the Boyer Moore algorithm, which are the most efficient and most used. We will discuss the Boyer Moore algorithm in this article. Similar to the KMP algorithm, we also do some pre-processing in the Boyer Moore algorithm.

In this algorithm, we start to check characters from the right and then move to the left. We compare characters of the pattern (let us call it P) and text (let us call it T). We get a match then we check the next characters of both P and T, and if we get a mismatch then we check whether the character in T which got the mismatch (let us call it ‘c’) is present anywhere in P. If it is present then we shift P until we get c occurrence in P aligned with T, and if it is not present then we completely shift P past that character in T. This type of shift avoids a comparison of needless characters again and again, improving the running time of the algorithm. More specifically, this is bad character heuristics, which we will discuss in detail. There is also a different version of this strategy known as good suffix heuristics, in which efficiency further improves, and it can also be used with the bad character heuristic for better implementation.

We use two different strategies (formally called heuristics) in this algorithm, which work independently or together, which reduces the search.

Bad character heuristics
Good suffix heuristics

We will discuss bad character heuristics which is used more and is easier to implement as compared to good suffix heuristics, which is more efficient but is harder to implement.

Bad character heuristics

Here we consider two cases, and we call the character of the text which is not matching with the pattern character a bad character.

Case 1: The mismatched character of text T is present in pattern P.

In this case, we will shift the pattern P until it gets aligned to the mismatched character of T.

Since we got a mismatch between the ‘R’ of text (the bad character) and ‘C’ of the pattern, but we also know that ‘R’ is present in the pattern, so it is the first case, then we will shift the pattern until it matches with ‘R’ (bad character) of the text.

Here is what we got after shifting the pattern. We shifted it because it might be the case, in some situations, that we may get a matching pattern from that position.

Case 2: The mismatch character of text T is not present in pattern P.
In this case, we will shift the pattern P until it gets past that mismatched character of T.

Here we got a mismatch, but “G” is not present in the pattern so there is no point in comparing the pattern again to any previous one, so we will shift pattern P until after “G” of text.

After the shift, we got our pattern in the above example, but there may be a case in which we didn’t get any matches. In that case we would return -1.

Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
#include <bits/stdc++.h>

using namespace std;
# define Total_Chars 256

// The preprocessing function
void badCharHeuristic(string str, int size, int bad_characters[Total_Chars]) {
  int i;
  // Initialize with -1
  for (i = 0; i < Total_Chars; i++)
    bad_characters[i] = -1;

  // Fill the actual value of last occurrence
  // of a character
  for (i = 0; i < size; i++) {
    bad_characters[(int) str[i]] = i;
  }
}

// Boyer Moore algorithm using bad character Heuristic
void boyermoore(string T, string P) {
  int m = P.size();
  int n = T.size();

  int bad_character[Total_Chars];

  /* Preprocessing function to  fill the bad character array */
  badCharHeuristic(P, m, bad_character);

  int idx = 0; /* idx defines how much pattern is shifted */
  while (idx <= (n - m)) {
    int j = m - 1;

    /* Reducing j of P until we gets a
    mismatch character, at this shift idx */
    while (j >= 0 && P[j] == T[idx + j])
      j--;

    /* In case if we get j=-1 which signify that
    P is present at current shift */
    if (j < 0) {
      cout << "Shift at which pattern occurs = " << idx << "\n";

      /* We will shift P such that the next
        character in T aligns with the last
        occurrence of it in P.
        To cover the case when P occur at end
        of T we need the condition idx+m < n */

      idx += (idx + m < n) ? m - bad_character[T[idx + m]] : 1;
    }
    // other case
    else
      /* In this case also, We will shift P such
        that the next character in T aligns
        with the last occurrence of it in P.
        To unsure that we get positive
        shift we are using max function.
        The negative shift may occur when the
        last occurrence of bad character in P
        is on the right side of the current
        character. */

      idx += max(1, j - bad_character[T[idx + j]]);
  }
}

int main() {
  string text = "AYRRQMGRPCRQ";
  string pattern = "RPCRQ";
  boyermoore(text, pattern);
  return 0;
}

We are using a preprocessing function to fill the bad_character array to get the last occurrence of a character which will take O(n).

In the above code, n indicates the size of text T and m indicates the size of pattern P. We are using the while loop (looping till n-m) in which we are using variable ‘idx’, which indicates how much the pattern P is shifted, and j which is used to find a mismatch. There will be two conditions, one when j becomes -1, which indicates pattern P is present at the current shift so we will print it. Also, we will shift P such that the next character in text T is aligned with the last occurrence of it in pattern P so we can get the next position where pattern P is present if it is present more than once in text T. As we stated (idx+m<n) condition is used to cover the case when P occurs at the end of T. Another one is when j != -1. In this case, we will simply shift P as per the previous rule, but we need only positive shift, not negative, so we are using max function.

Bad character heuristics worst-case time complexity is O(mn) which may occur when both text T and pattern P have all characters the same, like in this case T=”ppppppppp” and P=”pppp”.
In the best-case scenario, the bad character heuristic takes O(n/m) which may occur when all characters of pattern P and text T are different.

Overview

The Boyer Moore algorithm is very useful for pattern searching due to preprocessing, which increases its efficiency. It utilizes different implementation strategies, which can be used independently or together, to get better efficiency.

Chat on Discord

November 15, 2021

Boyer Moore Algorithm with Bad character Heuristic

DURATION

categories

Tags

share

Join Topcoder Challenges

Bad character heuristics

Implementation

Overview

Recommended for you

How To Compete in SRMs

Suffix Array- LCP and Finding Unique Substring

Getting Started with Hash Table Data Structure - Introduction