November 14, 2021

Suffix Array- LCP and Finding Unique Substring

Nikhil Kumar SinghVrishchik

DURATION

9min

What is a suffix?

A nonempty substring at the end of a string is known as a suffix of a string.

Example:

What is a suffix array?
A suffix array is an array consisting of all the sorted suffixes of a string.

The array of sorted indices is the actual ‘suffix array’. This provides a compressed representation of the sorted suffixes without the need to store the suffixes.
The suffix array provides a space-efficient alternative to a suffix tree, which already is a compressed version of a tree.

Note: suffix arrays can do everything suffix trees can, with the help of some additional information, such as a Longest Common Prefix ( LCP ) array.

Longest Common Prefix ( LCP )

In the LCP array each index stores how many characters two sorted suffixes have in common with each other.
Let’s find the LCP array of the string ‘ABABBAB’.

Finding Unique Substrings

Many commonplace problems in competitive programming revolve around finding/counting all the unique substrings of a string.
The naive algorithm can generate all substrings and sort them in an array in an O(n2) algorithm. But using the LCP array provides a better approach; it is both space and time-efficient.

Suppose we wish to find the unique substring of ‘ABABA’

Number of unique substrings: 9

Now let’s generate the LCP array of our string and find out how we can solve this using suffix arrays.

We generate the LC array for the string ABABA. It represents that two adjacent suffixes of the original string share certain characters with each other. So, if the LCP value at a certain index is x, then those two suffixes have x characters in common, or in other words, there are x repeated substrings between those two suffixes since they both come from the same larger string.

Now, let’s use the above information to calculate the unique substrings.

LCP position at index 1, we could see that it has a value of 1.
We see that A is repeated.

LCP position at index 2, we could see that it has a value of 3.
We can see that A, AB, ABA are repeated.

LCP position at index 4, we could see that it has a value of 2.
We can see that B, BA, BAB are repeated.

So if we generalize we can formulate our concept:

The number of unique substrings in a string is given by:

For ur string : n = 5

Unique substring: 5(5+1)/2 - (1+3+0+2+0) = 9

We are going to use Kasai’s algorithm for calculating the LCP array, which is a linear time algorithm, i.e O(n) time.

Code implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
#include <bits/stdc++.h>

using namespace std;

struct suffix {
  int idx;
  int R[2];
};

int comp(suffix a, suffix b) {
  return (a.R[0] == b.R[0]) ?
    (a.R[1] < b.R[1] ? 1 : 0) :
    (a.R[0] < b.R[0] ? 1 : 0);
}

vector < int > build(string txt, int n) {
  suffix suffixes[n];
  int i, j;

  for (i = 0; i < n; i++) {
    suffixes[i].idx = i;
    suffixes[i].R[0] = txt[i] - 'A';
    suffixes[i].R[1] = ((i + 1) < n) ?
      (txt[i + 1] - 'a') : -1;
  }

  sort(suffixes, suffixes + n, comp);

  int ind[n];
  for (j = 4; j < 2 * n; j = j * 2) {
    int R = 0;
    int prev_R = suffixes[0].R[0];
    suffixes[0].R[0] = R;
    ind[suffixes[0].idx] = 0;

    for (i = 1; i < n; i++) {
      if (suffixes[i].R[0] == prev_R &&
        suffixes[i].R[1] == suffixes[i - 1].R[1]) {
        prev_R = suffixes[i].R[0];
        suffixes[i].R[0] = R;
      } else {
        prev_R = suffixes[i].R[0];
        suffixes[i].R[0] = ++R;
      }
      ind[suffixes[i].idx] = i;
    }

    for (i = 0; i < n; i++) {
      int nextidx = suffixes[i].idx + j / 2;
      suffixes[i].R[1] = (nextidx < n) ?
        suffixes[ind[nextidx]].R[0] : -1;
    }

    sort(suffixes, suffixes + n, comp);
  }

  vector < int > sa;
  for (i = 0; i < n; i++)
    sa.push_back(suffixes[i].idx);

  return sa;
}

vector < int > kasai(string txt, vector < int > sa) {
  int n = sa.size();

  vector < int > lcp(n, 0);

  vector < int > inverseSuff(n, 0);
  int i, j;
  for (i = 0; i < n; i++)
    inverseSuff[sa[i]] = i;

  int k = 0;

  for (i = 0; i < n; i++) {
    if (inverseSuff[i] == n - 1) {
      k = 0;
      continue;
    }

    int j = sa[inverseSuff[i] + 1];

    while (i + k < n && j + k < n && txt[i + k] == txt[j + k])
      k++;

    lcp[inverseSuff[i]] = k;

    if (k > 0)
      k--;
  }

  return lcp;
}

int countUniqueSubstring(string txt) {
  int n = txt.length();

  vector < int > sa = build(txt, n);
  vector < int > lcp = kasai(txt, sa);

  int result = n - sa[0];

  for (int i = 1; i < lcp.size(); i++)
    result += (n - sa[i]) - lcp[i - 1];

  result++;
  return result;
}

int main() {
  string str;
  cin >> str;
  cout << countUniqueSubstring(str);
  return 0;
}

Input:
ABABA

Output:
9

Chat on Discord

November 14, 2021

Suffix Array- LCP and Finding Unique Substring

DURATION

categories

Tags

share

Topcoder Thrive

What is a suffix?

Longest Common Prefix ( LCP )

Finding Unique Substrings

Recommended for you

Introduction to String Searching Algorithms

Marathon Processor Specification and Scoring Terminology

How To Compete in a Topcoder Marathon Match