Suffix array is an incredibly powerful data structure to have in your toolbox when you encounter some string-related processing. It is a relatively new data structure designed due to the heavy memory consumption needs of suffix trees.
A nonempty substring at the end of a string is known as a suffix of a string.
Example:
What is a suffix array?
A suffix array is an array consisting of all the sorted suffixes of a string.
The array of sorted indices is the actual ‘suffix array’. This provides a compressed representation of the sorted suffixes without the need to store the suffixes.
The suffix array provides a space-efficient alternative to a suffix tree, which already is a compressed version of a tree.
Note: suffix arrays can do everything suffix trees can, with the help of some additional information, such as a Longest Common Prefix ( LCP ) array.
In the LCP array each index stores how many characters two sorted suffixes have in common with each other.
Let’s find the LCP array of the string ‘ABABBAB’.
Many commonplace problems in competitive programming revolve around finding/counting all the unique substrings of a string.
The naive algorithm can generate all substrings and sort them in an array in an O(n2) algorithm. But using the LCP array provides a better approach; it is both space and time-efficient.
Suppose we wish to find the unique substring of ‘ABABA’
Number of unique substrings: 9
Now let’s generate the LCP array of our string and find out how we can solve this using suffix arrays.
We generate the LC array for the string ABABA. It represents that two adjacent suffixes of the original string share certain characters with each other. So, if the LCP value at a certain index is x, then those two suffixes have x characters in common, or in other words, there are x repeated substrings between those two suffixes since they both come from the same larger string.
Now, let’s use the above information to calculate the unique substrings.
LCP position at index 1, we could see that it has a value of 1.
We see that A is repeated.
LCP position at index 2, we could see that it has a value of 3.
We can see that A, AB, ABA are repeated.
LCP position at index 4, we could see that it has a value of 2.
We can see that B, BA, BAB are repeated.
So if we generalize we can formulate our concept:
The number of unique substrings in a string is given by:
For ur string : n = 5
Unique substring: 5(5+1)/2 - (1+3+0+2+0) = 9
We are going to use Kasai’s algorithm for calculating the LCP array, which is a linear time algorithm, i.e O(n) time.
Code implementation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
#include <bits/stdc++.h>
using namespace std;
struct suffix {
int idx;
int R[2];
};
int comp(suffix a, suffix b) {
return (a.R[0] == b.R[0]) ?
(a.R[1] < b.R[1] ? 1 : 0) :
(a.R[0] < b.R[0] ? 1 : 0);
}
vector < int > build(string txt, int n) {
suffix suffixes[n];
int i, j;
for (i = 0; i < n; i++) {
suffixes[i].idx = i;
suffixes[i].R[0] = txt[i] - 'A';
suffixes[i].R[1] = ((i + 1) < n) ?
(txt[i + 1] - 'a') : -1;
}
sort(suffixes, suffixes + n, comp);
int ind[n];
for (j = 4; j < 2 * n; j = j * 2) {
int R = 0;
int prev_R = suffixes[0].R[0];
suffixes[0].R[0] = R;
ind[suffixes[0].idx] = 0;
for (i = 1; i < n; i++) {
if (suffixes[i].R[0] == prev_R &&
suffixes[i].R[1] == suffixes[i - 1].R[1]) {
prev_R = suffixes[i].R[0];
suffixes[i].R[0] = R;
} else {
prev_R = suffixes[i].R[0];
suffixes[i].R[0] = ++R;
}
ind[suffixes[i].idx] = i;
}
for (i = 0; i < n; i++) {
int nextidx = suffixes[i].idx + j / 2;
suffixes[i].R[1] = (nextidx < n) ?
suffixes[ind[nextidx]].R[0] : -1;
}
sort(suffixes, suffixes + n, comp);
}
vector < int > sa;
for (i = 0; i < n; i++)
sa.push_back(suffixes[i].idx);
return sa;
}
vector < int > kasai(string txt, vector < int > sa) {
int n = sa.size();
vector < int > lcp(n, 0);
vector < int > inverseSuff(n, 0);
int i, j;
for (i = 0; i < n; i++)
inverseSuff[sa[i]] = i;
int k = 0;
for (i = 0; i < n; i++) {
if (inverseSuff[i] == n - 1) {
k = 0;
continue;
}
int j = sa[inverseSuff[i] + 1];
while (i + k < n && j + k < n && txt[i + k] == txt[j + k])
k++;
lcp[inverseSuff[i]] = k;
if (k > 0)
k--;
}
return lcp;
}
int countUniqueSubstring(string txt) {
int n = txt.length();
vector < int > sa = build(txt, n);
vector < int > lcp = kasai(txt, sa);
int result = n - sa[0];
for (int i = 1; i < lcp.size(); i++)
result += (n - sa[i]) - lcp[i - 1];
result++;
return result;
}
int main() {
string str;
cin >> str;
cout << countUniqueSubstring(str);
return 0;
}
Input:ABABA
Output:9