Throwback: A 2007 TCO Round 2 Problem
With 2018 TCO Round 1 ending on May 3rd, the hype for Round 2 is picking up. In tribute to this, this week we'll analyze a problem from a previous Topcoder Open Round 2, from back in 2007.
The question we're analyzing this week is "WordSplit", Level Two problem statement of the contest.
The summary of the problem statement is as follows -
Given a string of lowercase letters, split it up into pieces so that the letters in each piece are distinct. Form as few pieces as possible.
If more than one way of splitting is minimal, return the sorted sequence of pieces that is first lexicographically.
For example, the input "facetiously" returns {"facetiously"} unchanged. But the input string "aba" returns {"a", "ab"}
Now, TCO07 was won by Jan_Kuipers, so let's analyze the solution submitted by the eventual TCO champion of that year.
The problem ultimately resolves to finding the optimal method to break the input string into substrings or pieces that satisfy the required conditions -
Each piece consists entirely of unique letters, i.e. no letter occurs twice or more in a piece.
The sorted sequences of pieces is lexicographically smallest.
For example, consider the input "aabba". Now, there are multiple valid ways of breaking this input into valid pieces -
{"a", "a", "b", "ba"} or {"a", "ab", "ba"} or {"a", "ab", "b", "a"} or {"a", "a", "b", "b", "a"} are some examples of valid splitting, prior to sorting or checking for minimal number of pieces.
The answer here is {"a", "ab", "ba"} since the number of elements are minimal here.
Given we now have an understanding of the question, let's analyze the approach taken by the champion.
First, we identify that this problem can be broken into subproblems. If I can optimally divide a string into 2 substrings and optimally divide those substrings, by repeatedly performing these operations, I am guaranteed to eventually find an optimal solution.
Doing this efficiently requires some memoization. Specifically, we memoize the optimal splitting of pieces, for a given string. That is, the memoized value for "aba" will be {"a", "ab"}.
In addition, we also need to verify that our splitting of strings produces the minimal number of pieces. This can be done in linear time by leveraging a hash table by mapping the ASCII values of characters to array indices.
After solving the subproblems, we'll need to determine the sorted sequence of pieces. The optimal way to do this, is similar to the merging aspect of merge sort. We have two vectors of strings, and to our result vector, we maintain two pointers for both string vectors, and push the lexicographically lower of the two string vectors and increment the corresponding vector's pointer to show the next string in the vector. This generates the sorted string in linear time.
Overall, the algorithm runs in quadratic time, as finding the right split position requires linear time, and for each split determining if its an optimally minimal split requires linear time, as there is no use in checking splits that don't generate minimal number of pieces.
Jan_Kuiper's implementation is given below -
[text]
using namespace std;
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <ctype.h>
#include <string.h>
#include <string>
#include <sstream>
#include <iostream>
#include <vector>
#include <queue>
#include <stack>
#include <set>
#include <map>
#include <algorithm>
#include <functional>
#define PB push_back
#define SZ size()
#define REP(v, hi) for (int v=0; v<(hi); v++)
#define REPD(v, hi) for (int v=((hi)-1); v>=0; v--)
#define FOR(v, lo, hi) for (int v=(lo); v<(hi); v++)
#define FORD(v, lo, hi) for (int v=((hi)-1); v>=(lo); v--)
typedef vector <int> VI;
typedef vector <string> VS;
/* ############################ THE REAL CODE ############################ */
int go (string s) {
int res=0;
VI u(26,1);
REP(i,s.SZ) {
if (!u[s[i]-'a'])
u[s[i]-'a'] = 1;
else {
u=VI(26,0);
u[s[i]-'a'] = 1;
res++;
}
}
return res;
}
map<string, VS> cache;
VS solve (string s) {
if (cache.count(s)) return cache[s];
int best = go(s);
if (best==1) return cache[s] = VS(1,s);
VS res;
FOR(split,1,s.SZ) {
string s1 = s.substr(0,split);
string s2 = s.substr(split);
if (go(s1) + go(s2) == best) {
VS r1 = solve(s1);
VS r2 = solve(s2);
VS r;
int i1=0, i2=0;
while (i1<r1.SZ || i2<r2.SZ) {
if (i1==r1.SZ) r.PB(r2[i2++]);
else if (i2==r2.SZ) r.PB(r1[i1++]);
else if (r1[i1]<r2[i2]) r.PB(r1[i1++]);
else r.PB(r2[i2++]);
}
if (res==VS() || r<res) res=r;
}
}
return cache[s] = res;
}
class WordSplit {
public:
vector <string> pieces(string s) {
cache[""] = VS();
return solve(s);
}
};
[/text]
The function go() determines the minimal number of pieces that a string can have if split optimally.
It is invoked for every potential split of the current string at hand.
The solve function is where the string splitting, dividing and merging happens.
First the memoized table is checked, to see if there already exists solved solution for the parameter string. If not, the minimal number of splits possible is stored in an integer "best".
If "best" is 1, the string need not be broken down further and already consists of distinct characters, and the memoized table stores a vector of size 1, containing the same string as its value and its key.
But if "best" is not 1, then the optimal split position needs to be determined. This is done, by iterating over all possible points, and strings s1 and s2 store the first and second part of the parameter string accordingly.
The following if-condition is an important optimization -
if (go(s1) + go(s2) == best)
As the next part of the algorithm is the invoking of the recursive calls, we can prune the number of our recursive calls by checking if this split is an optimal split.
For example, splitting the string "aabcd" as "a" + "abcd" (where go(s1) + go(s2) = 1+1 = 2) is optimal as compared to splitting it as "aab" + "cd" (where go(s1) + go(s2) = 2+1 = 3)
Thus, we only check for optimal splits by leveraging our go() function and optimize the algorithm by doing so.
We then recursively call the solve() function to solve the problem for the substrings and store their solutions.
Then we proceed to merge the results using a method similar to merge sort's merging process.
Now we compare the present value in the memoized table with our merged result vector of strings, and if the result vector is lexicographically lower than the memoized table's value, we update the memoized table to store the new lower value.
And finally we obtain our final answer, the lexicographically lowest, minimal number of pieces, that can be obtained by optimally breaking the input string.