cover
SRM

Problem Of The Week: Strings, Dynamic Programming and Edit Distance



This week, we look into a problem that appeared in SRM 698, a round sponsored by Google.
This was a division 1 Level 1 problem who's summary is as follows -
Given a string, you are allowed to perform any of the following operations -

  1. Delete any character from the string.

  2. Replace any character with any other

  3. Add any character into any part of the string.


Your task, is to find the minimum number of operations to be performed, in order to convert the string to a "perfect square".
A "perfect square" is defined as a string which can be represented as the concatenation of two equal strings.
For example, these strings are perfect squares - "abcdabcd" or "xyxy" or "abxxabxx"
While these are not perfect squares - "abcdefg" or "xyza"
Now let's breakdown the question into simpler subtasks.
Lets our output string be called S. Now as S is a perfect square string, it can be represented as the concatenation of two equal substrings. Let us call this substring T.
Therefore S = T+T
Now since we want to split S into two parts, we need to define WHERE we want to split S.
There are length(S)+1 ways to split S.
So after splitting S into two substrings, let us label these two substrings as A and B.
Our objective is to find the minimum number of operations to make A and B equal to each other.
If we can do this for any substring A and B, we can find the overall minimum number of operations to make S a perfect square string by iterating over all possible ways of splitting S, and thus all possible values of A and B.
So how do we find the minimum number of operations to make A and B equal to one another? Enter Dynamic Programming. In fact, this problem involves the application of a famous algorithm, called “Edit Distance” or “Levenshtein Distance”.
Let’s understand the intuition behind this.
Dynamic Programming is the idea of solving a large problem, by solving its own subproblems and remembering the results.
In our case, our “problem” is to find the minimum number of operations to make A and B equal. Consequently, our “subproblem” is to find the minimum number of operations to make the prefixes of A and B equal, if we decide construct A and B from the bottom-up, that is from a null string to their complete state.
So now, how does solving our subproblems help us solve our main problem? In other words, what does our recurrence relation look like?
Well, let’s define two new strings X and Y as the current prefixes of A and B. Over the course of our algorithm’s runtime, X and Y start as null strings and by the end, become A and B, while we memorize all results made during this transition.
Consequently, for X to become X+A[ len(X) + 1], either

  • a new character must be added (if the length of X exceeds Y)

  • a character must be replaced (if the new character does not match the corresponding character in Y)

  • no operation needs to be performed (if the new character matches the corresponding character in Y)


We need not consider the case of removing a character as we are building the strings X and Y from the ground up, and the removal of a character from a string means the same as the addition of the same character in the other string.
Let’s look at the pseudocode of this algorithm.
memset(dp, INF, sizeof(dp)) // Initialize all values of dp to infinity
dp[0][0] = 0 // It takes zero moves to make two null strings equal
Let i iterate over all prefixes of X
Let j iterate over all prefixes of Y
dp[i][j] = min( dp[i-1][j] + 1, dp[i][j-1] +1, dp[i-1][j-1] + A[i-1]!=B[j-1], dp[i][j])


Thus, after doing so, our answer is stored in dp[ len(X) ][ len(Y) ].
This is because our state dp[ i ][ j ] defines the minimum number of operations to make the prefixes of length i and j in X and Y respectively, equal to one another.
Thus, we are able to check all possible combinations of constructing our strings in this manner.
Consequently, if we were to substitute A and B for the strings obtained when iterating over all the splits of the initial input S, and take the global minimum, we can successfully solve this problem.
The time complexity for solving this problem with this approach, is O(N^3) where N is the length of the input string, as there are O(N) possible splits, and the dynamic programming tabulation is performed in O(N^2).
This approach is similar to that used by the fastest submission, by subscriber [https://www.topcoder.com/members/subscriber], who solved this in 3 minutes.
Here is his submission, for those readers interested in the full implementation of this approach - http://community.topcoder.com/stat?c=problem_solution&cr=22883353&rd=16802&pm=14391