Longest Common Subsequence Problem

SE 504
The Longest Common Subsequence Problem

If x is a sequence, then any sequence that can be obtained by "removing" zero or more of the elements of x is a subsequence of x. Note that the removed elements need not be adjacent to each other. For example, consider the character sequence x = abbabcab. If the 2nd, 4th, 5th, and 8th elements are removed, we are left with the subsequence abca.

Consider now the sequence y = babacbaca. Notice that abca is a subsequence of it, too. (Indeed, there are at least three different ways of removing elements from y to obtain abca.) Hence, we have that abca is a common subsequence of x and y.

The Longest Common Subsequence problem is as follows: Given as input two sequences x and y containing elements is of some type T (char if they are character strings), find a sequence z that is a subsequence of both x and y and is maximum in length among all such sequences.

A somewhat more difficult version of the problem is to find the set consisting of all common maximum-length subsequences of x and y (rather than only a particular element of this set). Here we will consider a less difficult version of the problem, which is to find the length of a longest common subsequence of x and y, which are represented by the arrays X[1..M] and Y[1..N].

For the purpose of developing the program, we should first explore the the relevant concept for the purposes of coming to a decent understanding of it. So consider two arbitrary sequences x and y (whose elements are of type T) and two elements of T, a and b. For the purposes of brevity, hereafter the phrase "longest common subsequence" will be abbreviated as "lcs".

Suppose that we already had identified an lcs z of x and y (and thus its length), and now we wanted to identify an lcs of xa and yb. There are essentially two cases:

a = b. Then za (equivalently, zb) must be an lcs of xa and yb.
a ≠ b. Then any lcs of xa and yb must be either an lcs of xa and y or an lcs of x and yb.

Program Specification

|[ con M, N : nat; con X : array [1..M] of T; con Y : array [1..N] of T; var r : nat r := ? {Q: r = LLCS.M.N } ]|

Program Specification
`\|[ con M, N : nat; con X : array [1..M] of T; con Y : array [1..N] of T; var r : nat r := ? {Q: r = LLCS.M.N } ]\|`

Now we transfer that understanding into the realm of the program:

For 0≤m≤M and 0≤n≤N, let LLCS.m.n = the Length of a Longest Common Subsequence of X[1..m] and Y[1..n]. Then we can specify our program as shown to the right.

Consistent with the observations that we made above about empty sequences, we get

(1) LLCS.0.n = 0 (for all n in 0..N)
(2) LLCS.m.0 = 0 (for all m in 0..M)

For the case of non-empty sequences, and consistent with the observations made earlier:
Let 1≤m≤M and 1≤n≤N. If X.m = Y.n, and if Z is a longest common subsequence of X[1..m-1] and Y[1..n-1], then appending X.m to Z yields a longest common subsequence of X[1..m] and Y[1..n]. This observation gives rise to

(3) LLCS.m.n = LLCS.(m-1).(n-1) + 1 if X.m = Y.n (1≤m≤M, 1≤n≤N)

If, on the other hand, X.m ≠ Y.n, then any common subsequence of X[1..m] and Y[1..n] must also be a common subsequence either of X[1..m] and Y[1..n-1] or of X[1..m-1] and Y[1..n] (or both). This observation gives rise to

(4) LLCS.m.n = LLCS.m.(n-1) max LLCS.(m-1).n if X.m ≠ Y.n (1≤m≤M, 1≤n≤N)

Summarizing (1) through (4), we have

LLCS.m.n = {   0    if m=0 ∨ n=0

LLCS.(m-1).(n-1) + 1    if 1≤m≤M ∧ 1≤n≤N ∧ X.m = Y.n

LLCS.m.(n-1) max LLCS.(m-1).n    if 1≤m≤M ∧ 1≤n≤N ∧ X.m ≠ Y.n

Note that this definition covers all cases, so it is complete.

Imagine the values LLCS.m.n in a two-dimensional table. For example, if X and Y were of lengths three and two, respectively, the table would be like this: 3 LLCS.3.0 LLCS.3.1 LLCS.3.2 2 LLCS.2.0 LLCS.2.1 LLCS.2.2 1 LLCS.1.0 LLCS.1.1 LLCS.1.2 0 LLCS.0.0 LLCS.0.1 LLCS.0.2 0 1 2

3	LLCS.3.0	LLCS.3.1	LLCS.3.2
2	LLCS.2.0	LLCS.2.1	LLCS.2.2
1	LLCS.1.0	LLCS.1.1	LLCS.1.2
0	LLCS.0.0	LLCS.0.1	LLCS.0.2
	0	1	2

Generally speaking, the value LLCS.m.n would occupy the cell at location (m,n) (i.e., in row m and column n), for all m and n satisfying 0≤m≤M ∧ 0≤n≤N. In particular, the solution to the problem is found in the cell at location (M,N) (the upper right corner).

From our description of LLCS, it follows that each cell in row zero contains zero, as does each cell in column zero. It also follows that, for every other cell, its value depends upon either the cell diagonal to it below and left or else the cells immediately under it and to its left. That is, the value at each such location (m,n) depends upon either the one at location (m-1,n-1) or upon the ones at locations (m,n-1) and (m-1,n). (Of course, which of the two dependencies exist is determined by whether or not X.m = Y.n.)

So, having filled row zero and column zero with zeros, we can fill in the rest of the table row-by-row (with row numbers increasing as we go along), with each row being filled in with column numbers increasing as we go along. (Alternatively, we could fill the rest of the table column-by-column, or even diagonal-by-diagonal.)

A Program Solution

Using the ideas developed above, here is a program solution, which uses a two-dimensional array llcs to store the values of LLCS:

|[ con M, N : nat; con X : array [1..M] of T; con Y : array [1..N] of T; var r : nat; var m,n : nat; var llcs : array[0..M,0..N] of nat; llcs[0,0..N] := 0; // fill row zero of llcs with zeros llcs[0..M,0] := 0; // fill column zero of llcs with zeros m := 1; {Loop invariant I_(m): (∀i,j | 0≤i<m ∧ 0≤j≤N : llcs[i,j] = LLCS.i.j) ∧ 1≤m≤M+1 } {Bound function t: M+1-m} do m ≠ M+1 ---> n := 1; {Loop invariant I_(n): I_(m) ∧ (∀j | 0≤j<n : llcs[m,j] = LLCS.m.j) ∧ 1≤n≤N+1 } {Bound function t: N+1-n} do n ≠ N+1 ---> if X[m]= Y[n] ---> llcs[m,n] := llcs[m-1,n-1] + 1 [] X[m] ≠ Y[n] ---> llcs[m,n] := llcs[m,n-1] max llcs[m-1,n] fi ;n := n+1 od ;m := m+1; od {I_(m) ∧ m=M+1, from which it follows that llcs[M,N] = LLCS.M.N} ;r := llcs[M,N] {Q: r = LLCS.M.N } ]|

Informally, the invariant I_(m) says that rows 0..m-1 of the llcs[] array are filled correctly (with the corresponding values of LLCS) and the invariant I_(n) says that, within row m of llcs[], columns 0..n-1 are filled correctly.

Here is an alternative way of doing the looping, so that a single loop (with multiple guards) suffices, rather than nesting one loop inside the other:

|[ con M, N : nat; con X : array [1..M] of T; con Y : array [1..N] of T; var r : nat; var m,n : nat; var llcs : array[0..M,0..N] of nat; llcs[0,0..N] := 0; // fill row zero of llcs with zeros llcs[0..M,0] := 0; // fill column zero of llcs with zeros m,n := 1,1; {Loop invariant I: I_(m) ∧ I_(n), where I_(m) : (∀i,j | 0≤i<m ∧ 0≤j≤N : llcs[i,j] = LLCS.i.j) ∧ 1≤m≤M+1 I_(n) : (∀j | 0≤j<n : llcs[m,j] = LLCS.m.j) ∧ 1≤n≤N+1 } {Bound function t: (M+1)(N+1) - (N+1)m - n} do m ≠ M+1 ∧ n = N+1 ---> m,n := m+1,1 [] m ≠ M+1 ∧ n ≠ N+1 ---> if X.m = Y.n ---> llcs[m,n] := llcs[m-1,n-1] + 1 [] X[m] ≠ Y[n] ---> llcs[m,n] := llcs[m,n-1] max llcs[m-1,n] fi ;n := n+1 od {I_(m) ∧ m=M+1, from which it follows that llcs[M,N] = LLCS.M.N} ;r := llcs[M,N] {Q: r = LLCS.M.N } ]|

For that matter, we don't need the selection command if we split the second loop guard into two (and distribute the assignment n := n+1 over both resulting guarded commands):

|[ con M, N : nat; con X : array [1..M] of T; con Y : array [1..N] of T; var r : nat; var m,n : nat; var llcs : array[0..M,0..N] of nat; llcs[0,0..N] := 0; // fill row zero of llcs with zeros llcs[0..M,0] := 0; // fill column zero of llcs with zeros m,n := 1,1; {Loop invariant I: I_(m) ∧ I_(n), where I_(m) : (∀i,j | 0≤i<m ∧ 0≤j≤N : llcs[i,j] = LLCS.i.j) ∧ 1≤m≤M+1 I_(n) : (∀j | 0≤j<n : llcs[m,j] = LLCS.m.j) ∧ 1≤n≤N+1 } {Bound function t: (M+1)(N+1) - (N+1)m - n} do m ≠ M+1 ∧ n = N+1 ---> m,n := m+1,1 [] m ≠ M+1 ∧ n ≠ N+1 ∧ X[m] = Y[n] ---> n, llcs[m,n] := n+1, llcs[m-1,n-1] + 1 [] m ≠ M+1 ∧ n ≠ N+1 ∧ X[m] ≠ Y[n] ---> n, llcs[m,n] := n+1, llcs[m,n-1] max llcs[m-1,n] od {I_(m) ∧ m=M+1, from which it follows that llcs[M,N] = LLCS.M.N} ;r := llcs.M.N {Q: r = LLCS.M.N } ]|

Reducing Space Usage

The solution above uses a quantity of space that is proportional to the product M×N of the lengths of the two input sequences. (Indeed, the two-dimensional array llcs has (M+1)×(N+1) elements.) Can we rewrite the program to use significantly less space? Consider that as we fill the elements of llcs in a row-by-row fashion, at no time is it necessary to refer to any element that is not either in the row currently being filled or the row immediately preceding that one. From this we conclude that it suffices to store two rows of llcs, the "current" one and the preceding one. Based on this idea, we offer the following program, in which arrays llcsCur and llcsPrev are used for storing the "current" row (m) and the preceding row (m-1) of the table containing the LLCS values.

|[ con M, N : nat; con X : array [1..M] of T; con Y : array [1..N] of T; var r : nat; var m,n : nat; var llcsCur, llcsPrev : array[0..N] of nat; llcsPrev[0..N] := 0; // fill llcsPrev[] with zeros llcsCur[0] := 0; // fill zero-th elem of llcsCur[] with zero m := 1; {Loop invariant I_(m) : llcsPrev[0..N] = LLCS.(m-1).[0..N] ∧ 1≤m≤M+1 } {Bound function t: M+1-m} do m ≠ M+1 ---> n := 1; {Loop invariant I_(n) : I_(m) ∧ llcsCur[0..n) = LLCS.m[0..n) ∧ 1≤n≤N+1 } {Bound function t: N+1-n} do n ≠ N+1 ---> if X[m] = Y[n] ---> llcsCur[n] := llcsPrev[n-1] + 1 [] X[m] ≠ Y[n] ---> llcsCur[n] := llcsCur[n-1] max llcsPrev[n] fi ;n := n+1 od ;llcsCur, llcsPrev := llcsPrev, llcsCur // swap (references to) the two arrays ;m := m+1; od {I_(m) ∧ m=M+1, from which it follows that llcsPrev[N] = LLCS.M.N} ;r := llcsPrev[N] {Q: r = LLCS.M.N } ]|

Here is the alternative single-loop version:

|[ con M, N : nat; con X : array [1..M] of T; con Y : array [1..N] of T; var r : nat; var m,n : nat; var llcsCur, llcsPrev : array[0..N] of nat; llcsPrev[0..N] := 0; // fill llcsPrev with zeros llcsCur[0] := 0; // fill zero-th element of llcsCur with zero m,n := 1,1; {Loop invariant I: I_(m) ∧ I_(n), where I_(m) : llcsPrev[0..N] = LLCS.(m-1).[0..N] ∧ 1≤m≤M+1 I_(n) : llcsCur[0..n) = LLCS.m.[0..n) ∧ 1≤n≤N+1 } {Bound function t: (M+1)(N+1) - (N+1)m - n} do m ≠ M+1 ∧ n = N+1 ---> llcsCur, llcsPrev := llcsPrev, llcsCur; // swap (references to) the arrays m,n := m+1,1; [] m ≠ M+1 ∧ n ≠ N+1 ---> if X[m] = Y[n] ---> llcsCur[n] := llcsPrev[n-1] + 1 [] X[m] ≠ Y[n] ---> llcsCur[n] := llcsCur[n-1] max llcsPrev[n] fi ;n := n+1 od {I_(m) ∧ m=M+1, from which it follows that llcsPrev[N] = LLCS.M.N} ;r := llcsPrev.N {Q: r = LLCS.M.N } ]|

Having replaced the two dimensional array llcs[] by the two one-dimensional arrays llcsPrev[] and llcsCur[], the program's space complexity falls from Θ(MN) to Θ(N), which is a huge improvement.

It turns out that we can do a little bit better, however, as we don't really need both of the arrays! Consider that, as llcsCur[] is being "filled up", we have llcsCur[0..n) = LLCS.m.[0..n). Because we also have llcsPrev[0..N] = LLCS.(m-1).[0..N] and nothing has been done to any element in llcsCur[n..N] since the last time the two arrays were swapped, it must also be that llcsCur[n..N] = LLCS.(m-1).[n..N]. In other words, the elements in llcsCur[n..N] still contain the corresponding values from the previous row, m-1, of LLCS. In particular, llcsCur[n] = llcsPrev[n]. Hence, the assignment

llcsCur[n] := llcsCur[n-1] max llcsPrev[n]

could just as well have been written

llcsCur[n] := llcsCur[n-1] max llcsCur[n]

Rewriting the assignment

llcsCur[n] := llcsPrev[n-1] + 1

without referring to llcsPrev is more problematic, however, because llcsCur[n-1] = LLCS.m.(n-1) but llcsPrev[n-1] = LLCS.(m-1).(n-1). (What happened is that, on the previous loop iteration, the value that had been in llcsCur[n-1] —namely, LLCS.(m-1).(n-1)— was overwritten by LLCS.m.(n-1). What we need is a way to "remember" the value previously occupying llcsCur[n-1]! Ah, but that is solved by introducing a fresh variable, old, and strengthening the invariant by stipulating that old = LLCS.(m-1).(n-1). Having done that, the job of the two arrays llcsPrev[] and llcsCur[] can be served by a single array, which we call llcs[]. The "trick" to maintaining the new invariant is to assign the value of llcs[n] to old while simultaneously modifying llcs[n]. Here is the program, using the single-loop and selection-free approach:

|[ con M, N : nat; con X : array [1..M] of T; con Y : array [1..N] of T; var r : nat; var m,n : nat; var llcs : array[0..N] of nat; var old : nat; llcs[0..N] := 0; // fill llcs[] with zeros m, n, old := 1, 1, 0; {Loop invariant I: llcs[0..n) = LLCS.m.[0..n) ∧ llcs[n..N] = LLCS.(m-1).[n..N] ∧ old = LLCS.(m-1).(n-1) ∧ 1≤m≤M+1 ∧ 1≤n≤N+1 } {Bound function t: (M+1)(N+1) - (N+1)m - n} do m ≠ M+1 ∧ n = N+1 ---> m, n, old := m+1, 1, 0 [] m ≠ M+1 ∧ n ≠ N+1 ∧ X[m] = Y[n] ---> n, old, llcs[n] := n+1, llcs[n], old + 1 [] m ≠ M+1 ∧ n ≠ N+1 ∧ X[m] ≠ Y[n] ---> n, old, llcs[n] := n+1, llcs[n], llcs[n-1] max llcs[n] od {I ∧ m=M+1 ∧ n=1, from which it follows that llcs[N] = LLCS.M.N} ;r := llcs[N]; {Q: r = LLCS.M.N } ]|

Note that the claim that n=1 is necessarily true when the loop terminates is not rigorously justified here, but it is clearly true, because the loop's final iteration will be one in which m is increased from M to M+1 and n is set to 1. To justify the claim rigorously, we should add the implication m=M+1 ⇒ n=1 as a new conjunct to the loop invariant.

LLCS.m.n = {	0	if m=0 ∨ n=0
	LLCS.(m-1).(n-1) + 1	if 1≤m≤M ∧ 1≤n≤N ∧ X.m = Y.n
	LLCS.m.(n-1) max LLCS.(m-1).n	if 1≤m≤M ∧ 1≤n≤N ∧ X.m ≠ Y.n

SE 504 The Longest Common Subsequence Problem

A Program Solution

Reducing Space Usage

SE 504
The Longest Common Subsequence Problem