SE 504 Spring 2020
Prog. Assg. #1: Longest Common Subsequences
Due: Noon, May 4

Where x and y are strings of lengths M and N, respectively, and m and n satisfy 0≤m≤M and 0≤n≤N, respectively, let x(m) and y(n) refer to the prefixes of x and y of lengths m and n, respectively. Recall from class that LLCS.m.n is the length of any longest common subsequence (LCS) of x(m) and y(n).

Constructed using String objects x and y, an instance of the Java class LongComSubseq is intended to be able to provide information about LCS's of any prefixes of x and y.

Specifically, it can provide

Let x(m) = x0x1...xm-1 and y(n) = y0y1...yn-1. Then an (m,n)-matching is a pair of mappings (f,g), where, for some k≥0, f : [0..k) ⟶ [0..m) and g : [0..k) ⟶ [0..n) such that both f and g are increasing and, for all i, 0≤i<k, xf.i = yg.i. An (m,n)-matching is maximal if k = LLCS.m.n.

As an example, consider

    0   1   2   3   4   5   6   7   8   9
  +---+---+---+---+---+---+---+---+---+---+
x | a | a | b | a | c | b | a | c | b | a | 
  +---+---+---+---+---+---+---+---+---+---+

    0   1   2   3   4   5   6   7   8
  +---+---+---+---+---+---+---+---+---+
y | c | a | c | a | b | b | c | c | a | 
  +---+---+---+---+---+---+---+---+---+

Then the following (non-maximal) (9,8)-matching is consistent with the fact that x0x3x5x8x9 = aabba = y1y3y4y5y8

   0   1   2   3   4
  +---+---+---+---+---+
f | 0 | 3 | 5 | 8 | 9 |
  +---+---+---+---+---+

    0   1   2   3   4
  +---+---+---+---+---+
g | 1 | 3 | 4 | 5 | 8 |
  +---+---+---+---+---+

Your job is to complete the class, as the source code given to you includes three stubbed methods, one of which is to compute an LCS, another of which is to compute a maximal matching, and the third of which is to compute the number of distinct maximal matchings, all with respect to given m and n.

As a technical detail, a matching is to be represented by an array of type int[][] of length two (i.e., having two "rows"). Its two elements (each of type int[], of course) are to correspond to what we called f and g above.

The method that computes the LLCS function (as described in class) is given to you. You can make use of the matrix that it produces (which is stored in an instance variable) to compute an LCS or a maximum matching. You can use a maximum matching to compute an LCS, and vice versa.

Your source code is expected to be augmented by comments that describe loop invariants and bound functions.

For purposes of testing your work, you are provided with the Java application LongComSubseqTester.


Counting Maximal Matchings

To illustrate, consider the strings x:adbcbca and y:bdaab. The set of LCS's of x and y is { aa, ab, ba, bb, da, db }. For each common subsequence there is at least one matching that gives rise to it, and possibly more.

Describing a matching <f,g> as the sequence of ordered pairs [(f.0,g.0), (f.1,g.1), ...] (rather than as a pair of vectors), there are four matchings by which the LCS  ba  is formed:

[(3,1),(7,3)]
[(3,1),(7,4)]
[(5,1),(7,3)];
[(5,1),(7,4)]

Let NLCS.m.n = the number of distinct (necessarily maximal) matchings describing LCS's of x(m) and y(n).

For our example of x:adbcbca and y:bdaab, the final answer will be 14, as the number of ways to form aa, ab, ba, bb, da, and db are 1, 4, 4, 1, 2, and 2, respectively.

The key to arriving at a solution is to properly characterize the value of NLCS.m.n (for m,n > 0) in terms of NLCS.m.(n-1), NLCS.(m-1).n, and NLCS.(m-1).(n-1). (However, this characterization is a little more complicated than the analogous characterization of LLCS.m.n in terms of LLCS.m.(n-1), LLCS.(m-1).n, and LLCS.(m-1).(n-1).)

One observation that may be crucial in understanding the various recursive cases of NLCS follows from the fact that LLCS ascends along its rows, columns, and diagonals, but by at most one at each step:

If LLCS.m.n = w (where m,n > 0), then each of LLCS.m.(n-1), LLCS.(m-1).n, and LLCS.(m-1).(n-1) is either w-1 or w.

One can picture it like this, where we focus on one two-by-two section of a table holding the values of LLCS:

2-by-2 segment
of LLCS Table
n-1n
mw-1 or w
w
m-1w-1 or ww-1 or w

Assuming that the value in the cell at location (m,n) is w, each of the remaining three cells must contain either w-1 or w (two possibilities), which makes for 23, or 8, possible cases. However, only five of those cases can occur in an LLCS table, again due to the fact that it must ascend along rows and columns. (Indeed, if LLCS.(m-1).(n-1) is w, then neither LLCS.m.(n-1) nor LLCS.(m-1).n can be w-1. That eliminates three of the eight possibilities.)

To compute NLCS.m.n, it helps to figure out how many of the maximal matchings of x(m) and y(n-1) (respectively, x(m-1) and y(n)) are also maximal matchings of x(m) and y(n). And how many of them are in common, so that you don't count them twice? Similarly, how many of the maximal matchings of x(m-1) and y(n-1) also are maximal matchings of x(m) and y(n) or can be extended by one element to become so?