SE 504 Spring 2020
Prog. Assg. #1: Longest Common Subsequences
Due: Noon, May 4

Where x and y are strings of lengths M and N, respectively, and m and n satisfy 0≤m≤M and 0≤n≤N, respectively, let x_(m) and y_(n) refer to the prefixes of x and y of lengths m and n, respectively. Recall from class that LLCS.m.n is the length of any longest common subsequence (LCS) of x_(m) and y_(n).

Constructed using String objects x and y, an instance of the Java class LongComSubseq is intended to be able to provide information about LCS's of any prefixes of x and y.

Specifically, it can provide

The value LLCS.m.n for any m and n satisfying 0≤m≤M and 0≤n≤N. In fact, one method prints a table showing the values of LLCS.m.n for all such m and n.

one LCS of x_(m) and y_(n) (for specified m and n). (There could be many such LCS's, of course.)

One maximal (m,n)-matching (for specified m and n) (see below).

The number of distinct maximal (m,n)-matchings (for specified m and n) (see below).

Let x_(m) = x₀x₁...x_m-1 and y_(n) = y₀y₁...y_n-1. Then an (m,n)-matching is a pair of mappings (f,g), where, for some k≥0, f : [0..k) ⟶ [0..m) and g : [0..k) ⟶ [0..n) such that both f and g are increasing and, for all i, 0≤i<k, x_f.i = y_g.i. An (m,n)-matching is maximal if k = LLCS.m.n.

As an example, consider

0 1 2 3 4 5 6 7 8 9 +---+---+---+---+---+---+---+---+---+---+ x | a | a | b | a | c | b | a | c | b | a | +---+---+---+---+---+---+---+---+---+---+ 0 1 2 3 4 5 6 7 8 +---+---+---+---+---+---+---+---+---+ y | c | a | c | a | b | b | c | c | a | +---+---+---+---+---+---+---+---+---+

Then the following (non-maximal) (9,8)-matching is consistent with the fact that x₀x₃x₅x₈x₉ = aabba = y₁y₃y₄y₅y₈

0 1 2 3 4 +---+---+---+---+---+ f | 0 | 3 | 5 | 8 | 9 | +---+---+---+---+---+ 0 1 2 3 4 +---+---+---+---+---+ g | 1 | 3 | 4 | 5 | 8 | +---+---+---+---+---+

Your job is to complete the class, as the source code given to you includes three stubbed methods, one of which is to compute an LCS, another of which is to compute a maximal matching, and the third of which is to compute the number of distinct maximal matchings, all with respect to given m and n.

As a technical detail, a matching is to be represented by an array of type int[][] of length two (i.e., having two "rows"). Its two elements (each of type int[], of course) are to correspond to what we called f and g above.

The method that computes the LLCS function (as described in class) is given to you. You can make use of the matrix that it produces (which is stored in an instance variable) to compute an LCS or a maximum matching. You can use a maximum matching to compute an LCS, and vice versa.

Your source code is expected to be augmented by comments that describe loop invariants and bound functions.

For purposes of testing your work, you are provided with the Java application LongComSubseqTester.

Counting Maximal Matchings

To illustrate, consider the strings x:adbcbca and y:bdaab. The set of LCS's of x and y is { aa, ab, ba, bb, da, db }. For each common subsequence there is at least one matching that gives rise to it, and possibly more.

Describing a matching <f,g> as the sequence of ordered pairs [(f.0,g.0), (f.1,g.1), ...] (rather than as a pair of vectors), there are four matchings by which the LCS ba is formed:

[(3,1),(7,3)]
[(3,1),(7,4)]
[(5,1),(7,3)];
[(5,1),(7,4)]

Let NLCS.m.n = the number of distinct (necessarily maximal) matchings describing LCS's of x_(m) and y_(n).

For our example of x:adbcbca and y:bdaab, the final answer will be 14, as the number of ways to form aa, ab, ba, bb, da, and db are 1, 4, 4, 1, 2, and 2, respectively.

The key to arriving at a solution is to properly characterize the value of NLCS.m.n (for m,n > 0) in terms of NLCS.m.(n-1), NLCS.(m-1).n, and NLCS.(m-1).(n-1). (However, this characterization is a little more complicated than the analogous characterization of LLCS.m.n in terms of LLCS.m.(n-1), LLCS.(m-1).n, and LLCS.(m-1).(n-1).)

One observation that may be crucial in understanding the various recursive cases of NLCS follows from the fact that LLCS ascends along its rows, columns, and diagonals, but by at most one at each step:

If LLCS.m.n = w (where m,n > 0), then each of LLCS.m.(n-1), LLCS.(m-1).n, and LLCS.(m-1).(n-1) is either w-1 or w.

One can picture it like this, where we focus on one two-by-two section of a table holding the values of LLCS:

2-by-2 segment
of LLCS Table
n-1 n

m w-1 or w w

m-1 w-1 or w w-1 or w

2-by-2 segment
of LLCS Table
	n-1	n
m	w-1 or w	w
m-1	w-1 or w	w-1 or w

Assuming that the value in the cell at location (m,n) is w, each of the remaining three cells must contain either w-1 or w (two possibilities), which makes for 2³, or 8, possible cases. However, only five of those cases can occur in an LLCS table, again due to the fact that it must ascend along rows and columns. (Indeed, if LLCS.(m-1).(n-1) is w, then neither LLCS.m.(n-1) nor LLCS.(m-1).n can be w-1. That eliminates three of the eight possibilities.)

To compute NLCS.m.n, it helps to figure out how many of the maximal matchings of x_(m) and y_(n-1) (respectively, x_(m-1) and y_(n)) are also maximal matchings of x_(m) and y_(n). And how many of them are in common, so that you don't count them twice? Similarly, how many of the maximal matchings of x_(m-1) and y_(n-1) also are maximal matchings of x_(m) and y_(n) or can be extended by one element to become so?

SE 504 Spring 2020 Prog. Assg. #1: Longest Common Subsequences Due: Noon, May 4

Counting Maximal Matchings

SE 504 Spring 2020
Prog. Assg. #1: Longest Common Subsequences
Due: Noon, May 4