CMPS 260 Spring 2020
Prog. Assg. #1: Regular Expressions
Due Date: April 16
The Relevant Java Components
The purpose of this assignment is to solidify your understanding
of regular expressions. Provided are a Java interface and eleven
Java classes, three of which are incomplete. Instances of
those three classes represent composite regular expressions
whose main operators are union, concatenation, and Kleene/star
closure, respectively.
- RegularExpresssion:
Abstract class each of whose child classes represents
regular expressions of a particular variety.
The child classes are as follows, with the first two provided
in full and each of the others having some missing pieces that
the student is to supply.
- RegExprNullSet:
Instances of this class model the atomic regular expression
∅, which represents the empty set of strings.
- RegExprWord:
Instances of this (convenience) class model semi-atomic
regular expressions that represent languages containing a
single string.
Examples: abbab, a, λ.
- RegExprUnion:
Instances of this class model composite regular expressions
having union as their main operator.
Example: ab + b(aba)*.
- RegExprConcat:
Instances of this class model composite regular expressions
having concatenation as their main operator.
Example: (a*b* + ba) . ((c+a).bba)*.
- RegExprStar:
Instances of this class model composite regular expressions
having Kleene/star closure as their main operator.
Example: (a*b* + ba)*.
The remaining Java classes are provided in full.
- RegExprSymbols:
Defines the "plain text" symbols that are used in regular expressions
(as entered on a keyboard and displayed in a console window) to refer
to the operators (union (+), concatenation (.),
and Kleene/star closure (*)),
the null set (N), and the empty string (L).
- RegExprTokenizer:
An instance of this class is for the purpose of identifying the
sequence of tokens within a regular expression (as entered by a
user on the keyboard), which it provides to clients using an
Iterator-style set of methods (e.g., hasNext(), next()).
- Stack: Java interface for the stack ADT
- StackViaArray: class that
implements Stack.
- RegExprBuilder:
This class has two static methods, both of which make use of
instances of the RegExprTokenizer and
StackViaArray classes:
- isValid(): verifies the syntactic validity of a given
string that is purported to be a regular expression.
- parse(): given a syntactically valid regular expression,
returns a corresponding instance of the class
RegularExpression.
- RegExprApp:
Java application that allows a user to enter a regular expression
and then answers questions about the language that it represents,
including whether or not it is finite, the length of its shortest
string (and longest string, if it is finite), and whether a given
string is a member.
The intent is that students will use this as a means for testing
their work on this assignment.
Review of Regular Expressions
You will recall that regular expressions come in these varieties:
- Atomic:
- ∅: describes the language having no members.
That is, L(∅) = {}.
- λ: describes the language { λ }
whose lone member is the empty string.
That is, L(λ) = { λ }.
- a: (where a is a letter/symbol) describes
the language { a } containing a lone string of
length one.
That is, L(a) = { a }.
- Composite
- r + s (where each or r and s is
a regular expression and + represents the union
operator): L(r + s) = L(r) ∪ L(s).
- r · s (where each or r and s is
a regular expression and · represents the
concatenation operator):
L(r · s) = L(r) · L(s).
- r* (where r is a regular expression and
* represents the star closure operator):
L(r*) = L(r)*.
Deciding Membership
Deciding whether a given string is a member of the language described
by an atomic regular expression is straighforward. As for the
composite regular expressions:
- x ∈ L(r + s) iff either
x ∈ L(r) or
x ∈ L(s)
- x ∈ L(r · s) iff
there exist strings y and z such that
- x = yz,
- y ∈ L(r), and
- z ∈ L(s)
- x ∈ L(r*) iff
either x = λ or
there exist strings y and z such that
- x = yz,
- y ≠ λ,
- y ∈ L(r), and
- z ∈ L(r*)
Treating "words" as Atomic Regular Expressions
Technically, the regular expression abbab is an abbreviation
for a·b·b·a·b, which is itself
an abbreviation for (((a·b)·b)·a)·b.
For the sake of convenience (and efficiency), we treat
regular expressions such as abbab as being atomic.
Such expressions are modeled by the class RegExprWord.
From a user's point of view, the main consequence of this is
that, for example, abbab* is equivalent to
(abbab)* rather than to abba · b*.
Because, by convention, the star operator has higher precedence
than concatenation (which itself has higher precedence than union),
the normal interpretation would be the latter rather than the former.
One way to look at it is that, in our system, implicit concatenation
has higher precedence than star. For the sake of avoiding ambiguity,
the image of a regular expression, as produced by the toString()
method in the relevant class (i.e., RegExprStar), will include
parentheses surrounding any sub-expression of length two or more
to which the star operator applies.
Dealing with ASCII Limitations
Given that the ASCII alphabet does not support superscripts or
symbols such as ∅, λ, or ·, for the purpose
of entering regular expressions at the keyboard and viewing them
on a console window, we have to use substitutes.
The ones chosen are reflected by the symbolic constants defined
in the RegExprSymbols class. In particular,
- N is used in place of ∅
- L is used in place of λ
- . is used in place of ·
- * is used in place of *
Submission of Work
All that you should submit are the three .java files
corresponding to the classes RegExprUnion,
RegExprConcat, and RegExprStar.
Use the file submission system, to which there is a link on
the course web page.
Make sure that you use comments to list names of people with
whom you collaborated and to acknowledge any defects that
you have identified.
Sample Dialog with RegExprApp
What follows is a "transcript" of an interaction between a user
and the RegExprApp application. The first thing that
the program does is to print the "help page", which lists all
the commands that the program can respond to.
All user input appears to the right of the > prompt.
Commands:
---------
q: to quit.
h: for this list.
n <regular expression>: to establish a new regular expression.
Example: n aba.(ba)* + bba
d: to display the current regular expression
m <string>: to test string for membership
Example: m bbabaab
s: to display stats about the current regular expression.
g [seed]: to display a random member of the language.
r: to display reverse of current rexpr.
> n (ab + bb*) . bab + a*
New regular expression is ((ab + (bb)*).bab + a*)
> s
Image: ((ab + (bb)*).bab + a*)
Shortest member has length 0
Has infinitely many members.
> g
Random member: |aaa|
> g
Random member: |bbbbbbbbbab|
> m bbbbbab
The string |bbbbbab| is a member.
> m
The string || is a member.
> n (aa + ba*)*
New regular expression is ((aa + (ba)*))*
> g
Random member: |aabaaa|
> m aabaaaa
The string |aabaaaa| is NOT a member.
> r
Reverse is ((aa + (ab)*))*
> n (aa + b.a*)*
New regular expression is ((aa + b.a*))*
> g
Random member: |aabaaaa|
> g
Random member: |aa|
> g
Random member: |baabaaaaaa|
> s
Image: ((aa + b.a*))*
Shortest member has length 0
Has infinitely many members.
> m aabaaaaaabaaabbbbbaa
The string |aabaaaaaabaaabbbbbaa| is a member.
> n (a.(L + b))* . cca
New regular expression is (a.(L + b))*.cca
> g 27
Random member: |aaabcca|
> s
Image: (a.(L + b))*.cca
Shortest member has length 3
Has infinitely many members.
> n (ab + N) . cc
New regular expression is (ab + N).cc
> s
Image: (ab + N).cc
Shortest member has length 4
Longest member has length 4
> n N
New regular expression is N
> s
Image: N
Has no members.
> h
Commands:
---------
q: to quit.
h: for this list.
n : to establish a new regular expression.
Example: n aba.(ba)* + bba
d: to display the current regular expression
m : to test string for membership
Example: m bbabaab
s: to display stats about the current regular expression.
g [seed]: to display a random member of the language.
r: to display reverse of current rexpr.
> q
Goodbye.
|