Regular Expressions Revisited

Most java developers seem to shy away from regular expressions because they are difficult. And I fully agree, the syntax devised for regular expressions puts the word terse to shame. We all know how to match everything with .* but when it gets to negative look ahead or atomic matches most of my peers, including me, tend to run away. It is a bit of a shame. Regular expressions properly used are incredibly powerful. You can capture an enormous amount of error cases in input that would take hundreds of lines of code to analyze.

And the syntax is one thing, the secondary peeve is that Pattern and Matcher feel as antiquated as some the Dictionary we seem to still cherish in OSGi. Much of the API could benefit greatly with Stream and Optional.

These thoughts have been playing through for many a year. This week I needed some regular expressions and decided to just address my itch. First the more conscientious then tell me: “but there is an open source project X that does this already!” They’re of course completely right but I’ve made it a case of honor to never have any external dependency in bndlib. Second, I am arrogant enough to think I can do better :sunglasses:. And hey, it was a fun library to write.

I did take a peek at JavaVerbalExpressions but the syntax felt awkward. (I know I am prejudiced.)Although I love the Builder patter, it seems quite off for regular expressions. Primarily since you should be able to build complex expressions out of simple expressions, and the builder pattern is not really suited for that.

So I took the dive, and of course spend too much time on it. You can find the result here. The whole library is in 2 classes so it is easy to copy the code. No dependencies but Java 17.

So how would you use this?

First, the implementation is in the class Catalog. Most methods are static and can be statically imported to use their short version. A tip for bndtools/Eclipse users: you can add Catalog to the preferences in Favorites to get automatic imports. The API is in the interface RE.

The simplest is matching the character a.

RE a = lit("a");

This is a literal. I’ve kept the names short to make the grammar more readable. You can quantify a regular expression with the methods set, some, opt, or multiple. Set maps to the infamous *, some is +, and optional is ?.

RE a = lit("a");
RE as = set(a);

If you call toString on an RE, it will look like a. To match a string you’d call a.matches("a"). This will return an Optional with a present value of Match. A Match implements CharSequence and toString() will provide a String with exactly the matched text. It contains a number of other useful methods we discuss later.

The Optional removes the awkward pattern where you first have to store the Matcher in a temporary variable, then call the find, matches, or lookingAt, and then get the results.

Pattern SOME_A_P = Pattern.compile("a+");
Matcher m = SOME_A_P.matcher("some string");
while(m.find()) { .... }

With an optional, the chores are done for you and you only use the result when there actually is one.

int count = as
  .findIn( "bbaaaaaaacccccc" )
  .map( m-> m.length())
  .orElse(-1);

When you use the find() method on the classic Matcher, you often use it in a loop to process all the matches. In the RE library this is replaced with the Java Stream library. For example, we want to make a list of words in a text.

String poem = """
    I wandered lonely as a cloud
    That floats on high o'er vales and hills,
    When all at once I saw a crowd,
    A host, of golden daffodils;
    Beside the lake, beneath the trees,
    Fluttering and dancing in the breeze.
""";
Catalog.word.findAllIn(poem).map(Object::toString)
      .map(String::toLowerCase)
      .distinct()
      .sorted()
      .toList()`

One of the parts I like best is how you can compose really complicated regular expressions out of simpler ones. Imagine you need to match Java Identifiers. Java identifiers can contain a lot of Unicode characters and have some special rules. This is how it looks in RE.

RE id = g( javaIdentifierStart, set(javaIdentifierPart))

We can now use the java identifier to construct a fully qualified name.

RE fqn = g( id, set( dot, id ));

The RE library is also a bit more pragmatic. In most parsing situations the white space handling tends to take a lot of effort. For example matching a string like "biz.aQute.bnd, javax, com.example" makes many textual regular expressions a pain to read. In the library, the term and list methods automatically add matching zero or more white spaces ahead of each RE. This greatly reduces the grammar. For example, to represent a comma separated list of fully qualified names.

RE fqns = list(fqn);

A powerful aspect of the Pattern class are the named groups. For example, you’re parsing a Java file and want to find the package name. You will need to match the keyword package to make sure you really got the package.

RE package_    = lit("package");
RE packageDecl = term(package_,ws,g("pname",fqn));
String pname   = packageDecl
  .findIn(someProgram)
  .map( m->group("pname") )
  .orElse(".");

Regular expressions have some facilities that are hard to use in their regular form but become quite useful in RE. For example, we want to count quotes that are not escaped.

RE unescapedQuote = g( behind("'").not(), lit("'"));

assertThat(unescapedQuote.findAllIn("a', b\\', c', d\\', e'")
  .count()).isEqualTo(3);

Last a real world example. The OSGi uses a syntax for manifest headers that is quite powerful but also quite complex. In the spec it is called Parameters. In bnd, we relaxed many of the OSGi requirements to accept more types of input. For the values we support single quoted strings and double quoted strings as well as anything that cannot be confused with the delimiters. With RE, the syntax looks like:

RE	id        = g(javaJavaIdentifierStart, set(javaJavaIdentifierPart));
RE  strings   = or(string('\''), string('"'));
RE  freeform  = set(not(cc("\'\",;"));          
RE	value     = or(strings,freeform));
RE	property  = term(id, lit("="), value);
RE	clause	  = term(list(id, semicolon), set(term(semicolon, property)));
RE	parameters= list(clause);

It is useful that a string is a proper Parameters. However, it would be nicer to construct a proper Parameters out of it.

For this purpise, the Match object that is the argument of the Optional and Stream callbacks, maintains a rover. This is a current index in the string that the whole expression matched.

A number of convenient methods are provided to match from the rover position to the end of the matching string. The take(re) method expects that the re is matching from the rover position forward, this is called lookingAt in the Java API. The rover is then updated to the end of this match and the string that was matched is returned. The check(re) method looks at the current rover position. If it matches, the rover is updated to the end of the match and true is returned. Otherwise nothing is changed and false is returned.

If there is no match, then an exception is thrown in both methods. This rather brute approach is acceptable here because the whole syntax is already verified by the outer RE. If these inner checks and matches would not match, then the code is clearly incorrect.

To parse a Parameters out of the matching expression, the following code would do it.

ParameterMap pars = x.parameters.matches(
  "a ;b, c;foo = '\"bar\"',   d;e ; f;g=3, h;s=\";bla\\\"bla,\"")
.map(m -> {
	ParameterMap ps = new ParameterMap();
	do {
		Set<String> aliases = new LinkedHashSet<>();
		Attributes attrs = new Attributes();
		do {
			String key = m.take(x.id);
			if (m.check(x.eq)) {
				String value = m.take(x.value);
				value = fixupStrings(value);
				attrs.put(key, value);
			} else
				aliases.add(key);
		} while (m.check(semicolon));
		aliases.forEach(k -> ps.put(k, attrs));
	} while (m.check(comma));
	return ps;
})
.orElse(new Parameters());

The RE library is brand new and written from scratch so we’ll probably find some shortcomings but it turned out, imho, quite nice. There are numerous test cases so it should do the basic stuff quite well. It is in aQute.libg so it will end up on Maven Central when we release 7.1.0. It will soon be available from our snapshot repository.