Expression Pattern Language (eXPL)

Regular Expressions

Regular expressions match text to a pattern and optionally collect portions of text referred to as groups. eXPL regular expressions are implemented using the Java Pattern class. An eXPL regular expression behaves like the ? operator in that a false condition causes execution to short circuit.

Format

The regular expression declaration format starts with keyword regex, followed by a match expression and optional set of one or more groups exclosed in braces:

regex[( option-set )] match-expression [ { group-set } ]

Match Expression

The match expression has the format: input ? regular-expression. The input is a variable potentially assigned to an expression, from which the text is extracted to perform the match. The regular expression is a string literal or variable, the latter being useful for creating a regular expression from concatenation of components instead of having one unwieldly string literal.

As in Java, a backslash \ character, when used in a regular expression, needs to be escaped by using 2 backslashes \\.
Groups

The group set is a comma-delimited list of identifiers, with the sequence corresponding to the occurence of groups defined in the regular expression. If a group is made optional in the pattern using the "?" operator, then the group variable may not be set. In this case, the variable needs to be set to a default value before the regular expression match occurs. A group variable can also be assigned a type if type conversion from string to a number type is required.

Options

The option set is comma-delimited. The options are the Java Pattern class flags, but in lower case eg. "case_insensitive" for Pattern.CASE_INSENSITIVE - Enables case-insensitive matching.

Short Circuit Behaviour

A regular expression can be used either as a term in a template or calculator, or as the condition part of a calculator conditional block. When used as a term, the regular expression operation causes evaluation to short circuit on no match thus acting as a filter mechanism. When a match occurs, the term contains the input to the regular expression and this may be useful for providing a query solution.

When used in a conditional block and on no match, the regular expression operation causes evaluation to skip over the block. When a regular expression is intended to only capture group values, then an empty block may be used to prevent term short circuit behaviour.

Case-insensitive Matching

Application Pets of tutorial11 takes avantage of the "case-insensitive" option to handle the fact that the input has a mix of cases. It works with pet informatiom in XML format to print out details on dogs. The species elements have a mix of "dog", "Dog", "cat" and "Cat". It traverses the input data using a cursor in an unconditional loop. To prevent the loop being prematurely exited when the first cat is encountered, the regex is used in a conditional branch. Here is the loop:

{
? pet.fact,
regex(case_insensitive) dog = (pet++) ? petRegex { name, color }
{ dogs += name + " is a " + color + " dog." }
}

Groups Example

The RegexGroups application of tutorial11 displays a selection of words with dictionary details. It has a regular expression which defines two groups to separate part of speech (noun, verb, adverb or adjective) from definition. The part of speech is subsequently expanded for readability from one character to four. This is the regular expression pattern, declared as a string variable, regex term and following 2 terms to export the groups

  string defRegex = "^(.)\\. (.*+)";
...
regex definition ? defRegex { part, def },
  expnd[part],
  def

Here are the first 3 of 54 results formed by concatenating the exported terms:

inadequate (adj.) not sufficient to meet a need
incentive (noun) a positive motivational influence
incidence (noun) the relative frequency of occurrence...

Points of note are:

  • The definition input is prevented from being exported by placing a dot in front of the regex term.
  • The "def" group variable is placed as a template term so it will be exported.
  • If, for some reason, the definition regular expression fails to match, then the whole entry will be skipped. This is preferred to creating a partial entry.
Literal Regex

A second groups example, appropriately named RegexGroups2, is similar to the first, but there is now a single input consisting of a entire dictionary entry, so a third group is required to extract each word. The only notable thing is that the regular expression is a string literal, which turns out to be quite short despite defining 3 groups:

. regex entry ?
"(^in[^ ]+) - (.)\\. (.*+)" { word, pos, definition },

Group Type and Default

Application ServiceItems in tutorial11 takes a record of a set of services, identified by code, and charges, in dollar and cents amounts, and translates it to an axiom list. A regular expression extracts service codes and amounts as groups. Each amount needs to be converted from text to a decimal number, so the amount group variable is assigned a currency type so the conversion occurs automatically. Some services are free and this is signalled in the input by the absence of an amount. To handle this, the regular expression uses the "?" operator on the amount group to indicate it occurs 0 or 1 times. The possible absence of the amount requires the amount be assigned a default value of 0.0 as well as having a currency type. Finally, the input has a line specifying an account number and which the regular expression will fail to match. To allow for this, the regular expression is used in a conditional branch. Here is the amount declaration and translation loop:

currency $ "US" amount,
{
? item.fact,
amount = 0.0,
regex line = (item++) ? itemRegex { service, amount }
{
charges += axiom{
Service = service,
Amount = amount.format}
}
}