Contents

Roo

The Regex data type

Regular expressions are represented by the Regex class. This section will not explain how to write regular expression (regex) patterns as that is outside this document’s scope. I will say that for dynamically testing an expression in a browser, https://regex101.com is excellent. In short, a regex is a sequence of characters that defines a search pattern. This pattern can then be applied to text to find characters, words and phrases of interest.

A Regex is typically created with a regex literal using PCRE syntax. The exception to this is that the pipe character (|) must be escaped within regex literals with a backslash: \|. The literal consists of UTF-8 encoded text enclosed by pipes (|):

|\d+| # Matches any number of digits.
|foo\|bar| # Matches "foo" or "bar". Note the escaped pipe.

Modifiers

The closing pipe delimiter may be followed by a number of optional modifiers to adjust the matching behaviour of the regular expression.

  • i: Case-insensitive matching. Letters in the pattern match both upper and lower case letters in the search text. Default is False
  • s: Normally the period matches everything except a new line, this option allows it to match new lines. Default is False
  • e: Indicates whether patterns are allowed to match an empty string. The default is False
  • u: Ungreedy. Greedy means the search finds everything from the beginning of the first delimiter to end of the last delimiter and everything in-between. Default is False
  • m: Multiline matching. ^ and $ match new lines within data. Default is False
|foo|.matches?("FOO") # => False.
|foo|i.matches?("FOO") # => True.
|bar|ieu # => Multiple options.

To see if some text matches a pattern we can use either the match() method on a Text object (by passing in our Regex object) or by using the match() method on a Regex object (by passing in the text to search). Both approaches will return either a RegexResult object if at least one match is found or Nothing if no match is found:

var r = |\w+\.\w+@\w+\.com|
"pepper.potts@stark.com".match(r) # => <RegexResult instance>
r.match("pepper.potts@stark.com") # => <RegexResult instance>
"hello there".match(|\w+\.\w+@\w+\.com|) # => Nothing.

If you just want to see if a pattern matches some text, you can use the matches?() method on either a Regex object or a Text object:

"hello".matches?(|.lo|) # => True
|.lo|.matches?("boo") # => False

RegexResult objects

A RegexResult object is returned when the query text matches the regex search pattern. This object contains information about the search result. Since it’s possible for there to be multiple matches to a pattern within a single piece of text, this object contains one or more RegexMatch objects. Each RegexMatch object represents a single match to the pattern. There are a number of ways to get these matches:

var r = |love\|hate| # Matches either `love` or `hate`
var result = "Sally loves Harry. Batman hates the Joker".match(r)
result.length # => 2 (as there are two matches, 'love' and 'hate').
var match1 = result.first_match # Could use result.match(0)
var match2 = result.match(1) # Second match has a value of `1` because `match()` is zero-based

RegexMatch

Following on from the above example, a RegexMatch object contains everything you need to know about an individual match. The RegexMatch object contains the (zero-based) start and finish position of the match within the original query text, the actual text value of the match and information about any capture groups:

# Following on from the love/hate matches above...
match1.value # => "love"
match1.start # => 6
match1.finish # => 10

match2.value # => "hate"
match2.start # => 26
match2.finish # => 30

Capture groups

One of the great strengths of regular expressions is the ability to capture portions of matched text. This is done with capture groups. Any regex contained within parentheses is a capture group. An easy way to get all captures of a match as an array is with RegexMatch.captures:

var r = |(\w+)\.(\w+)@(\w+\.com)|
var captures = "pepper.potts@stark.com".match(r).first_match.captures
print(captures) # ["pepper", "potts", "stark.com"]

You can get the contents of a particular capture group using it’s group number. The first group is numbered 1. If a regex pattern contains capture groups then you can get information about the text captured in that group with the RegexMatch.group() method. This returns a MatchInfo object:

var result = "Dr McCoy".match(|(\w+) (\w+)|)
var group1 = result.first_match.group(1)
group1.value # => "Dr"
group1.start # => 0
group1.finish # => 2

var group2 = result.first_match.group(2)
group2.value # => "McCoy"
group2.start # => 3
group2.finish # => 8

A pattern can also contain named capture groups. These function just like regular capture groups (and indeed are included in the numbered groups) but they are assigned a name to make it easier to retrieve them. Named capture groups are created with the regex syntax (?<name>REGEX). Their data is encapsulated as a MatchInfo object.

var r = |(?<forename>\w+)\.(?<surname>\w+)@(?<domain>\w+\.com)|
var t = "pepper.potts@stark.com"
var match = t.match(r).first_match # Just get the first match (we know there's only one)
var d = match.name("domain") # The `domain` group
print(d.value + " (" + d.start + ", " + d.finish + ")") # => stark.com (13, 22)

Comprehensive documentation of the Regex object’s methods and getters can be found in the Regex section of the standard library documentation. It’s also worth familiarising yourself with the MatchInfo, RegexMatch and RegexResult documentation if you’ll be working with regular expressions.