The Regex data type
Regular expressions are represented by the
Regex class. This section will not explain how to write regular expression (regex) patterns as that is outside this document’s scope. I will say that for dynamically testing an expression in a browser, https://regex101.com is excellent. In short, a regex is a sequence of characters that defines a search pattern. This pattern can then be applied to text to find characters, words and phrases of interest.
Regex is typically created with a regex literal using PCRE syntax. The exception to this is that the pipe character (
|) must be escaped within regex literals with a backslash:
\|. The literal consists of UTF-8 encoded text enclosed by pipes (
|\d+| # Matches any number of digits. |foo\|bar| # Matches "foo" or "bar". Note the escaped pipe.
The closing pipe delimiter may be followed by a number of optional modifiers to adjust the matching behaviour of the regular expression.
i: Case-insensitive matching. Letters in the pattern match both upper and lower case letters in the search text. Default is
s: Normally the period matches everything except a new line, this option allows it to match new lines. Default is
e: Indicates whether patterns are allowed to match an empty string. The default is
u: Ungreedy. Greedy means the search finds everything from the beginning of the first delimiter to end of the last delimiter and everything in-between. Default is
m: Multiline matching.
$match new lines within data. Default is
|foo|.matches?("FOO") # => False. |foo|i.matches?("FOO") # => True. |bar|ieu # => Multiple options.
To see if some text matches a pattern we can use either the
match() method on a
Text object (by passing in our
Regex object) or by using the
match() method on a
Regex object (by passing in the text to search). Both approaches will return either a
RegexResult object if at least one match is found or
Nothing if no match is found:
var r = |\w+\.\w+@\w+\.com| "firstname.lastname@example.org".match(r) # => <RegexResult instance> r.match("email@example.com") # => <RegexResult instance> "hello there".match(|\w+\.\w+@\w+\.com|) # => Nothing.
If you just want to see if a pattern matches some text, you can use the
matches?() method on either a
Regex object or a
"hello".matches?(|.lo|) # => True |.lo|.matches?("boo") # => False
RegexResult object is returned when the query text matches the regex search pattern. This object contains information about the search result. Since it’s possible for there to be multiple matches to a pattern within a single piece of text, this object contains one or more
RegexMatch objects. Each
RegexMatch object represents a single match to the pattern. There are a number of ways to get these matches:
var r = |love\|hate| # Matches either `love` or `hate` var result = "Sally loves Harry. Batman hates the Joker".match(r) result.length # => 2 (as there are two matches, 'love' and 'hate'). var match1 = result.first_match # Could use result.match(0) var match2 = result.match(1) # Second match has a value of `1` because `match()` is zero-based
Following on from the above example, a
RegexMatch object contains everything you need to know about an individual match. The
RegexMatch object contains the (zero-based) start and finish position of the match within the original query text, the actual text value of the match and information about any capture groups:
# Following on from the love/hate matches above... match1.value # => "love" match1.start # => 6 match1.finish # => 10 match2.value # => "hate" match2.start # => 26 match2.finish # => 30
One of the great strengths of regular expressions is the ability to capture portions of matched text. This is done with capture groups. Any regex contained within parentheses is a capture group. An easy way to get all captures of a match as an array is with
var r = |(\w+)\.(\w+)@(\w+\.com)| var captures = "firstname.lastname@example.org".match(r).first_match.captures print(captures) # ["pepper", "potts", "stark.com"]
You can get the contents of a particular capture group using it’s group number. The first group is numbered 1. If a regex pattern contains capture groups then you can get information about the text captured in that group with the
RegexMatch.group() method. This returns a
var result = "Dr McCoy".match(|(\w+) (\w+)|) var group1 = result.first_match.group(1) group1.value # => "Dr" group1.start # => 0 group1.finish # => 2 var group2 = result.first_match.group(2) group2.value # => "McCoy" group2.start # => 3 group2.finish # => 8
A pattern can also contain named capture groups. These function just like regular capture groups (and indeed are included in the numbered groups) but they are assigned a name to make it easier to retrieve them. Named capture groups are created with the regex syntax
(?<name>REGEX). Their data is encapsulated as a
var r = |(?<forename>\w+)\.(?<surname>\w+)@(?<domain>\w+\.com)| var t = "email@example.com" var match = t.match(r).first_match # Just get the first match (we know there's only one) var d = match.name("domain") # The `domain` group print(d.value + " (" + d.start + ", " + d.finish + ")") # => stark.com (13, 22)
Comprehensive documentation of the Regex object’s methods and getters can be found in the Regex section of the standard library documentation. It’s also worth familiarising yourself with the MatchInfo, RegexMatch and RegexResult documentation if you’ll be working with regular expressions.