String Regex Annotator Tutorial

The StringRegexAnnotator makes it extremely easy to match several complex regular expressions against a document and annotate the matches and/or the part of a match corresponding to a capturing regular expression group.

It also has a simple macro substitution feature that makes it easy to build more complex regular expression from simpler ones.

import os
from gatenlp import Document
from gatenlp.processing.gazetteer import StringRegexAnnotator, StringGazetteer

Creating the Annotator

Similar to the gazetteer annotators, there are several ways of how the annotator can be created: from a file that contains the regular expression rules, from a string (consisting of several lines) that contains regular expression rules (basically the content of a file as a string) or from prepared rule objects. Which of this to use is specified with the source_fmt parameter of either the constructor or the append method.

Create from a string with rules

The following example shows a string that contains a single simple rule which finds a date in ISO format (YYYY-MM-DD) and annotates it with annotation type “Date”

rules1 = """
|[0-9]{4}-[0-9]{2}-[0-9]{2}
0 => Date
"""

annt1 = StringRegexAnnotator(source=rules1, source_fmt="string")

doc1 = Document("A document that contains a date here: 2013-01-12 and also here: 1999-12-31")

annt1(doc1)
doc1

The rules file/string format

A rules file must contain one or more rules.

Each rule consists of:

The action line specifies how an annotation should get created for one or more groups of a matching regular expression.

The simple rules string above contains one rule, with one patterh line and one action line:

|[0-9]{4}-[0-9]{2}-[0-9]{2}
0 => Date

The pattern line |[0-9]{4}-[0-9]{2}-[0-9]{2} specifies the simple regular expression.

The action line 0 => Date specifies that an annotation with the annotation type “Date” should get created for the match, spanning “group 0”. The convention with regular expressions is that “group 0” always referes to whatever is matched by the whole regular expression.

Using groups

In addition to group 0, anything within simple parentheses in the regular expression is a “capturing group”. Capturing groups get numberd by their opening parenthesis when counting from left to right. For example, the following regular expression has 3 additional groups for the year, month and day part of the whole ISO date. The rule then refers to the whole matched date via group 0 but also creates annotations of type Year, Month and Day for each of the groups:

|([0-9]{4})-([0-9]{2})-([0-9]{2})
0 => Date
1 => Year
2 => Month
3 => Day
rules2 = """
|([0-9]{4})-([0-9]{2})-([0-9]{2})
0 => Date
1 => Year
2 => Month
3 => Day
"""

annt2 = StringRegexAnnotator(source=rules2, source_fmt="string")

doc2 = Document("A document that contains a date here: 2013-01-12 and also here: 1999-12-31")

annt2(doc2)
doc2

Adding features to annotations

For each annotation that gets created for a match it is possible to also specify features to set in each action. Feature values can be specified as constants or as the value of one of the matched groups. To illustrate this, the following example assigns the year, month and day string to all annotations (Date, Day, Month, Year). In addition it assigns the constant value “iso” to the “type” feature of the “Date” annotation. To assign the value of some group number n, the variable “Gn” can be used, e.g. “G2” for group 2:

rules3 = """
|([0-9]{4})-([0-9]{2})-([0-9]{2})
0 => Date  type="iso", year=G1, month=G2, day=G3
1 => Year  year=G1, month=G2, day=G3
2 => Month year=G1, month=G2, day=G3
3 => Day year=G1, month=G2, day=G3
"""

annt3 = StringRegexAnnotator(source=rules3, source_fmt="string")

doc3 = Document("A document that contains a date here: 2013-01-12 and also here: 1999-12-31")

annt3(doc3)
doc3

Having more than one rule

A rule file/string can contain any number of rules. The following example includes 2 rules, matching either an ISO date, or a traditional date (DD/MM/YYYY). The example also contains comment lines, which start either with a “#” or a double slash “//”:

rules4 = """
// The ISO date:
|([0-9]{4})-([0-9]{2})-([0-9]{2})
0 => Date  type="iso", year=G1, month=G2, day=G3

# The traditional way of writing a date:
|([0-9]{2})/([0-9]{2})/([0-9]{4})
0 => Date  type="traditional", year=G3, month=G2, day=G1
"""

annt4 = StringRegexAnnotator(source=rules4, source_fmt="string")

doc4 = Document("A document that contains a date here: 2013-01-12 and also here: 14/02/1991")

annt4(doc4)
doc4

Longest match only

Dates are sometimes written with 2 digits for the year only. The following example has two rules for a traditional date format. Because of the parameter longest_only=False the second date now matches both the first and the second rule.

rules5 = """
# The traditional way of writing a date, 2 digit year
|([0-9]{2})/([0-9]{2})/([0-9]{2})
0 => Date  type="traditional-short", year=G3, month=G2, day=G1

# The traditional way of writing a date, 4 digit year
|([0-9]{2})/([0-9]{2})/([0-9]{4})
0 => Date  type="traditional-long", year=G3, month=G2, day=G1

"""

annt5 = StringRegexAnnotator(source=rules5, source_fmt="string", longest_only=False)

doc5 = Document("A document that contains a date here: 12/04/98 and also here: 14/02/1991")

annt5(doc5)
doc5

With longest_only=True at each matching position, only the longest match (or longest matches if there are several matches with the same longest length) are annotated. Now only the rule that produces the longer match is used:

annt5a = StringRegexAnnotator(source=rules5, source_fmt="string", longest_only=True)

doc5a = Document("A document that contains a date here: 12/04/98 and also here: 14/02/1991")

annt5a(doc5a)
doc5a

Choosing matching rules

It is possible that several rules match the same position. The select_rules parameter can be used to configure which of all matching rules should actually be used. The default is “all”, so all matching rules are considered, but if longest_only=True then only the longest of all rules are considered.

If select_rules="first" then whichever rule is the first (in order of appearance in the rule file/string) to match is the one used, all other rules which may also match at a position are ignored. Similarly, if select_rules="last" only the last of all matching rules is used.

In the following example, longest_only=False and select_rules="first" so the first rule that matches is the only one used:

annt5b = StringRegexAnnotator(source=rules5, source_fmt="string", longest_only=False, select_rules="first")

doc5b = Document("A document that contains a date here: 12/04/98 and also here: 14/02/1991")

annt5b(doc5b)
doc5b

Overlapping matches

Sometimes it is possible that matches from different rules or the same rule can overlap, here is a simple example: the following rule simply matches any number of basic ASCII lower case characters. At each position where such a sequence starts, a match is found and an annotation is created.

rules6a = """
|[a-z]+
0 => Match
"""

annt6a = StringRegexAnnotator(source=rules6a, source_fmt="string")

doc6a = Document("A document that contains a date here: 12/04/98 and also here: 14/02/1991")

annt6a(doc6a)
print("Matching:", [doc6a[a] for a in doc6a.annset()])
doc6a
Matching: ['document', 'ocument', 'cument', 'ument', 'ment', 'ent', 'nt', 't', 'that', 'hat', 'at', 't', 'contains', 'ontains', 'ntains', 'tains', 'ains', 'ins', 'ns', 's', 'a', 'date', 'ate', 'te', 'e', 'here', 'ere', 're', 'e', 'and', 'nd', 'd', 'also', 'lso', 'so', 'o', 'here', 'ere', 're', 'e']

In such cases, it is often desirable to only try and find a match after any match that has already been found, so in this case, once “document” has been matched, only try and find the next match after the end of that match. This can be achieved by setting the parameter skip_longest=True:

rules6b = """
|[a-z]+
0 => Match
"""

annt6b = StringRegexAnnotator(source=rules6b, source_fmt="string", skip_longest=True)

doc6b = Document("A document that contains a date here: 12/04/98 and also here: 14/02/1991")

annt6b(doc6b)
print("Matching:", [doc6b[a] for a in doc6b.annset()])
doc6b
Matching: ['document', 'that', 'contains', 'a', 'date', 'here', 'and', 'also', 'here']

Macros: creating complex regular expressions from simpler ones

Complex regular expressions can get hard to read quickly especially when there are many nested alternatives, and often, the same complex sub-expression can be part of a bigger expression several times.

The StringRegexAnnotator therefore provides a macro mechanism which allows for complex regular expression to get composed by simpler ones in steps: one can assign the simpler regular expressions to a macro variable and then use such variables in the final complex regular expression.

Here is an example where either ISO or “traditional” dates should get matched and where the year, month and day parts of the regular expression are more specific than in the examples above. Instead of copy-pasting those sub-expressions for the year, month and day into each rule, macro assignments are used:

rules7 = """
year=(19[0-9]{2}|20[0-9]{2})
month=(0[0-9]|10|11|12)
day=([012][0-9]|3[01])

// The ISO date:
|--
0 => Date  type="iso", year=G1, month=G2, day=G3

# The traditional way of writing a date:
|/()/
0 => Date  type="traditional", year=G3, month=G2, day=G1
"""

annt7 = StringRegexAnnotator(source=rules7, source_fmt="string")

doc7 = Document("""
A document that contains a date here: 2013-01-12 and also here: 14/02/1991. This should not 
get matched: 1833-12-21 and nor should this 45/03/2012 but this should 13/12/2012 and also
this, despite not being a valid data: 31/02/2000
""")

annt7(doc7)
doc7

Combine with a String Gazetteer

In addition to the type of rules described above, there is a special rule which can be used to combine the regular expressions with StringGazetteer matching. The initialized StringGazetteer instances can be specified when creating the StringRegexAnnotator.

The rule consists of a single line of the form GAZETTEER => or GAZETTEER => feat1 = val1, feat2=val2 to assign some constant features (in addition to the features from the gazetteer entry and gazetteer list).

This examples illustrates this by additing a small string gazetteer to the previous example which matches the strings “date”, “a date”, “and”, “also”:

gazlist1 = [
    ("date", ),
    ("a date",),
    ("and",),
    ("also",),
]

gaz1 = StringGazetteer(source=gazlist1, source_fmt="gazlist")

rules8 = """
year=(19[0-9]{2}|20[0-9]{2})
month=(0[0-9]|10|11|12)
day=([012][0-9]|3[01])

// The ISO date:
|--
0 => Date  type="iso", year=G1, month=G2, day=G3

# The traditional way of writing a date:
|/()/
0 => Date  type="traditional", year=G3, month=G2, day=G1

# The rule to match the GAZETTEER
GAZETTEER => somefeature="some value"
"""

annt8 = StringRegexAnnotator(source=rules8, source_fmt="string", string_gazetteer=gaz1)

doc8 = Document("""
A document that contains a date here: 2013-01-12 and also here: 14/02/1991. This should not 
get matched: 1833-12-21 and nor should this 45/03/2012 but this should 13/12/2012 and also
this, despite not being a valid data: 31/02/2000
""")

annt8(doc8)
doc8

Using the StringRegexAnnotator API directly

The main methods of StringRegexAnnotator are:

The find_all method can be useful when some string outside of a document should get processed, or when the matches need to get processed by code before they should get added as annotations to the document.

The following shows the result of calling find_all on the document text with the annotator configured above:

for m in annt8.find_all(doc8.text):
    print(m)
GazetteerMatch(start=26, end=32, match='a date', features={'somefeature': 'some value'}, type='Lookup')
GazetteerMatch(start=28, end=32, match='date', features={'somefeature': 'some value'}, type='Lookup')
GazetteerMatch(start=39, end=49, match='2013-01-12', features={'type': 'iso', 'year': '2013', 'month': '01', 'day': '12'}, type='Date')
GazetteerMatch(start=50, end=53, match='and', features={'somefeature': 'some value'}, type='Lookup')
GazetteerMatch(start=54, end=58, match='also', features={'somefeature': 'some value'}, type='Lookup')
GazetteerMatch(start=65, end=75, match='14/02/1991', features={'type': 'traditional', 'year': '02', 'month': '02', 'day': '14'}, type='Date')
GazetteerMatch(start=118, end=121, match='and', features={'somefeature': 'some value'}, type='Lookup')
GazetteerMatch(start=165, end=175, match='13/12/2012', features={'type': 'traditional', 'year': '12', 'month': '12', 'day': '13'}, type='Date')
GazetteerMatch(start=176, end=179, match='and', features={'somefeature': 'some value'}, type='Lookup')
GazetteerMatch(start=180, end=184, match='also', features={'somefeature': 'some value'}, type='Lookup')
GazetteerMatch(start=223, end=233, match='31/02/2000', features={'type': 'traditional', 'year': '02', 'month': '02', 'day': '31'}, type='Date')

Example use: GATE default tokenizer

The StringRegexAnnotator is used to implement the default_tokenizer, a tokenizer annotator which should work in the same way as the Java GATE DefaultTokenizer PR. The rules from the Java tokenizer have been directly converted into StringRegexAnnotator rules:

from gatenlp.lang.en.gatetokenizers import default_tokenizer, default_tokenizer_rules

print(default_tokenizer_rules)
#words#
// a word can be any combination of letters, including hyphens,
// but excluding symbols and punctuation, e.g. apostrophes
// Note that there is an alternative version of the tokeniser that
// treats hyphens as separate tokens


|(?:\p{Lu}(?:\p{Mn})*)(?:(?:\p{Ll}(?:\p{Mn})*)(?:(?:\p{Ll}(?:\p{Mn})*)|\p{Pd}|\p{Cf})*)*
0 =>  Token orth="upperInitial", kind="word", 

|(?:\p{Lu}(?:\p{Mn})*)(?:\p{Pd}|\p{Cf})*(?:(?:\p{Lu}(?:\p{Mn})*)|\p{Pd}|\p{Cf})+
0 =>  Token orth="allCaps", kind="word", 

|(?:\p{Ll}(?:\p{Mn})*)(?:(?:\p{Ll}(?:\p{Mn})*)|\p{Pd}|\p{Cf})*
0 =>  Token orth="lowercase", kind="word", 

// MixedCaps is any mixture of caps and small letters that doesn't
// fit in the preceding categories

|(?:(?:\p{Ll}(?:\p{Mn})*)(?:\p{Ll}(?:\p{Mn})*)+(?:\p{Lu}(?:\p{Mn})*)+(?:(?:\p{Lu}(?:\p{Mn})*)|(?:\p{Ll}(?:\p{Mn})*))*)|(?:(?:\p{Ll}(?:\p{Mn})*)(?:\p{Ll}(?:\p{Mn})*)*(?:\p{Lu}(?:\p{Mn})*)+(?:(?:\p{Lu}(?:\p{Mn})*)|(?:\p{Ll}(?:\p{Mn})*)|\p{Pd}|\p{Cf})*)|(?:(?:\p{Lu}(?:\p{Mn})*)(?:\p{Pd})*(?:\p{Lu}(?:\p{Mn})*)(?:(?:\p{Lu}(?:\p{Mn})*)|(?:\p{Ll}(?:\p{Mn})*)|\p{Pd}|\p{Cf})*(?:(?:\p{Ll}(?:\p{Mn})*))+(?:(?:\p{Lu}(?:\p{Mn})*)|(?:\p{Ll}(?:\p{Mn})*)|\p{Pd}|\p{Cf})*)|(?:(?:\p{Lu}(?:\p{Mn})*)(?:\p{Ll}(?:\p{Mn})*)+(?:(?:\p{Lu}(?:\p{Mn})*)+(?:\p{Ll}(?:\p{Mn})*)+)+)|(?:(?:(?:\p{Lu}(?:\p{Mn})*))+(?:(?:\p{Ll}(?:\p{Mn})*))+(?:(?:\p{Lu}(?:\p{Mn})*))+)
0 =>  Token orth="mixedCaps", kind="word", 

|(?:\p{Lo}|\p{Mc}|\p{Mn})+
0 => Token kind="word", type="other", 

#numbers#
// a number is any combination of digits
|\p{Nd}+
0 => Token kind="number", 

|\p{No}+
0 => Token kind="number", 

#whitespace#
|(?:\p{Zs}) 
0 => SpaceToken kind="space", 

|(?:\p{Cc}) 
0 => SpaceToken kind="control", 

#symbols#
|(?:\p{Sk}|\p{Sm}|\p{So}) 
0 =>  Token kind="symbol", 

|\p{Sc} 
0 =>  Token kind="symbol", symbolkind="currency", 

#punctuation#
|(?:\p{Pd}|\p{Cf}) 
0 => Token kind="punctuation", subkind="dashpunct", 

|(?:\p{Pc}|\p{Po})
0 => Token kind="punctuation", 

|(?:\p{Ps}|\p{Pi}) 
0 => Token kind="punctuation", position="startpunct", 

|(?:\p{Pe}|\p{Pf}) 
0 => Token kind="punctuation", position="endpunct", 
doc = Document("""
This is a short document. Has miXedCaps and ALLUPPER and 1234 and hyphen-word. 
Also something after a new line. And another sentence. A float 3.4123 and a code XZ-2323-a.
""")

default_tokenizer(doc)
doc

Notebook last updated

import gatenlp
print("NB last updated with gatenlp version", gatenlp.__version__)
NB last updated with gatenlp version 1.0.8a1