String Regex Annotator Tutorial
The StringRegexAnnotator
makes it extremely easy to match several complex regular expressions against
a document and annotate the matches and/or the part of a match corresponding to a capturing regular expression
group.
It also has a simple macro substitution feature that makes it easy to build more complex regular expression from simpler ones.
import os
from gatenlp import Document
from gatenlp.processing.gazetteer import StringRegexAnnotator, StringGazetteer
Creating the Annotator
Similar to the gazetteer annotators, there are several ways of how the annotator can be created: from a file that
contains the regular expression rules, from a string (consisting of several lines) that contains regular expression rules (basically the content of a file as a string) or from prepared rule objects. Which of this to use is specified with the source_fmt
parameter of either the constructor or the append
method.
Create from a string with rules
The following example shows a string that contains a single simple rule which finds a date in ISO format (YYYY-MM-DD) and annotates it with annotation type “Date”
rules1 = """
|[0-9]{4}-[0-9]{2}-[0-9]{2}
0 => Date
"""
annt1 = StringRegexAnnotator(source=rules1, source_fmt="string")
doc1 = Document("A document that contains a date here: 2013-01-12 and also here: 1999-12-31")
annt1(doc1)
doc1
The rules file/string format
A rules file must contain one or more rules.
Each rule consists of:
-
one or more pattern lines which must start with “ ”, followed by - one or more action lines which must start with a comma separated list of group numbers followed by “=>” followed by the annotation type to assign, optionally followed by feature assignments.
The action line specifies how an annotation should get created for one or more groups of a matching regular expression.
The simple rules string above contains one rule, with one patterh line and one action line:
|[0-9]{4}-[0-9]{2}-[0-9]{2}
0 => Date
The pattern line |[0-9]{4}-[0-9]{2}-[0-9]{2}
specifies the simple regular expression.
The action line 0 => Date
specifies that an annotation with the annotation type “Date” should get created
for the match, spanning “group 0”. The convention with regular expressions is that “group 0” always referes to
whatever is matched by the whole regular expression.
Using groups
In addition to group 0, anything within simple parentheses in the regular expression is a “capturing group”. Capturing groups get numberd by their opening parenthesis when counting from left to right. For example, the following regular expression has 3 additional groups for the year, month and day part of the whole ISO date. The rule then refers to the whole matched date via group 0 but also creates annotations of type Year, Month and Day for each of the groups:
|([0-9]{4})-([0-9]{2})-([0-9]{2})
0 => Date
1 => Year
2 => Month
3 => Day
rules2 = """
|([0-9]{4})-([0-9]{2})-([0-9]{2})
0 => Date
1 => Year
2 => Month
3 => Day
"""
annt2 = StringRegexAnnotator(source=rules2, source_fmt="string")
doc2 = Document("A document that contains a date here: 2013-01-12 and also here: 1999-12-31")
annt2(doc2)
doc2
Adding features to annotations
For each annotation that gets created for a match it is possible to also specify features to set in each action. Feature values can be specified as constants or as the value of one of the matched groups. To illustrate this, the following example assigns the year, month and day string to all annotations (Date, Day, Month, Year). In addition it assigns the constant value “iso” to the “type” feature of the “Date” annotation. To assign the value of some group number n, the variable “Gn” can be used, e.g. “G2” for group 2:
rules3 = """
|([0-9]{4})-([0-9]{2})-([0-9]{2})
0 => Date type="iso", year=G1, month=G2, day=G3
1 => Year year=G1, month=G2, day=G3
2 => Month year=G1, month=G2, day=G3
3 => Day year=G1, month=G2, day=G3
"""
annt3 = StringRegexAnnotator(source=rules3, source_fmt="string")
doc3 = Document("A document that contains a date here: 2013-01-12 and also here: 1999-12-31")
annt3(doc3)
doc3
Having more than one rule
A rule file/string can contain any number of rules. The following example includes 2 rules, matching either an ISO date, or a traditional date (DD/MM/YYYY). The example also contains comment lines, which start either with a “#” or a double slash “//”:
rules4 = """
// The ISO date:
|([0-9]{4})-([0-9]{2})-([0-9]{2})
0 => Date type="iso", year=G1, month=G2, day=G3
# The traditional way of writing a date:
|([0-9]{2})/([0-9]{2})/([0-9]{4})
0 => Date type="traditional", year=G3, month=G2, day=G1
"""
annt4 = StringRegexAnnotator(source=rules4, source_fmt="string")
doc4 = Document("A document that contains a date here: 2013-01-12 and also here: 14/02/1991")
annt4(doc4)
doc4
Longest match only
Dates are sometimes written with 2 digits for the year only. The following example has two rules for a traditional date format. Because of the parameter longest_only=False
the second date now matches both the first and the second rule.
rules5 = """
# The traditional way of writing a date, 2 digit year
|([0-9]{2})/([0-9]{2})/([0-9]{2})
0 => Date type="traditional-short", year=G3, month=G2, day=G1
# The traditional way of writing a date, 4 digit year
|([0-9]{2})/([0-9]{2})/([0-9]{4})
0 => Date type="traditional-long", year=G3, month=G2, day=G1
"""
annt5 = StringRegexAnnotator(source=rules5, source_fmt="string", longest_only=False)
doc5 = Document("A document that contains a date here: 12/04/98 and also here: 14/02/1991")
annt5(doc5)
doc5
With longest_only=True
at each matching position, only the longest match (or longest matches if there are several matches with the same longest length) are annotated. Now only the rule that produces the longer match is used:
annt5a = StringRegexAnnotator(source=rules5, source_fmt="string", longest_only=True)
doc5a = Document("A document that contains a date here: 12/04/98 and also here: 14/02/1991")
annt5a(doc5a)
doc5a
Choosing matching rules
It is possible that several rules match the same position. The select_rules
parameter can be used to configure which of all matching rules should actually be used. The default is “all”, so all matching rules are considered, but if longest_only=True
then only the longest of all rules are considered.
If select_rules="first"
then whichever rule is the first (in order of appearance in the rule file/string) to match is the one used, all other rules which may also match at a position are ignored. Similarly, if select_rules="last"
only the last of all matching rules is used.
In the following example, longest_only=False
and select_rules="first"
so the first rule that matches is the only one used:
annt5b = StringRegexAnnotator(source=rules5, source_fmt="string", longest_only=False, select_rules="first")
doc5b = Document("A document that contains a date here: 12/04/98 and also here: 14/02/1991")
annt5b(doc5b)
doc5b
Overlapping matches
Sometimes it is possible that matches from different rules or the same rule can overlap, here is a simple example: the following rule simply matches any number of basic ASCII lower case characters. At each position where such a sequence starts, a match is found and an annotation is created.
rules6a = """
|[a-z]+
0 => Match
"""
annt6a = StringRegexAnnotator(source=rules6a, source_fmt="string")
doc6a = Document("A document that contains a date here: 12/04/98 and also here: 14/02/1991")
annt6a(doc6a)
print("Matching:", [doc6a[a] for a in doc6a.annset()])
doc6a
Matching: ['document', 'ocument', 'cument', 'ument', 'ment', 'ent', 'nt', 't', 'that', 'hat', 'at', 't', 'contains', 'ontains', 'ntains', 'tains', 'ains', 'ins', 'ns', 's', 'a', 'date', 'ate', 'te', 'e', 'here', 'ere', 're', 'e', 'and', 'nd', 'd', 'also', 'lso', 'so', 'o', 'here', 'ere', 're', 'e']
In such cases, it is often desirable to only try and find a match after any match that has already been found, so in this case, once “document” has been matched, only try and find the next match after the end of that match. This can be achieved by setting the parameter skip_longest=True
:
rules6b = """
|[a-z]+
0 => Match
"""
annt6b = StringRegexAnnotator(source=rules6b, source_fmt="string", skip_longest=True)
doc6b = Document("A document that contains a date here: 12/04/98 and also here: 14/02/1991")
annt6b(doc6b)
print("Matching:", [doc6b[a] for a in doc6b.annset()])
doc6b
Matching: ['document', 'that', 'contains', 'a', 'date', 'here', 'and', 'also', 'here']
Macros: creating complex regular expressions from simpler ones
Complex regular expressions can get hard to read quickly especially when there are many nested alternatives, and often, the same complex sub-expression can be part of a bigger expression several times.
The StringRegexAnnotator
therefore provides a macro mechanism which allows for complex regular expression to get composed by simpler ones in steps: one can assign the simpler regular expressions to a macro variable and then
use such variables in the final complex regular expression.
Here is an example where either ISO or “traditional” dates should get matched and where the year, month and day parts of the regular expression are more specific than in the examples above. Instead of copy-pasting those sub-expressions for the year, month and day into each rule, macro assignments are used:
rules7 = """
year=(19[0-9]{2}|20[0-9]{2})
month=(0[0-9]|10|11|12)
day=([012][0-9]|3[01])
// The ISO date:
|--
0 => Date type="iso", year=G1, month=G2, day=G3
# The traditional way of writing a date:
|/()/
0 => Date type="traditional", year=G3, month=G2, day=G1
"""
annt7 = StringRegexAnnotator(source=rules7, source_fmt="string")
doc7 = Document("""
A document that contains a date here: 2013-01-12 and also here: 14/02/1991. This should not
get matched: 1833-12-21 and nor should this 45/03/2012 but this should 13/12/2012 and also
this, despite not being a valid data: 31/02/2000
""")
annt7(doc7)
doc7
Combine with a String Gazetteer
In addition to the type of rules described above, there is a special rule which can be used to
combine the regular expressions with StringGazetteer
matching. The initialized StringGazetteer instances can be specified when creating the StringRegexAnnotator
.
The rule consists of a single line of the form GAZETTEER =>
or GAZETTEER => feat1 = val1, feat2=val2
to assign some constant features (in addition to the features from the gazetteer entry and gazetteer list).
This examples illustrates this by additing a small string gazetteer to the previous example which matches the strings “date”, “a date”, “and”, “also”:
gazlist1 = [
("date", ),
("a date",),
("and",),
("also",),
]
gaz1 = StringGazetteer(source=gazlist1, source_fmt="gazlist")
rules8 = """
year=(19[0-9]{2}|20[0-9]{2})
month=(0[0-9]|10|11|12)
day=([012][0-9]|3[01])
// The ISO date:
|--
0 => Date type="iso", year=G1, month=G2, day=G3
# The traditional way of writing a date:
|/()/
0 => Date type="traditional", year=G3, month=G2, day=G1
# The rule to match the GAZETTEER
GAZETTEER => somefeature="some value"
"""
annt8 = StringRegexAnnotator(source=rules8, source_fmt="string", string_gazetteer=gaz1)
doc8 = Document("""
A document that contains a date here: 2013-01-12 and also here: 14/02/1991. This should not
get matched: 1833-12-21 and nor should this 45/03/2012 but this should 13/12/2012 and also
this, despite not being a valid data: 31/02/2000
""")
annt8(doc8)
doc8
Using the StringRegexAnnotator API directly
The main methods of StringRegexAnnotator
are:
append(source, source_fmt="file", list_features=None)
: to add one or more rule files/strings/rulelists to the annotatorfind_all(..)
to search some string using the stored rules and return a generator of Match objects
The find_all
method can be useful when some string outside of a document should get processed, or when the matches need to get processed by code before they should get added as annotations to the document.
The following shows the result of calling find_all
on the document text with the annotator configured above:
for m in annt8.find_all(doc8.text):
print(m)
GazetteerMatch(start=26, end=32, match='a date', features={'somefeature': 'some value'}, type='Lookup')
GazetteerMatch(start=28, end=32, match='date', features={'somefeature': 'some value'}, type='Lookup')
GazetteerMatch(start=39, end=49, match='2013-01-12', features={'type': 'iso', 'year': '2013', 'month': '01', 'day': '12'}, type='Date')
GazetteerMatch(start=50, end=53, match='and', features={'somefeature': 'some value'}, type='Lookup')
GazetteerMatch(start=54, end=58, match='also', features={'somefeature': 'some value'}, type='Lookup')
GazetteerMatch(start=65, end=75, match='14/02/1991', features={'type': 'traditional', 'year': '02', 'month': '02', 'day': '14'}, type='Date')
GazetteerMatch(start=118, end=121, match='and', features={'somefeature': 'some value'}, type='Lookup')
GazetteerMatch(start=165, end=175, match='13/12/2012', features={'type': 'traditional', 'year': '12', 'month': '12', 'day': '13'}, type='Date')
GazetteerMatch(start=176, end=179, match='and', features={'somefeature': 'some value'}, type='Lookup')
GazetteerMatch(start=180, end=184, match='also', features={'somefeature': 'some value'}, type='Lookup')
GazetteerMatch(start=223, end=233, match='31/02/2000', features={'type': 'traditional', 'year': '02', 'month': '02', 'day': '31'}, type='Date')
Example use: GATE default tokenizer
The StringRegexAnnotator is used to implement the default_tokenizer
, a tokenizer annotator which should work
in the same way as the Java GATE DefaultTokenizer PR. The rules from the Java tokenizer have been directly
converted into StringRegexAnnotator rules:
from gatenlp.lang.en.gatetokenizers import default_tokenizer, default_tokenizer_rules
print(default_tokenizer_rules)
#words#
// a word can be any combination of letters, including hyphens,
// but excluding symbols and punctuation, e.g. apostrophes
// Note that there is an alternative version of the tokeniser that
// treats hyphens as separate tokens
|(?:\p{Lu}(?:\p{Mn})*)(?:(?:\p{Ll}(?:\p{Mn})*)(?:(?:\p{Ll}(?:\p{Mn})*)|\p{Pd}|\p{Cf})*)*
0 => Token orth="upperInitial", kind="word",
|(?:\p{Lu}(?:\p{Mn})*)(?:\p{Pd}|\p{Cf})*(?:(?:\p{Lu}(?:\p{Mn})*)|\p{Pd}|\p{Cf})+
0 => Token orth="allCaps", kind="word",
|(?:\p{Ll}(?:\p{Mn})*)(?:(?:\p{Ll}(?:\p{Mn})*)|\p{Pd}|\p{Cf})*
0 => Token orth="lowercase", kind="word",
// MixedCaps is any mixture of caps and small letters that doesn't
// fit in the preceding categories
|(?:(?:\p{Ll}(?:\p{Mn})*)(?:\p{Ll}(?:\p{Mn})*)+(?:\p{Lu}(?:\p{Mn})*)+(?:(?:\p{Lu}(?:\p{Mn})*)|(?:\p{Ll}(?:\p{Mn})*))*)|(?:(?:\p{Ll}(?:\p{Mn})*)(?:\p{Ll}(?:\p{Mn})*)*(?:\p{Lu}(?:\p{Mn})*)+(?:(?:\p{Lu}(?:\p{Mn})*)|(?:\p{Ll}(?:\p{Mn})*)|\p{Pd}|\p{Cf})*)|(?:(?:\p{Lu}(?:\p{Mn})*)(?:\p{Pd})*(?:\p{Lu}(?:\p{Mn})*)(?:(?:\p{Lu}(?:\p{Mn})*)|(?:\p{Ll}(?:\p{Mn})*)|\p{Pd}|\p{Cf})*(?:(?:\p{Ll}(?:\p{Mn})*))+(?:(?:\p{Lu}(?:\p{Mn})*)|(?:\p{Ll}(?:\p{Mn})*)|\p{Pd}|\p{Cf})*)|(?:(?:\p{Lu}(?:\p{Mn})*)(?:\p{Ll}(?:\p{Mn})*)+(?:(?:\p{Lu}(?:\p{Mn})*)+(?:\p{Ll}(?:\p{Mn})*)+)+)|(?:(?:(?:\p{Lu}(?:\p{Mn})*))+(?:(?:\p{Ll}(?:\p{Mn})*))+(?:(?:\p{Lu}(?:\p{Mn})*))+)
0 => Token orth="mixedCaps", kind="word",
|(?:\p{Lo}|\p{Mc}|\p{Mn})+
0 => Token kind="word", type="other",
#numbers#
// a number is any combination of digits
|\p{Nd}+
0 => Token kind="number",
|\p{No}+
0 => Token kind="number",
#whitespace#
|(?:\p{Zs})
0 => SpaceToken kind="space",
|(?:\p{Cc})
0 => SpaceToken kind="control",
#symbols#
|(?:\p{Sk}|\p{Sm}|\p{So})
0 => Token kind="symbol",
|\p{Sc}
0 => Token kind="symbol", symbolkind="currency",
#punctuation#
|(?:\p{Pd}|\p{Cf})
0 => Token kind="punctuation", subkind="dashpunct",
|(?:\p{Pc}|\p{Po})
0 => Token kind="punctuation",
|(?:\p{Ps}|\p{Pi})
0 => Token kind="punctuation", position="startpunct",
|(?:\p{Pe}|\p{Pf})
0 => Token kind="punctuation", position="endpunct",
doc = Document("""
This is a short document. Has miXedCaps and ALLUPPER and 1234 and hyphen-word.
Also something after a new line. And another sentence. A float 3.4123 and a code XZ-2323-a.
""")
default_tokenizer(doc)
doc
Notebook last updated
import gatenlp
print("NB last updated with gatenlp version", gatenlp.__version__)
NB last updated with gatenlp version 1.0.8a1