Processing Resource JavaRegexpAnnotator

Overview

The Java Regexp Annotator PR offfers an easy way to use Java regular expression to annotate document text. Different binding groups of the regular expression can be annotated in different ways by the same rule and can be used to set the values of features of the created annotations. The Java regular expressions can be built in a structured way by defining macros which can be re-used in several regular expressions.

The matching process works like this:

Each document is processed from start to end.
All rules are tried and one or more matching rules at the next possible position are selected according to the matchPreference parameter.
For each matching rule, all the matches for the pattern are found and processed by creating an annotation and optionally features according to the rule body.
After a match has been processed the algorithm advances either to the next possible match position (if overlappingMatches is true) or to the next possible match position after the end of the longest current match (if overlappingMatches is false).
If the containingAnnotationType parameter is specified, all these steps are carried out for each span covered by a containing anntoation
If the inputAnnotationType parameter is specified, the steps are carried out for the text generated from all those annotations in sequence, separated by a space for each sequence of spaceAnnotationType annotations between them. The actual text will be taken from the document text covered by the annotation if textFeature is empty or from the value of the feature if it is specified.

Init Time Parameters

patternFileURL (URL, no default): the URL of a file containing the java regular expression pattern rules. See Pattern File below.

Runtime Parameters

containingAnnotationType (String, no default): if specified, the rules will be applied to each part of the document covered by an annotation of that type in the inputAnnotationSet separately and only for those parts.
inputAnnotationSet (String, default=”” for the default annotation set): the name of the annotation set from which to take the containingAnnotationType and/or inputAnnotationType annotations
inputAnnotationType (String, no default): if specified, the text to annotate will be taken based on these annotations: if textFeature is not specified, from the covered document text, otherwise from the value of that feature.
outputAnnotationSet (String, default=”” for the default annotation set): the name of the annotation set where the generated annotations should be stored (default: empty, representing the default annotation set)
overlappingMatches (boolean, false): If set to true, after a successful match, the next offset is considered for a new match, if set to false, the offset following the longest match of any rule.
matchPreference (enumeration, default: LONGEST_LASTRULE): allows to choose one of the following values:
- ALL: all rules that match at some position will be applied
- FIRSTRULE: the first rule that matches is applied, even if the match is shorter than for some other rule
- LASTRULE: the last rule that matches is applied, even if the match is shorter than for some other rule
- LONGEST_FIRSTRULE: of all the rules that produce a longest match, the first one is applied
- LONGEST_LASTRULE: of all the rules that produce a longest match, the last one is applied
- LONGEST_ALLRULES: all rules that produce a longest match are applied
textFeature (String, no default): if this is specified and inputAnnotationType is also specified, the values of that feature of those annotations is used for matching.

Pattern File Format

lines starting with double slashes are comments
empty lines are ignored
there should be one or more pattern rules
a macro line is of the form name=pattern where name is any name consisting of ASCII letters, numbers and underscores, and pattern is any java pattern. The pattern can be substituted in a pattern line later by using the macro identifier <<name>>. Note that if name has not been identified in a macro line, then <<name>> will be left as is in a pattern line. Macros can also be substituted in later macro definition lines, building increasingly complex nested macros and patterns.
a pattern rule consists of one or more pattern lines followed by one or more action lines
pattern lines must start with a vertical bar (|) followed by a java regular expression. The final regular expression used consists of each line enclosed in non-binding parentheses and interpreted as alternatives.
action lines start with something like 1,2 => and must have the following content, in order:
- a comma-separated list of matching group numbers: these indicate the number of the binding group as counted from the first line of the expression. All these groups will, if they are part of a match, each get annotated by a separate annotation in the way specified in the remaining part of the rule action. Group number 0 stands for the whole match.
- a right-arrow => optionally preceded and/or followed by white space
- the name of the annotation type to use for this rule, optionally followed by whitespace
- optionally, a list of feature assignments: feature assignments are a comma seperated list of assignments of the form featurename=$n where n is a number, or of the form featurename="string" . If $n is used, the value of the feature will be the nth matched group.

Examples

Example 1

The following rule matches any of three possible strings, creating the annotation type “EmailAddress” and setting the feature “which” to the constant value “mine” for the first address and to “alternate” for the other two addresses. Also it sets the feature “string” to the actual address:

|(my.name@somewhere.com)
|(myother.name@somewhere.com)
|(my.name@somewhereelse.com)
1 => EmailAddress which="mine",string=$0
2,3 => EmailAddress which="alternate",string=$0

Example 2

The following rule matches email addresses and annotates the whole email address (group 0) with annotation Type “EmailAdress”, setting the feature “local” to the local name of the address (group 1) and the feature “domain” to the domain name of the address (group 2). (This pattern reflects the format of an email address as specified in RFC 5321 and RFC 5322 but does not correctly limit the maximum length of the local or hostname parts. It also does not allow internationalized addresses)

|(\b[a-zA-Z0-9!#$%&'*+/=\?^_`{|}~-]{1,64}(?:\.[a-zA-Z0-9!#$%&'*+/=\?^_`{|}~-]{1,64}){0,32})@([a-zA-Z0-9-]{1,63}(?:\.[a-zA-Z0-9-]{1,63}){1,32}\b)
0 => EmailAddress local=$1,domain=$2

Example 4

The following rule matches the word “lookmeup” ignoring the case of the match:

|(?i)lookmeup
0 => FoundIt

Example 5

The following illustrates the use of macros for matching dates (simplified and stripped down example):

daynum=(?:[0-2]?[1-9]|10|20|30|31)
monthnum=(?:0?[1-9]|10|11|12)
year4=(?:19[5-9][0-9]|20[0-2][0-9])

|(<<daynum>>)(-|\. ?| |/) {0,3}(<<monthnum>>)\2(<<year4>>)
0 => Date kind="date",format="d.m.y",dayString=$1,monthString=$3,yearString=$4,monthIsNumber="true"

|(<<daynum>>)(<<monthnum>>)(<<year4>>)(?![0-9\p{L}])
0 => Date kind="date",format="dmy",dayString=$1,monthString=$2,yearString=$3,monthIsNumber="true"

|(<<year4>>)(-|\. ?| |/)(<<monthnum>>)\2(<<daynum>>)
0 => Date kind="date",format="y.m.d",dayString=$4,monthString=$3,yearString=$1,monthIsNumber="true"

gateplugin-StringAnnotation