Processing Resource JavaRegexpAnnotator
Overview
The Java Regexp Annotator PR offfers an easy way to use Java regular expression to annotate document text. Different binding groups of the regular expression can be annotated in different ways by the same rule and can be used to set the values of features of the created annotations. The Java regular expressions can be built in a structured way by defining macros which can be re-used in several regular expressions.
The matching process works like this:
- Each document is processed from start to end.
- All rules are tried and one or more matching rules at the next possible position are selected according to the
matchPreference
parameter. - For each matching rule, all the matches for the pattern are found and processed by creating an annotation and optionally features according to the rule body.
- After a match has been processed the algorithm advances either to the next possible match position (if overlappingMatches is true) or to the next possible match position after the end of the longest current match (if overlappingMatches is false).
- If the
containingAnnotationType
parameter is specified, all these steps are carried out for each span covered by a containing anntoation - If the
inputAnnotationType
parameter is specified, the steps are carried out for the text generated from all those annotations in sequence, separated by a space for each sequence ofspaceAnnotationType
annotations between them. The actual text will be taken from the document text covered by the annotation iftextFeature
is empty or from the value of the feature if it is specified.
Init Time Parameters
patternFileURL
(URL, no default): the URL of a file containing the java regular expression pattern rules. See Pattern File below.
Runtime Parameters
containingAnnotationType
(String, no default): if specified, the rules will be applied to each part of the document covered by an annotation of that type in theinputAnnotationSet
separately and only for those parts.inputAnnotationSet
(String, default=”” for the default annotation set): the name of the annotation set from which to take thecontainingAnnotationType
and/orinputAnnotationType
annotationsinputAnnotationType
(String, no default): if specified, the text to annotate will be taken based on these annotations: iftextFeature
is not specified, from the covered document text, otherwise from the value of that feature.outputAnnotationSet
(String, default=”” for the default annotation set): the name of the annotation set where the generated annotations should be stored (default: empty, representing the default annotation set)overlappingMatches
(boolean, false): If set to true, after a successful match, the next offset is considered for a new match, if set to false, the offset following the longest match of any rule.matchPreference
(enumeration, default: LONGEST_LASTRULE): allows to choose one of the following values:ALL
: all rules that match at some position will be appliedFIRSTRULE
: the first rule that matches is applied, even if the match is shorter than for some other ruleLASTRULE
: the last rule that matches is applied, even if the match is shorter than for some other ruleLONGEST_FIRSTRULE
: of all the rules that produce a longest match, the first one is appliedLONGEST_LASTRULE
: of all the rules that produce a longest match, the last one is appliedLONGEST_ALLRULES
: all rules that produce a longest match are applied
textFeature
(String, no default): if this is specified andinputAnnotationType
is also specified, the values of that feature of those annotations is used for matching.
Pattern File Format
- lines starting with double slashes are comments
- empty lines are ignored
- there should be one or more pattern rules
- a macro line is of the form
name=pattern
wherename
is any name consisting of ASCII letters, numbers and underscores, andpattern
is any java pattern. The pattern can be substituted in a pattern line later by using the macro identifier<<name>>
. Note that if name has not been identified in a macro line, then<<name>>
will be left as is in a pattern line. Macros can also be substituted in later macro definition lines, building increasingly complex nested macros and patterns. - a pattern rule consists of one or more pattern lines followed by one or more action lines
- pattern lines must start with a vertical bar (
|
) followed by a java regular expression. The final regular expression used consists of each line enclosed in non-binding parentheses and interpreted as alternatives. - action lines start with something like
1,2 =>
and must have the following content, in order:- a comma-separated list of matching group numbers: these indicate the number of the binding group as counted from the first line of the expression. All these groups will, if they are part of a match, each get annotated by a separate annotation in the way specified in the remaining part of the rule action. Group number 0 stands for the whole match.
- a right-arrow
=>
optionally preceded and/or followed by white space - the name of the annotation type to use for this rule, optionally followed by whitespace
- optionally, a list of feature assignments: feature assignments are a comma seperated list of assignments of the form
featurename=$n
where n is a number, or of the formfeaturename="string"
. If$n
is used, the value of the feature will be the nth matched group.
Examples
Example 1
The following rule matches any of three possible strings, creating the annotation type “EmailAddress” and setting the feature “which” to the constant value “mine” for the first address and to “alternate” for the other two addresses. Also it sets the feature “string” to the actual address:
|(my.name@somewhere.com)
|(myother.name@somewhere.com)
|(my.name@somewhereelse.com)
1 => EmailAddress which="mine",string=$0
2,3 => EmailAddress which="alternate",string=$0
Example 2
The following rule matches email addresses and annotates the whole email address (group 0) with annotation Type “EmailAdress”, setting the feature “local” to the local name of the address (group 1) and the feature “domain” to the domain name of the address (group 2). (This pattern reflects the format of an email address as specified in RFC 5321 and RFC 5322 but does not correctly limit the maximum length of the local or hostname parts. It also does not allow internationalized addresses)
|(\b[a-zA-Z0-9!#$%&'*+/=\?^_`{|}~-]{1,64}(?:\.[a-zA-Z0-9!#$%&'*+/=\?^_`{|}~-]{1,64}){0,32})@([a-zA-Z0-9-]{1,63}(?:\.[a-zA-Z0-9-]{1,63}){1,32}\b)
0 => EmailAddress local=$1,domain=$2
Example 4
The following rule matches the word “lookmeup” ignoring the case of the match:
|(?i)lookmeup
0 => FoundIt
Example 5
The following illustrates the use of macros for matching dates (simplified and stripped down example):
daynum=(?:[0-2]?[1-9]|10|20|30|31)
monthnum=(?:0?[1-9]|10|11|12)
year4=(?:19[5-9][0-9]|20[0-2][0-9])
|(<<daynum>>)(-|\. ?| |/) {0,3}(<<monthnum>>)\2(<<year4>>)
0 => Date kind="date",format="d.m.y",dayString=$1,monthString=$3,yearString=$4,monthIsNumber="true"
|(<<daynum>>)(<<monthnum>>)(<<year4>>)(?![0-9\p{L}])
0 => Date kind="date",format="dmy",dayString=$1,monthString=$2,yearString=$3,monthIsNumber="true"
|(<<year4>>)(-|\. ?| |/)(<<monthnum>>)\2(<<daynum>>)
0 => Date kind="date",format="y.m.d",dayString=$4,monthString=$3,yearString=$1,monthIsNumber="true"