Processing Resource JavaRegexpAnnotator

Overview

The Java Regexp Annotator PR offfers an easy way to use Java regular expression to annotate document text. Different binding groups of the regular expression can be annotated in different ways by the same rule and can be used to set the values of features of the created annotations. The Java regular expressions can be built in a structured way by defining macros which can be re-used in several regular expressions.

The matching process works like this:

Init Time Parameters

Runtime Parameters

Pattern File Format

Examples

Example 1

The following rule matches any of three possible strings, creating the annotation type “EmailAddress” and setting the feature “which” to the constant value “mine” for the first address and to “alternate” for the other two addresses. Also it sets the feature “string” to the actual address:

|(my.name@somewhere.com)
|(myother.name@somewhere.com)
|(my.name@somewhereelse.com)
1 => EmailAddress which="mine",string=$0
2,3 => EmailAddress which="alternate",string=$0

Example 2

The following rule matches email addresses and annotates the whole email address (group 0) with annotation Type “EmailAdress”, setting the feature “local” to the local name of the address (group 1) and the feature “domain” to the domain name of the address (group 2). (This pattern reflects the format of an email address as specified in RFC 5321 and RFC 5322 but does not correctly limit the maximum length of the local or hostname parts. It also does not allow internationalized addresses)

|(\b[a-zA-Z0-9!#$%&'*+/=\?^_`{|}~-]{1,64}(?:\.[a-zA-Z0-9!#$%&'*+/=\?^_`{|}~-]{1,64}){0,32})@([a-zA-Z0-9-]{1,63}(?:\.[a-zA-Z0-9-]{1,63}){1,32}\b)
0 => EmailAddress local=$1,domain=$2

Example 4

The following rule matches the word “lookmeup” ignoring the case of the match:

|(?i)lookmeup
0 => FoundIt

Example 5

The following illustrates the use of macros for matching dates (simplified and stripped down example):

daynum=(?:[0-2]?[1-9]|10|20|30|31)
monthnum=(?:0?[1-9]|10|11|12)
year4=(?:19[5-9][0-9]|20[0-2][0-9])

|(<<daynum>>)(-|\. ?| |/) {0,3}(<<monthnum>>)\2(<<year4>>)
0 => Date kind="date",format="d.m.y",dayString=$1,monthString=$3,yearString=$4,monthIsNumber="true"

|(<<daynum>>)(<<monthnum>>)(<<year4>>)(?![0-9\p{L}])
0 => Date kind="date",format="dmy",dayString=$1,monthString=$2,yearString=$3,monthIsNumber="true"

|(<<year4>>)(-|\. ?| |/)(<<monthnum>>)\2(<<daynum>>)
0 => Date kind="date",format="y.m.d",dayString=$4,monthString=$3,yearString=$1,monthIsNumber="true"