The PythonPr Processing Resource

The PythonPr processing resource can be used to run a Python program on documents.

When a pipeline that contains the PythonPr processing resource is run, the following main steps happens:

Here is a simple example Python program which splits the document into white-space separated tokens using a simple regular expression and creates an annotation with the type “Token” in the default annotation set for each token. For each token annotation, a feature “tokennr” is set to the sequence number of the token in the document. It also sets the total number of tokens as a document feature.

This example implements the code to run for each document as a function with the name run which must take the document to process as a parameter and allow arbitrary additional kwargs.

To actually invoke the function for each document the interact() function has to get invoked at the end of the Python script!

import re
from gatenlp import GateNlpPr, interact


@GateNlpPr
def run(doc, **kwargs):
    set1 = doc.annset()
    set1.clear()
    text = doc.text
    whitespaces = [m for m in re.finditer(r"[\s,.!?]+|^[\s,.!?]*|[\s,.!?]*$", text)]
    for k in range(len(whitespaces) - 1):
        fromoff = whitespaces[k].end()
        tooff = whitespaces[k + 1].start()
        set1.add(fromoff, tooff, "Token", {"tokennr": k})
    doc.feature["nr_tokens"] = len(whitespaces) - 1


interact()

The function gets the document passed (as its first argument) a gatenlp.Document and also gets all the parameters defined in the PythonPr programParams parameter as kwargs plus the _config_file parameter as additional kwarg if it was set in the PR. Not that if the function does not have **kwargs then it gets called without any keyword arguments.

Instead of a function, a callable class can be implemented with the @GateNlpPr decorator.

The class must implement the __call__ method, but in addition can also implement the start, finish, reduce and result methods. The following example implements the same tokenizer as above in a class but also counts and prints out the total number of tokens over all documents. Again the interact() call must be placed at the end of the Python script.

import re
from gatenlp import GateNlpPr, interact, logger

@GateNlpPr
class MyProcessor:
    def __init__(self):
        self.tokens_total = 0

    def start(self, **kwargs):
        self.tokens_total = 0

    def finish(self, **kwargs):
        logger.info("Total number of tokens: {}".format(self.tokens_total))

    def __call__(self, doc):
        set1 = doc.annset()
        set1.clear()
        text = doc.text
        whitespaces = [m for m in re.finditer(r"[\s,.!?]+|^[\s,.!?]*|[\s,.!?]*$", text)]
        nrtokens = len(whitespaces) - 1
        for k in range(nrtokens):
            fromoff = whitespaces[k].end()
            tooff = whitespaces[k + 1].start()
            set1.add(fromoff, tooff, "Token", {"tokennr": k})
        doc.features["nr_tokens"] = nrtokens
        self.tokens_total += nrtokens

interact()

Advantages of using a callable class:

PythonPr Init Parameters

Parameters that have to get set when the processing resource is created:

If file URL is specified and the file is writable, the file can be changed and edited within GATE by double clicking the processing resource in the GUI. See Python Editor

PythonPr Runtime Parameters

Parameter Preconfiguration File

When a PythonPr is created from a Python file and another file exists that has the same name as the Python file with “.parms” appended, then this file is expected to be a JSON map which is used to pre-set the programParms for this script.

For example, when a PythonPr is created from the Python file mydir/myscript.py and a file mydir/myscrip.py.parms exists, then this file is used to pre-set the programParms. The parms file should contain a JSON map and each key in the JSON map is used as the name of an entry in the programParms FeatureMap and the value is used as the value in the feature map.

Note that both feature names and values can only be entered as String in the programParms FeatureMap, so the JSON map should also contain String keys and values. If a value is “null” then the parameter added but assigned null/None. If the value is not a string it is converted to a String using the objects toString() method before it is assigned to the FeatureMap.

For example if the file mydir/myscript.py.parms contains the following:

{
  "parm1": "val1",
  "parm2": false,
  "parm3": null,
  "parm4": 33
}

then the programParms of the PythonPr for this script will be set to the following FeatureMap: “parm1”=”val1”, “parm2”=”false”, “parm3”=null, “parm4”=”33”.