GATE Worker

The GATE Worker is a module that allows to run anything in a Java GATE process from Python and interchange documents between Python and Java.

One possible use of this is to run an existing GATE pipeline on a Python GateNLP document.

This is done by the python module communicating with a Java process over a socket connection. Java calls on the Python side are sent over to Java, executed and the result is send back to Python.

For this to work, GATE and Java have to be installed on the machine that runs the GATE Worker.

The easiest way to run this is by first manually starting the GATE Worker in the Java GATE GUI and then connecting to it from the Python side.

Manually starting the GATE Worker from GATE

  1. Start GATE
  2. Load the Python plugin using the CREOLE Plugin Manager
  3. Create a new Language Resource (NOTE: not a Processing Reource!): “PythonWorkerLr”

When creating the PyhonWorkerLr, the following initialization parameters can be specified:

A GATE Worker started via the PythonWorkerLr keeps running until the resource is deleted or GATE is ended.

Using the GATE Worker from Python

Once the PythonWorkerLr resource has been created it is ready to get used by a Python program:

from gatenlp.gateworker import GateWorker

To connect to an already running worker process, the parameter start=False must be specified. In addition the auth token must be provided and the port and host, if they differ from the default.

gs = GateWorker(start=False, auth_token="verysecretauthtoken")

The gate worker instance can now be used to run arbitrary Java methods on the Java side. The gate worker instance provides a number of useful methods directly (see PythonDoc for gateworker )

In addition, there is a larger number of utility methods which are available through gs.worker (see PythonWorker Source code, here are a few examples:

# Create a new Java document from a string
# You should see how the document gets created in the GATE GUI
gdoc1 = gs.worker.createDocument("This is a 💩 document. It mentions Barack Obama and George Bush and New York.")
gdoc1
JavaObject id=o5
# you can call the API methods for the document directly from Python
print(gdoc1.getName())
print(gdoc1.getFeatures())
GATE Document_00016
{'gate.SourceURL': 'created from String'}
# so far the document only "lives" in the Java process. In order to copy it to Python, it has to be converted
# to a Python GateNLP document:
pdoc1 = gs.gdoc2pdoc(gdoc1)
pdoc1.text
'This is a 💩 document. It mentions Barack Obama and George Bush and New York.'
# Let's load ANNIE on the Java side and run it on that document:
# First we have to load the ANNIE plugin:
gs.worker.loadMavenPlugin("uk.ac.gate.plugins", "annie", "8.6")
# now load the prepared ANNIE pipeline from the plugin
pipeline = gs.worker.loadPipelineFromPlugin("uk.ac.gate.plugins","annie", "/resources/ANNIE_with_defaults.gapp")
pipeline.getName()
'ANNIE'
# run the pipeline on the document and convert it to a GateNLP Python document and display it
gs.worker.run4Document(pipeline, gdoc1)
pdoc1 = gs.gdoc2pdoc(gdoc1)

pdoc1

Manually starting the GATE Worker from Python

After installation of Python gatenlp, the command gatenlp-gate-worker is available.

You can run gatenlp-gate-worker --help to get help information:

usage: gatenlp-gate-worker [-h] [--port PORT] [--host HOST] [--auth AUTH]
                          [--noauth] [--gatehome GATEHOME]
                          [--platform PLATFORM] [--log_actions] [--keep]

Start Java GATE Worker

optional arguments:
  -h, --help           show this help message and exit
  --port PORT          Port (25333)
  --host HOST          Host to bind to (127.0.0.1)
  --auth AUTH          Auth token to use (generate random)
  --noauth             Do not use auth token
  --gatehome GATEHOME  Location of GATE (environment variable GATE_HOME)
  --platform PLATFORM  OS/Platform: windows or linux (autodetect)
  --log_actions        If worker actions should be logged
  --keep               Prevent shutting down the worker

For example to start a gate worker as with the PythonWorkerLr above, but this time re-using the exact same auth token and switching on logging of the actions:

gatenlp-gate-worker --auth 841e634a-d1f0-4768-b763-a7738ddee003 --log_actions

Again the Python program can connect to the server as before:

gs = GateWorker(start=False, auth_token="841e634a-d1f0-4768-b763-a7738ddee003")
gs
<gatenlp.gateworker.GateWorker at 0x7fb6204e67f0>

The GATE worker started that way keeps running until it is interrupted from the keyboard using “Ctrl-C” or until the GATE worker sends the “close” request:

gs.close()

Automatically starting the GATE Worker from Python

When using the GateWorker class from python, it is possible to just start the worker processes automatically in the background by setting the paramater start to True:

gs = GateWorker(start=True, auth_token="my-super-secret-auth-token")
Trying to start GATE Worker on port=25333 host=127.0.0.1 log=false keep=false
PythonWorkerRunner.java: starting server with 25333/127.0.0.1/my-super-secret-auth-token/false
gdoc1 = gs.worker.createDocument("This is a 💩 document. It mentions Barack Obama and George Bush and New York.")
gdoc1
JavaObject id=o0
# when done, the gate worker should get closed:
gs.close()

A better way to close the GATE Worker

# using the GateWork this way will automatically close it when exiting the with block:
with GateWorker(start=True) as gw:
    print(gw.gate_version)
    
Trying to start GATE Worker on port=25333 host=127.0.0.1 log=false keep=false
Process id is 8778


9.0.1


PythonWorkerRunner.java: starting server with 25333/127.0.0.1/OQ__kPvCOvkanlu4S9TGcpQrssg/false
Java GatenlpWorker ENDING: 8778

Using the GateWorkerAnnotator

The GateWorkerAnnotator is an annotator that simplifies the common task of letting a GATE Java annotation pipeline annotate a bunch of Python gatenlp documents. It can be used like other annotators (see Processing)

To run the GateWorkerAnnotator, Java must be installed and the java command must be on the path. Currently only Java version 8 has been tested.

A simple way to install Java on Linux and choose from various Java versions is SDKMan

Also, the GATE_HOME environment variable must be set, or the path to an installed Java GATE must get passed on using the gatehome parameter.

An installed Java GATE can be one of:

from gatenlp import Document
# Create a small corpus of documents to process
texts = [
    "A very simple document.",
    "Another document, this one mentions New York and Washington. It also mentions the person Barack Obama.",
    "One more document for this little test."
]
corpus = [Document(t) for t in texts]
from gatenlp.gateworker import GateWorkerAnnotator
from gatenlp.processing.executor import SerialCorpusExecutor
# use the path of your GATE pipeline instead of annie.xgapp
# To create the GateWorkerAnnotator a GateWorker must first be created

# To run the pipeline on a corpus, first initialize the pipeline using start(), then annotate all documents, 
# then finish the pipeline using finish().
# At this point the same annotator can be used in the same way again to run on another corpus.
# If the GateWorkerAnnotator is not used any more, use close() to stop the GateWorker (the GATE worker is also
# stopped automatically when the Python process ends)

# If an executor is used, only the final close() is necessary, as the executor takes care of everything else

with GateWorker() as gw:
    pipeline = GateWorkerAnnotator("annie.xgapp", gw)
    executor = SerialCorpusExecutor(pipeline, corpus=corpus)
    executor()

    
Trying to start GATE Worker on port=25333 host=127.0.0.1 log=false keep=false
Process id is 10995
PythonWorkerRunner.java: starting server with 25333/127.0.0.1/6C8L67T0iLuVFHEovPN07nNGz2c/false
Java GatenlpWorker ENDING: 10995
# Show the second document
corpus[1]