{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# GATE Worker\n", "\n", "The GATE Worker is a module that allows to run anything in a Java GATE process from Python and interchange documents between Python and Java.\n", "\n", "One possible use of this is to run an existing GATE pipeline on a Python GateNLP document.\n", "\n", "This is done by the python module communicating with a Java process over a socket connection. \n", "Java calls on the Python side are sent over to Java, executed and the result is send back to Python. \n", "\n", "For this to work, GATE and Java have to be installed on the machine that runs the GATE Worker.\n", "\n", "The easiest way to run this is by first manually starting the GATE Worker in the Java GATE GUI and then \n", "connecting to it from the Python side. \n", "\n", "## Manually starting the GATE Worker from GATE\n", "\n", "1. Start GATE\n", "2. Load the Python plugin using the CREOLE Plugin Manager\n", "3. Create a new Language Resource (NOTE: not a Processing Reource!): \"PythonWorkerLr\"\n", "\n", "When creating the PyhonWorkerLr, the following initialization parameters can be specified:\n", "* `authToken`: this is used to prevent other processes from connecting to the worker. You can either specify \n", " some string here or with `useAuthToken` set to `true` let GATE choose a random one and display it in the \n", " message pane after the resource has been created. \n", " * for testing this, enter \"verysecretauthtoken\" \n", "* `host`: The host name or address to bind to. The default 127.0.0.1 makes the worker only visible on the same\n", " machine. In order to make it visible on other machines, use the host name or IP address on the network\n", " or use 0.0.0.0 \n", " * for testing, keep the default of 127.0.0.1\n", "* `logActions`: if this is set to true, the actions requested by the Python process are logged to the message pane. \n", " * for testing, change to \"true\"\n", "* `port`: the port number to use. Each worker requires their own port number so if more than one worker is running\n", " on a machine, they need to use different, unused port numbers. \n", " * for testing, keep the default\n", "* `useAuthToken`: if this is set to false, no auth token is generated and used, and the connection can be \n", " established by any process connecting to that port number. \n", " * for testing, keep the default\n", "\n", "A GATE Worker started via the PythonWorkerLr keeps running until the resource is deleted or GATE is ended.\n", "\n", "\n", "## Using the GATE Worker from Python\n", "\n", "Once the PythonWorkerLr resource has been created it is ready to get used by a Python program:\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from gatenlp.gateworker import GateWorker" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To connect to an already running worker process, the parameter `start=False` must be specified. \n", "In addition the auth token must be provided and the port and host, if they differ from the default." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "gs = GateWorker(start=False, auth_token=\"verysecretauthtoken\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The gate worker instance can now be used to run arbitrary Java methods on the Java side. \n", "The gate worker instance provides a number of useful methods directly (see [PythonDoc for gateworker](https://gatenlp.github.io/python-gatenlp/pythondoc/gatenlp/gateworker.html) )\n", "* `gs.load_gdoc(filepath, mimetype=None`: load a GATE document on the Java side and return it to Python\n", "* `gs.save_gdoc(gatedocument, filepath, mimetype=None, inline_anntypes=None, inline_annset=\"\", inline_features=True)`: save a GATE document on the Java side\n", "* `gs.gdoc2pdoc(gatedocument)`: convert the Java GATE document as a Python GateNLP document and return it\n", "* `gs.pdoc2gdoc(doc)`: convert the Python GateNLP document to a Java GATE document and return it\n", "* `gs.del_resource(gatedocument)`: remove a Java GATE document on the Java side (this necessary to release memory)\n", " This can also be used to remove other kinds of GATE resources like ProcessingResource, Corpus, LanguageResource\n", " etc.\n", "* `gs.load_pdoc(filepath, mimetype=None)`: load a document on the Java side using the file format specified via the mime type and return it as a Python GateNLP document\n", "* `gs.log_actions(trueorfalse)`: switch logging of actions on the worker side off/on\n", "\n", "In addition, there is a larger number of utility methods which are available through `gs.worker` (see \n", "[PythonWorker Source code](https://github.com/GateNLP/gateplugin-Python/blob/master/src/main/java/gate/plugin/python/PythonWorker.java), here are a few examples:\n", "\n", "* `loadMavenPlugin(group, artifact, version)`: make the plugin identified by the given Maven coordinates available\n", "* `loadPipelineFromFile(filepath)`: load the pipeline/controller from the given file path and return it\n", "* `loadDocumentFromFile(filepath)`: load a GATE document from the file and return it\n", "* `loadDocumentFromFile(filepath, mimetype)`: load a GATE document from the file using the format corresponding to the given mime type and return it\n", "* `saveDocumentToFile(gatedocument, filepath, mimetype=None, inline_anntypes=None, inline_annset=\"\", inline_features=True)`: save the document to the file, using the format corresponding to the mime type\n", "* `createDocument(content)`: create a new document from the given String content and return it\n", "* `run4Document(pipeline, document)`: run the given pipeline on the given document\n", "\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "JavaObject id=o6" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create a new Java document from a string\n", "# You should see how the document gets created in the GATE GUI\n", "gdoc1 = gs.worker.createDocument(\"This is a 💩 document. It mentions Barack Obama and George Bush and New York.\")\n", "gdoc1" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "GATE Document_0001B\n", "{'gate.SourceURL': 'created from String'}\n" ] } ], "source": [ "# you can call the API methods for the document directly from Python\n", "print(gdoc1.getName())\n", "print(gdoc1.getFeatures())" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'This is a 💩 document. It mentions Barack Obama and George Bush and New York.'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# so far the document only \"lives\" in the Java process. In order to copy it to Python, it has to be converted\n", "# to a Python GateNLP document:\n", "pdoc1 = gs.gdoc2pdoc(gdoc1)\n", "pdoc1.text" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Let's load ANNIE on the Java side and run it on that document:\n", "# First we have to load the ANNIE plugin:\n", "gs.worker.loadMavenPlugin(\"uk.ac.gate.plugins\", \"annie\", \"8.6\")" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'ANNIE'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# now load the prepared ANNIE pipeline from the plugin\n", "pipeline = gs.worker.loadPipelineFromPlugin(\"uk.ac.gate.plugins\",\"annie\", \"/resources/ANNIE_with_defaults.gapp\")\n", "pipeline.getName()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# run the pipeline on the document and convert it to a GateNLP Python document and display it\n", "gs.worker.run4Document(pipeline, gdoc1)\n", "pdoc1 = gs.gdoc2pdoc(gdoc1)\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "