European Language Grid (ELG) Annotator

The ElgAnnotator is an annotator that onne of the ELG services to annotate documents: for each document that gets processed, data is sent to a HTTP endpoint, processed there and information is sent back that is then used to annotate the document.

import os
from gatenlp import Document
# It is called "ElgTextAnnotator" because ELG also provides services to analyze audio and other kinds of data.
from gatenlp.processing.client.elg import ElgTextAnnotator
from elg import Authentication
import elg

print("ELG version:", elg.__version__)
ELG version: 0.4.22

Lets try annotating a document with the UDPipe English: Morphosyntactic Analysis service (https://live.european-language-grid.eu/catalogue/tool-service/423).

The service requires authentication, and there are several ways to get a key and provide the information to the ElgTextAnnotator class:

To create and store authentication for offline access in a file tokens.json you can run the following code interactively (it will show the offline access URL from above and prompt for the success code):

auth = Authentication.init(scope="offline_access")
auth.to_json("tokens.json")
print(f"The tokens will expire: {auth.refresh_expires_time}")

The following code assumes that a valid tokens.json file exists.

The service can be specified by:

Each service returns its own set of annotations and features. The ElgTextAnnotator constructor allows to specify a map via the anntypes_map to replace the original annotation types with new names when processing the document.

if not os.path.exists("tokens.json"):
    auth = Authentication.init(scope="offline_access")
    auth.to_json("tokens.json")
    print(f"The tokens will expire: {auth.refresh_expires_time}")
else:
    auth = elg.authentication.Authentication.from_json("tokens.json")
    print("Expires:", auth.expires_time)
Expires: time.struct_time(tm_year=2022, tm_mon=6, tm_mday=28, tm_hour=1, tm_min=10, tm_sec=29, tm_wday=1, tm_yday=179, tm_isdst=0)
doc = Document("Barack Obama visited Microsoft in New York last May.")
annt = ElgTextAnnotator(
    url="https://live.european-language-grid.eu/execution/process/udpipeen",
    auth_file="tokens.json", 
    anntypes_map = {"udpipe/paragraphs": "Paragraph", "udpipe/sentences": "Sentence", "udpipe/tokens": "UDPToken"}
)
doc = annt(doc)
doc

UDPToken annotations are not very useful as they are, because each token contains a single feature describing one or more words corresponding to these tokens (UDPipe services support multi-word tokens).

To convert these tokens to token annotations (for single word tokens) and MWT (multi word token) and Token annotations for multi word tokens the following utility method can be used. This method will also adapt the dependency parser ids to refer to the respective Token annotation ids (the head id is always mapped to the containing sentence annotation) and will split the “feats” feature up into separate features on the annotation.

from gatenlp.processing.client.elg import udptoken2tokens
udptoken2tokens(doc.annset().with_type("UDPToken"), doc.annset().with_type("Sentence"), doc.annset())
doc
import gatenlp
print("NB last updated with gatenlp version", gatenlp.__version__)
NB last updated with gatenlp version 1.0.8.dev3