Processing

GateNLP does not limit how documents are stored in collections, iterated over, functions applied to the documents to modify them etc.

However, GateNLP provides a number of abstractions to help with this in an organized fashion:

Annotators

Anny callable that takes a document and returns that document can act as an annotator. Note that an annotator usually modifies the annotations or features of the document it receives. This happens in place, so the annotator would not have to return the document. However, it is a convention that annotators always return the document that got modified to indicate this to downstream annotators or document destinations.

If an annotator returns a list, the result of processing is instead the documents in the list which could be none, or more than one. This convention allows a processing pipeline to filter documents or generate several documents from a single one.

Lets create a simple annotator as a function and apply it to a corpus of documents which in the simplest form is just a list of documents:

import os
from gatenlp import Document
from gatenlp.processing.executor import SerialCorpusExecutor
def annotator1(doc):
    doc.annset().add(2,3,"Type1")
    return doc

texts = [
    "Text for the first document.",
    "Text for the second document. This one has two sentences.",
    "And another one.",
]

corpus = [Document(txt) for txt in texts]

# everything happens in memory here, so we can ignore the returned document
for doc in corpus:
    annotator1(doc)
    
for doc in corpus:
    print(doc)


    
Document(Text for the first document.,features=Features({}),anns=['':1])
Document(Text for the second document. This one has two sentences.,features=Features({}),anns=['':1])
Document(And another one.,features=Features({}),anns=['':1])

Annotator classes

When scaling up, annotators become more complex, processing gets more complex, a corpus does not fit into memory any more and so on. For this reason, GateNLP has abstractions which simplify processing in those situations.

Annotators as classes always must implement the __call__ special method so that an instance of the class can be used just like a function. In addition Annotator classes can also implement the following methods:

The result of processing a corpus returned by the executor is whatever is returned by the finish method for a single process execution or what is returned by the reduce method for multiprocessing. (NOTE: multiprocessing executors are not implemented yet!)

Notebook last updated

import gatenlp
print("NB last updated with gatenlp version", gatenlp.__version__)
NB last updated with gatenlp version 1.0.8a1