Module gatenlp.processing.annotator
Module with the base class and supporting functions for all annotators. Any callable that can be called by passing a document can be used as an annotator, but the base class "Annotator" defined in here is designed to allow for a more flexible approach to do things.
Expand source code
"""
Module with the base class and supporting functions for all annotators.
Any callable that can be called by passing a document can be used as an annotator,
but the base class "Annotator" defined in here is designed to allow for a more
flexible approach to do things.
"""
from typing import Union, List, Iterable, Any, Callable
from abc import ABC, abstractmethod
from gatenlp import Document
__pdoc__ = {"Annotator.__call__": True}
class Annotator(ABC):
@abstractmethod
def __call__(self, doc: Document, **kwargs) -> Union[Document, List[Document], None]:
"""
This method MUST get implemented in a concrete subclass to do the actual processing
and annotation. It must accept a document and arbitrary keyword arguments and it must
return either a document which may be the same or a different object than the document passed,
or None or an empty list or a list of one or more documents. The method also may raise an
exception.
The semantics of returning None or an empty list are not strictly defined: this may be used to
handle processing errors where documents which cannot be processed are quietly ignored or
filtering.
The method must accept arbitrary keyword arguments which will be passed on to sub-annotators and
may be used to configure or parametrize processing.
NOTE: some annotators may set or use special document features in order to handle
document context or the document id when processing a corpus or streams where a document id
is important.
Args:
doc: the document to process
kwargs: any arguments to pass to the annotator or sub-annotators called by this annotator
Returns:
a document, None, or a possibly empty list of documents
"""
raise Exception("This method must be implemented!")
def pipe(self, documents: Iterable[Document], **kwargs):
"""
If this method gets overridden, it should take an iterable of documents and yield processed documents.
This allows for batching, caching, and other optimizations over streams of documents.
If with_context is True, then the documents parameter should be an iterable over tuples (document, context).
Args:
documents: an iterable over documents
**kwargs: arbitrary other keyword arguments must be accepted
Yields:
processed documents
"""
for document in documents:
if document is not None:
docordocs = self.__call__(document, **kwargs)
if docordocs is not None:
if isinstance(docordocs, list):
for doc in docordocs:
if doc is not None:
yield doc
else:
yield docordocs
def start(self) -> None:
"""
A method that gets called when processing starts, e.g. before the first document in
corpus gets processed. This is invoked by an executor to initialize processing a batch
of documents.
This is different from initializing the Annotator: initializing may load large data which
can be reused even if the same annotator instance is run several times over documents.
"""
pass
def finish(self) -> Any:
"""
A method that gets called when processing ends, e.g. when all documents of a corpus
have been processed. It should return some result for processing the whole batch of documents
it has seen - that result may be None.
Returns:
The overall result of processing all documents or None
"""
pass
def reduce(self, results: Iterable[Any]) -> Any:
"""
A method that should know how to combine the results passed on in some collection into a
single result. This method should behave like a static method, i.e. not make use of any
data that is specific to the concrete instance.
This can be used to combine corpus results obtained from several processes running on
different parts of a corpus.
This gets invoked by the executor if more than one instance of the annotator was run
over separate sets of documents. If only a single instance was used, the result returned
from finish is used directly.
Args:
results: an iterable of individual results over some documents each or None if no results are available.
If no results have been passed back from the finish method of any of the processes, the executor should
not call reduce, but if it does, reduce should accept None or an iterator of all None and return None.
Returns:
The combined overall result or None if there are no individual results
"""
return results
class AnnotatorFunction(Annotator):
def __init__(self, funct: Callable):
self.funct = funct
def __call__(self, doc: Document, **kwargs) -> Union[Document, List[Document], None]:
return self.funct(doc, **kwargs)
Classes
class Annotator
-
Helper class that provides a standard way to create an ABC using inheritance.
Expand source code
class Annotator(ABC): @abstractmethod def __call__(self, doc: Document, **kwargs) -> Union[Document, List[Document], None]: """ This method MUST get implemented in a concrete subclass to do the actual processing and annotation. It must accept a document and arbitrary keyword arguments and it must return either a document which may be the same or a different object than the document passed, or None or an empty list or a list of one or more documents. The method also may raise an exception. The semantics of returning None or an empty list are not strictly defined: this may be used to handle processing errors where documents which cannot be processed are quietly ignored or filtering. The method must accept arbitrary keyword arguments which will be passed on to sub-annotators and may be used to configure or parametrize processing. NOTE: some annotators may set or use special document features in order to handle document context or the document id when processing a corpus or streams where a document id is important. Args: doc: the document to process kwargs: any arguments to pass to the annotator or sub-annotators called by this annotator Returns: a document, None, or a possibly empty list of documents """ raise Exception("This method must be implemented!") def pipe(self, documents: Iterable[Document], **kwargs): """ If this method gets overridden, it should take an iterable of documents and yield processed documents. This allows for batching, caching, and other optimizations over streams of documents. If with_context is True, then the documents parameter should be an iterable over tuples (document, context). Args: documents: an iterable over documents **kwargs: arbitrary other keyword arguments must be accepted Yields: processed documents """ for document in documents: if document is not None: docordocs = self.__call__(document, **kwargs) if docordocs is not None: if isinstance(docordocs, list): for doc in docordocs: if doc is not None: yield doc else: yield docordocs def start(self) -> None: """ A method that gets called when processing starts, e.g. before the first document in corpus gets processed. This is invoked by an executor to initialize processing a batch of documents. This is different from initializing the Annotator: initializing may load large data which can be reused even if the same annotator instance is run several times over documents. """ pass def finish(self) -> Any: """ A method that gets called when processing ends, e.g. when all documents of a corpus have been processed. It should return some result for processing the whole batch of documents it has seen - that result may be None. Returns: The overall result of processing all documents or None """ pass def reduce(self, results: Iterable[Any]) -> Any: """ A method that should know how to combine the results passed on in some collection into a single result. This method should behave like a static method, i.e. not make use of any data that is specific to the concrete instance. This can be used to combine corpus results obtained from several processes running on different parts of a corpus. This gets invoked by the executor if more than one instance of the annotator was run over separate sets of documents. If only a single instance was used, the result returned from finish is used directly. Args: results: an iterable of individual results over some documents each or None if no results are available. If no results have been passed back from the finish method of any of the processes, the executor should not call reduce, but if it does, reduce should accept None or an iterator of all None and return None. Returns: The combined overall result or None if there are no individual results """ return results
Ancestors
- abc.ABC
Subclasses
- GateWorkerAnnotator
- AnnSpacy
- AnnStanza
- PampacAnnotator
- AnnotatorFunction
- ElgTextAnnotator
- GateCloudAnnotator
- GoogleNlpAnnotator
- IbmNluAnnotator
- PerspectiveAnnotator
- RewireAnnotator
- TagMeAnnotator
- TextRazorTextAnnotator
- EmojiAnnotator
- GazetteerBase
- FeatureGazetteer
- Normalizer
- Pipeline
- Tokenizer
Methods
def __call__(self, doc: Document, **kwargs) ‑> Union[Document, List[Document], None]
-
This method MUST get implemented in a concrete subclass to do the actual processing and annotation. It must accept a document and arbitrary keyword arguments and it must return either a document which may be the same or a different object than the document passed, or None or an empty list or a list of one or more documents. The method also may raise an exception.
The semantics of returning None or an empty list are not strictly defined: this may be used to handle processing errors where documents which cannot be processed are quietly ignored or filtering.
The method must accept arbitrary keyword arguments which will be passed on to sub-annotators and may be used to configure or parametrize processing.
NOTE: some annotators may set or use special document features in order to handle document context or the document id when processing a corpus or streams where a document id is important.
Args
doc
- the document to process
kwargs
- any arguments to pass to the annotator or sub-annotators called by this annotator
Returns
a document, None, or a possibly empty list of documents
Expand source code
@abstractmethod def __call__(self, doc: Document, **kwargs) -> Union[Document, List[Document], None]: """ This method MUST get implemented in a concrete subclass to do the actual processing and annotation. It must accept a document and arbitrary keyword arguments and it must return either a document which may be the same or a different object than the document passed, or None or an empty list or a list of one or more documents. The method also may raise an exception. The semantics of returning None or an empty list are not strictly defined: this may be used to handle processing errors where documents which cannot be processed are quietly ignored or filtering. The method must accept arbitrary keyword arguments which will be passed on to sub-annotators and may be used to configure or parametrize processing. NOTE: some annotators may set or use special document features in order to handle document context or the document id when processing a corpus or streams where a document id is important. Args: doc: the document to process kwargs: any arguments to pass to the annotator or sub-annotators called by this annotator Returns: a document, None, or a possibly empty list of documents """ raise Exception("This method must be implemented!")
def finish(self) ‑> Any
-
A method that gets called when processing ends, e.g. when all documents of a corpus have been processed. It should return some result for processing the whole batch of documents it has seen - that result may be None.
Returns
The overall result of processing all documents or None
Expand source code
def finish(self) -> Any: """ A method that gets called when processing ends, e.g. when all documents of a corpus have been processed. It should return some result for processing the whole batch of documents it has seen - that result may be None. Returns: The overall result of processing all documents or None """ pass
def pipe(self, documents: Iterable[Document], **kwargs)
-
If this method gets overridden, it should take an iterable of documents and yield processed documents. This allows for batching, caching, and other optimizations over streams of documents. If with_context is True, then the documents parameter should be an iterable over tuples (document, context).
Args
documents
- an iterable over documents
**kwargs
- arbitrary other keyword arguments must be accepted
Yields
processed documents
Expand source code
def pipe(self, documents: Iterable[Document], **kwargs): """ If this method gets overridden, it should take an iterable of documents and yield processed documents. This allows for batching, caching, and other optimizations over streams of documents. If with_context is True, then the documents parameter should be an iterable over tuples (document, context). Args: documents: an iterable over documents **kwargs: arbitrary other keyword arguments must be accepted Yields: processed documents """ for document in documents: if document is not None: docordocs = self.__call__(document, **kwargs) if docordocs is not None: if isinstance(docordocs, list): for doc in docordocs: if doc is not None: yield doc else: yield docordocs
def reduce(self, results: Iterable[Any]) ‑> Any
-
A method that should know how to combine the results passed on in some collection into a single result. This method should behave like a static method, i.e. not make use of any data that is specific to the concrete instance.
This can be used to combine corpus results obtained from several processes running on different parts of a corpus.
This gets invoked by the executor if more than one instance of the annotator was run over separate sets of documents. If only a single instance was used, the result returned from finish is used directly.
Args
results
- an iterable of individual results over some documents each or None if no results are available. If no results have been passed back from the finish method of any of the processes, the executor should not call reduce, but if it does, reduce should accept None or an iterator of all None and return None.
Returns
The combined overall result or None if there are no individual results
Expand source code
def reduce(self, results: Iterable[Any]) -> Any: """ A method that should know how to combine the results passed on in some collection into a single result. This method should behave like a static method, i.e. not make use of any data that is specific to the concrete instance. This can be used to combine corpus results obtained from several processes running on different parts of a corpus. This gets invoked by the executor if more than one instance of the annotator was run over separate sets of documents. If only a single instance was used, the result returned from finish is used directly. Args: results: an iterable of individual results over some documents each or None if no results are available. If no results have been passed back from the finish method of any of the processes, the executor should not call reduce, but if it does, reduce should accept None or an iterator of all None and return None. Returns: The combined overall result or None if there are no individual results """ return results
def start(self) ‑> None
-
A method that gets called when processing starts, e.g. before the first document in corpus gets processed. This is invoked by an executor to initialize processing a batch of documents.
This is different from initializing the Annotator: initializing may load large data which can be reused even if the same annotator instance is run several times over documents.
Expand source code
def start(self) -> None: """ A method that gets called when processing starts, e.g. before the first document in corpus gets processed. This is invoked by an executor to initialize processing a batch of documents. This is different from initializing the Annotator: initializing may load large data which can be reused even if the same annotator instance is run several times over documents. """ pass
class AnnotatorFunction (funct: Callable)
-
Helper class that provides a standard way to create an ABC using inheritance.
Expand source code
class AnnotatorFunction(Annotator): def __init__(self, funct: Callable): self.funct = funct def __call__(self, doc: Document, **kwargs) -> Union[Document, List[Document], None]: return self.funct(doc, **kwargs)
Ancestors
- Annotator
- abc.ABC
Inherited members