Comparison with the GATE API
Generally, the priority in the
gatenlp package is to make the API
pythonic, make use of Python features (e.g. keyword arguments) and use
names that are easy to remember and easy to understand when completion in
an IDE is used.
In the Jave GATE API, many functions are overloaded (e.g. AnnotationSet.get) and return different things depending on the type of parameter(s). In the Python API, the method name should make it clear what gets returned, unless it is clear or easy to assume.
Also, most methods related to a kind of object are directly available through
the instance or class, e.g. in the Java GATE API, to create or load a document one has
to call a method of the Factory, to save a document one has to find the
correct Exporter class and call their method and so on. In Python
all these actions are performed by methods on the class or instance.
There is no support or equivalent implementation for the
following GATE features and functionality in
- editing, i.e. modifying the text of a document: once the document text is set, it is immutable. In order achieve a document with modified text, a new document must be created
- features work somewhat differently as it is not possible to replace the features object stored with a document or annotation, only modify it.
- toXml: the GATE XML serialization format is not supported for saving. Instead the formats also implemented in the Java GATE Format_Bdoc plugin are supported: Bdoc JSON and Bdoc MsgPack.
- DocumentContent: not necessary
- Additional document classes
There is no support or equivalent for the following GateNLP functions in GATE:
to_type: to change offset type between java/python
set_changelog: to record changes to a changelog
doc[span]: where span is either an offset or offset range or an annotation
- a number of other methods of the AnnotationSet and Annotation classes
Here is a comparison of the most important API methods related to documents, showing first the GATE method, then the corresponding Python method (if any) and/or remarks:
|get/setContent(): -, not needed
get/setSourceUrl(): not needed internally, the end user can simply store this or any other information about the document source in a document feature
add/removeDocumentListener(listener): not needed as we do not have a GUI
edit(...): not supported as
gatenlpdocuments are immutable
getEncoding(): -, not needed: internally, all texts are Unicode, if a file with a different encoding should get loaded it can be done simply by passing an open file connection that has been opened with the proper encoding.
getMimeType():not needed not needed: the file format is specified when loading and not relevant once the Document has been created
getNextAnnotationId()not needed, ids are allocated per annotation set
getNextNodeId()`: not needed, the annotation set implementation does not use nodes
getMarkupAware(): not supported
getPreserveOriginalContent(): there is no support for “markup-awareness”
save()– saving to (or loading from) GATE XML format is not supported, saving/loading using supported formats works via
get/setSourceUrlStart/EndOffset(): not supported
- So far, only Annotations and Documents can have features
- Features behave much like a Python dict
- Features are stored in a
Featuresobject for a Document or Annotation cannot be replaced
- Features must have string keys and should have values that are JSON-serializable (otherwise, the document cannot get saved in Bdoc JSPN format)
Setting a feature:
obj.features[name] = value
Getting a feature:
The main differences and properties are:
- no listeners
- offsets are int not Long
- no nodes
- ! annotation ids are/have to be unique per set, not per document
- as for documents, the feature dictionary cannot be replaced as a whole, only be modified
- the offsets and type of an annotation are immutable
- ordering is based on increasing start offset, then increasing annotation id.
Here is a comparison of the most important API methods related to annotations, showing first the GATE method, then the corresponding Python method (if any) and/or remarks:
iscoextensive(annorsetorrange)- many methods for annotations or annotation sets allow offset pairs, annotations or annotation sets to get used as an offset range
is[Partially]Compatible(ann): not implemented
idthis is an immutable property in Python
Python also has
getStartNode() are not needed in Python.