Comparison with the GATE API
Generally, the priority in the gatenlp
package is to make the API
pythonic, make use of Python features (e.g. keyword arguments) and use
names that are easy to remember and easy to understand when completion in
an IDE is used.
In the Jave GATE API, many functions are overloaded (e.g. AnnotationSet.get) and return different things depending on the type of parameter(s). In the Python API, the method name should make it clear what gets returned, unless it is clear or easy to assume.
Also, most methods related to a kind of object are directly available through
the instance or class, e.g. in the Java GATE API, to create or load a document one has
to call a method of the Factory, to save a document one has to find the
correct Exporter class and call their method and so on. In Python gatenlp
all these actions are performed by methods on the class or instance.
Documents:
There is no support or equivalent implementation for the
following GATE features and functionality in gatenlp
:
- listeners
- editing, i.e. modifying the text of a document: once the document text is set, it is immutable. In order achieve a document with modified text, a new document must be created
- markup-aware/repositioning/preserveOriginalContent
- features work somewhat differently as it is not possible to replace the features object stored with a document or annotation, only modify it.
- sourceUrlOffsets
- toXml: the GATE XML serialization format is not supported for saving. Instead the formats also implemented in the Java GATE Format_Bdoc plugin are supported: Bdoc JSON and Bdoc MsgPack.
- DocumentContent: not necessary
- Additional document classes
There is no support or equivalent for the following GateNLP functions in GATE:
to_type
: to change offset type between java/pythonset_changelog
: to record changes to a changelogdoc[span]
: where span is either an offset or offset range or an annotation- a number of other methods of the AnnotationSet and Annotation classes
Here is a comparison of the most important API methods related to documents, showing first the GATE method, then the corresponding Python method (if any) and/or remarks:
getContent().size()
:len(doc)
getAnnotations()
:annset()
getAnnotations(name)
:annset(name)
getAnnotationSetNames()
:annset_names()
removeAnnotationSet(name)
:remove_annset(name)
|get/setContent()
: -, not neededget/setSourceUrl()
: not needed internally, the end user can simply store this or any other information about the document source in a document featureadd/removeDocumentListener(listener)
: not needed as we do not have a GUIedit(...)
: not supported asgatenlp
documents are immutablegetEncoding()
: -, not needed: internally, all texts are Unicode, if a file with a different encoding should get loaded it can be done simply by passing an open file connection that has been opened with the proper encoding.getMimeType():
not needed not needed: the file format is specified when loading and not relevant once the Document has been createdgetNextAnnotationId()
not needed, ids are allocated per annotation set-
getNextNodeId()`: not needed, the annotation set implementation does not use nodes getCollectRepositioningInfo():
not supportedgetMarkupAware()
: not supportedgetPreserveOriginalContent()
: there is no support for “markup-awareness”toXml():
save()
– saving to (or loading from) GATE XML format is not supported, saving/loading using supported formats works viadoc.save(...)
andDocument.load(...)
get/setSourceUrlStart/EndOffset()
: not supported
Features:
- So far, only Annotations and Documents can have features
- Features behave much like a Python dict
- Features are stored in a
Features
object - The
Features
object for a Document or Annotation cannot be replaced - Features must have string keys and should have values that are JSON-serializable (otherwise, the document cannot get saved in Bdoc JSPN format)
Setting a feature:
- GATE:
obj.getFeatures().put(name, value)
- Python:
obj.features[name] = value
Getting a feature:
- GATE:
obj.getFeatures().get(name)
- Python:
obj.features.get(name [,defaulval])
orobj.features[name]
Annotations:
The main differences and properties are:
- no listeners
- offsets are int not Long
- no nodes
- ! annotation ids are/have to be unique per set, not per document
- as for documents, the feature dictionary cannot be replaced as a whole, only be modified
- the offsets and type of an annotation are immutable
- ordering is based on increasing start offset, then increasing annotation id.
Here is a comparison of the most important API methods related to annotations, showing first the GATE method, then the corresponding Python method (if any) and/or remarks:
coextensive(ann)
:iscoextensive(annorsetorrange)
- many methods for annotations or annotation sets allow offset pairs, annotations or annotation sets to get used as an offset rangeis[Partially]Compatible(ann)
: not implementedoverlaps(ann)
:is overlapping(annorsetorrange)
withinSpanOf(ann)
:iswithin(annorsetorrange)
getId()
:id
this is an immutable property in PythongetType()
:type
getEndNode().getOffset()
:end
getStartNode().getOffset()
:start
Python also has iscovering(annorsetorrange)
and isinside(offset)
GATE methods getEndNode()
, getStartNode()
are not needed in Python.