Versions and changes
AnnotationSet.with_type() previously returned a detached set with all annotations if no types were specified,
this now returns a detached set with no annotations which is more logical.
- Fixed bug in Token Gazetteer: issue #93
- Pampac: use the term “matches” instead of “data” for the named information stored for each named pattern that
fits the document. A single one of these is often called “match info” and the index for a specific info is called
“matchidx”. See issue #89
- Pampac: a Result is now an Iterable of match infos.
- Pampac: the
.contains(..) etc. constraints now allow to use a separate annotation set, e.g.
.within("Person", annset=doc.annset("Other")). See issue #57
- Pampac: RemoveAnn action has been added
- Pampac: UpdateAnnFeatures has been improved
- Span objects are now immutable. Equality and hashing of Span objects are based on their start and end offsets.
- Annotation equality and hashing has been changed back to the Python default: variables compare only equal if they
reference the same object and hashing is based on object identity.
For comparing annotations by content, the methods
ann.equal(other) (compare content without annotation id)
ann.same(other) (compare content including annotation id) have been implemented.
- API changes:
anntype has been changed to
- The Feature() constructor kw arg
logger has been changed to
deepcopy has been changed to
- The GateWorkerAnnotator parameters have been changed: instead of parameters gatehom and port,
the parameter gateworker now needs to receive a GateWorker instance.
update_document parameter has been added and now allows both updating and replacing
the Python document from the Java GATE document
- Issue #66: make it possible to show annotations over new-lines in the html ann viewer
- Issue #65: provide ParagraphTokenizer and SplitPatternTokenizer to easily annotate paragraphs
and other spans separated by some split pattern
- Issue #73: pickle document with offset index created
- Issue #68: rename the main development branch from “master” to “main”
- Issue #74: fix a bug in PAMPAC related to matching an annotation after some text
- Various improvements, additions and bug fixes in Pampac
- Issue #75: GateWorker now shows any Java exception when starting the Java process fails
- Issue #76: GateWorker has a new method
- Issue #77: GateWorkerAnnotator now automatically loads a pipeline from a URL if the string
passed to the
pipeline parameter looks like a URL or if it is the result of urllib.parse.urlparse.
It is always treated like a file if it is a pathlib.Path
- added the
Actions action for Pampac to recursively wrap several actions into one
- allow each Rule to have any number of actions, change signature to
Rule(patter, *actions, priority=0)
- The Pampac AddAnn action does not require a value for the name parameter any more, if not specified, the
full span of the match is used.
- New method
add_anns(anniterable) to add an iterable of annotations to a set
- The document viewer now also works in Google Colab
- The GateWorker can now be used as context manager:
with GateWorker() as gw:
- add training course slides
- fix issue #63: could not import html document from a local file
- Fix issues with logging and error handling in executor module
- Improve/add/change document sources/destination JsonLinesFile
- Implement multi-word tokens (MWTs) for the Stanza annotator
- Add support for space tokens for the Stanza annotator
- Support showing annotations over trailing spaces in the html ann viewer
- Add the
Document.attach(annset) method (mostly for internal use only!)
- Add the ConllUFileSource to import CoNLL-U corpora
- Fix a problem in the html ann viewer where unnecessary spans were created
- Add option to the
Document.show() method to style the document text div
- Fix issue #56: Rename GateSlave to GateWorker