Versions and changes
Upcoming (1.0.9)
- Minimum required Python version is now 3.8!!
- Fixed issues #191, #189, #187, #199, #197, #196, #202
- new method
Document.annslist(annspec)
- all methods for selecting annotations from a document using ann annotation specification (
annspec
) new take the optional parametersingle_set=False
which can be set to True to make sure that the specification only includes annotations from one set. - (INCOMPATIBLE CHANGE) Pampac: Actions now take an optional
annset_name
instead ofannset
parameter. This makes sure that the action can run as part of a PampacAnnotator on several documents and always use the right set from the correct document. - New method
Annotation.owning_set
which returns the set that owns an Annotation:- this is the attached Annotation set the Annotation is contained in
- or None, if the Annotation is not member of an attached Annotation set (but of one or more detached Annotation sets)
- AnnotationSet constructor: now always creates a detached set
- The only supported way to create an attached set is via
doc.annset(name)
- for internal use in the library, the constructor
AnnotationSet._create(name,owner_doc)
has been added
- The only supported way to create an attached set is via
1.0.8 (2022-11-09)
- improve inline XML output support in the GateWorker (contributed by https://github.com/paulbriton)
- correct documentation text (contributed by https://github.com/dmarx)
- Client for the Rewire service, see Rewire Client
- Client for the Perspective service, see Perspective Client
1.0.8a1 (2022-07-02)
- Adding several annotations to an annotation set at once can now be done with the more pythonic
update
method instead of theadd_anns
method (same method signature as before). - Removed dependency on the recordclass package and removed
gazetteer
extra (no specific dependencies needed for the gazetteer classes any more) - The GateWorker by default now automatically tries additional ports when the default or specified port is in use when starting.
- Added the
EmojiAnnotator
- Added the command
gatenlp-dir2dir
for running a pipeline either in a single process or in several processes in parallel using Ray - Added the
CorpusViewer
- Added
ConcatCorpus
andConcatSource
- CAUTION!!! Renamed parameters for specifying lists of setnames and optional typenames to
annspec
. Most of these parameters were previously calledannsets
orann_desc
. This requires updating the client code wherever such parameters where used!! - Added parameter “presel” to
Document.show
to preselect annotation types - Added parameter “palette” to
Document.show
to replace the default colour palette - Added parameter “cols4types” to
Document.show
to specify colours for specific setname / typename combinations - Add support for installing and using core functionality under pyodide and jupyter-lite
Also new:
- the package gatenlp-ml-tner can be used to train and apply transformers-based chunking/NER models
1.0.7 (2022-02-06)
- added new parameter
reset_annids
AnnotatationSet.clear
- fixed a bug in
StringGazetteer
when the gazetteer list contains a string only entry instead of a tuple. - Added parameters
row1_style
androw2_style
toDocument.show
- Fixed issue 149: reset the annotation id after clearing the annotation set
- ELG annotator: allow to use a pre-initialized authentification file
- ELG annotator: add utility function to convert UDPipe annotations:
udptoken2tokens
- Added IBM NLU annotator
- Added Google NLP annotator
- Loading a GATE XML using
Document.load(path, fmt="gatexml")
now supports feature values of type Map, List, Set, Array, Date
1.0.6 (2021-11-19)
- The minimum Python version has been changed from 3.6 to 3.7. This now allows the use of postponed evaluation of type annotations and the use of dataclasses.
- add
Document.edit(edits, affected_strategy="keepadapt")
method: update document text and change offsets/indices for all annotations, if necessary. - New annotator StringRegexpAnnotator which allows to annotate documents using Python regular expressions in a very simple and flexible way. See the StringRegexpAnnotator Documentation.
- The StringGazetteer has been implemented. See the Gazetteers documentation.
- The TokenGazetteer parameter names got changed to match the corresponding
StringGazetteer
names - ! the parameter name
out_set
ingatenlp.processing.tokenizer
was changed tooutset_name
to be consistent with the name used elsewhere. - ! the parameter name
out_annset
ingatenlp.processing.client
was changed tooutset_name
to be consistent with the name used elsewhere. - The
Document.clone()
method can be used to easily create an exact copy of a document, where none of the data is shared (deep copy) - The TextNormalizer has been added. It can be used to normalize the unicode representation of the text in a document.
- loading a document from bdocjs format now does not require any keys in the JSON map and also ignores all unknown keys. This allows to more easily import ad-hoc documents which e.g. only contain the text or text and annotations (if not offset type is specified, python is assumed).
- The documentation has been updated and extended (especially for gazetteers and PAMPAC)
1.0.5.1 (2021-10-09)
- Bug fix: make
lib_spacy
support both versions 2.x and 3.x (1.0.5 used a method which is only available in 3.x)
1.0.5 (2021-10-08)
Changes that break backwards compatibility:
AnnotationSet.with_type()
previously returned a detached set with all annotations if no types were specified, this now returns a detached set with no annotations which is more logical.- API changes:
pam.pampac.actions.AddAnn
: parameteranntype
has been changed totype
- The Feature() constructor kw arg
logger
has been changed to_change_logger
anddeepcopy
has been changed to_deepcopy
- Pampac: use the term “matches” instead of “data” for the named information stored for each named pattern that fits the document. A single one of these is often called “match info” and the index for a specific info is now called “matchidx” instead of “dataidx”. See issue #89
- Parameter
spacetoken_type
forAnnSpacy
andspacy2gatenlp
has been changed tospace_token_type
to conform to the parameter name used forAnnStanza
andstanza2gatenlp
. - Stanford Stanza support now requires Stanza version 1.3.0 or higher
- Changes to
lib_spacy
: new parametercontaining_anns
to apply the spacy pipeline only to the part of the document covered by each of the annotations in the annotation set or iterator. New parameterscomponent_cfg
to specify a component config for Spacy andretrieve_spans
to retrieve additional span types to retrieve. - Several bugfixes in Pampac.
Other changes and improvements:
- New method
AnnotationSet.create_from(anniterable)
to create a detached, immutable annotation set from an iterable of annotations - New method
Document.anns(annspec)
creates a detached set of all annotations that match the specification - New method
Document.yield_anns(annspec)
yields all annotations which match the specification - Fixed bug in Token Gazetteer: issue #93
- Pampac: there is now a PampacAnnotator class to simplify using Pampac in a pipeline.
- Pampac: New parameter
containing_anns
forPampac.run
: if specified, runs the rules on each span of each of the containing annotations - Pampac: a Result is now an Iterable of match infos.
- Pampac: the
.within(..)
.contains(..)
etc. constraints now allow to use a separate annotation set, e.g..within("Person", annset=doc.annset("Other"))
. See issue #57 - Pampac:
RemoveAnn
action has been added - Pampac:
UpdateAnnFeatures
has been improved - Pampac:
AddAnn
action supports getter helpers in feature values Span
objects are now immutable. Equality and hashing ofSpan
objects are based on their start and end offsets.Annotation
equality and hashing has been changed back to the Python default: variables compare only equal if they reference the same object and hashing is based on object identity. For comparing annotations by content, the methodsann.equal(other)
(compare content without annotation id) andann.same(other)
(compare content including annotation id) have been implemented.- Documents can be saved in “tweet-v1” format
- Fixed a problem with the HTML viewer: leading and multiple whitespace annotations now show correctly.
1.0.4 (2021-04-10)
- The GateWorkerAnnotator parameters have been changed: instead of parameters gatehom and port,
the parameter gateworker now needs to receive a GateWorker instance.
Also the
update_document
parameter has been added and now allows both updating and replacing the Python document from the Java GATE document - Issue #66: make it possible to show annotations over new-lines in the html ann viewer
- Issue #65: provide ParagraphTokenizer and SplitPatternTokenizer to easily annotate paragraphs and other spans separated by some split pattern
- Issue #73: pickle document with offset index created
- Issue #68: rename the main development branch from “master” to “main”
- Issue #74: fix a bug in PAMPAC related to matching an annotation after some text
- Various improvements, additions and bug fixes in Pampac
- Issue #75: GateWorker now shows any Java exception when starting the Java process fails
- Issue #76: GateWorker has a new method
loadPipelineFromUri(uri)
- Issue #77: GateWorkerAnnotator now automatically loads a pipeline from a URL if the string
passed to the
pipeline
parameter looks like a URL or if it is the result of urllib.parse.urlparse. It is always treated like a file if it is a pathlib.Path - added the
Actions
action for Pampac to recursively wrap several actions into one - allow each Rule to have any number of actions, change signature to
Rule(patter, *actions, priority=0)
- The Pampac AddAnn action does not require a value for the name parameter any more, if not specified, the full span of the match is used.
- New method
add_anns(anniterable)
to add an iterable of annotations to a set - The document viewer now also works in Google Colab
- The GateWorker can now be used as context manager:
with GateWorker() as gw:
1.0.3.1 (2021-03-01)
- add training course slides
- fix issue #63: could not import html document from a local file
1.0.3 (2021-02-22)
- Fix issues with logging and error handling in executor module
- Improve/add/change document sources/destination JsonLinesFile
- add
Span.embed
method - Implement multi-word tokens (MWTs) for the Stanza annotator
- Add support for space tokens for the Stanza annotator
- Support showing annotations over trailing spaces in the html ann viewer
- Add the
Document.attach(annset)
method (mostly for internal use only!) - Add the ConllUFileSource to import CoNLL-U corpora
- Fix a problem in the html ann viewer where unnecessary spans were created
- Add option to the
Document.show()
method to style the document text div
1.0.2 (2021-02-09)
- Fix issue #56: Rename GateSlave to GateWorker
1.0.1 (2021-02-07)
- Initial release