Annotations

See also:

Annotations are objects that provide information about a span of text. In gatenlp annotations are used to identify tokens, entities, sentences, paragraphs, and other things: unlike in other NLP libraries, the same abstraction is used for everything that is about offset spans in a document. This abstraction is identical to what is used in Java GATE.

Annotations contain the following information:

The normal way to create an annotation is by using the AnnotationSet method add:

from gatenlp import Document, Annotation
doc = Document("Some test document")
annset = doc.annset()
ann = annset.add(2,4,"Token",{"lemma": "is"})
ann
# Out: Annotation(2,4,Token,id=1,features={'lemma': 'is'})
Annotation(2,4,Token,features=Features({'lemma': 'is'}),id=0)

This creates an annotation of type “Token” starting with the character at offset 2 and ending with the character at offset 3 (the end offset is alwyas one after the last character). The annotation gets initialized with a single feature “lemma” which has the value “is”.

Once an annotation has been created, everything but the features is immutable. Trying to e.g. do ann.start = 12 will raise an exception.

To change or set or remove a feature use the methods provided by Features

An annotation can also be directly created:

ann2 = Annotation(2,4,"Token",annid=1,features={"lemma": "is"})
ann2
# Out: Annotation(2,4,Token,id=1,features={'lemma': 'is'})
Annotation(2,4,Token,features=Features({'lemma': 'is'}),id=1)

However such a “free floating” annotation is probably not of much use and there is no way to add it directly to an annotation set. The method annset.add_ann(ann) can be used to add an anntotation that is a copy of ann.

Annotation span methods

There can be as many annotations for as many arbitrary spans as needed, and they can overlap arbitrarily. There are several annotation methods which can be used to find out how exactly they overlap or are contained within each other.

ann_tok1 = annset.add(0,4,"Token")
ann_tok2 = annset.add(5,13,"Token")
ann_all = annset.add(0,13,"Document")
ann_vowel1 = annset.add(1,2,"Vowel")
ann_vowel2 = annset.add(3,4,"Vowel")

Annotations have a “length” which is the number of characters annotated, i.e. the length of the annotated span:

assert ann_tok1.length == 4

The ordering of annotations is by start offset, then annotation id.

# does one annotation come before the other?
assert ann_tok1 < ann_tok2
# True
assert ann_tok1 < ann_vowel1
# True
assert ann_tok1 < ann_all
# True (annotations added later have a higher annotation id)

Checking for overlaps:

assert not ann_tok1.isoverlapping(ann_tok2)

assert not ann_tok1.iscoextensive(ann_tok2)

assert ann_tok1.isoverlapping(ann_vowel1)

assert ann_tok1.iswithin(ann_all)

assert ann_tok1.iscovering(ann_vowel2)

assert ann_tok1.isbefore(ann_tok2)

assert not ann_tok1.isbefore(ann_tok2, immediately=True)

assert ann_tok1.gap(ann_tok2) == 1
# show the document with those annotations in the notebook:
doc

Spans

Span objects just represent a an offset span. Sometimes it is necessary to keep around just spans without the additional information represented in an Annotation. This is what Span objects can be used for.

Span objects have the same methods for checking overlap, coextensiveness etc. as Annotations.

In addition, Spans can often be used with methods where start,end offsets can be used.

from gatenlp import Span

span1 = Span(0,3)
span2 = Span(2,3)
span3 = Span(2,3)
span4 = Span(3,5)

# Create a span from an Annotation
span5 = Span(ann_tok1)
assert span1.isoverlapping(span2)
assert span1.iscovering(span2)
assert span2.iscoextensive(span3)
assert span2.iswithin(span1)
# Create a (detached) Annotation using a span
ann1 = Annotation(span1, "Type")
print(ann1)

# Create a document Annotation using a span
ann2 = doc.annset().add(span1, "Type")
print(ann2)
Annotation(0,3,Type,features=Features({}),id=0)
Annotation(0,3,Type,features=Features({}),id=6)