Visualizing Documents and Corpora¶

GateNLP provides interactive visualization of Documents and Corpora in notebooks.

For documents, the text, document features and annotation types in all annotation sets are shown and types can be selected for visualization of the individual annotations.

For corpora, the document viewer is shown with interactive controls to move through the corpus and reload a document from the corpus.

Document viewer¶

In a notebook, the default for showing a document is the document viewer:

In [1]:

from gatenlp import Document
doc = Document.load("https://gatenlp.github.io/python-gatenlp/testdocument1.bdocjs", fmt="bdocjs")
doc

Out[1]:

This is equivalent to calling doc.show(). However, the show() method allows to customize how the document is shown in various ways. The following are the most important parameters (see PythonDoc for details):

annspec: specify which annotations to include in the viewer. if specified, must be a list where each element is either the name of a whole annotation set to include or a tuple, where the first element is the annotation set name and the second element the annotation type names for that set to include.
preselect: which annotation types in which annotation sets to preselect for viewing. Same format as annspec
palette: a list of strings which are Javascript colors to use instead of the default palette
cols4types: a dictionary which maps tuples of the form (annsetname, anntypename) to Javascript colors to use for those annotations
doc_style: CSS style to use for the part of the viewer that shows the document text
row1_style: CSS style to use for the first row of the viewer which shows the document text and the annotation selection
row2_style: CSS style to use for the second row of the viewer which shows the document features of annotation features if an annotation is selected

In [2]:

# Show the document but show only the Token and PERSON annotations from only the Stanza and Spacy sets and 
# preselect the PERSON annotations in both sets. 
# Also, show the document text in blue and make the font in the feature display pane smaller.
# Finally, use yellow as the color "#00FFFF" for the Stanza Token annotations

doc.show(
    annspec=[("Stanza", ["Token", "PERSON"]), ("Spacy", ["Token", "PERSON"])],
    preselect=[("Stanza", ["PERSON"]), ("Spacy", ["PERSON"])],
    doc_style="color: blue;",
    row2_style="font-size: 70%;",
    cols4types={("Stanza", "Token"): "00FFFF"},
)

Corpus Viewer¶

In [3]:

from gatenlp import Document
from gatenlp.corpora import ListCorpus
from gatenlp.visualization import CorpusViewer 

texts = [
    "Text for the first document",
    "text for the second document",
    "And here is another document",
]
docs = [Document(t) for t in texts]
docs.append(doc)
corpus = ListCorpus(docs)
corpus

Out[3]:

<gatenlp.corpora.memory.ListCorpus at 0x7f65de16b8d0>

To interactively browse the corpus, create a CorpusViewer and use its show method to show the corpus in the notebook. Note that this ONLY works in a notebooks, if the notebook is converted to HML or Markdown, the corpus viewer is not only not working, it is also not getting properly shown.

The constructor for the CorpusViewer instance can take all the parameters which doc.show can doc which allows to configure how the documents are shown in the corpus viewer.

In [4]:

cviewer = CorpusViewer(
    corpus, 
    preselect=[("Stanza", ["PERSON"]), ("Spacy", ["PERSON"])],
    annspec=[("Stanza", ["Token", "PERSON"]), ("Spacy", ["Token", "PERSON"])],
)
cviewer.show()

In [5]:

import gatenlp
print("NB last updated with gatenlp version", gatenlp.__version__)

NB last updated with gatenlp version 1.0.8a1