Visualizing Documents and Corpora¶
GateNLP
provides interactive visualization of Documents and Corpora in notebooks.
For documents, the text, document features and annotation types in all annotation sets are shown and types can be selected for visualization of the individual annotations.
For corpora, the document viewer is shown with interactive controls to move through the corpus and reload a document from the corpus.
Document viewer¶
In a notebook, the default for showing a document is the document viewer:
from gatenlp import Document
doc = Document.load("https://gatenlp.github.io/python-gatenlp/testdocument1.bdocjs", fmt="bdocjs")
doc
This is equivalent to calling doc.show()
. However, the show()
method allows to customize how the document is shown in various ways. The following are the most important parameters (see PythonDoc for details):
annspec
: specify which annotations to include in the viewer. if specified, must be a list where each element is either the name of a whole annotation set to include or a tuple, where the first element is the annotation set name and the second element the annotation type names for that set to include.preselect
: which annotation types in which annotation sets to preselect for viewing. Same format asannspec
palette
: a list of strings which are Javascript colors to use instead of the default palettecols4types
: a dictionary which maps tuples of the form(annsetname, anntypename)
to Javascript colors to use for those annotationsdoc_style
: CSS style to use for the part of the viewer that shows the document textrow1_style
: CSS style to use for the first row of the viewer which shows the document text and the annotation selectionrow2_style
: CSS style to use for the second row of the viewer which shows the document features of annotation features if an annotation is selected
# Show the document but show only the Token and PERSON annotations from only the Stanza and Spacy sets and
# preselect the PERSON annotations in both sets.
# Also, show the document text in blue and make the font in the feature display pane smaller.
# Finally, use yellow as the color "#00FFFF" for the Stanza Token annotations
doc.show(
annspec=[("Stanza", ["Token", "PERSON"]), ("Spacy", ["Token", "PERSON"])],
preselect=[("Stanza", ["PERSON"]), ("Spacy", ["PERSON"])],
doc_style="color: blue;",
row2_style="font-size: 70%;",
cols4types={("Stanza", "Token"): "00FFFF"},
)
Corpus Viewer¶
from gatenlp import Document
from gatenlp.corpora import ListCorpus
from gatenlp.visualization import CorpusViewer
texts = [
"Text for the first document",
"text for the second document",
"And here is another document",
]
docs = [Document(t) for t in texts]
docs.append(doc)
corpus = ListCorpus(docs)
corpus
<gatenlp.corpora.memory.ListCorpus at 0x7f65de16b8d0>
To interactively browse the corpus, create a CorpusViewer and use its show
method to show the corpus in
the notebook. Note that this ONLY works in a notebooks, if the notebook is converted to HML or Markdown,
the corpus viewer is not only not working, it is also not getting properly shown.
The constructor for the CorpusViewer
instance can take all the parameters which doc.show
can doc which allows to configure how the documents are shown in the corpus viewer.
cviewer = CorpusViewer(
corpus,
preselect=[("Stanza", ["PERSON"]), ("Spacy", ["PERSON"])],
annspec=[("Stanza", ["Token", "PERSON"]), ("Spacy", ["Token", "PERSON"])],
)
cviewer.show()
import gatenlp
print("NB last updated with gatenlp version", gatenlp.__version__)
NB last updated with gatenlp version 1.0.8a1