Visualizing Documents and Corpora¶
GateNLP
provides interactive visualization of Documents and Corpora in notebooks.
For documents, the text, document features and annotation types in all annotation sets are shown and types can be selected for visualization of the individual annotations.
For corpora, the document viewer is shown with interactive controls to move through the corpus and reload a document from the corpus.
Document viewer¶
In a notebook, the default for showing a document is the document viewer:
from gatenlp import Document
doc = Document.load("https://gatenlp.github.io/python-gatenlp/testdocument1.bdocjs", fmt="bdocjs")
doc
It contains just a few sentences.
Here is a sentence that mentions a few named entities like
the persons Barack Obama or Ursula von der Leyen, locations
like New York City, Vienna or Beijing or companies like
Google, UniCredit or Huawei.
Here we include a URL https://gatenlp.github.io/python-gatenlp/
and a fake email address john.doe@hiscoolserver.com as well
as #some #cool #hastags and a bunch of emojis like 😽 (a kissing cat),
👩🏫 (a woman teacher), 🧬 (DNA),
🧗 (a person climbing),
💩 (a pile of poo).
Here we test a few different scripts, e.g. Hangul 한글 or
simplified Hanzi 汉字 or Farsi فارسی which goes from right to left.
This is equivalent to calling doc.show()
. However, the show()
method allows to customize how the document is shown in various ways. The following are the most important parameters (see PythonDoc for details):
annspec
: specify which annotations to include in the viewer. if specified, must be a list where each element is either the name of a whole annotation set to include or a tuple, where the first element is the annotation set name and the second element the annotation type names for that set to include.preselect
: which annotation types in which annotation sets to preselect for viewing. Same format asannspec
palette
: a list of strings which are Javascript colors to use instead of the default palettecols4types
: a dictionary which maps tuples of the form(annsetname, anntypename)
to Javascript colors to use for those annotationsdoc_style
: CSS style to use for the part of the viewer that shows the document textrow1_style
: CSS style to use for the first row of the viewer which shows the document text and the annotation selectionrow2_style
: CSS style to use for the second row of the viewer which shows the document features of annotation features if an annotation is selected
# Show the document but show only the Token and PERSON annotations from only the Stanza and Spacy sets and
# preselect the PERSON annotations in both sets.
# Also, show the document text in blue and make the font in the feature display pane smaller.
# Finally, use yellow as the color "#00FFFF" for the Stanza Token annotations
doc.show(
annspec=[("Stanza", ["Token", "PERSON"]), ("Spacy", ["Token", "PERSON"])],
preselect=[("Stanza", ["PERSON"]), ("Spacy", ["PERSON"])],
doc_style="color: blue;",
row2_style="font-size: 70%;",
cols4types={("Stanza", "Token"): "00FFFF"},
)
It contains just a few sentences.
Here is a sentence that mentions a few named entities like
the persons Barack Obama or Ursula von der Leyen, locations
like New York City, Vienna or Beijing or companies like
Google, UniCredit or Huawei.
Here we include a URL https://gatenlp.github.io/python-gatenlp/
and a fake email address john.doe@hiscoolserver.com as well
as #some #cool #hastags and a bunch of emojis like 😽 (a kissing cat),
👩🏫 (a woman teacher), 🧬 (DNA),
🧗 (a person climbing),
💩 (a pile of poo).
Here we test a few different scripts, e.g. Hangul 한글 or
simplified Hanzi 汉字 or Farsi فارسی which goes from right to left.
Corpus Viewer¶
from gatenlp import Document
from gatenlp.corpora import ListCorpus
from gatenlp.visualization import CorpusViewer
texts = [
"Text for the first document",
"text for the second document",
"And here is another document",
]
docs = [Document(t) for t in texts]
docs.append(doc)
corpus = ListCorpus(docs)
corpus
<gatenlp.corpora.memory.ListCorpus at 0x7f65de16b8d0>
To interactively browse the corpus, create a CorpusViewer and use its show
method to show the corpus in
the notebook. Note that this ONLY works in a notebooks, if the notebook is converted to HML or Markdown,
the corpus viewer is not only not working, it is also not getting properly shown.
The constructor for the CorpusViewer
instance can take all the parameters which doc.show
can doc which allows to configure how the documents are shown in the corpus viewer.
cviewer = CorpusViewer(
corpus,
preselect=[("Stanza", ["PERSON"]), ("Spacy", ["PERSON"])],
annspec=[("Stanza", ["Token", "PERSON"]), ("Spacy", ["Token", "PERSON"])],
)
cviewer.show()
import gatenlp
print("NB last updated with gatenlp version", gatenlp.__version__)
NB last updated with gatenlp version 1.0.8a1