Tokenizers
Tokenizers identify the Tokens/Words in a text. gatenlp
allows the use of tokenizers from other NLP libraries like NLTK, Spacy or Stanza and provides the tools to implement your own.
Tokenization is often the first step in an annotation pipeline, it creates the initial set of annotations to work on. Later steps usually process existing annotations and add new features to them or create new annotations from them (e.g. add part of speech (POS) features to existing Tokens or create noun phrase (NP) annotations from Token annotations).
import os
from gatenlp import Document
Use NLTK or own classes/methods for tokenization
The NLTK tokenizers can be used from gatenlp
via the gatenlp
NLTKTokenizer
annotator.
This annotator can take any NLTK tokenizer or any object that has the span_tokenize(str)
or tokenize(str)
method. The objects that support span_tokenize(str)
are preferred, as this method directly returns the spans of Tokens, not a list of tokens like tokenize(str)
or a passed tokenize function. With tokenize(str)
the spans have to be determined by aligning them to the original text. For this reason, the tokenize/function methods must not token strings which are modified in any way from the original text (e.g. the default NLTK word tokenizer converts beginning double quotes to 2 backquotes and cannot be used for this reason).
Some tokenize methods need to run on sentences instead of full documents, for this it is possible to specify an object/function that splits the document into sentences first. If a sentence tokenizer is specified, then the tokenize
method will always be used, even if a span_tokenize
method exists.
from gatenlp.processing.tokenizer import NLTKTokenizer
# Text used for the examples below
text = """Barack Obama was the 44th president of the US and he followed George W. Bush and
was followed by Donald Trump. Before Bush, Bill Clinton was president.
Also, lets include a sentence about South Korea which is called 대한민국 in Korean.
And a sentence with the full name of Iran in Farsi: جمهوری اسلامی ایران and also with
just the word "Iran" in Farsi: ایران
Also barack obama in all lower case and SOUTH KOREA in all upper case
"""
doc0 = Document(text)
doc0
Use the NLTK Whitespace Tokenizer
from nltk.tokenize.regexp import WhitespaceTokenizer
tok1 = NLTKTokenizer(nltk_tokenizer=WhitespaceTokenizer())
doc1 = Document(text)
doc1 = tok1(doc1)
doc1
Use the NLTK WordPunctTokenizer
from nltk.tokenize.regexp import WordPunctTokenizer
tok2 = NLTKTokenizer(nltk_tokenizer=WordPunctTokenizer())
doc2 = Document(text)
doc2 = tok2(doc2)
doc2
Use Spacy for tokenization
The AnnSpacy
annotator can be used to run Spacy on a document and convert Spacy annotations to gatenlp
annotations (see the section on lib_spacy
)
If the Spacy pipeline only includes the tokenizer, this can be used for just performing tokenization as well. The following example only uses the tokenizer and adds the sentencizer
to also create sentence annotations.
from gatenlp.lib_spacy import AnnSpacy
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
nlp_spacy = English()
nlp_spacy.add_pipe('sentencizer')
tok3 = AnnSpacy(nlp_spacy, add_nounchunks=False, add_deps=False, add_entities=False)
doc3 = Document(text)
doc3 = tok3(doc3)
doc3
Use Stanza for Tokenization
Similar to Spacy the Stanza library can be used for tokenization (see the lib_stanza
documentation) by using a Stanza pipeline that only includes the tokenizer.
from gatenlp.lib_stanza import AnnStanza
import stanza
nlp_stanza = stanza.Pipeline("en", processors="tokenize")
doc4 = Document(text)
tok4 = AnnStanza(nlp_stanza)
doc4 = tok4(doc4)
doc4
2022-11-09 22:03:09,169|INFO|stanza|Loading these models for language: en (English):
========================
| Processor | Package |
------------------------
| tokenize | combined |
========================
2022-11-09 22:03:09,170|INFO|stanza|Use device: gpu
2022-11-09 22:03:09,170|INFO|stanza|Loading: tokenize
2022-11-09 22:03:12,013|INFO|stanza|Done loading processors!
Use Java GATE for Tokenization
The gatenlp GateWorker
can be used to run arbitrary Java GATE pipelines on documents, see the GateWorker
documentation for how to do this
Notebook last updated
import gatenlp
print("NB last updated with gatenlp version", gatenlp.__version__)
NB last updated with gatenlp version 1.0.8a1