Gazetteers

Gazetteers make it easy to find matches in a document from a large list of gazetteer entries. Entries can be associated with arbitrary features, and when a match is found, an annotation is created with the features related to the gazetteer entry. gatenlp currently supports the following gazetteer annotators:

import os
from gatenlp import Document
from gatenlp.processing.gazetteer import TokenGazetteer, StringGazetteer

StringGazetteer

The main features of the StringGazetteer

Create a gazetteer from a Python list

Each gazetteer entry is a tuple, where the first element is the string to match and the second element is a dictionary with arbitrary features. When an entry contains leading or trailing whitespace, by default it is removed and multiple whitespace characters within the entry are replaced by a single space internally (this can be disabled with the ws_clean=False parameter if the gazetteer entries are already properly cleaned)

gazlist1 = [
    ("Barack Obama", dict(url="https://en.wikipedia.org/wiki/Barack_Obama")),
    ("Obama", dict(url="https://en.wikipedia.org/wiki/Barack_Obama")),
    ("Donald Trump", dict(url="https://en.wikipedia.org/wiki/Donald_Trump")),
    ("Trump", dict(url="https://en.wikipedia.org/wiki/Donald_Trump")),
    ("George W. Bush", dict(url="https://en.wikipedia.org/wiki/George_W._Bush")),
    ("George Bush", dict(url="https://en.wikipedia.org/wiki/George_W._Bush")),
    ("Bush", dict(url="https://en.wikipedia.org/wiki/George_W._Bush")),
    ("    Bill        Clinton   ", dict(url="https://en.wikipedia.org/wiki/Bill_Clinton")),
    ("Clinton", dict(url="https://en.wikipedia.org/wiki/Bill_Clinton")),
]

# Document with some text mentioning some of the names in the gazeteer for testing
text = """Barack Obama was the 44th president of the US and he followed George W. Bush and
  was followed by Donald Trump. Before Bush, Bill Clinton was president.
  Also, lets include a sentence about South Korea which is called 대한민국 in Korean.
  And a sentence with the full name of Iran in Farsi: جمهوری اسلامی ایران and also with 
  just the word "Iran" in Farsi: ایران 
  Also barack obama in all lower case and SOUTH KOREA in all upper case
  """
doc0 = Document(text)
doc0

Create the StringGazetteer annotator

In the following example we create the StringGazetteer and specify the source and the format of the source to also load some gazetteer entries into it. This is not required, gazetteer entries can also be added later (see below)

gaz1 = StringGazetteer(source=gazlist1, source_fmt="gazlist")

The StringGazetteer instance is a gatenlp annotator, but can also be used to lookup the information for an entry or check if an entry is in the gazetteer.

print("Entries:     ", len(gaz1))
print("Entry 'Trump': ", gaz1["Trump"])
print("Entry 'Bill Clinton': ", gaz1.get("Bill Clinton"))
print("Contains 'Bush':", "Bush" in gaz1)
Entries:      9
Entry 'Trump':  [{'url': 'https://en.wikipedia.org/wiki/Donald_Trump'}]
Entry 'Bill Clinton':  [{'url': 'https://en.wikipedia.org/wiki/Bill_Clinton'}]
Contains 'Bush': True

Gazetteer entries can also be added with the add and append methods. That way the gazetteer can be created from several different sources.

Every time gazetteer entries are loaded, it is possible to specify features which should get added to all entries of that list.

Let us create a new list and specify some features common to all entries of this list and add it to the gazetteer:

gazlist2 = [
    ("United States", dict(url="https://en.wikipedia.org/wiki/United_States")),
    ("US", dict(url="https://en.wikipedia.org/wiki/United_States")),
    ("United Kingdom", dict(url="https://en.wikipedia.org/wiki/United_Kingdom")),
    ("UK", dict(url="https://en.wikipedia.org/wiki/United_Kingdom")),    
    ("Austria", dict(url="https://en.wikipedia.org/wiki/Austria")),
    ("South Korea", dict(url="https://en.wikipedia.org/wiki/South_Korea")),
    ("대한민국", dict(url="https://en.wikipedia.org/wiki/South_Korea")),
    ("Iran", dict(url="https://en.wikipedia.org/wiki/Iran")),
    ("جمهوری اسلامی ایران", dict(url="https://en.wikipedia.org/wiki/Iran")),
    ("ایران", dict(url="https://en.wikipedia.org/wiki/Iran")),
]

# Note: if this cell gets executed several times, the data stored with each gazetteer entry gets  
# extended by a new dictionary of features!
# In general, there can be arbitrary many feature dictionaries for each entry which can be used to 
# store the different sets of information for different entities which share the same name.
gaz1.append(source=gazlist2, source_fmt="gazlist", list_features=dict(type="country"))

print("Entries:     ", len(gaz1))
print("Entry 'ایران': ", gaz1["ایران"])
print("Entry 'South Korea': ", gaz1["South Korea"])
Entries:      19
Entry 'ایران':  [{'url': 'https://en.wikipedia.org/wiki/Iran', 'type': 'country'}]
Entry 'South Korea':  [{'url': 'https://en.wikipedia.org/wiki/South_Korea', 'type': 'country'}]

There are also methods to check if there is a match at some specific position in some text, to find the next match in some text, and to find all matches in some text:

# methods match and find return a tuple with a list of StringGazetteerMatch objects describing all matches
# as the first element and the length of the longest of the matches at the second element, the find method returns
# the location of the match as the third element in the tuple
print("Check for a match in the document text at position 0: ", gaz1.match(text, start=0))
print("Check for a match in the document text at position 1: ", gaz1.match(text, start=1))
print("Find the next match from position 3", gaz1.find(text, start=3))
# the find_all method does not return a tuple, but a generator of tuples:
print("Find all matches from position 340", list(gaz1.find_all(text, start=340)))
Check for a match in the document text at position 0:  ([GazetteerMatch(start=0, end=12, match='Barack Obama', features={'url': 'https://en.wikipedia.org/wiki/Barack_Obama'}, type='Lookup')], 12)
Check for a match in the document text at position 1:  ([], 0)
Find the next match from position 3 ([GazetteerMatch(start=7, end=12, match='Obama', features={'url': 'https://en.wikipedia.org/wiki/Barack_Obama'}, type='Lookup')], 5, 7)
Find all matches from position 340 [GazetteerMatch(start=342, end=346, match='Iran', features={'type': 'country', 'url': 'https://en.wikipedia.org/wiki/Iran'}, type='Lookup'), GazetteerMatch(start=358, end=363, match='ایران', features={'type': 'country', 'url': 'https://en.wikipedia.org/wiki/Iran'}, type='Lookup')]

To annotate a document with the matches found in the gazetteer, the StringGazetteer instance can be used as an annotator. By default, matches can occur anywhere in the document, non-whitespace characters must match exactly and no special split characters are recognized (so matches can occur across newline characters and sentence boundaries)

By default, annotations of type “Lookup” are created in the default set. The features of the annotation are set to the information from the gazetteer entry and the list. If a gazetteer entry was added several times, separate annotations are created for each information that was added for the gazetteer string.

doc1 = Document(text)
doc1 = gaz1(doc1)
doc1

StringGazetteer parameters

The parameters for the StringGazetteer constructor can be used to change the behaviour of the gazetteer in many ways. The parameters related to loading gazetteer entries can also be specified with the append method.

Parameters to influence how annotations for matches are created:

Parameters to influence how the matches are carried out through annotations in the document. If a parameter is None, the match is not influenced by that kind of annotations, but could be influenced by other parameters (see below):

Other parameters to influence how matches are carried out:

Parameters that influence how gazetteer data is loaded:

# Create a new StringGazetteer which creates "Person" annotations for the person list, "Country" annotations
# for the country list, and ignores case when matching
# Because the gazetteer by default matches anywhere, the lower case "us" now matches inside several words
gaz2 = StringGazetteer(map_chars="lower")
gaz2.append(source=gazlist1, source_fmt="gazlist", list_type="Person")
gaz2.append(source=gazlist2, source_fmt="gazlist", list_type="Country")
doc2 = Document(text)
doc2 = gaz2(doc2)
doc2
# Create a new StringGazetteer which matches case-insensitive and creates annotations as above, 
# but limits matches to where Token annotations start/end. For this we have to first annotate the 
# document with a Tokenizer. 
# Now, matches are restricted so the start/end matches the start/end of a Token annotation, so the 
# lower-case matches inside words do not occur any more

# create a tokenizer based on the NLTK WordPunctTokenizer. 
from gatenlp.processing.tokenizer import NLTKTokenizer
from nltk.tokenize.regexp import WordPunctTokenizer
tokenizer = NLTKTokenizer(
    nltk_tokenizer=WordPunctTokenizer(), 
    token_type="Token", outset_name="")

gaz3 = StringGazetteer(map_chars="lower", start_type="Token", end_type="Token")
gaz3.append(source=gazlist1, source_fmt="gazlist", list_type="Person")
gaz3.append(source=gazlist2, source_fmt="gazlist", list_type="Country")
doc3 = Document(text)
doc3 = tokenizer(doc3)
doc3 = gaz3(doc3)
doc3

for person in doc3.annset().with_type("Person"):
    print(doc3[person], person)

Barack Obama Annotation(0,12,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/Barack_Obama'}),id=89)
Obama Annotation(7,12,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/Barack_Obama'}),id=90)
George W. Bush Annotation(62,76,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/George_W._Bush'}),id=92)
Bush Annotation(72,76,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/George_W._Bush'}),id=93)
Donald Trump Annotation(99,111,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/Donald_Trump'}),id=94)
Trump Annotation(106,111,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/Donald_Trump'}),id=95)
Bush Annotation(120,124,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/George_W._Bush'}),id=96)
Bill Clinton Annotation(126,138,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/Bill_Clinton'}),id=97)
Clinton Annotation(131,138,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/Bill_Clinton'}),id=98)
barack obama Annotation(372,384,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/Barack_Obama'}),id=106)
obama Annotation(379,384,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/Barack_Obama'}),id=107)

TokenGazetteer

Unlike the StringGazetteer, which matches gazetteer strings against the document text, the TokenGazetteer matches tokenstring sequences generated from the gazetteer strings against the sequences of tokens in the document. This is usually done on the Token annotations, but the gazetteer can be used on any sequence of annotations of some type.

Since what needs to get matched is a sequence of token strings, the gazetteer strings need to get converted to sequences of token strings as well when loading from a file. This can be achieved by a simple split-on-whitespace approach (the default) or by specifying a tokenizer or splitter to be used when loading the gazetter entries. When loading a prepared gazetteer list, the splitting into token strings must already have been done.

Use a NLTK tokenizer for the gazetteer strings and document

# first create new gazetteer lists from the string-based gazetteer lists we already have
def text2tokenstrings(text):
    tmpdoc = Document(text)
    tokenizer(tmpdoc)
    tokens = list(tmpdoc.annset().with_type("Token"))
    return [tmpdoc[tok] for tok in tokens]

tok_gazlist1 = [(text2tokenstrings(txt), feats) for txt, feats in gazlist1]
tok_gazlist2 = [(text2tokenstrings(txt), feats) for txt, feats in gazlist2]

tok_gazlist1, tok_gazlist2
([(['Barack', 'Obama'], {'url': 'https://en.wikipedia.org/wiki/Barack_Obama'}),
  (['Obama'], {'url': 'https://en.wikipedia.org/wiki/Barack_Obama'}),
  (['Donald', 'Trump'], {'url': 'https://en.wikipedia.org/wiki/Donald_Trump'}),
  (['Trump'], {'url': 'https://en.wikipedia.org/wiki/Donald_Trump'}),
  (['George', 'W', '.', 'Bush'],
   {'url': 'https://en.wikipedia.org/wiki/George_W._Bush'}),
  (['George', 'Bush'],
   {'url': 'https://en.wikipedia.org/wiki/George_W._Bush'}),
  (['Bush'], {'url': 'https://en.wikipedia.org/wiki/George_W._Bush'}),
  (['Bill', 'Clinton'], {'url': 'https://en.wikipedia.org/wiki/Bill_Clinton'}),
  (['Clinton'], {'url': 'https://en.wikipedia.org/wiki/Bill_Clinton'})],
 [(['United', 'States'],
   {'url': 'https://en.wikipedia.org/wiki/United_States'}),
  (['US'], {'url': 'https://en.wikipedia.org/wiki/United_States'}),
  (['United', 'Kingdom'],
   {'url': 'https://en.wikipedia.org/wiki/United_Kingdom'}),
  (['UK'], {'url': 'https://en.wikipedia.org/wiki/United_Kingdom'}),
  (['Austria'], {'url': 'https://en.wikipedia.org/wiki/Austria'}),
  (['South', 'Korea'], {'url': 'https://en.wikipedia.org/wiki/South_Korea'}),
  (['대한민국'], {'url': 'https://en.wikipedia.org/wiki/South_Korea'}),
  (['Iran'], {'url': 'https://en.wikipedia.org/wiki/Iran'}),
  (['جمهوری', 'اسلامی', 'ایران'],
   {'url': 'https://en.wikipedia.org/wiki/Iran'}),
  (['ایران'], {'url': 'https://en.wikipedia.org/wiki/Iran'})])
# Create the token gazetter and and load the two lists, then apply to the document

tok_gaz1 = TokenGazetteer(longest_only=False,
                          skip_longest=False, outset_name="", ann_type="Lookup",
                          annset_name="", token_type="Token")
tok_gaz1.append(source=tok_gazlist1, source_fmt="gazlist", list_type="Person")
tok_gaz1.append(source=tok_gazlist2, source_fmt="gazlist", list_type="Country")

doc5 = Document(text)
doc5 = tokenizer(doc5)
tokens = doc5.annset().with_type("Token")
doc5 = tok_gaz1(doc5)
doc5
for person in doc5.annset().with_type("Person"):
    print(doc5[person], person)

Barack Obama Annotation(0,12,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/Barack_Obama'}),id=89)
Obama Annotation(7,12,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/Barack_Obama'}),id=90)
George W. Bush Annotation(62,76,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/George_W._Bush'}),id=92)
Bush Annotation(72,76,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/George_W._Bush'}),id=93)
Donald Trump Annotation(99,111,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/Donald_Trump'}),id=94)
Trump Annotation(106,111,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/Donald_Trump'}),id=95)
Bush Annotation(120,124,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/George_W._Bush'}),id=96)
Bill Clinton Annotation(126,138,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/Bill_Clinton'}),id=97)
Clinton Annotation(131,138,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/Bill_Clinton'}),id=98)

Notebook last updated

import gatenlp
print("NB last updated with gatenlp version", gatenlp.__version__)
NB last updated with gatenlp version 1.0.8a1