Processing Resource ExtendedGazetteer

This processing resource works similar to the DefaultGazetteer included in the GATE distribution: it will read one or more gazetteer list files which may or may not contain feature values and use those lists to identify and annotate text spans in a document which match one or more of the entries in the lists. However, this implementation is also different in several important ways from the DefaultGazetteer:

Init Time Parameters

NOTE: the parameter gazetteerFeatureSeparator which has been available in previous versions of this plugin has been removed. All gazetteer list files now must use the tab character for field separation.

Runtime Parameters

Configuration and List Files

Caching

When a gazetteer is first loaded for a config file (a .def or .yaml file, which in turns describes the gazetteer list files to load), then the ExtendedGazetteer PR will create a new gazetteer cache file. This cache file has the same name as the config file but with the file extension replaced by “.gazbin”. When the gazetteer gets loaded and such a cache file exists, the cache file will be loaded instead of the original files.

NOTE: if a cache file exists, it will always be used, no matter if the config or gazetteer list files have been changed in the meantime!!! This behavior has been implemented to make it easier for users to explicitly choose when to update the chache file. There is no automatism (e.g. modification dates, checksums etc). Instead, if the cache file should get re-created, simply delete it.

Multithreading, Custom Duplication

If a gazetteer has been loaded from some config file using some particular case-sensitivity setting, then any new instance of a gazetteer PR (either ExtendedGazetteer or FeatureGazetteer) that use the same files and case sensitivity setting will automatically share the loaded data. This avoids using up memory for something that can be shared between several PRs. This is especially useful in situations where the same pipeline (with identical gazetteers) is loaded several times in order to process documents in parallel.

The ExtendedGazetteer and FeatureGazetteer PRs both will share their data if they are duplicated by the GATE Factory.

Using from the GUI