Using Neural Networks

The LearningFramwork allows the use of Neural Networks for classification and sequence tagging through two different backends:

Pytorch - based on the Python Pytorch library. This backend provides the algorithms:
- PytorchWrapper_CL_DR for processing classification instances
- PytorchWrapper_SEQ_DR for processing sequence tagging instances
Keras - based on the Python Keras library. This backend provides the algorithms:
- KerasWrapper_CL_DR for processing classification instances
- KerasWrapper_SEQ_DR for processing classification instances

Documentation for Using Neural Networks

Installation/Preparation
Overview (below)
PytorchWrapper – how to use the PytorchWrapper (algorithms PytorchWrapper_CL_DR and PytorchWrapper_SEQ_DR)
KerasWrapper – (NOT YET) how to use the KerasWrapper (algorithms KerasWrapper_CL_DR and KerasWrapper_SEQ_DR)
gate-lf-python-data – a python library used by both wrappers to access the JSON-format instance data generated by the LearningFramework
gatel-lf-pytorch-json – the python library that implements Pytorch support
gate-lf-keras-json – the python library that implements Keras support
NOTE: the gate-lf- libraries are included in the plugin and do NOT need to get installed seaparately!

Overview

Support for neural networks through the Pytorch and Keras wrappers follows the same basic design and is based on the same representation of training data through an out-of-memory file that contains one JSON representation of an instance or sequence per line.

When a PR for training classification or chunking is used in GATE with the Pytorch or Keras wrapper,

each document is processed and the feature specification is used to extract the necessary features
the features and the target for the instance are written as a new line to a data file in JSON format. This file is located in the dataDirectory and has the name crvd.data.json. If a sequence is specified, a whole sequence of instances is extracted and saved as a line to the data file.
If the PR is used with GCP, documents are processed and features extracted in parallel
Unlike with all other algorithms (the ones having names that end in _MR), the corpus is not kept in memory and can thus be larger than the available memory would allow
Once all documents have been processed, a “metafile” is created. This file contains statistics about the features used (e.g. for a feature that contains the strings of tokens, the number of different strings, frequency of each string etc are written to the file). The metafile contains all information in JSON format and is stored in the dataDirectory with the name crvd.meta.json
Additional files are created in the dataDirectory for use by the LearningFramework at application time
A directory is created inside the dataDirectory that contains the necessary python software for the wrapper:
- for the PytorchWrapper, a directory FileJsonPytorch is created. This directory contains script files for running the training (train.sh) and application apply.sh and contains subdirectories gate-lf-python-data and gate-lf-pytorch-json with the two python libraries required by the wrapper.
- for the KerasWrapper, a directory FileJsonKeras is created. This directory contains script files for running the training (train.sh) and application apply.sh and contains subdirectories gate-lf-python-data and gate-lf-keras-json with the two python libraries required by the wrapper.
- NOTE: once the directory has been created it is never being overridden! This means that you can change or add to the backend specific library, e.g. add a new network module to gate-lf-pytorch-json specific for your project.
The training script is run, passing the name of the meta file (always crvd.meta.json) and the a fixed name prefix for the model files (always FileJsonPytorch for PytorchWrapper and FileJsonKeras.model for KerasWrapper) as well as the algorithmParameters specified for the PR.
The training script uses the gate-lf-python-data library to prepare the JSON-format data and convert it to a JSON representation that only contains float and integer numbers. This also prepares the actual training file and a validation file.
The training script then uses the information in the meta file to build a default neural network for the task. Depending on the wrapper, a number of details for this can be overridden by parameters, and the details for embeddings are based on what has been specified, if any, in the feature specification file.
Once the neural network has been built, the actual training of the network starts. For this the prepared training file is run through the network in training mode in small chunks, “batches”, of only a few instances (a few dozen to a few hundred usually, the “batchsize”), optimizing the network parameters after each chunk.
After a certain number of batches/instances the optimization criterion, the “loss” is shown together with the accuracy of the model on that batch of data.
This is repeated until the whole training file has been processed, completing one “epoch”. Training continues for many epochs (maximum can be specified by a parameter)
After every epoch (or some other configurable amount of data), the model is evaluated on the validation data.
Training ends once the maximum number epochs has been reached or some specific ending criterion has been reached (for the PytorchWrapper, the default is that two validations on the validation data did not show any improvement)
The model get saved (either the model at the time training was terminated or the best model encountered until then)

When a PR for application is run:

The apply.sh script is being run, passing the default name prefix of the model. This first loads and activates the saved model, the starts reading data from the GATE process and sending response data back to the gate process until the GATE process stops sending data.
Each document in the corpus is processed, an for each instance, the features are extracted; if a sequence annotation is specified, features for a sequence of instances are extracted.
The instance or sequence is converted to a JSON representation and sent to the apply script process
The apply script process converts the JSON data, runs it through the trained model, generates the predictions and converts them back to JSON format which is then sent back to the GATE process.
The GATE process converts the JSON and uses the data to annotate the instance or sequence of instances.

Training when using a Neural Network backend

Note that training of a neural network on a large corpus can take a very long time. In some cases, and with very complex network architectures, it may be necessary to train on a different computer than where the GATE LearningFramework was run. Exploring different architectures or hyperparameters may require many training runs and sometimes it is convenient to run those in parallel on several computers.

Note also that re-training variations of the network or re-training with different hyperparameters does not actually need the step where the data and meta files are created from the original GATE document corpus: unless the feature specification is changed, these files will end up to be identical.

The way how the neural network backends are implemented makes it easy to just concentrate on re-running the actual training step (running the train.sh) directly from the command line, either on the same computer on which the content of the dataDirectory was created or on a different computer that has Python and the required Python packages installed by copying the whole directory to that computer. That way the user can experiment with modified networks or hyperparameters until the validation accuracy looks good. The model created that way can then be transferred back to the dataDirectory where the application to new documents should be carried out with GATE.

gateplugin-LearningFramework

Using Neural Networks

Documentation for Using Neural Networks

Overview

Training when using a Neural Network backend