Using Neural Networks
The LearningFramwork allows the use of Neural Networks for classification and sequence tagging through two different backends:
- Pytorch - based on the Python Pytorch library.
This backend provides the algorithms:
PytorchWrapper_CL_DR
for processing classification instancesPytorchWrapper_SEQ_DR
for processing sequence tagging instances
- Keras - based on the Python Keras library.
This backend provides the algorithms:
KerasWrapper_CL_DR
for processing classification instancesKerasWrapper_SEQ_DR
for processing classification instances
Documentation for Using Neural Networks
- Installation/Preparation
- Overview (below)
- PytorchWrapper – how to use the PytorchWrapper (algorithms
PytorchWrapper_CL_DR
andPytorchWrapper_SEQ_DR
) - KerasWrapper – (NOT YET) how to use the KerasWrapper (algorithms
KerasWrapper_CL_DR
andKerasWrapper_SEQ_DR
) - gate-lf-python-data – a python library used by both wrappers to access the JSON-format instance data generated by the LearningFramework
- gatel-lf-pytorch-json – the python library that implements Pytorch support
- gate-lf-keras-json – the python library that implements Keras support
- NOTE: the
gate-lf-
libraries are included in the plugin and do NOT need to get installed seaparately!
Overview
Support for neural networks through the Pytorch and Keras wrappers follows the same basic design and is based on the same representation of training data through an out-of-memory file that contains one JSON representation of an instance or sequence per line.
When a PR for training classification or chunking is used in GATE with the Pytorch or Keras wrapper,
- each document is processed and the feature specification is used to extract the necessary features
- the features and the target for the instance are written as a new line to a data file in JSON format.
This file is located in the
dataDirectory
and has the namecrvd.data.json
. If a sequence is specified, a whole sequence of instances is extracted and saved as a line to the data file. - If the PR is used with GCP, documents are processed and features extracted in parallel
- Unlike with all other algorithms (the ones having names that end in
_MR
), the corpus is not kept in memory and can thus be larger than the available memory would allow - Once all documents have been processed, a “metafile” is created. This file contains statistics about the
features used (e.g. for a feature that contains the strings of tokens, the number of different strings, frequency
of each string etc are written to the file). The metafile contains all information in JSON format and is stored
in the
dataDirectory
with the namecrvd.meta.json
- Additional files are created in the
dataDirectory
for use by the LearningFramework at application time - A directory is created inside the
dataDirectory
that contains the necessary python software for the wrapper:- for the PytorchWrapper, a directory
FileJsonPytorch
is created. This directory contains script files for running the training (train.sh
) and applicationapply.sh
and contains subdirectoriesgate-lf-python-data
andgate-lf-pytorch-json
with the two python libraries required by the wrapper. - for the KerasWrapper, a directory
FileJsonKeras
is created. This directory contains script files for running the training (train.sh
) and applicationapply.sh
and contains subdirectoriesgate-lf-python-data
andgate-lf-keras-json
with the two python libraries required by the wrapper. - NOTE: once the directory has been created it is never being overridden! This means that you can change or add
to the backend specific library, e.g. add a new network module to
gate-lf-pytorch-json
specific for your project.
- for the PytorchWrapper, a directory
- The training script is run, passing the name of the meta file (always
crvd.meta.json
) and the a fixed name prefix for the model files (alwaysFileJsonPytorch
for PytorchWrapper andFileJsonKeras.model
for KerasWrapper) as well as the algorithmParameters specified for the PR. - The training script uses the
gate-lf-python-data
library to prepare the JSON-format data and convert it to a JSON representation that only contains float and integer numbers. This also prepares the actual training file and a validation file. - The training script then uses the information in the meta file to build a default neural network for the task. Depending on the wrapper, a number of details for this can be overridden by parameters, and the details for embeddings are based on what has been specified, if any, in the feature specification file.
- Once the neural network has been built, the actual training of the network starts. For this the prepared training file is run through the network in training mode in small chunks, “batches”, of only a few instances (a few dozen to a few hundred usually, the “batchsize”), optimizing the network parameters after each chunk.
- After a certain number of batches/instances the optimization criterion, the “loss” is shown together with the accuracy of the model on that batch of data.
- This is repeated until the whole training file has been processed, completing one “epoch”. Training continues for many epochs (maximum can be specified by a parameter)
- After every epoch (or some other configurable amount of data), the model is evaluated on the validation data.
- Training ends once the maximum number epochs has been reached or some specific ending criterion has been reached (for the PytorchWrapper, the default is that two validations on the validation data did not show any improvement)
- The model get saved (either the model at the time training was terminated or the best model encountered until then)
When a PR for application is run:
- The
apply.sh
script is being run, passing the default name prefix of the model. This first loads and activates the saved model, the starts reading data from the GATE process and sending response data back to the gate process until the GATE process stops sending data. - Each document in the corpus is processed, an for each instance, the features are extracted; if a sequence annotation is specified, features for a sequence of instances are extracted.
- The instance or sequence is converted to a JSON representation and sent to the apply script process
- The apply script process converts the JSON data, runs it through the trained model, generates the predictions and converts them back to JSON format which is then sent back to the GATE process.
- The GATE process converts the JSON and uses the data to annotate the instance or sequence of instances.
Training when using a Neural Network backend
Note that training of a neural network on a large corpus can take a very long time. In some cases, and with very complex network architectures, it may be necessary to train on a different computer than where the GATE LearningFramework was run. Exploring different architectures or hyperparameters may require many training runs and sometimes it is convenient to run those in parallel on several computers.
Note also that re-training variations of the network or re-training with different hyperparameters does not actually need the step where the data and meta files are created from the original GATE document corpus: unless the feature specification is changed, these files will end up to be identical.
The way how the neural network backends are implemented makes it easy to just concentrate on re-running
the actual training step (running the train.sh
) directly from the command line, either on the same
computer on which the content of the dataDirectory
was created or on a different computer that has Python
and the required Python packages installed by copying the whole directory to that computer. That way
the user can experiment with modified networks or hyperparameters until the validation accuracy looks
good. The model created that way can then be transferred back to the dataDirectory
where the
application to new documents should be carried out with GATE.