Overview

The Learning Framework is GATE’s most recent machine learning plugin. It’s still under active development, but stable enough to use. However future versions may introduce changes which may not be backwards compatible (meaning that pipelines may only work with the older version or saved models may not be compatible between versions)

It offers a wider variety of more up to date ML algorithms than the earlier machine learning plugins, currently the following is supported natively (directly integrated in the plugin code):

most Mallet classification algorithms
Mallet’s CRF implementation
LibSVM for classification and regression, using the java implementation of the original LibSVM code.

The following libraries and tools are available in the LearningFramework through a wrapper (see below):

Weka through the weka-wrapper, see Using Weka
SciKit-Learn through the sklearn-wrapper, see Using SciKit Learn
CostSensitiveClassification through the sklearn-wrapper, see Using CostCla
PytorchJson: this is a built-in wrapper to use Pytorch neural networks, see Using Neural Networks
KerasJson: this is a built-in wrapper to use Keras neral networks, see Using Neural Networks

Wrappers are software which runs the machine learning software or library in a separate process and the LearningFramework communicates with the wrapper software for training and application by providing a file or sending/receiving data. This solution is used for either or both of two reasons:

the license of the machine learning library or tool is not compatible with the license of the LearningFramework (e.g. Weka) and therefore cannot get distributed with it
the machine learning tool is written in a different language, e.g. Python (e.g. Keras, Pytorch, SciKit-Learn).

Finally, the application of a trained model can also be performed via the use of a HTTP model application server. The LearningFramework supports a very simple HTTP protocol for sending feature vectors to the server in JSON format, getting back the model predictions and applying them to the document that is being processed. See ServerForApplication

Supported Machine Learning Tasks

The Learning Framework supports the following tasks:

Classification, which simply assigns a class to each instance annotation. For example, each sentence might be classified as having positive or negative sentiment, each word may get assigned a part-of-speech tag, or a document may be classified as being relevant to some topic or not. With classification, the parts of text are known in advance and assigned one out of several possible class labels.
Sequence tagging, also called Chunking, which finds mentions, such as locations or persons, within the text, i.e. the relevant parts of text are not known in advanced but the task is to find them.
Topic modelling an unsupervised learning method where the algorithm processes chunks of texts / documents and tries to infer “topics” (distributions over words), then tries to assign each text/document to a distribution over topics.
Regression, which assigns a numerical target, and might be used to rank disambiguation candidates, for example. This is similar to classification in that the relevant parts of text (sentences, words, …) are known in advance, but instead of a nominal class label, a numeric value is assigned to those parts.
Exporting of training data in various formats, including ARFF, CSV, TSV, MatrixMarket, dense JSON format
Evaluation (only for some algorithms)

These are provided in separate processing resources (PRs), with separate PRs for training and application and evaluation plugins for classification and regression. Get started here!

In addition, the plugin contains PRs that help with the creation of features to use for a machine learning task:

Affixes: given a string, e.g. a token string, generate features that contain the length-k suffixes and/or prefixes of the string (see LF_GenFeatures_Affixes)
WordShape: given a string, e.g. a token string, generate a more generic string that represents the kind of characters present in the string (see LF_GenFeatures_Misc)

Note that PRs from other plugins can also be very useful to generate features:

StringAnnotation plugin: PR FeatureGazetteer can be used to add features to an annotation, by looking up the string of another feature in a gazetteer and retrieving the features from there. This can be used e.g. to add cluster ids (embedding cluster ids, brown cluster ids, distsim cluster ids) or other features.
CorpusStats plugin: this can be used to calculate corpus-based word/term statistics like term frequency, document frequency, TF*IDF and others and to assign those statistics as features.
JdbcLookup plugin: can be used to add features from a JDBC database
Java plugin and Groovy plugin for general-purpose coding

Feature Overview

Supports classification, regression, sequence tagging, topic modelling
Supports learning algorithms from: LibSVM, Mallet, Weka (using a wrapper software), Scikit-Learn (using a wrapper software), Keras, Pytorch
Supports various ways of handling missing values
Supports sparse coding of nominal values as one-of-k or as “value number”
Supports instance weights (limited support depending on the algorithm used)
Supports per-instance classification cost vectors instead of the class label for classification for per-instance cost aware algorithms (however only works with algorithms which support this)
Supports limiting attribute lists to only those annotations which are within another containing annotation
Supports using pre-calculated scores for one-of-k coded nominal values, e.g. pre-calculated TF*IDF scores for terms or ngrams (for n-grams with n>1 the final score is calculated as the product of the individual pre-calculated gram scores)
Supports multi-valued annotation features for one-of-k coded nominal attributes: for example if the annotation feature is a List, a dimension / feature is created for each element in the list
Supports multi-valued annotation features for numeric attributes: in this case the elements (which must be doubles or must be convertible to doubles) are “spliced” into the final feature vector (e.g. for making use of pre-calculated word embeddings).

Processing Resources:

LF_TrainClassification train a classification model
LF_ApplyClassification apply a trained classification model
LF_TrainRegression train a regression model
LF_ApplyRegression apply a trained regression model
LF_TrainChunking train a model for sequence tagging / chunking
LF_ApplyChunking apply a trined model for sequence tagging / chunking
LF_TrainTopicModel train an LDA topic model
LF_ApplyTopicModel find topic distribution for new documents/texts
LF_Export export a training set to an external file
LF_EvaluateClassification estimate classification accuracy
LF_EvaluateRegression estimate regression quality
LF_GenFeatures_Affixes generate features from prefixes and suffixes
LF_GenFeatures_Misc generate other features like word shape

Example pipelines, tutorials etc

Pipeline_LF_TrainTopicModel_Mallet_EN prepared pipeline for filtering tokens and training a Mallet topic model

gateplugin-LearningFramework

Overview

Supported Machine Learning Tasks

Feature Overview

Processing Resources:

Example pipelines, tutorials etc

Other important documentation pages:

Miscellaneous other pages: