Overview
The Learning Framework is GATE’s most recent machine learning plugin. It’s still under active development, but stable enough to use. However future versions may introduce changes which may not be backwards compatible (meaning that pipelines may only work with the older version or saved models may not be compatible between versions)
It offers a wider variety of more up to date ML algorithms than the earlier machine learning plugins, currently the following is supported natively (directly integrated in the plugin code):
- most Mallet classification algorithms
- Mallet’s CRF implementation
- LibSVM for classification and regression, using the java implementation of the original LibSVM code.
The following libraries and tools are available in the LearningFramework through a wrapper (see below):
- Weka through the weka-wrapper, see Using Weka
- SciKit-Learn through the sklearn-wrapper, see Using SciKit Learn
- CostSensitiveClassification through the sklearn-wrapper, see Using CostCla
- PytorchJson: this is a built-in wrapper to use Pytorch neural networks, see Using Neural Networks
- KerasJson: this is a built-in wrapper to use Keras neral networks, see Using Neural Networks
Wrappers are software which runs the machine learning software or library in a separate process and the LearningFramework communicates with the wrapper software for training and application by providing a file or sending/receiving data. This solution is used for either or both of two reasons:
- the license of the machine learning library or tool is not compatible with the license of the LearningFramework (e.g. Weka) and therefore cannot get distributed with it
- the machine learning tool is written in a different language, e.g. Python (e.g. Keras, Pytorch, SciKit-Learn).
Finally, the application of a trained model can also be performed via the use of a HTTP model application server. The LearningFramework supports a very simple HTTP protocol for sending feature vectors to the server in JSON format, getting back the model predictions and applying them to the document that is being processed. See ServerForApplication
Supported Machine Learning Tasks
The Learning Framework supports the following tasks:
- Classification, which simply assigns a class to each instance annotation. For example, each sentence might be classified as having positive or negative sentiment, each word may get assigned a part-of-speech tag, or a document may be classified as being relevant to some topic or not. With classification, the parts of text are known in advance and assigned one out of several possible class labels.
- Sequence tagging, also called Chunking, which finds mentions, such as locations or persons, within the text, i.e. the relevant parts of text are not known in advanced but the task is to find them.
- Topic modelling an unsupervised learning method where the algorithm processes chunks of texts / documents and tries to infer “topics” (distributions over words), then tries to assign each text/document to a distribution over topics.
- Regression, which assigns a numerical target, and might be used to rank disambiguation candidates, for example. This is similar to classification in that the relevant parts of text (sentences, words, …) are known in advance, but instead of a nominal class label, a numeric value is assigned to those parts.
- Exporting of training data in various formats, including ARFF, CSV, TSV, MatrixMarket, dense JSON format
- Evaluation (only for some algorithms)
These are provided in separate processing resources (PRs), with separate PRs for training and application and evaluation plugins for classification and regression. Get started here!
In addition, the plugin contains PRs that help with the creation of features to use for a machine learning task:
- Affixes: given a string, e.g. a token string, generate features that contain the length-k suffixes and/or prefixes of the string (see LF_GenFeatures_Affixes)
- WordShape: given a string, e.g. a token string, generate a more generic string that represents the kind of characters present in the string (see LF_GenFeatures_Misc)
Note that PRs from other plugins can also be very useful to generate features:
- StringAnnotation plugin: PR FeatureGazetteer can be used to add features to an annotation, by looking up the string of another feature in a gazetteer and retrieving the features from there. This can be used e.g. to add cluster ids (embedding cluster ids, brown cluster ids, distsim cluster ids) or other features.
- CorpusStats plugin: this can be used to calculate corpus-based word/term statistics like term frequency, document frequency, TF*IDF and others and to assign those statistics as features.
- JdbcLookup plugin: can be used to add features from a JDBC database
- Java plugin and Groovy plugin for general-purpose coding
Feature Overview
- Supports classification, regression, sequence tagging, topic modelling
- Supports learning algorithms from: LibSVM, Mallet, Weka (using a wrapper software), Scikit-Learn (using a wrapper software), Keras, Pytorch
- Supports various ways of handling missing values
- Supports sparse coding of nominal values as one-of-k or as “value number”
- Supports instance weights (limited support depending on the algorithm used)
- Supports per-instance classification cost vectors instead of the class label for classification for per-instance cost aware algorithms (however only works with algorithms which support this)
- Supports limiting attribute lists to only those annotations which are within another containing annotation
- Supports using pre-calculated scores for one-of-k coded nominal values, e.g. pre-calculated TF*IDF scores for terms or ngrams (for n-grams with n>1 the final score is calculated as the product of the individual pre-calculated gram scores)
- Supports multi-valued annotation features for one-of-k coded nominal attributes: for example if the annotation feature is a List
, a dimension / feature is created for each element in the list - Supports multi-valued annotation features for numeric attributes: in this case the elements (which must be doubles or must be convertible to doubles) are “spliced” into the final feature vector (e.g. for making use of pre-calculated word embeddings).
Processing Resources:
- LF_TrainClassification train a classification model
- LF_ApplyClassification apply a trained classification model
- LF_TrainRegression train a regression model
- LF_ApplyRegression apply a trained regression model
- LF_TrainChunking train a model for sequence tagging / chunking
- LF_ApplyChunking apply a trined model for sequence tagging / chunking
- LF_TrainTopicModel train an LDA topic model
- LF_ApplyTopicModel find topic distribution for new documents/texts
- LF_Export export a training set to an external file
- LF_EvaluateClassification estimate classification accuracy
- LF_EvaluateRegression estimate regression quality
- LF_GenFeatures_Affixes generate features from prefixes and suffixes
- LF_GenFeatures_Misc generate other features like word shape
Example pipelines, tutorials etc
- Pipeline_LF_TrainTopicModel_Mallet_EN prepared pipeline for filtering tokens and training a Mallet topic model
Other important documentation pages:
- FeatureSpecification all about the feature specification file and what it can contain as well as how machine learning features are created from the original document annotations
- AlgorithmParameters some general notes about algorithm parameters. Most parameters are documented with the wiki page about the PR where they can be used
- DNN Preparation how to install Python and prepare for using the Pytorch/Keras backends
- DNN WrapperConfig documents the wrapper configuration file for the Pytorch and Keras backends
- UsingWeka all about how to use Weka with the LearningFramework plugin.
- UsingSklearn all about how to use SciKit Learn with the LearningFramework plugin.
- UsingCostCla all about how to use CostCla (https://github.com/albahnsen/CostSensitiveClassification) with the LearningFramework plugin
- VectorValues how to use pre-caluclated dense vectors like embeddings and other vector-valued features
- SavedFiles the files that get saved as a result of training or exporting
- ServerForApplication describes the interaction with a HTTP server for carrying out the application of trained models
- API how to use the LearningFramework classes from Java/Scala/Groovy code and examples of how to use it with the GATE Java Plugin
- FAQs