Getting Started

Some familiarity with GATE is assumed here. To get started with GATE, check out the user guide and video tutorials.

Overview of using the plugin

To get started using machine learning in GATE, you need to decide which type of task you are interested in doing. Are you interested in finding things in the text, such as person names, or names of locations? If so, you are doing a named entity recognition or “Chunking” task. If, however, you want to label each sentence, or each document, with some class (positive or negative would be common in sentiment analysis tasks), then you have a classification task. If you want to learn a numeric value, for example predicting the star rating assigned by a product reviewer on the basis of the text of their review, then you have a regression task. There are separate PRs for training models for each of these tasks and separate PRs for using trained models and applying them to new documents.

A pre-annotated corpus is central to any attempt to apply machine learning. You will need to obtain and prepare a good quality corpus. Ideally your corpus should contain at least thousands of training instances; for simple tasks, some users have been successful with just a few hundred, but a more complex task may require tens or hundreds of thousands to produce a good result. The corpus needs to contain those annotations or feature values which should later be assigned by the trained model in new documents.

In most cases, the corpus needs to be annotated by humans, so as to produce a “gold standard”. These “right answers” are what the machine learning system will attempt to replicate on unseen data. For named entity recognition, this involves going through the corpus and putting annotations of a particular type across all occurrences of the target entity type; for example, annotating all person names with an annotation of type “Person”, perhaps in the default annotation set. For classification tasks, the annotations will cover each span of interest, for example every sentence or every document, and the human annotator will apply a feature value to each one. for example, you might have “sentence” annotations in the default annotation set, and the human annotator might set the “type” feature of each one to “positive” or “negative”.

Additionally, you need to have some useful features annotated on your corpus. For example, in finding city names, the string value of each token is likely to be useful. It is easy to see that being provided with information such as that the string of the token is “London” is likely to be useful to the learner in determining if it belongs in a city name or not! However, for many tasks, the inference process is more subtle and complex than that. The value in the machine learning approach lies in machine learners’ ability to make different inferences than a human being would, were they simply writing JAPE rules to achieve their task. Our job at this stage, then, is to provide a variety of useful features, and let the learner figure out what it can use. A common approach to providing features is to run ANNIE over the corpus. Then we have access to information such as the string and type of each token, and so forth. You may also have additional features relevant to your task, that you can use, such as a gazetteer of positive or negative opinion words.

Next, you will write a feature specification file, describing the features you want to use. A guide to the feature file format is available here. After that, you can load the appropriate training PR. There will be no init-time parameters. After you have created it, you can set the runtime parameters. Runtime parameters are described more fully in the pages for each PR, but in essence, for training, you need to tell the PR what algorithm you want to use, where your feature specification is, where you want the model to be saved to, what your instance is (“Token” is common for NER tasks; “Sentence” would be common for classification), and how the learner will find the correct classes. For NER, class annotations take the form of the location of the “classAnnotationType” annotation type; for classification, it is the “targetFeature” on the instance annotations that will be learned.

Now you have a trained model that you can apply. Create the appropriate application PR and tell it where to find the model (the “dataDirectory” runtime parameter) By default, output annotations go in the “LearningFramework” annotation set, so you can look there and see what annotations the PR created. If it isn’t what you expect, check your parameters and input annotations! The most common source of error is setting a parameter wrong, for example, the instance annotation, or not having the annotations present as you specified, for example, not having them all in the input annotation set.

You might also like to play with the evaluation PRs, which allow you to get a fast result for how well an algorithm and feature set are performing on your data. These are available for classification (providing an accuracy result for cross-fold or hold-out evaluation) and regression (providing a root mean squared error). Evaluation for chunking tasks isn’t currently provided, so you’ll have to use the training and application PRs on separate corpora and evaluation on the test corpus using Corpus QA or the Evaluation plugin