Feature Specification File

The feature specification file is a file that describes how machine learning features should get constructed from the GATE document and the annotations present in a GATE document. “Machine learning features” are a different from the features in a GATE feature map. Roughly, the machine learning features are key/value pairs which are created in a certain way for each machine learning example/instance that should get added to the training set during training time or should get classified during application time. These key/value pairs can be created from the original document text, from annotation features from the feature map of the instance annotation, from features from the feature map of annotations that precede, follow or overlap with the instance annotation and various other ways. The feature specification file describes most of how this is done, while the instance annotation parameter in the PR describes which annotation is used to start with (see more about this below).

Machine learning features are sometimes also called “attributes” or “independent variables”. This can be a bit confusing because annotation features are converted to machine learning features and depending on the type of value, one annotation feature can be converted and represented through several different attributes or variables for the machine learning algorithm. Some of the details of how this is done is included in the description of the feature specification settings below.

In addition to the examples in this page, example feature specification files for the two task types are included in the tutorials.

A rough overview of how machine learning features / attributes are created

At training time, the training PR processes document by document in the training corpus. For each document, the PR processes all “instance annotations”, i.e. all annotations which are from the input annotation set and have the type specified as the instance annotation type. The PR then gets the “target” value from the instance annotation. This is the value that the model we train should learn to predict. In addition, the PR extracts the machine learning features, these are the values for each training example from which the model should learn to predict the target. The machine learning features can come from:

The original values for which machine learning features can be created can be numeric, boolean, strings, lists or maps. Each of these values gets converted into one or more machine learning features (attributes) and the way this is done can be influenced by the settings in the feature specification file.

At application time, the machine learning features are extracted in almost the same way, however, there is no target value to extract because this is what the trained model will predict from the extracted features.

Note that very often the quality of the trained model depends on which annotation features are available for the machine learning algorithm and how exactly they get converted into machine learning features.

Not using a Feature Specification File

If no file is specified, a default is assumed that is equivalent to the following:

If sequence annotation is provided and a sequence classifier is used:

<ML-CONFIG>
  <NGRAM>
  <NUMBER>1</NUMBER>
  <TYPE>Token</TYPE>
  <FEATURE>string</FEATURE>
  </NGRAM>
</ML-CONFIG>

If a normal classification algorithm is used and no sequence annotation type is specified:

<ML-CONFIG>
  <ATTRIBUTE>
  <DATATYPE>nominal</DATATYPE>
  <FEATURE>string</FEATURE>
  </ATTRIBUTE>
</ML-CONFIG>

XML representation of the Feature Specification File

The Feature Specification file is an XML file that should be encoded in UTF-8. It can have any root element, but something like <LearningFramework> is recommended for clarity.

Nested within the root element there must be 1 or more attribute specification elements. An attribute specification is one of the elements described below and contains in turn nested elements that further describe each attribute specification. The case of the element names does not matter (e.g. <ATTRIBUTE> and <attribute> and <aTTriButE> all work equally)

<ATTRIBUTE>

This describes a feature that is taken from the instance annotation or from an overlapping annotation of some specified type if the <TYPE> element is present. If the <TYPE> element is present and is identical to the instance annotation type specified as PR parameter, then the features are taken from the original instance annotation. If the <TYPE> element is present and different from the instance annotation type specified for the PR, then the first annotation of that type that is overlapping with the instance annotation is used. If there are several such annotations, the longest is chosen, if there are several of those, a random one is selected. If there is no overlapping annotation, no feature is created for that instance.

NOTE: usually the annotations overlapping with the instance annotation should be contained within the instance annotation or even be coextensive with the instance annotation and the user should make sure that all annotations are created in a way that is useful for the learning algorithm.

The ATTRIBUTE element can have the following nested elements (see below for a more detailed explanation of some of these):

(non-dense only:) A value of one_of_k forthe CODEAS element means that for each possible value of the annotation feature, a separate attribute (machine learning feature) is created, and that feature is set to 1.0 if that value is present and 0.0 if the value is not present. If instead number is used, then every possible value is mapped to a different integer value. one_of_k coding is normally used to represent words or tokens or other nominal features derived from words.

Note that if the datatype is nominal and the coding is one_of_k then if the value of a feature is an array, an indicator feature for each element of the array is generated, if the value of a feature is a map than an indicator feature for each key/value combination in the map is generated.

If the datatype is numeric the LF makes an attempt to convert any scalar type to a number, e.g. Boolean values are converted as expected and a String is converted to a number, if possible (if not possible, 0.0 is used). If the value is an array of doubles or an Iterable, then each element is converted to its own feature and “spliced” into the final feature vector. This can be used to add values from dense vectors like embeddings to the machine learning features.

Here is an example of an attribute specification, in which the value of the feature “string” of annotations of type “Token” is used:

<ATTRIBUTE>
<TYPE>Token</TYPE>
<FEATURE>string</FEATURE>
<DATATYPE>nominal</DATATYPE>
</ATTRIBUTE>

<ATTRIBUTELIST>

This represents a whole ordered list of features, one list for each instance annotation. The annotations of the specified type are ordered by offset. The one overlapping with the instance is element “0” in the attribute list, the one before is element “-1”, the one after is element “1” and so on. The <FROM> and <TO> elements give the start and end element indices of how many annotations to use.

As with the <ANNOTATION> specification, if the <TYPE> element is missing, or identical to the instance annotation type, then the instance annotation type specified in the PR is used. Whichever type is used, the annotations of this type should not overlap among each other. The feature extraction code does not check if annotations are actually strictly forming a sequence in the document!

All settings from the <ATTRIBUTE> specification can be used plus:

The element numbers are zero based and increasing for annotations which start to the right of the start offset of the instance annotation and -1 based and decreasing for annotations which start to the left of the start offset of the instance annotation.

<NGRAM>

This creates features for all annotations contained within the span of the instance annotation and optionally combines sequences of successive N values into N-grams.

For all nominal attributes (ATTRIBUTE, ATTRIBUTELIST or NGRAM), an EMBEDDINGS block can be specified (this will get ignored for sparse representations and may get ignored for some dense representations). Such a block looks like this:

<EMBEDDINGS>
  <ID>token</ID>
  <DIMS>100</DIMS>
  <FILE>embs/glove6b.txt.gz</FILE>
  <TRAIN>mapping</TRAIN>
  <MINFREQ>5</MINFREQ>
</EMBEDDINGS>

Within the EMBEDDINGS block the following settings are possible:

NOTES

“Not creating a feature for an instance”: machine learning features are created dynamically as they are encountered. For example, if an <attribute> specification specifies a type but the training set contains no annotation of that type that overlaps with any of the instance annotations, then the feature or features from that specification are never created and not in the training set. However if some instances contain an overlapping annotation and others do not, then a feature which is created for one instance is not created for a different instance. In that case, the feature is in the training set but will be treated as a “missing value” in which case what happens exactly depends on the missing value treatment specified for that attribute and on the learning algorithm (because some missing value treatments are not supported by some learning algorithms). The default is in most cases that a missing value is treated like the value zero (0.0).

Machine Learning Feature Names

The internal feature names used for machine learning are generated from the annotation type, feature map feature name, attribute specification type, the details specific to an attributelist or ngram attribute and possibly the actual value encoded in a one-of-k fashion. They follow the following scheme: