LF_TrainClassification Processing Resource

The classification training PR allows you to train a classifier suitable for problems such as language identification, genre identification, choosing which type to assign to named entities already located in the text and so forth. The PR has no init time parameters.

Runtime parameters

Training a model

Set your instance to the annotation type that you wish to classify, for example “Sentence”. These annotations must already be present in the input annotation set. The features used for each instance can come directly from the instance annotation or from annotations overlapping with or contained in the span of the instance annotation, or from instance annotations preceding or following the current instance annotation.

For example, with the following feature specification, all Token.string values of “Token” annotations contained in the instance annotation (e.g. a “Sentence” annotation) are used:

<ML-CONFIG>    
  <NGRAM>
    <NUMBER>1</NUMBER>
    <TYPE>Token</TYPE>
    <FEATURE>string</FEATURE>
  </NGRAM>    
</ML-CONFIG>

The following indicates to take the features “string”, “orth” and “length” from the instance annotation (which is assumed to be “Token”) and to also take the feature “category” from the preceding, the current and the next instance annotation (note that type is not specified, so the type of the instance annotation as specified for the PR is sued):

<ML-CONFIG>    
  <ATTRIBUTE>
    <FEATURE>string</FEATURE>
  </ATTRIBUTE
  <ATTRIBUTE>
    <FEATURE>orth</FEATURE>
  </ATTRIBUTE>
  <ATTRIBUTE>
    <FEATURE>length</FEATURE>
    <DATATYPE>numeric</DATATYPE>
  <ATTRIBUTE>
  <ATTRIBUTELIST>
    <FROM>-1</FROM>
    <TO>1</TO>
    <FEATURE>category</FEATURE>
  </ATTRIBUTELIST>    
</ML-CONFIG>

See FeatureSpecification for more information on the content of the feature specification file.

Algorithms and their Parameters

LibSVM_CL_MR

This uses the LibSVM Support Vector Machine training algorithm (see https://www.csie.ntu.edu.tw/~cjlin/libsvm/).

The algorithm parameters which can be used are exactly identical to what can be used with the command svm-train, except that the only allowed values for parameter -s are “0” (C-SVC) and “1” (nu-SVC). If no parameters are specified, the defaults used are the same as for the svm-train command, except for the -b (probability_estimates) parameter which is set to 1 by default.

Mallet Algorithms

The following algorithms are all from the Mallet Machine Learning Toolkit (see http://mallet.cs.umass.edu/).

MalletBalancedWinnow_CL_MR

This uses the algorithm cc.mallet.classify.BalancedWinnowTrainer, see http://mallet.cs.umass.edu/api/cc/mallet/classify/BalancedWinnnowTrainer.html

Parameters:

MalletC45_CL_MR

This uses the algorithm cc.mallet.classify.C45Trainer, see http://mallet.cs.umass.edu/api/cc/mallet/classify/C45Trainer.html

Parameters:

MalletDecisionTree_CL_MR

This uses the algorithm cc.mallet.classify.DecisionTreeTrainer, see http://mallet.cs.umass.edu/api/cc/mallet/classify/DecisionTreeTrainer.html

NOTE: This algorithm is limited and according to the documentation does not split on continuous attributes!

Parameters:

MalletMexEnt_CL_MR (Multivarate Logistic Regression)

This uses the algorithm cc.mallet.classify.MaxEntTrainer, see http://mallet.cs.umass.edu/api/cc/mallet/classify/DecisionTreeTrainer.html

Parameters:

MalletNaiveBayesEM_CL_MR

This uses the algorithm cc.mallet.classify.NaiveBayesEMTrainer, see http://mallet.cs.umass.edu/api/cc/mallet/classify/NaiveBayesEMTrainer.html

This algorithm does not have any parameters to set.

MalletNaiveBayes_CL_MR

This uses the algorithm cc.mallet.classify.NaiveBayesTrainer, see http://mallet.cs.umass.edu/api/cc/mallet/classify/NaiveBayesTrainer.html

This algorithm does not have any parameters to set.

MalletWinnow_CL_MR

This uses the algorithm cc.mallet.classify.WinnowTrainer, see http://mallet.cs.umass.edu/api/cc/mallet/classify/WinnowTrainer.html

Parameters:

MALLET*_SEQ_MR

The algorithms with names starting with MALLET*_SEQ_MR are all sequence tagging algorithms: these algorithms can make use of the way how instances occur in sequence and also of which predections were made for preceding instances.

These algorithms are primarily used for chunking but are included here so they can be applied to classification tasks as well. The algorithms and their parameters are all documented in LF_TrainChunking.

PytorchWrapper_CL_DR and PytorchWrapper_SEQ_DR

This uses the Python-based wrapper for the Pytorch back-end. In order to use this, the Python environment must first be prepared on your machine. See the documentation for the Pytorch backend for more.

KerasWrapper_CL_DR and KerasWrapper_SEQ_DR

This uses the Python-based wrapper for the Keras/Tensorflow back-end. In order to use this, the Python environment must first be prepared on your machine. See the documentation for the Keras backend for more.

WekaWrapper_CL_MR

See UsingWeka for how to use the external weka-wrapper software to train a Weka model.