LF_TrainClassification Processing Resource
The classification training PR allows you to train a classifier suitable for problems such as language identification, genre identification, choosing which type to assign to named entities already located in the text and so forth. The PR has no init time parameters.
Runtime parameters
algorithmParameters
(String, no default) parameters influencing the training algorithm (see below)dataDirectory
(URL, no default, required) the directory where to save all the files generated by the algorithm (model file, dataset description file, information file etc). The file names are always the same, so a different directory MUST be used to keep them separate.featureSpecURL
(URL, no default, required) the XML file describing the features to use, see FeatureSpecificationinputASName
(String, default is the empty String for the default annotation set) input annotation set containining the instance annotations, the annotations specified in the feature specification and the sequenceSpan annotations, if used.instanceType
(String, default “Token”, required) the annotation type of instance annotations.instanceWeightFeature
(String, no default, optional) the name of a feature in the instance annotation that contains the instance weight. If this is not specified, no instance weights are collected. If this is specified, then if the feature exists, its value is converted to an instance weight if possible, or an error occurs, if the the feature does not exist, the weight 1.0 is used. This is only relevant for training algorithms that can use instance weights.scaleFeatures
(Enumeration, default NONE) how to scale features, if at all. Possible values:NONE
do not do any scaling at allMEANVARIANCE_ALL_FEATURES
normalize all features to have mean 0 and variance 1. [NOTE: this is not implement properly yet and may change in the future!]. See FeatureScaling
sequenceSpan
(String, no default) this must be used for sequence tagging algorithms only! For such algorithms, it specifies the span across which to learn a sequence; for example a sentence is a meaningful sequence of words. If used like this a sequence algorithm can be used for classification, although this is not normally what one wants to do.targetFeature
(String, no default, required) the feature on the instance annotation that contains the nominal value which represents the class label. All instance annotations should have a class label.trainingAlgorithm
the classification training algorithm to use (see below). See UsingWeka for how to use WekaWrapper_CL_MR to train a Weka model.
Training a model
Set your instance to the annotation type that you wish to classify, for example “Sentence”. These annotations must already be present in the input annotation set. The features used for each instance can come directly from the instance annotation or from annotations overlapping with or contained in the span of the instance annotation, or from instance annotations preceding or following the current instance annotation.
For example, with the following feature specification, all Token.string values of “Token” annotations contained in the instance annotation (e.g. a “Sentence” annotation) are used:
<ML-CONFIG>
<NGRAM>
<NUMBER>1</NUMBER>
<TYPE>Token</TYPE>
<FEATURE>string</FEATURE>
</NGRAM>
</ML-CONFIG>
The following indicates to take the features “string”, “orth” and “length” from the instance annotation (which is assumed to be “Token”) and to also take the feature “category” from the preceding, the current and the next instance annotation (note that type is not specified, so the type of the instance annotation as specified for the PR is sued):
<ML-CONFIG>
<ATTRIBUTE>
<FEATURE>string</FEATURE>
</ATTRIBUTE
<ATTRIBUTE>
<FEATURE>orth</FEATURE>
</ATTRIBUTE>
<ATTRIBUTE>
<FEATURE>length</FEATURE>
<DATATYPE>numeric</DATATYPE>
<ATTRIBUTE>
<ATTRIBUTELIST>
<FROM>-1</FROM>
<TO>1</TO>
<FEATURE>category</FEATURE>
</ATTRIBUTELIST>
</ML-CONFIG>
See FeatureSpecification for more information on the content of the feature specification file.
Algorithms and their Parameters
LibSVM_CL_MR
This uses the LibSVM Support Vector Machine training algorithm (see https://www.csie.ntu.edu.tw/~cjlin/libsvm/).
The algorithm parameters which can be used are exactly identical to what can be used with the command svm-train
, except that the only allowed values for parameter -s
are “0” (C-SVC) and “1” (nu-SVC). If no parameters are specified, the defaults used are the same as for the svm-train
command, except for the -b
(probability_estimates) parameter which is set to 1 by default.
Mallet Algorithms
The following algorithms are all from the Mallet Machine Learning Toolkit (see http://mallet.cs.umass.edu/).
MalletBalancedWinnow_CL_MR
This uses the algorithm cc.mallet.classify.BalancedWinnowTrainer
, see
http://mallet.cs.umass.edu/api/cc/mallet/classify/BalancedWinnnowTrainer.html
Parameters:
-epsilon
-e
(Double, default: 0.5)-delta
-d
(Double, default: 0.1)-maxIter
-i
(Integer, default: 30)-coolingRate
-d
(Double, default: 0.5)
MalletC45_CL_MR
This uses the algorithm cc.mallet.classify.C45Trainer
, see http://mallet.cs.umass.edu/api/cc/mallet/classify/C45Trainer.html
Parameters:
-maxDepth
-m
(Integer) the maximum depth to grow the decision tree to, default is unlimited (0)-prune
-b
(Boolean) this requires an explicit parameter “true” or “false”, when “false” is specified pruning is disabled (enabled is the default)-minNumInsts
-n
(Integer) the minimum number of instances in each node (default is 2)
MalletDecisionTree_CL_MR
This uses the algorithm cc.mallet.classify.DecisionTreeTrainer
, see
http://mallet.cs.umass.edu/api/cc/mallet/classify/DecisionTreeTrainer.html
NOTE: This algorithm is limited and according to the documentation does not split on continuous attributes!
Parameters:
-minInfoGainSplit
-i
(Double, default 0.001)-maxDepth
-m
(Integer, default: 5) the maximum depth to grow the decision tree to.
MalletMexEnt_CL_MR
(Multivarate Logistic Regression)
This uses the algorithm cc.mallet.classify.MaxEntTrainer
, see
http://mallet.cs.umass.edu/api/cc/mallet/classify/DecisionTreeTrainer.html
Parameters:
-gaussianPriorVariance
-v
(Double, default 1.0)-l1Weight
-l
(Double, default 0.0): use an L1 prior-numIterations
-n
(Integer, default unlimited) (according to javadoc, currently not functional)
MalletNaiveBayesEM_CL_MR
This uses the algorithm cc.mallet.classify.NaiveBayesEMTrainer
, see
http://mallet.cs.umass.edu/api/cc/mallet/classify/NaiveBayesEMTrainer.html
This algorithm does not have any parameters to set.
MalletNaiveBayes_CL_MR
This uses the algorithm cc.mallet.classify.NaiveBayesTrainer
, see
http://mallet.cs.umass.edu/api/cc/mallet/classify/NaiveBayesTrainer.html
This algorithm does not have any parameters to set.
MalletWinnow_CL_MR
This uses the algorithm cc.mallet.classify.WinnowTrainer
, see
http://mallet.cs.umass.edu/api/cc/mallet/classify/WinnowTrainer.html
Parameters:
-alpha
-a
(Double, default: 2.0)-beta
-b
(Double, default: 2.0)-nfact
-n
(Double, default: 0.5)
MALLET*_SEQ_MR
The algorithms with names starting with MALLET*_SEQ_MR
are all sequence tagging algorithms: these algorithms can make use of the way how instances occur in
sequence and also of which predections were made for preceding instances.
These algorithms are primarily used for chunking but are included here so they can be applied to classification tasks as well. The algorithms and their parameters are all documented in LF_TrainChunking.
PytorchWrapper_CL_DR
and PytorchWrapper_SEQ_DR
This uses the Python-based wrapper for the Pytorch back-end. In order to use this, the Python environment must first be prepared on your machine. See the documentation for the Pytorch backend for more.
KerasWrapper_CL_DR
and KerasWrapper_SEQ_DR
This uses the Python-based wrapper for the Keras/Tensorflow back-end. In order to use this, the Python environment must first be prepared on your machine. See the documentation for the Keras backend for more.
WekaWrapper_CL_MR
See UsingWeka for how to use the external weka-wrapper software to train a Weka model.