LF_TrainTopicModel Processing Resource

This PR can be used to train an LDA topic model from a training corpus. For this algorithm the instanceType parameter specifies the annotation that identifies what should be a “document” as far as LDA is concerned. If the parameter is left empty, then the whole document is used, but if there are more than one such annotations in a GATE document, the LDA algorithm will treat the text covered by each of the instance annotations as individual documents. The text is identified by the annotations of type tokenAnnotaitonType and comes either from the tokenFeature specified or the underlying cleaned document text, if tokenFeature is left empty.

Parameters

The algorithm has no init parameters and the following run-time parameters:

Algorithms

Algorithm MalletLDA_CLUS_MR

This algorithm is part of Mallet and directly included with the plugin. It uses an in-memory representaiton of the training data, so it is limited by the amount of available RAM.

The following parameters can be specified in the algorithmParameters field:

If the applyAfterTraining parameter is true and all conditions for application to run are met, then after training the model, the topic distributions are applied to each document. This is done by adding features to the instance annotations as specified throught the instanceType parameter or by using instead any “Document” annotation in the inputAS set, or if none is found, adding one that spans the whole GATE document. The following features are added:

In addition to the annotations and features in the annotations created by the PR, the following files are written to the data directory:

Algorithm GensimWrapper_CLUS_DR

NOTE: NOT YET IMPLEMENTED!!

This is a wrapper around the LDA implementation in the Python Gensim package und uses an out-of-memory representation of the training data, so it can scale to very large corpora.