Pipeline `LF_TrainTopicModel_Mallet_EN`

This is a “ready made” pipeline that illustrates a simple preparation of documents for training a topic model and then runs the PR to train a Mallet topic model.

IMPORTANT: this pipeline expects that Token annotations as created by ANNIE, with POS tags in the feature “category” and the lemma/root in feature “root” are already in the default annotation set.

Steps:

Use a Groovy script to copy only those token annotations which are of kind “word” and which do not have a POS tag starting with “V” into the annotation set LDA as annotation type “TokenWord”
Use a FeatureGazetteer to remove TokenWord annotations where a stop word matches the lemma
Use a FeatureGazetteer to remove TokenWord annotations where a stop word matches the lower case string
Use a JavaRegexpAnnotator to find strings in the document that look like something we do not want to include for topic modelling (currently only finds URLs and email addresses)
Remove TokenWord annotations contained within what has found in the previous step
Run the LF PR for training a topic model, using the remaining TokenWord annotations as input

gateplugin-LearningFramework

Pipeline LF_TrainTopicModel_Mallet_EN

Pipeline `LF_TrainTopicModel_Mallet_EN`