CorpusStatsTfIdfPR Processing Resource

This PR calculates the term frequencies (Tf, often also called total term frequencies, TTF) and document frequencies (Df) of terms over a whole corpus. Terms could be words identified by Token annotations but any annotation can be used where the string for the term is either the document string covered by the annotation or the value of some feature in that annotation.

Runtime parameters

If the keyFeature parameter is empty then the document text is used as returned by the gate.Utils.cleanStringFor method. If a feature is pecified, then the value for the feature is used unchanged unless the feature is missing. So any removal of white space, case-folding or similar has to be already carried out!

CAUTION: if the value of the keyFeature contains a tab, it will also be used unchanged which will mess up the output file.

Files created

The PR creates any of the files described below, if a URL is specified for them:

The dataFile This file is a compressed serialization of the internal data structure and meant to be used when data needs to get pre-loaded for this PR, or when the AssignStatsPR should get used subsequently.

The sumsFile This is a TSV-format (Tab-Separated-Values) file which contains only the header row and one row with the following values/fields for the whole corpus:

The tfDfFile This is a TSV-format file which contains a header row, and one row for each distinct term in the corpus with the following fields:

NOTE: the last three fields tfidf, ntfidf, wtfidf may get removed or be made optional in future versions as they are redundant and can easily be calculated from the remaining fields. They are included for now for convenience but there is convenience/space trade-off. idf is included although it could get calculated easily by the ndocs field from the summary file and the df field of this file.

Multi-Threaded Operation

This PR can be safely used in a pipeline which is run in multi-processed mode, e.g. in GCP, by duplicating the PR using GATE’s duplication mechanism.