AssignStatsPR Processing Resource
This PR loads a data file which has been created previously by the CorpusStatsPR
processing resource and uses the loaded corpus statistics to assign features with term-specific statistics like tf*idf
to annotations.
Runtime Parameters
containingAnnotationType
(String, no default): if this is specified, only input annotations contained in an annotation of this type in the input annotation set will be used for the statistics.dataFileUrl
(URL, optional): the URL of a binary file to load pre-calculated corpus statistics from.featurePrefix
(String, optional, default is “cs_”): a string to prepend to the feature names of features added by this PR. The feature name is made from this prefix plus the name of the statistic (see below)inputAnnotationSet
(String, default empty for the default annotation set): the annotation set that should contain the input annotations and, if specified, containing annotations.inputAnnotationType
(String, required, default is “Token”): the type of the annotations that represent terms.keyFeature
(String, default empty for using the document content covered by the annotation): If this is specified, the value of this feature is used instead of the document content covered by the annotation.statsList
(String, default is “tfidf,wtfidf,ltfidf”): this is a list of comma- or semicolon- or whitespace-separated names of statistics to add to each annotation. The feature name is generated by thefeaturePrefix
, if any, concatenated with the name of the statistic. Se below for a list of possible names and how the values are calculated for them.
Available Term-Statistics
The statistics which can be calculated are calculated from the loaded corpus statistics data (the statistics previously calculated using the CorpusStatsPR
processing resource), from the current document, or both.
The corpus from which the statistics were calculated will be called the “stats corpus” below – it can be the same corpus on which the AssignStatsPR
is run or a different one.
The following statistics are supported:
nDocs
: the total number of documents the corpus statistics were calculated fromnWords
: the total number of words (not distinct terms) the corpus containednTerms
: the total number of distinct terms the corpus containedidf
: Inverse Document Frequency, looked up from the loaded corpus statistics for the term. This is calculated as (1 + log((nDocs+1)/(df+1))). This way of calculating idf guarantees that the value is at least equal to 1 and that there is no division by zero if the term has not been in the stats corpus and thus df is 0.df
: Document Frequency: the number of documents in the stats corpus the term occurred in.tf
: Term Frequency: the number of times the term occurs in the current documentntf
: Normalized Term Frequency: (tf / maxTf) where maxTf is the number of times the most frequent term occurs in the current document.wtf
: Weighted Term Frequency: (tf / sumTf) where sumTf is the total number of words (not distinct terms) in the current document.ltf
: Logarithmic Term Frequency: (1 + log(tf))tfidf
: (tf * idf)ntfidf
: (ntf * idf)wtfidf
: (wtf * idf) – this is probably closest to what is normally meant by “TF*IDF”ltfidf
: (ltf * idf)ctf
: Corpus Term Frequeny: the number of times the term occurred in the stats corpuscntf
: Corpus Normalized Frequency: the total sum of all ntf values of the term in the stats corpuscwtf
: Corpus Weighted Frequency: the total sum of all wtf values of the term in the stats corpusactf
: Average Corpus Term Frequeny: (ctf/nDocs)acntf
: Average Corpus Normalized Frequency: (cntf/nDocs)acwtf
: Average Corpus Weighted Frequency: (cwtf/nDocs)
Multi-Threaded Operation
his PR can be safely used in a pipeline which is run in multi-processed mode, e.g. in GCP, by duplication the PR using GATE’s duplication mechanism.