CorpusStatsCollocationsPR Processing Resource

This PR calculates various counts and measures for pairs of terms within contexts in a corpus. This is used to find words that occur more frequently with each other (in the same context) than would be expected by chance. Contexts can be whole documents, sentences, paragraphs, or sliding windows withing a document, within a sentence etc. Contexts can also be further restricted by not allowing pairs to be considered with some kind of “split” annotation between them. This allows for a very flexible way to calculate the pair statistics.

Runtime parameters

If the stringFeature parameter is empty then the document text is used as returned by the gate.Utils.cleanStringFor method. If a feature is specified, then the value for the feature is used unchanged unless the feature is missing. So any removal of white space, case-folding or similar has to be already carried out!

CAUTION: if the value of the stringFeature contains a tab, it will also be used unchanged which will mess up the output file.

Files created

The PR creates any of the files described below, if a URL is specified for them:

The dataFile This file is a compressed serialization of the internal data structure and meant to be used when data needs to get pre-loaded for this PR.

The sumsFile This is a TSV-format (Tab-Separated-Values) file which contains only the header row and one row with the following values/fields for the whole corpus:

The pairStatsFile This is a TSV-format file which contains a header row, and one row for each distinct pair in the corpus with the following fields:

Probability estimation

By default the unsmoothed ML estimate is used:

where c(a) is the number of contexts a term occurs in and N is the total number of contexts considered.

If the parameter laplaceCoefficient is set to a value > 0.0, then the probability estimates are calculated using laplace smoothing. For a smoothing parameter $\alpha$ the estimate is:

where $d$ is the number of different values (terms, pairs) counted. For very large values of alpha, this estimate corresponds more and more to the uniform probability 1/d.

NOTE: Laplace smoothing is not used or influences the calculation of the chi-squared statistic which is based on the raw counts. However the laplace smoothed estimates are used for the t-statistic!

Associations measures

The following association measures are calculated:

PMI (point-wise mutual information) and NPMI (normalized PMI)

Note: negative values of pmi are not very useful, this is why ppmi (positive pmi) is often used which is simply $max(pmi,0)$. We use pmi since it is trivial to calculate PPMI from it later.

Normalized PMI is calculated as

Normalized PMI roughly associates the value -1 to “no occurrences”, the value 0 to random co-occurrences, and 1 to “always co-occurring”.

Except when there is only one pair in the corpus in which case the logarithm of 1 would be 0 and npmi would be +/- infinity, so instead we set it to -1:

Chi-squared statistic and p-value

The chi-squared statistic is provided in order to use the idea of Pearson’s chi-squared test to find out about the association between pairs of terms. For this the number of times a pair (a,b) occurs is used together with the number of times a appears in a context without b, (a,¬b), the number of times b appears in a context without a, (¬a,b), and the number of contexts without either a or b in them, (¬a,¬b). The test statistic chi-squared is calculated as:

The p-value is obtained from the chi-squared distribution with 1 degree of freedom since we have a 2 by 2 table of counts.

Student-T statistic and p-value

The Student-T statistic is provided in order to use the idea behind the t-test for comparing means of two Bernoulli distributions to find out about the association between pairs of terms.

Given the probability $p(a,b)$ of a pair, and the expected probability $p(a)p(b)$ if independent, the test statistic is calculated as

The p-value is obtained from the student distribution with N-1 degrees of freedom.

Minimum number of contexts and minTf

Multi-Threaded Operation

This PR can be safely used in a pipeline which is run in multi-processed mode, e.g. in GCP, by duplicating the PR using GATE’s duplication mechanism.

TODOs or things currently not implemented but maybe later