Feature selection: finding distinctive words

We often want to know what words distinguish one group of texts from another group of texts. For instance, we might be working with an archive of two city newspapers, say, the Frankfurter Allgemeine Zeitung and the Frankfurter Rundschau and want to know which words tend to appear in one newspaper rather than the other. Or we might be interested in comparing word usage in US Presidents’ State of the Union addresses given during recessions with addresses given during periods of economic growth. Or we might be comparing the style of several novelists and want to know if one author tends to use words not found in the works of others.

This section illustrates how distinctive words can be identified using a corpus of novels containing works by two authors: Jane Austen and Charlotte Brontë.

  • Austen, Emma
  • Austen, Pride and Prejudice
  • Austen, Sense and Sensibility
  • C. Brontë, Jane Eyre
  • C. Brontë, The Professor
  • C. Brontë, Villette

This corpus of six novels consists of the following text files:

In [1]: filenames
Out[1]: 
['Austen_Emma.txt',
 'Austen_Pride.txt',
 'Austen_Sense.txt',
 'CBronte_Jane.txt',
 'CBronte_Professor.txt',
 'CBronte_Villette.txt']

We will find that among the words that reliably distinguish Austen from Brontë are “such”, “could”, and “any”. This tutorial demonstrates how we arrived at these words.

all they been miss every mrs very such must so any
--keyness-- 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
Austen_Emma 5.5 3.5 4.9 3.9 2.8 4.5 7.8 3.2 3.7 6.3 4.2
Austen_Pride 5.3 5.1 4.4 2.4 1.7 2.9 4.1 3.3 2.6 5.0 2.3
Austen_Sense 5.7 4.5 3.8 1.8 3.3 4.6 4.3 3.1 2.4 5.6 3.4
CBronte_Jane 3.6 2.9 2.2 1.8 0.6 1.4 2.1 1.5 1.8 3.6 1.2
CBronte_Professor 3.0 2.6 2.2 0.0 0.6 0.3 2.3 1.4 1.0 3.9 1.2
CBronte_Villette 3.9 2.4 1.9 0.8 0.5 0.8 2.1 1.6 1.5 3.8 1.0

Note

The following features an introduction to the concepts underlying feature selection. Those who are working with a very large corpus and are familiar with statistics may wish to skip ahead to the section on group comparison or the section Log likelihood ratio and feature selection.

Since we are concerned with words, we begin by extracting word frequencies from each of the texts in our corpus and construct a document-term matrix that records the rate per 1,000 words for each word appearing in the corpus. Using rates rather than counts will allow us to account for differences in the length of the novels. Accounting for differences in document lengths when dealing with word counts is essential. For example, a text using “whence” ten times in a 1,000 word article uses the word at a rate of 10 per 1,000 words, while a 100,000 word novel that uses “whence” 20 times uses it at a rate of 0.2 per 1,000 words. While the word occurs more in absolute terms in the second text, the rate is higher in the first text. While there are other ways to account for document length—a procedure called “normalization”—considering the rate per 1,000 words will serve us well. An appealing feature of word rates per 1,000 words is that readers are familiar with documents of this length (e.g., a newspaper article).

In [2]: import os

In [3]: import nltk

In [4]: import numpy as np

In [5]: from sklearn.feature_extraction.text import CountVectorizer

In [6]: filenames_with_path = [os.path.join(CORPUS_PATH, fn) for fn in filenames]

# these texts have underscores ('_') that indicate italics; remove them.
In [7]: raw_texts = []

In [8]: for fn in filenames_with_path:
   ...:     with open(fn) as f:
   ...:         text = f.read()
   ...:         text = text.replace('_', '')  # remove underscores (italics)
   ...:         raw_texts.append(text)
   ...: 

In [9]: vectorizer = CountVectorizer(input='content')

In [10]: dtm = vectorizer.fit_transform(raw_texts)

In [11]: vocab = np.array(vectorizer.get_feature_names())

# fit_transform returns a sparse matrix (which uses less memory)
# but we want to work with a normal numpy array.
In [12]: dtm = dtm.toarray()

# normalize counts to rates per 1000 words
In [13]: rates = 1000 * dtm / np.sum(dtm, axis=1, keepdims=True)
# just examine a sample, those at offsets 100 to 105
In [14]: rates[:, 100:105]
Out[14]: 
array([[ 0.  ,  0.01,  0.  ,  0.  ,  0.  ],
       [ 0.  ,  0.  ,  0.  ,  0.  ,  0.  ],
       [ 0.  ,  0.  ,  0.  ,  0.  ,  0.  ],
       [ 0.  ,  0.  ,  0.  ,  0.  ,  0.  ],
       [ 0.01,  0.  ,  0.  ,  0.  ,  0.  ],
       [ 0.  ,  0.  ,  0.01,  0.01,  0.01]])

In [15]: vocab[100:105]
Out[15]: 
array(['abdiel', 'abdy', 'abed', 'aberdeen', 'aberration'], 
      dtype='<U20')
abdiel abdy abed aberdeen aberration
Austen_Emma 0.00 0.01 0.00 0.00 0.00
Austen_Pride 0.00 0.00 0.00 0.00 0.00
Austen_Sense 0.00 0.00 0.00 0.00 0.00
CBronte_Jane 0.00 0.00 0.00 0.00 0.00
CBronte_Professor 0.01 0.00 0.00 0.00 0.00
CBronte_Villette 0.00 0.00 0.01 0.01 0.01

Measuring “distinctiveness”

Finding distinctive words requires a decision about what “distinctive” means. As we will see, there are a variety of definitions that we might use. It seems reasonable to expect that all definitions of distinctive would identify as distinctive words found exclusively in texts associated with a single author (or group). For example, if Brontë uses the word “access” and Austen never does, we should count “access” as distinctive. A more challenging question is how to treat words that occur in both groups of texts but do so with different rates.

Finding words that are unique to a group is a simple exercise. Indeed, it is worth treating these words a special case so they will not clutter our work later on. We will quickly identify these words and remove them. (They tend not to be terribly interesting words.)

A simple way of identifying words unique to one author would be to calculate the average rate of word use across all texts for each author and then to look for cases where the average rate is zero for one author.

# indices so we can refer to the rows for the relevant author
In [16]: austen_indices, cbronte_indices = [], []

In [17]: for index, fn in enumerate(filenames):
   ....:     if "Austen" in fn:
   ....:         austen_indices.append(index)
   ....:     elif "CBronte" in fn:
   ....:         cbronte_indices.append(index)
   ....: 

# this kind of slicing should be familiar if you've used R or Octave/Matlab
In [18]: austen_rates = rates[austen_indices, :]

In [19]: cbronte_rates = rates[cbronte_indices, :]

# np.mean(..., axis=0) calculates the column-wise mean
In [20]: austen_rates_avg = np.mean(austen_rates, axis=0)

In [21]: cbronte_rates_avg = np.mean(cbronte_rates, axis=0)

# since zero times any number is zero, this will identify documents where
# any author's average rate is zero
In [22]: distinctive_indices = (austen_rates_avg * cbronte_rates_avg) == 0

# examine words that are unique, ranking by rates
In [23]: np.count_nonzero(distinctive_indices)
Out[23]: 14414

In [24]: ranking = np.argsort(austen_rates_avg[distinctive_indices] + cbronte_rates_avg[distinctive_indices])[::-1]  # from highest to lowest; [::-1] reverses order.

In [25]: vocab[distinctive_indices][ranking]
Out[25]: 
array(['elinor', 'emma', 'marianne', ..., 'incautious', 'incedingly',
       'kitten'], 
      dtype='<U20')
elinor emma marianne darcy weston bennet bingley monsieur knightley elton dashwood
Austen 1.97 1.86 1.63 1.18 0.95 0.91 0.86 0.00 0.84 0.83 0.72
Brontë 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00 0.00 0.00

Now that we have identified these words, we will remove them from our corpus in order to focus on identifying distinctive words that appear in texts associated with every author.

In [26]: dtm = dtm[:, np.invert(distinctive_indices)]

In [27]: rates = rates[:, np.invert(distinctive_indices)]

In [28]: vocab = vocab[np.invert(distinctive_indices)]

# recalculate variables that depend on rates
In [29]: austen_rates = rates[austen_indices, :]

In [30]: cbronte_rates = rates[cbronte_indices, :]

In [31]: austen_rates_avg = np.mean(austen_rates, axis=0)

In [32]: cbronte_rates_avg = np.mean(cbronte_rates, axis=0)

Differences in averages

How can we identify a distinctive word? One approach would compare the average rate at which authors use a word. A simple quantitative comparison would calculate the difference between the rates. If one author uses a word often across his or her oeuvre and another barely uses the word at all, then we suspect the difference in rates will be large. This will be the first definition of distinctiveness (sometimes called “keyness”) we will consider. Using this measure we can calculate the top ten distinctive words in the Austen-Brontë comparison as follows:

In [33]: import numpy as np

# calculate absolute value because we only care about the magnitude of the difference
In [34]: keyness = np.abs(austen_rates_avg - cbronte_rates_avg)

In [35]: ranking = np.argsort(keyness)[::-1]  # from highest to lowest; [::-1] reverses order in Python sequences

# print the top 10 words along with their rates and the difference
In [36]: vocab[ranking][0:10]
Out[36]: 
array(['the', 'to', 'her', 'my', 'me', 'be', 'and', 'she', 'mr', 'very'], 
      dtype='<U20')
the to her my me be and she mr very mrs
--keyness-- 9.87 7.07 6.98 6.66 6.23 6.03 4.94 4.80 3.70 3.26 3.14
Austen_Emma 33.65 33.90 16.10 4.73 3.77 12.78 31.67 15.29 7.46 7.84 4.53
Austen_Pride 36.67 35.24 18.86 6.04 3.79 10.49 30.35 14.48 6.66 4.12 2.90
Austen_Sense 35.42 35.52 22.01 5.42 3.83 11.22 30.13 13.92 1.54 4.31 4.57
CBronte_Jane 44.44 29.64 9.71 12.65 11.63 5.82 37.53 8.37 3.08 2.14 1.42
CBronte_Professor 45.63 27.91 14.74 13.18 9.07 5.17 34.93 10.45 1.23 2.26 0.32
CBronte_Villette 45.29 25.91 11.59 10.33 9.39 5.39 34.50 10.46 0.25 2.11 0.84

This is a start. The problem with this measure is that it tends to highlight differences in very frequent words. For example, this method gives greater attention to a word that occurs 30 times per 1,000 words in Austen and 25 times per 1,000 in Brontë than it does to a word that occurs 5 times per 1,000 words in Austen and 0.1 times per 1,000 words in Brontë. This does not seem right. It seems important to recognize cases when one author uses a word frequently and another author barely uses it.

As this initial attempt suggests, identifying distinctive words will be a balancing act. When comparing two groups of texts differences in the rates of frequent words will tend to be large relative to differences in the rates of rarer words. Human language is variable; some words occur more frequently than others regardless of who is writing. We need to find a way of adjusting our definition of distinctive in light of this.

One adjustment that is easy to make is to divide the difference in authors’ average rates by the average rate across all authors. Since dividing a quantity by a large number will make that quantity smaller, our new distinctiveness score will tend to be lower for words that occur frequently. While this is merely a heuristic, it does move us in the right direction.

# we have already calculated the following quantities
# austen_rates_avg
# cbronte_rates_avg
In [37]: rates_avg = np.mean(rates, axis=0)

In [38]: keyness = np.abs(austen_rates_avg - cbronte_rates_avg) / rates_avg

In [39]: ranking = np.argsort(keyness)[::-1]  # from highest to lowest; [::-1] reverses order.

# print the top 10 words along with their rates and the difference
In [40]: vocab[ranking][0:10]
Out[40]: 
array(['madame', 'elizabeth', 'graham', 'lizzy', 'flowers', 'french',
       'lessons', 'smallest', 'clean', 'movement'], 
      dtype='<U20')
madame elizabeth graham lizzy flowers french lessons smallest clean movement schoolroom
--keyness-- 1.99 1.99 1.98 1.97 1.96 1.95 1.95 1.94 1.94 1.94 1.94
Austen_Emma 0.01 0.05 0.01 0.00 0.01 0.01 0.01 0.13 0.01 0.01 0.01
Austen_Pride 0.00 5.38 0.00 0.81 0.00 0.01 0.00 0.09 0.00 0.00 0.00
Austen_Sense 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.17 0.00 0.00 0.00
CBronte_Jane 0.07 0.01 0.00 0.01 0.18 0.23 0.06 0.01 0.16 0.11 0.19
CBronte_Professor 0.56 0.00 0.00 0.00 0.21 0.71 0.31 0.00 0.14 0.19 0.14
CBronte_Villette 1.92 0.01 1.22 0.00 0.21 0.32 0.15 0.00 0.12 0.12 0.06

This method improves on our initial attempt. It has the virtue of being simple and easy to implement. Yet it has its flaws. For example, the method tends to overemphasize very rare words.

Just as there are many definitions of “similarity” or “distance” available to compare two texts (see Working with text), there are many definitions of distinctive that can be used to identify words that characterize a group of texts.

Note

While we used the absolute value of the difference in average rates, \(|x-y|\) we might have easily used the squared difference, \((x-y)^2\) as it has similar properties (always positive, increasing as difference increases).

Bayesian group comparison

Note

The following sections assume some familiarity with statistics and probability. Introductory texts include [CB01], [Hof09], and [Lee04].

Note

The following excursion into the world of Bayesian inference and Gibbs sampling is closely related to topic modeling and Latent Dirichlet Allocation (LDA). The inference for the model discussed below proceeds using a Gibbs sampler from the full condition distribution of each variable of interest—precisely the same procedure is used in LDA.

A more nuanced comparison of word use in two groups takes account of the variability in word use. Consider for instance the word “green” in Austen and Brontë. In Austen the word occurs with the following rates: 0.01, 0.03, and 0.06 (0.03 on average). In Brontë the word is consistently more frequent: 0.16, 0.36, and 0.22 (0.24 on average). These two groups of rates look different. But consider how our judgment might change if the rates observed in Brontë’s novels were much more variable, say, 0.03, 0.04, and 0.66 (0.24 on average). Although the averages remain the same, the difference does not seem so pronounced; with only one observation (0.66) noticeably greater than we find in Austen, we might reasonably doubt that there is evidence of a systematic difference between the authors. [1]

One way of formalizing a comparison of two groups that takes account of the variability of word usage comes from Bayesian statistics. To describe our beliefs about the word frequencies we observe, we use a probability distribution, which we will call our a sampling model. Under the model we will use, the rates are assumed to come from two different normal distributions. The question we will be asking is how confident we are that the means of the two normal distributions are different. How confident we are (expressed as a probability) that the means are indeed different will stand in as our measure of distinctiveness.

We will use the parameterization below for our two normal sampling distributions. Group 1 corresponds to Austen and group 2 corresponds to Brontë:

\[Y_{i,1} = \mu + \delta + \epsilon_{i,1}\]\[Y_{i,2} = \mu - \delta + \epsilon_{i,2}\]\[\{\epsilon_{i,j}\} \sim \textrm{i.i.d.} \; \textrm{Normal}(0, \sigma^2)\]\[n = 1, 2, 3\]

(i.i.d. stands for independently and identically distributed)

It is easy to relate this parameterization back to two normal distributions. Austen’s texts come from a normal distribution with mean parameter \(\theta_1 = \mu + \delta\) and variance \(\sigma^2\), whereas Brontë’s novels come from a distribution with the same variance and with mean parameter \(\theta_2 = \mu - \delta\). \(\delta\) corresponds to half the difference between the two means and it is through this parameter that we will judge how confident we are of a difference between the two distributions.

As we consider the question of what prior distributions to assign to \(\mu\), \(\delta\), and \(\sigma^2\) we need to keep in mind that the word rates must be positive even though we are using normal distributions (which will always assign some, potentially quite small, probability to negative values). A compromise that will allow us to make use of computationally-convenient conjugate prior distributions will be to use normal prior distributions that favor positive values in most cases. As we will be modeling more than ten thousand of vocabulary items, computational speed will be important. These are the prior distributions that we will use:

\[\mu \sim \textrm{Normal}(\mu_0, \tau_0^2)\]\[\delta \sim \textrm{Normal}(0, \gamma_0^2)\]\[\sigma^2 \sim \textrm{Inverse-Gamma}(\nu_0/2, \nu_0\sigma_0^2/2)\]

We need to determine suitable values for the priors’ parameters (called hyperparameters): \(\mu_0, \tau_0^2, \gamma_0^2, \nu_0\), and \(\sigma_0^2\). Let us consider \(\mu_0\) and \(\sigma_0^2\) first. While words like “the” and “she” occur quite frequently, almost all words (>99%) occur less than four times per 1,000 words:

In [41]: np.mean(rates < 4)
Out[41]: 0.99578709412248256

In [42]: np.mean(rates > 1)
Out[42]: 0.015556925606247432

In [43]: from scipy.stats.mstats import mquantiles  # analgous to R's quantiles

In [44]: mquantiles(rates, prob=[0.01, 0.5, 0.99])
Out[44]: array([ 0.  ,  0.01,  1.63])

In keeping with this observation we will set \(\mu_0\) to be 3 and \(\tau^2\) to be \(1.5^2\), with the reasoning that when drawing from a normal distribution, the great majority (.95) of observations will fall between two standard deviations of the mean. There isn’t tremendous variability in rates across the works of a single author, so we will set \(\sigma_0^2\) to be 1 and \(\nu_0\) to be 1. (If we were to use non-conjugate priors we could model our prior beliefs about rates more realistically.) We know there is considerable variability in the rates between authors, so we will choose \(\gamma_0^2\) to be \(1.5^2\), as \(\delta\) represents half the difference between the means and its value is unlikely to be greater than 3 in absolute value.

With these conjugate priors it is possible to use a Gibbs sampler to sample efficiently from the posterior distribution, using the full conditional distributions for the parameters of interest [Hof09]:

\[\begin{split}\{\mu|\mathbf{y_1}, \mathbf{y_2}, \delta, \sigma^2\} &\sim \textrm{Normal}(\mu_n, \gamma_n^2)\\ \mu_n &= \gamma_n^2 \times [\mu_0/\gamma_0^2 + \sum_{i=1}^{n_1} (y_{i,1} - \delta)/\sigma^2 + \sum_{i=1}^{n_2} (y_{i,2} - \delta)/\sigma^2 ] \\ \gamma_n^2 &= [1/\gamma_0^2 + (n_1+n_2)/\sigma^2]^{-1} \\\end{split}\]\[\begin{split}\{\delta|\mathbf{y_1}, \mathbf{y_2}, \mu, \sigma^2\} &\sim \textrm{Normal}(\delta_n, \tau_n^2)\\ \delta_n &= \tau_n^2 \times [ \delta_0/\tau_0^2 + \sum_{i=1}^{n_1} (y_{i,1} - \mu)/\sigma^2 - \sum_{i=1}^{n_2} (y_{i,2} - \mu)/\sigma^2 ]\\ \tau_n^2 &= [1/\tau_0^2 + (n_1+n_2)/\sigma^2]^{-1} \\\end{split}\]\[\begin{split}\{\sigma^2|\mathbf{y_1}, \mathbf{y_2}, \delta, \mu\} &\sim \textrm{Inverse-Gamma}(\nu_n/2, \nu_n\sigma_n^2/2)\\ \nu_n &= \nu_0 + n_1 + n_2 \\ \nu_n\sigma_n^2 &= \nu_0\sigma_0^2 + \sum_{i=1}^{n_1} (y_{i,1} - (\mu+\delta)) + \sum_{i=1}^{n_2} (y_{i,2} - (\mu - \delta)) \\\end{split}\]

In Python, we can wrap the Gibbs sampler in single function and use it to get a distribution of posterior values for \(\delta\), which is the variable we care about in this context as it characterizes our belief about the difference in authors’ word usage.

In [45]: def sample_posterior(y1, y2, mu0, sigma20, nu0, delta0, gamma20, tau20, S):
   ....:     """Draw samples from posterior distribution using Gibbs sampling
   ....:     Parameters
   ....:     ----------
   ....:     `S` is the number of samples
   ....:     Returns
   ....:     -------
   ....:     chains : dict of array
   ....:         Dictionary has keys: 'mu', 'delta', and 'sigma2'.
   ....:     """
   ....:     n1, n2 = len(y1), len(y2)
   ....:     mu = (np.mean(y1) + np.mean(y2))/2
   ....:     delta = (np.mean(y1) - np.mean(y2))/2
   ....:     vars = ['mu', 'delta', 'sigma2']
   ....:     chains = {key: np.empty(S) for key in vars}
   ....:     for s in range(S):
   ....:         a = (nu0+n1+n2)/2
   ....:         b = (nu0*sigma20 + np.sum((y1-mu-delta)**2) + np.sum((y2-mu+delta)**2))/2
   ....:         sigma2 = 1 / np.random.gamma(a, 1/b)
   ....:         mu_var = 1/(1/gamma20 + (n1+n2)/sigma2)
   ....:         mu_mean = mu_var * (mu0/gamma20 + np.sum(y1-delta)/sigma2 +
   ....:                             np.sum(y2+delta)/sigma2)
   ....:         mu = np.random.normal(mu_mean, np.sqrt(mu_var))
   ....:         delta_var = 1/(1/tau20 + (n1+n2)/sigma2)
   ....:         delta_mean = delta_var * (delta0/tau20 + np.sum(y1-mu)/sigma2 -
   ....:                                 np.sum(y2-mu)/sigma2)
   ....:         delta = np.random.normal(delta_mean, np.sqrt(delta_var))
   ....:         chains['mu'][s] = mu
   ....:         chains['delta'][s] = delta
   ....:         chains['sigma2'][s] = sigma2
   ....:     return chains
   ....: 
# data
In [46]: word = "green"

In [47]: y1, y2 = austen_rates[:, vocab == word], cbronte_rates[:, vocab == word]

# prior parameters
In [48]: mu0 = 3

In [49]: tau20 = 1.5**2

In [50]: nu0 = 1

In [51]: sigma20 = 1

In [52]: delta0 = 0

In [53]: gamma20 = 1.5**2

# number of samples
In [54]: S = 2000

In [55]: chains = sample_posterior(y1, y2, mu0, sigma20, nu0, delta0, gamma20, tau20, S)

In [56]: delta = chains['delta']

These samples reflect what our belief about \(\delta\) ought to be given our prior specification. Our interest is in \(\delta\), which represents the half the difference between the population means for the distributions characterizing word rates in Austen and Brontë. We aren’t concerned with whether or not it is negative or positive, but we do care whether or not it is likely to be zero. In fact, we need to have a measure of how confident we are that \(\delta\) is something other than zero (implying no difference in means). If, for instance, the moment that samples of \(\delta\) tend to be negative; we need to know the posterior probability of its being definitively less than zero, \(\textrm{p}(\delta < 0)\). This probability can be estimated from the output of the Gibbs sampler. The following demonstrates the calculation of this probability for two different words, ‘green’ and ‘dark’, both words more characteristic of the Brontë novels than the Austen novels.

In [57]: y1 = austen_rates[:, vocab == 'green']

In [58]: y2 = cbronte_rates[:, vocab == 'green']

In [59]: chains = sample_posterior(y1, y2, mu0, sigma20, nu0, delta0, gamma20, tau20, S)

In [60]: delta_green = chains['delta']

In [61]: y1 = austen_rates[:, vocab == 'dark']

In [62]: y2 = cbronte_rates[:, vocab == 'dark']

In [63]: chains = sample_posterior(y1, y2, mu0, sigma20, nu0, delta0, gamma20, tau20, S)

In [64]: delta_dark = chains['delta']

# estimate of p(delta < 0)
In [65]: np.mean(delta_dark < 0)
Out[65]: 0.86499999999999999
In [66]: words = ['dark', 'green']

In [67]: ix = np.in1d(vocab, words)

In [68]: keyness = np.asarray([np.mean(delta_dark < 0), np.mean(delta_green < 0)])
dark green
p(delta<0) 0.86 0.71
Austen average 0.03 0.03
Bronte average 0.47 0.24

As ‘dark’ is more distinctive of Brontë than ‘green’ is, the probabilities (our measure of distinctiveness or keyness) reflect this.

If we want to apply this “feature selection” method en masse to every word occurring in the corpus, we need only write one short loop and make an adjustment for the fact that we don’t care whether or not \(\delta\) is positive or negative:

# fewer samples to speed things up, this may take several minutes to run
In [69]: S = 200

In [70]: def delta_confidence(rates_one_word):
   ....:     austen_rates = rates_one_word[0:3]
   ....:     bronte_rates = rates_one_word[3:6]
   ....:     chains = sample_posterior(austen_rates, bronte_rates, mu0, sigma20, nu0,
   ....:                               delta0, gamma20, tau20, S)
   ....:     delta = chains['delta']
   ....:     return np.max([np.mean(delta < 0), np.mean(delta > 0)])
   ....: 
# apply the function over all columns
In [117]: keyness = np.apply_along_axis(delta_confidence, axis=0, arr=rates)
In [71]: ranking = np.argsort(keyness)[::-1]  # from highest to lowest; [::-1] reverses order.

# print the top 10 words along with their rates and the difference
In [72]: vocab[ranking][0:10]
Out[72]: 
array(['all', 'they', 'been', 'miss', 'every', 'mrs', 'very', 'such',
       'must', 'so'], 
      dtype='<U20')
all they been miss every mrs very such must so any
--keyness-- 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
Austen_Emma 5.5 3.5 4.9 3.9 2.8 4.5 7.8 3.2 3.7 6.3 4.2
Austen_Pride 5.3 5.1 4.4 2.4 1.7 2.9 4.1 3.3 2.6 5.0 2.3
Austen_Sense 5.7 4.5 3.8 1.8 3.3 4.6 4.3 3.1 2.4 5.6 3.4
CBronte_Jane 3.6 2.9 2.2 1.8 0.6 1.4 2.1 1.5 1.8 3.6 1.2
CBronte_Professor 3.0 2.6 2.2 0.0 0.6 0.3 2.3 1.4 1.0 3.9 1.2
CBronte_Villette 3.9 2.4 1.9 0.8 0.5 0.8 2.1 1.6 1.5 3.8 1.0

This produces a useful ordering of characteristic words. Unlikely frequentist methods discussed below (chi-squared and log likelihood) this approach considers the variability of observations within groups. This method will also work for small corpora provided useful prior information is available. To the extent that we are interested in a close reading of differences of vocabulary use, the Bayesian method should be preferred. [2]

Log likelihood ratio and \(\chi^2\) feature selection

We can recast our discussions about measuring distinctiveness in terms of hypothesis testing. This turns out to be a satisfying way of thinking about the problem and it also allows us to introduce a range of feature selection methods, including the log likelihood test and the \(\chi^2\) test.

One hypothesis that we might test comes as no surprise: rather than two groups of texts characterized by different word rates, this hypothesis claims that there is, in fact, a single group. Words are examined one at a time; those words for which this hypothesis seems most wrong will be counted as distinctive (classical statistics is always a workout in counterfactual language).

Consider again the word “green”. Taking all the Austen texts together, the word “green” occurs 11 times out of ~370,000 words (0.03 per 1,000 words). In the novels by Brontë, “green” occurs 96 times out of ~400,000 (0.24 per 1,000 words). We do not really need statistics to tell us that this is a large difference: picking a word from each author-specific corpus at random, one is ten times more likely to find “green” in the Brontë corpus. To summarize the appearance of the word “green” we may assemble a table with the following code:

In [73]: green_austen = np.sum(dtm[austen_indices, vocab == "green"])

In [74]: nongreen_austen = np.sum(dtm[austen_indices, :]) - green_austen

In [75]: green_cbronte = np.sum(dtm[cbronte_indices, vocab == "green"])

In [76]: nongreen_cbronte = np.sum(dtm[cbronte_indices, :]) - green_cbronte

In [77]: green_table = np.array([[green_austen, nongreen_austen],
   ....:                         [green_cbronte, nongreen_cbronte]])
   ....: 

In [78]: green_table
Out[78]: 
array([[    11, 374596],
       [    96, 404720]], dtype=int64)
"green" not "green"
Austen 11 374596
C. Brontë 96 404720

The hypothesis being tested is that the grouping of the counts by author is unnecessary, that \(P(word = "green" | author = "Austen") = P(word = "green" | author != "Austen")\). If this were the case, then the rate of “green” in the corpus is the same, namely 0.14 per 1,000 words, and we would anticipate seeing the following frequencies given the total number of words for each group of texts:

In [79]: prob_green = np.sum(dtm[:, vocab == "green"]) / np.sum(dtm)

In [80]: prob_notgreen = 1 - prob_green

In [81]: labels = []

In [82]: for fn in filenames:
   ....:     label = "Austen" if "Austen" in fn else "CBrontë"
   ....:     labels.append(label)
   ....: 

In [83]: n_austen = np.sum(dtm[labels == "Austen", :])

In [84]: n_cbronte = np.sum(dtm[labels != "Austen", :])

In [85]: expected_table = np.array([[prob_green * n_austen, prob_notgreen * n_austen],
   ....:                            [prob_green * n_cbronte, prob_notgreen * nongreen_cbronte]])
   ....: 

In [86]: expected_table
Out[86]: 
array([[  2.04e+01,   1.49e+05],
       [  1.57e+01,   4.05e+05]])

# same result, but more concise and more general
In [87]: from sklearn.preprocessing import LabelBinarizer

In [88]: X = dtm[:, vocab == "green"]

In [89]: X = np.append(X, np.sum(dtm[:, vocab != "green"], axis=1, keepdims=True), axis=1)

In [90]: y = LabelBinarizer().fit_transform(labels)

In [91]: y = np.append(1 - y, y, axis=1)

In [92]: green_table = np.dot(y.T, X)

In [93]: green_table
Out[93]: 
array([[    11, 374596],
       [    96, 404720]], dtype=int64)

In [94]: feature_count = np.sum(X, axis=0, keepdims=True)

In [95]: class_prob = np.mean(y, axis=0, keepdims=True)

In [96]: expected_table = np.dot(class_prob.T, feature_count)

In classical statistics, hypothesis tests typically have a quantity called a test statistic associated with them. If the test statistic is greater than a critical value the hypothesis is rejected. In this case, the test statistic is identical with our measure of distinctiveness. The test commonly used to analyze the present hypothesis (that two distinct groups are unnecessary) is the log likelihood ratio test, and its statistic is called the log likelihood ratio (alternatively a G-test statistic or Dunning log likelihood [Dun93]). Various symbols are associated with this statistic, including \(G\), \(G^2\), \(l\), and \(\lambda\). (The theoretical underpinnings of the log likelihood ratio test and its application to corpus analysis are covered in chapter 8 of Casella and Berger (2001) and Dunning (1993) [CB01] [Dun93].)

The log likelihood ratio is calculated as follows:

\[\sum_i O_i \times \ln \frac{O_i}{E_i}\]

where \(i\) indexes the cells. (Note the similarity of this formula to the calculation of mutual information.) In Python:

In [97]: G = np.sum(green_table * np.log(green_table / expected_table))

The higher the value of the test statistic, the more pronounced the deviation is from the hypothesis—and, for our purposes, the more “distinctive” the word is.

Pearson’s \(\chi^2\) test statistic approximates the log likelihood ratio test (\(\chi^2\) is read chi-squared). It is computationally easier to calculate. The Python library scikit-learn provides a function sklearn.feature_selection.chi2 that allows us to use this test statistic as a feature selection method:

In [98]: from sklearn.feature_selection import chi2

In [99]: labels = []

In [100]: for fn in filenames:
   .....:     label = "Austen" if "Austen" in fn else "CBrontë"
   .....:     labels.append(label)
   .....: 

# chi2 returns two arrays, the chi2 test statistic and an
# array of "p-values", which we'll ignore
In [101]: keyness, _ = chi2(dtm, labels)

In [102]: ranking = np.argsort(keyness)[::-1]

In [103]: vocab[ranking][0:10]
Out[103]: 
array(['me', 'my', 'the', 'mr', 'mrs', 'elizabeth', 'be', 'and', 'very',
       'every'], 
      dtype='<U20')
me my the mr mrs elizabeth be and very every harriet
--keyness-- 1570.7 1379.0 1220.0 722.4 648.7 634.1 606.1 563.0 489.7 469.5 458.2
Austen_Emma 3.8 4.7 33.7 7.5 4.5 0.1 12.8 31.7 7.8 2.8 3.3
Austen_Pride 3.8 6.0 36.7 6.7 2.9 5.4 10.5 30.4 4.1 1.7 0.0
Austen_Sense 3.8 5.4 35.4 1.5 4.6 0.0 11.2 30.1 4.3 3.3 0.0
CBronte_Jane 11.6 12.7 44.4 3.1 1.4 0.0 5.8 37.5 2.1 0.6 0.0
CBronte_Professor 9.1 13.2 45.6 1.2 0.3 0.0 5.2 34.9 2.3 0.6 0.0
CBronte_Villette 9.4 10.3 45.3 0.2 0.8 0.0 5.4 34.5 2.1 0.5 0.1

Note

Logarithms are expensive. Calculating the log likelihood ratio over a vocabulary of 10,000 words will involve taking 40,000 logarithms. The \(\chi^2\) test statistic, by contrast, involves taking the square of a quantity the same number of times. On my computer, calculating the logarithm takes about twenty times longer than taking the square (simple multiplication):

In [104]: import timeit

In [105]: time_log = timeit.timeit("import numpy as np; np.log(np.arange(40000))", number=100)

In [106]: time_square = timeit.timeit("import numpy as np; np.square(np.arange(40000))", number=100)

In [107]: time_log / time_square
Out[107]: 25.161728622535968

Mutual information feature selection

Feature selection based on mutual information also delivers good results. Good introductions to the method can be found in Cosma Shalizi’s Data Mining course (Finding Informative Features) and in section 13.5 in [MRS08].

Feature selection as exploratory data analysis

If nothing else, studying methods of feature selection forces us to think critically about what we mean when we say some characteristic is “distinctive”.

In practice, these methods let us quickly identify features (when they exist) that appear more or less often in one group of texts. As such, these methods are useful for dimensionality reduction and exploratory data analysis. For example, if we suspect that there is a meaningful partition of a collection of texts, we can use one of the methods described above to pull out features that characterize the proposed groups of texts and explore whether those features make sense given other information. Or we may be confronted with a massive dataset—perhaps all 1-, 2-, and 3-grams in the corpus—and need to reduce the space of features so that our analyses can run on a computer with limited memory.

Feature selection needs to be used with care when working with a small number of observations and a relatively large number of features—e.g., a corpus with of a small number of documents and a very large vocabulary. Feature selection is perfectly capable of pulling out features that are characteristic of any division of texts.

Note

The shorthand \(n << p\) is used to describe situations where the number of variables greatly outnumbers the number observations. \(n\) is the customary label for the number of observations and \(p\) refers to the number of covariates.

A brief demonstration that feature selection “works” as expected can be seen by plotting the cosine distance among texts in the corpus before and after feature selection is applied. chi2 is the feature selection used in the bottom figure and the top 50 words are used.

In [108]: import matplotlib.pyplot as plt

In [109]: from sklearn.metrics.pairwise import cosine_similarity

In [110]: from sklearn.manifold import MDS

In [111]: dist = 1 - cosine_similarity(dtm)

In [112]: mds = MDS(n_components=2, dissimilarity="precomputed")

In [113]: pos = mds.fit_transform(dist)  # shape (n_components, n_samples)
In [114]: xs, ys = pos[:, 0], pos[:, 1]

In [115]: names = [os.path.basename(fn).replace('.txt', '') for fn in filenames]

In [116]: for x, y, name in zip(xs, ys, names):
   .....:     color = 'orange' if "Austen" in name else 'skyblue'
   .....:     plt.scatter(x, y, c=color)
   .....:     plt.text(x, y, name)
   .....: 

In [117]: plt.title("Before feature selection")
Out[117]: <matplotlib.text.Text at 0x2b10c10049b0>
_images/plot_feature_selection_mds_before.png
In [118]: keyness, _ = chi2(dtm, names)

In [119]: selected = np.argsort(keyness)[::-1][0:50]

In [120]: dtm_chi2 = dtm[:, selected]

In [121]: dist = 1 - cosine_similarity(dtm_chi2)

In [122]: mds = MDS(n_components=2, dissimilarity="precomputed")

In [123]: pos = mds.fit_transform(dist)  # shape (n_components, n_samples)
In [124]: xs, ys = pos[:, 0], pos[:, 1]

In [125]: for x, y, name in zip(xs, ys, names):
   .....:     color = 'orange' if "Austen" in name else 'skyblue'
   .....:     plt.scatter(x, y, c=color)
   .....:     plt.text(x, y, name)
   .....: 

In [126]: plt.title("After feature selection")
Out[126]: <matplotlib.text.Text at 0x2b10c10cd860>
_images/plot_feature_selection_mds_after.png

Exercises

  1. Using the two groups of texts (Austen and C. Brontë), find the top 40 characteristic words by the \(\chi^2\) statistic. Feel free to use scikit-learn’s chi2.
  2. The following is a random partition of the texts. Find the top 40 characteristic words by the \(\chi^2\) statistic. How do these compare with those you found in exercise 1?
In [127]: import random

In [128]: random.seed(1)

In [129]: shuffled = filenames.copy()

In [130]: random.shuffle(shuffled)

In [131]: group_a = shuffled[:len(filenames)//2]

In [132]: group_b = shuffled[len(filenames)//2:]
In [133]: group_a
Out[133]: ['Austen_Sense.txt', 'CBronte_Jane.txt', 'CBronte_Villette.txt']

In [134]: group_b
Out[134]: ['Austen_Emma.txt', 'CBronte_Professor.txt', 'Austen_Pride.txt']
  1. Reconstruct the corpus using only these 40 words. Find the cosine distances between pairs of texts and visualize these using multi-dimensional scaling (see Working with text for a refresher). Compare this plot to the MDS plot of the distances between texts using the full vocabulary.
[1]Unexpected spikes in word use happen all the time. Word usage in a large corpus is notoriously bursty [CG95]. Consider, for example, ten French novels, one of which is set in Lyon. While “Lyon” might appear in all novels, it would appear much (much) more frequently in the novel set in the city.]
[2]Ted Underwood has written a blog post discussing some of the drawbacks of using the log likelihood and chi-squared test statistic in the context of literary studies.]