forked from ariddell/tatom
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Allen Riddell
committed
Jan 19, 2014
1 parent
491805e
commit 8b8ce6c
Showing
1 changed file
with
344 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,344 @@ | ||
.. _classification-machine-learning: | ||
|
||
=========================================================== | ||
Classification, Machine Learning, and Logistic Regression | ||
=========================================================== | ||
|
||
Previous tutorials have illustrated how probabilistic topic models can be used | ||
to navigate a large corpus. As general purpose tools for identifying recurrent | ||
themes in a corpus, topic models and non-negative matrix factorization are | ||
useful. They perform better than methods previously used for similar | ||
purposes, such as principle component analysis (PCA) and latent semantic | ||
analysis (LSA). For tasks such as classifying texts into a known set of categories, however, | ||
there exist methods that are better suited to the problem. One family of such methods | ||
goes under the heading of neural networks (or, more recently, "deep learning"). | ||
An essential conceptual and practical building block for these methods is | ||
logistic regression, which we will review briefly in this tutorial. | ||
|
||
.. note:: Discussion of the role of logistic regression in neural networks may | ||
be found in section 5.1 of Bishop (2007) :cite:`bishop_pattern_2007`. | ||
|
||
Predicting genre classifications | ||
================================ | ||
|
||
The bag-of-words model is a horrible model of a text. Its failure to distinguish | ||
word order ('the cat ate the fish' from 'the fish ate the cat') is the least of | ||
its failings. In most cases, knowing the frequency with which a word occurs in | ||
a text tells us very little. Without additional context it is | ||
difficult to know how to interpret a word's frequency. For example, the word | ||
'heart' might occur in a discussion of courtly love, of physical exercise, or in | ||
a cookbook (e.g., "heart of palm"). And even when a word seems to have a single | ||
interpretation, its meaning may depend on words occurring around it. | ||
|
||
Nevertheless, sometimes the frequency of words appears to be correlated with | ||
useful information, such as pre-existing classifications (or classifications in | ||
which we happen to believe). Consider the word "ennemis" ("enemies") in the | ||
context of a corpus of :ref:`French classical theatre <datasets>`. This corpus | ||
includes only plays classified as tragedy or comedy. The word "ennemis" is not, | ||
at first glance, a word particularly troubled by problems of polysemy. | ||
Considered as an indicator of whether or not a play is a tragedy or a comedy, | ||
the frequency of "ennemis" seems to be a reliable guide; the word tends to occur | ||
more often in tragedies. | ||
|
||
The first way we can verify this is simply to calculate the percentage of plays | ||
classified as tragedy in which the word 'ennemis' occurs and compare that | ||
percentage with the corresponding percentage for comedies. As usual, in order to | ||
have a better sense of the variability of language in French classical theatre, | ||
we have split the plays into approximately 1,000-word sections. | ||
|
||
.. ipython:: python | ||
import os | ||
import numpy as np | ||
from sklearn.feature_extraction.text import CountVectorizer | ||
data_dir = 'data/french-tragedies-and-comedies-split/' | ||
filenames = np.array(os.listdir(data_dir)) | ||
filenames_with_path = [os.path.join(data_dir, fn) for fn in filenames] | ||
# tragedies and comedies are coded with 'TR' or 'CO', | ||
# e.g., PCorneille_TR-V-1647-Heraclius0001.txt | ||
genre = [] | ||
for fn in filenames: | ||
genre.append('tragedy' if '_TR-' in fn else 'comedy') | ||
genre = np.array(genre) | ||
# .strip() removes the trailing newline '\n' from each line in the file | ||
french_stopwords = [l.strip() for l in open('data/stopwords/french.txt')] | ||
vectorizer = CountVectorizer(input='filename', min_df=15, max_df=.95, stop_words=french_stopwords, max_features=3000) | ||
dtm = vectorizer.fit_transform(filenames_with_path) | ||
dtm = dtm.toarray() | ||
vocab = np.array(vectorizer.get_feature_names()) | ||
# texts are split into documents of approximately equal length, so we will | ||
# skip the normalization step and deal directly with counts | ||
Having assembled the corpus, it is easy to calculate the number of play sections | ||
in which 'ennemis' occurs. | ||
|
||
.. ipython:: python | ||
word = "ennemis" | ||
tragedy_counts = dtm[genre == 'tragedy', vocab == word] | ||
comedy_counts = dtm[genre == 'comedy', vocab == word] | ||
# tragedy percentage | ||
np.count_nonzero(tragedy_counts) / len(tragedy_counts) | ||
# comedy percentage | ||
np.count_nonzero(comedy_counts) / len(comedy_counts) | ||
# overall percentage | ||
np.count_nonzero(dtm[:, vocab == word]) / len(dtm) | ||
# text in which "ennemis" appears the most | ||
filenames[np.argmax(dtm[:, vocab == word])], np.max(dtm[:, vocab == word]) | ||
In our sample, if a play section is a tragedy it features the word 'ennemis' about a third | ||
of time. Among comedy sections, the word appears in only five percent. (Recall, however, | ||
that in the majority of play sections the word *does not appear* at all.) While this | ||
gives us a rough sense of the relationship between the word 'ennemis' and genre, | ||
we may want to describe the relationship more precisely. First, we would like to | ||
consider the relationship between the word's frequency (rather than just its | ||
presence or absence) and a text's classification. Second, we want to | ||
predict the classification of a section of a play for which we do not have | ||
a classification ready at hand. Logistic regression accomplishes both of these | ||
tasks. | ||
|
||
Like linear regression, logistic regression will happily make predictions based | ||
on aleatory patterns in our data. It is therefore important to make sure we have | ||
some additional basis for believing there might be a correlation between the | ||
frequency of the word 'ennemis' and a genre classification. Our intuition tells | ||
us that the word (particularly in its plural form) does not belong in a comedy | ||
(or at least not in any great frequency), whereas we can imagine a variety of | ||
sentences using the word appearing in a tragedy. Consider, for example, the | ||
section of Racine's *Thebaide* which features the six occurrences of the word | ||
(and plenty of 'ennemi' as well): | ||
|
||
:: | ||
|
||
Plus qu'à mes ennemis la guerre m'est mortelle, | ||
Et le courroux du ciel me la rend trop cruelle ; | ||
Il s'arme contre moi de mon propre dessein, | ||
Il se sert de mon bras pour me percer le sein. | ||
La guerre s'allumait, lorsque pour mon supplice, | ||
Hémon m'abandonna pour servir Polynice ; | ||
Les deux frères par moi devinrent ennemis, | ||
Et je devins, Attale, ennemi de mon fils. | ||
... | ||
|
||
In quantitative text analysis, a common way to represent a classification is as | ||
a binary outcome, e.g., 0 for comedy or 1 for tragedy. Whereas linear regression | ||
relates some quantity ``x`` to another quantity ``y``, logistic regression | ||
relates a quantity ``x`` to the *probability* of something being a member of one | ||
of two groups, that is, the probability of ``y`` having a value of 1. | ||
|
||
For reasons covered in greater detail at the :ref:`end of this section | ||
<logistic-regression>`, the probability of classification is expressed not in | ||
terms of probability (from 0 to 1) but in log `odds | ||
<https://en.wikipedia.org/wiki/Odds>`_. This is not a mysterious transformation. | ||
Indeed, in certain countries (and among individuals involved in | ||
gambling) expressing the likelihood of an event in terms of odds is common. | ||
Moving between probability, odds, and log odds is somewhat tedious but not | ||
difficult---e.g., an event occurring with probability 0.75, it occurs with odds | ||
3 (often expressed 3:1) and with log odds 1.1. Logistic regression delivers, for | ||
any value of ``x``, here the frequency of the word 'ennemis', the log odds of | ||
a play section being from a tragedy. Typically we immediately convert the log | ||
odds into probability as the latter is more familiar. | ||
|
||
.. note:: For very rare or very probable events using odds (and even log | ||
odds) can be preferable to using probabilities. Consider the | ||
`Intergovernmental Panel on Climate Change's <https://en.wikipedia.org/wiki/Intergovernmental_Panel_on_Climate_Change>`_ | ||
`guidance on addressing uncertainties <https://www.ipcc.ch/pdf/supporting-material/uncertainty-guidance-note_ar4.pdf>`_. | ||
|
||
====================== ====================== ============ ============ | ||
Terminology Likelihood Odds Log odds | ||
====================== ====================== ============ ============ | ||
Virtually certain 99% probability 99 (or 99:1) > 4.6 | ||
Very likely > 90% probability > 9 > 2.2 | ||
Likely > 66% probability > 2 > 0.7 | ||
About as likely as not 33 to 66% probability 0.5 to 2 -0.7 to 0.7 | ||
Unlikely < 33% probability < 0.5 < -0.7 | ||
Very unlikely < 10% probability < .1 < -2.2 | ||
Exceptionally unlikely < 1% probability < 0.01 < -4.6 | ||
====================== ====================== ============ ============ | ||
|
||
Note that whereas moving from a likelihood of 33% to 66% corresponds to | ||
moving from 0.5 to 2 on the odds scale, moving from 90% to 99% entails | ||
moving from 9 to 99 on the odds scale. The odds scale expresses better | ||
the difference between an event that happens 9 out of 10 times versus an | ||
event that happens 99 times out of 100. | ||
|
||
First we will fit the logistic regression model using the ``statsmodels`` | ||
package and then, converting from log odds to the more familiar scale of | ||
probability, we will plot this estimated relationship. | ||
|
||
.. ipython:: python | ||
import statsmodels.api as sm | ||
wordfreq = dtm[:, vocab == "ennemis"] | ||
# we need to add an intercept (whose coefficient is related to the | ||
# probability that a novel will be classified a tragedy when the | ||
# predictor is zero. | ||
# This is done automatically in R and by sklearn's LogisticRegression | ||
X = sm.add_constant(wordfreq) | ||
model = sm.GLM(genre == 'tragedy', X, family=sm.families.Binomial()) | ||
fit = model.fit() | ||
fit.params | ||
For those accustomed to fitting regression models in R, the following code | ||
produces precisely the same results: | ||
|
||
.. code-block:: r | ||
data = data.frame(wordfreq = wordfreq, genre = genre == 'tragedy') | ||
fit = glm(genre ~ wordfreq, data = data, family = binomial(link="logit")) | ||
coef(fit) | ||
# note that R is implicitly adding a constant term. We can make this | ||
# term explicit in our model if we choose (the results should be the same) | ||
fit = glm(genre ~ 1 + wordfreq, data = data, family = binomial(link="logit")) | ||
coef(fit) | ||
Using the fitted parameters of the model we can make a prediction for any given | ||
word frequency. For example, the probability of a section in which 'ennemis' | ||
occurs twice given by | ||
|
||
.. ipython:: python | ||
def invlogit(x): | ||
"""Convert from log odds to probability""" | ||
return 1/(1+np.exp(-x)) | ||
x = 2 | ||
invlogit(fit.params[0] + fit.params[1] * x) | ||
The following code plots the relationship between a section's word frequency and | ||
the model's estimate of the probability of a section being from a tragedy. The | ||
points on the figure mark the observations in the corpus. (The points have been | ||
jittered to improve readability.) | ||
|
||
.. ipython:: python | ||
xs = np.arange(min(wordfreq), max(wordfreq) + 1, 0.1) | ||
ys = np.array([invlogit(x) for x in xs]) | ||
plt.plot(xs, ys, linewidth=2) | ||
# jitter the outcomes (0 or 1) a bit | ||
jitter = np.random.random(len(genre)) / 5 | ||
ys_outcomes = np.abs((genre == 'tragedy') - 0.01 - jitter) | ||
alpha = 0.7 | ||
# use different colors for the different classes | ||
plt.plot(wordfreq[genre == 'tragedy'], ys_outcomes[genre == 'tragedy'], 'b.', alpha=alpha) | ||
plt.plot(wordfreq[genre != 'tragedy'], ys_outcomes[genre != 'tragedy'], 'y.', alpha=alpha) | ||
plt.xlabel("Word frequency") | ||
plt.ylabel("Predicted probability of play section being a tragedy") | ||
@suppress | ||
assert np.max(wordfreq) == 6 | ||
plt.title("Predicting genre by the frequency of 'ennemis'") | ||
# make some final aesthetic adjustments of the plot boundary | ||
@savefig plot_logistic_ennemis.png width=7in | ||
plt.xlim(-0.1, max(wordfreq) + 0.2); plt.tight_layout() | ||
The figure illustrates what the model infers: if 'ennemis' appears more than | ||
three times in a section it will tend to be a tragedy with high probability. | ||
|
||
As an experiment and an illustration of `cross validation | ||
<https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29#K-fold_cross-validation>`_ | ||
(also called out-of-sample validation), consider the task of predicting the | ||
classification of a section of text based on the frequency of 'ennemis' alone. | ||
From the 3,429 play sections in our corpus we will take one third of them at | ||
random and ask the model to predict their classification with the model | ||
fitted on the remaining sections. We will do this three times (once for each | ||
held-out third). The scikit-learn package makes this procedure embarrassingly | ||
easy, provided we use its version of logistic regression, which is designed for | ||
large datasets and differs slightly from the version provided by R and | ||
statsmodels. [#fn_sklearn_logisticregression]_ | ||
|
||
.. ipython:: python | ||
from sklearn import cross_validation | ||
from sklearn import linear_model | ||
clf = linear_model.LogisticRegression() | ||
cross_validation.cross_val_score(clf, wordfreq, genre == 'tragedy') | ||
Since 'ennemis' only appears in 20% of the sections and appears more than once | ||
in only 5% of the sections, the model will only have useful information to work | ||
with in a fraction of the cases presented to it. Nevertheless, it does | ||
considerably better than a baseline of simply picking 'tragedy' every time, which | ||
would be expected to achieve 52% accuracy, as sections from tragedies make up 52% of the sections. | ||
|
||
Of course, if we give the model access to all the word frequencies in the corpus | ||
(not just 'ennemis') and ask it to make predictions it does much better: | ||
|
||
.. ipython:: python | ||
clf = linear_model.LogisticRegression() | ||
cross_validation.cross_val_score(clf, dtm, genre == 'tragedy') | ||
.. note:: Those interested in using a large number of predictors---such as | ||
a matrix with 3,000 features---should use the implementation of logistic | ||
regression found in scikit-learn. Unlike the default version provided by | ||
R or statsmodels, scikit-learn's version includes a `penalty or | ||
regularization term | ||
<https://en.wikipedia.org/wiki/Regularization_%28mathematics%29>`_, which | ||
tends to help prevent `overfitting | ||
<https://en.wikipedia.org/wiki/Overfitting>`_ that can occur in models using | ||
a large number of predictors. | ||
|
||
.. _logistic-regression: | ||
|
||
Logistic regression | ||
=================== | ||
|
||
.. note:: Resources for those interested in learning about logistic (and linear) | ||
regression include Gelman and Hill (2006) :cite:`gelman_data_2006` and | ||
Bishop (2007) :cite:`bishop_pattern_2007`. Stanford's OpenClassroom also has | ||
a `series of lectures devoted to logistic regression | ||
<http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=DeepLearning>`_. | ||
|
||
Linear regression is one way of thinking about the relationship between two | ||
variables. Logistic regression is a linear model as well; it assumes a linear, | ||
additive relationship between the predictors and the *log odds* of a classification. | ||
With a single predictor and an intercept term, the relationship between | ||
a classification and a predictor has the following symbolic expression: | ||
|
||
.. math:: | ||
P(y_i = \mathrm{tragedy}) &= \mathrm{logit}^{-1}(\beta_0 + \beta_1 x_i)\\ | ||
&= \frac{e^{\beta_0 + \beta x_i}}{1+e^{\beta_0 + \beta_1 x_i}}\\ | ||
&= \frac{1}{1+e^{-(\beta_0 + \beta_1 x_i)}}\\ | ||
&= \sigma(\beta_0 + \beta_1 x_i)\\ | ||
Typically we have more than one observation. Letting :math:`\sigma(x_i\beta)` | ||
stand in for :math:`\frac{1}{1+e^{-(\beta_0 + \beta_1 x_i)}}` the `maximum | ||
likelihood estimate | ||
<https://en.wikipedia.org/wiki/Maximum_likelihood_estimate>`_ for :math:`\beta` | ||
is the value of :math:`\beta` which maximizes the log | ||
likelihood of the observations: | ||
|
||
.. math:: | ||
\log \prod_{i=1}^n P(y_i = \mathrm{tragedy}) &= \sum \left( y_i \log \sigma(x_i \beta) + (1 - y_i) \log (1 - \sigma(x_i \beta)) \right)\\ | ||
While for linear regression there is frequently a closed-form solution for the | ||
maximum, logistic regression lacks a tidy solution. The solution (there is | ||
indeed a unique maximum) is typically found using `iteratively reweighted least | ||
squares <https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares>`_. | ||
|
||
The solution may be found in Python using ``statsmodels.api.GLM`` or in R using | ||
the built-in ``glm`` function. The two functions should yield identical results. | ||
|
||
.. FOOTNOTES | ||
.. [#fn_sklearn_logisticregression] Scikit-learn's ``LogisticRegression`` | ||
includes a penalty term which prevents overfitting, something that is | ||
a major concern when the number of predictors exceeds the number of | ||
observations. Those wishing for a logistic regression model that mirrors | ||
R's ``glm()`` should use ``statsmodels``'s ``GLM``. |