Skip to content

Commit

Permalink
Add logistic regression section
Browse files Browse the repository at this point in the history
  • Loading branch information
Allen Riddell committed Jan 19, 2014
1 parent 491805e commit 8b8ce6c
Showing 1 changed file with 344 additions and 0 deletions.
344 changes: 344 additions & 0 deletions source/classification_logistic_regression.rst
@@ -0,0 +1,344 @@
.. _classification-machine-learning:

===========================================================
Classification, Machine Learning, and Logistic Regression
===========================================================

Previous tutorials have illustrated how probabilistic topic models can be used
to navigate a large corpus. As general purpose tools for identifying recurrent
themes in a corpus, topic models and non-negative matrix factorization are
useful. They perform better than methods previously used for similar
purposes, such as principle component analysis (PCA) and latent semantic
analysis (LSA). For tasks such as classifying texts into a known set of categories, however,
there exist methods that are better suited to the problem. One family of such methods
goes under the heading of neural networks (or, more recently, "deep learning").
An essential conceptual and practical building block for these methods is
logistic regression, which we will review briefly in this tutorial.

.. note:: Discussion of the role of logistic regression in neural networks may
be found in section 5.1 of Bishop (2007) :cite:`bishop_pattern_2007`.

Predicting genre classifications
================================

The bag-of-words model is a horrible model of a text. Its failure to distinguish
word order ('the cat ate the fish' from 'the fish ate the cat') is the least of
its failings. In most cases, knowing the frequency with which a word occurs in
a text tells us very little. Without additional context it is
difficult to know how to interpret a word's frequency. For example, the word
'heart' might occur in a discussion of courtly love, of physical exercise, or in
a cookbook (e.g., "heart of palm"). And even when a word seems to have a single
interpretation, its meaning may depend on words occurring around it.

Nevertheless, sometimes the frequency of words appears to be correlated with
useful information, such as pre-existing classifications (or classifications in
which we happen to believe). Consider the word "ennemis" ("enemies") in the
context of a corpus of :ref:`French classical theatre <datasets>`. This corpus
includes only plays classified as tragedy or comedy. The word "ennemis" is not,
at first glance, a word particularly troubled by problems of polysemy.
Considered as an indicator of whether or not a play is a tragedy or a comedy,
the frequency of "ennemis" seems to be a reliable guide; the word tends to occur
more often in tragedies.

The first way we can verify this is simply to calculate the percentage of plays
classified as tragedy in which the word 'ennemis' occurs and compare that
percentage with the corresponding percentage for comedies. As usual, in order to
have a better sense of the variability of language in French classical theatre,
we have split the plays into approximately 1,000-word sections.

.. ipython:: python
import os
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
data_dir = 'data/french-tragedies-and-comedies-split/'
filenames = np.array(os.listdir(data_dir))
filenames_with_path = [os.path.join(data_dir, fn) for fn in filenames]
# tragedies and comedies are coded with 'TR' or 'CO',
# e.g., PCorneille_TR-V-1647-Heraclius0001.txt
genre = []
for fn in filenames:
genre.append('tragedy' if '_TR-' in fn else 'comedy')
genre = np.array(genre)
# .strip() removes the trailing newline '\n' from each line in the file
french_stopwords = [l.strip() for l in open('data/stopwords/french.txt')]
vectorizer = CountVectorizer(input='filename', min_df=15, max_df=.95, stop_words=french_stopwords, max_features=3000)
dtm = vectorizer.fit_transform(filenames_with_path)
dtm = dtm.toarray()
vocab = np.array(vectorizer.get_feature_names())
# texts are split into documents of approximately equal length, so we will
# skip the normalization step and deal directly with counts
Having assembled the corpus, it is easy to calculate the number of play sections
in which 'ennemis' occurs.

.. ipython:: python
word = "ennemis"
tragedy_counts = dtm[genre == 'tragedy', vocab == word]
comedy_counts = dtm[genre == 'comedy', vocab == word]
# tragedy percentage
np.count_nonzero(tragedy_counts) / len(tragedy_counts)
# comedy percentage
np.count_nonzero(comedy_counts) / len(comedy_counts)
# overall percentage
np.count_nonzero(dtm[:, vocab == word]) / len(dtm)
# text in which "ennemis" appears the most
filenames[np.argmax(dtm[:, vocab == word])], np.max(dtm[:, vocab == word])
In our sample, if a play section is a tragedy it features the word 'ennemis' about a third
of time. Among comedy sections, the word appears in only five percent. (Recall, however,
that in the majority of play sections the word *does not appear* at all.) While this
gives us a rough sense of the relationship between the word 'ennemis' and genre,
we may want to describe the relationship more precisely. First, we would like to
consider the relationship between the word's frequency (rather than just its
presence or absence) and a text's classification. Second, we want to
predict the classification of a section of a play for which we do not have
a classification ready at hand. Logistic regression accomplishes both of these
tasks.

Like linear regression, logistic regression will happily make predictions based
on aleatory patterns in our data. It is therefore important to make sure we have
some additional basis for believing there might be a correlation between the
frequency of the word 'ennemis' and a genre classification. Our intuition tells
us that the word (particularly in its plural form) does not belong in a comedy
(or at least not in any great frequency), whereas we can imagine a variety of
sentences using the word appearing in a tragedy. Consider, for example, the
section of Racine's *Thebaide* which features the six occurrences of the word
(and plenty of 'ennemi' as well):

::

Plus qu'à mes ennemis la guerre m'est mortelle,
Et le courroux du ciel me la rend trop cruelle ;
Il s'arme contre moi de mon propre dessein,
Il se sert de mon bras pour me percer le sein.
La guerre s'allumait, lorsque pour mon supplice,
Hémon m'abandonna pour servir Polynice ;
Les deux frères par moi devinrent ennemis,
Et je devins, Attale, ennemi de mon fils.
...

In quantitative text analysis, a common way to represent a classification is as
a binary outcome, e.g., 0 for comedy or 1 for tragedy. Whereas linear regression
relates some quantity ``x`` to another quantity ``y``, logistic regression
relates a quantity ``x`` to the *probability* of something being a member of one
of two groups, that is, the probability of ``y`` having a value of 1.

For reasons covered in greater detail at the :ref:`end of this section
<logistic-regression>`, the probability of classification is expressed not in
terms of probability (from 0 to 1) but in log `odds
<https://en.wikipedia.org/wiki/Odds>`_. This is not a mysterious transformation.
Indeed, in certain countries (and among individuals involved in
gambling) expressing the likelihood of an event in terms of odds is common.
Moving between probability, odds, and log odds is somewhat tedious but not
difficult---e.g., an event occurring with probability 0.75, it occurs with odds
3 (often expressed 3:1) and with log odds 1.1. Logistic regression delivers, for
any value of ``x``, here the frequency of the word 'ennemis', the log odds of
a play section being from a tragedy. Typically we immediately convert the log
odds into probability as the latter is more familiar.

.. note:: For very rare or very probable events using odds (and even log
odds) can be preferable to using probabilities. Consider the
`Intergovernmental Panel on Climate Change's <https://en.wikipedia.org/wiki/Intergovernmental_Panel_on_Climate_Change>`_
`guidance on addressing uncertainties <https://www.ipcc.ch/pdf/supporting-material/uncertainty-guidance-note_ar4.pdf>`_.

====================== ====================== ============ ============
Terminology Likelihood Odds Log odds
====================== ====================== ============ ============
Virtually certain 99% probability 99 (or 99:1) > 4.6
Very likely > 90% probability > 9 > 2.2
Likely > 66% probability > 2 > 0.7
About as likely as not 33 to 66% probability 0.5 to 2 -0.7 to 0.7
Unlikely < 33% probability < 0.5 < -0.7
Very unlikely < 10% probability < .1 < -2.2
Exceptionally unlikely < 1% probability < 0.01 < -4.6
====================== ====================== ============ ============

Note that whereas moving from a likelihood of 33% to 66% corresponds to
moving from 0.5 to 2 on the odds scale, moving from 90% to 99% entails
moving from 9 to 99 on the odds scale. The odds scale expresses better
the difference between an event that happens 9 out of 10 times versus an
event that happens 99 times out of 100.

First we will fit the logistic regression model using the ``statsmodels``
package and then, converting from log odds to the more familiar scale of
probability, we will plot this estimated relationship.

.. ipython:: python
import statsmodels.api as sm
wordfreq = dtm[:, vocab == "ennemis"]
# we need to add an intercept (whose coefficient is related to the
# probability that a novel will be classified a tragedy when the
# predictor is zero.
# This is done automatically in R and by sklearn's LogisticRegression
X = sm.add_constant(wordfreq)
model = sm.GLM(genre == 'tragedy', X, family=sm.families.Binomial())
fit = model.fit()
fit.params
For those accustomed to fitting regression models in R, the following code
produces precisely the same results:

.. code-block:: r
data = data.frame(wordfreq = wordfreq, genre = genre == 'tragedy')
fit = glm(genre ~ wordfreq, data = data, family = binomial(link="logit"))
coef(fit)
# note that R is implicitly adding a constant term. We can make this
# term explicit in our model if we choose (the results should be the same)
fit = glm(genre ~ 1 + wordfreq, data = data, family = binomial(link="logit"))
coef(fit)
Using the fitted parameters of the model we can make a prediction for any given
word frequency. For example, the probability of a section in which 'ennemis'
occurs twice given by

.. ipython:: python
def invlogit(x):
"""Convert from log odds to probability"""
return 1/(1+np.exp(-x))
x = 2
invlogit(fit.params[0] + fit.params[1] * x)
The following code plots the relationship between a section's word frequency and
the model's estimate of the probability of a section being from a tragedy. The
points on the figure mark the observations in the corpus. (The points have been
jittered to improve readability.)

.. ipython:: python
xs = np.arange(min(wordfreq), max(wordfreq) + 1, 0.1)
ys = np.array([invlogit(x) for x in xs])
plt.plot(xs, ys, linewidth=2)
# jitter the outcomes (0 or 1) a bit
jitter = np.random.random(len(genre)) / 5
ys_outcomes = np.abs((genre == 'tragedy') - 0.01 - jitter)
alpha = 0.7
# use different colors for the different classes
plt.plot(wordfreq[genre == 'tragedy'], ys_outcomes[genre == 'tragedy'], 'b.', alpha=alpha)
plt.plot(wordfreq[genre != 'tragedy'], ys_outcomes[genre != 'tragedy'], 'y.', alpha=alpha)
plt.xlabel("Word frequency")
plt.ylabel("Predicted probability of play section being a tragedy")
@suppress
assert np.max(wordfreq) == 6
plt.title("Predicting genre by the frequency of 'ennemis'")
# make some final aesthetic adjustments of the plot boundary
@savefig plot_logistic_ennemis.png width=7in
plt.xlim(-0.1, max(wordfreq) + 0.2); plt.tight_layout()
The figure illustrates what the model infers: if 'ennemis' appears more than
three times in a section it will tend to be a tragedy with high probability.

As an experiment and an illustration of `cross validation
<https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29#K-fold_cross-validation>`_
(also called out-of-sample validation), consider the task of predicting the
classification of a section of text based on the frequency of 'ennemis' alone.
From the 3,429 play sections in our corpus we will take one third of them at
random and ask the model to predict their classification with the model
fitted on the remaining sections. We will do this three times (once for each
held-out third). The scikit-learn package makes this procedure embarrassingly
easy, provided we use its version of logistic regression, which is designed for
large datasets and differs slightly from the version provided by R and
statsmodels. [#fn_sklearn_logisticregression]_

.. ipython:: python
from sklearn import cross_validation
from sklearn import linear_model
clf = linear_model.LogisticRegression()
cross_validation.cross_val_score(clf, wordfreq, genre == 'tragedy')
Since 'ennemis' only appears in 20% of the sections and appears more than once
in only 5% of the sections, the model will only have useful information to work
with in a fraction of the cases presented to it. Nevertheless, it does
considerably better than a baseline of simply picking 'tragedy' every time, which
would be expected to achieve 52% accuracy, as sections from tragedies make up 52% of the sections.

Of course, if we give the model access to all the word frequencies in the corpus
(not just 'ennemis') and ask it to make predictions it does much better:

.. ipython:: python
clf = linear_model.LogisticRegression()
cross_validation.cross_val_score(clf, dtm, genre == 'tragedy')
.. note:: Those interested in using a large number of predictors---such as
a matrix with 3,000 features---should use the implementation of logistic
regression found in scikit-learn. Unlike the default version provided by
R or statsmodels, scikit-learn's version includes a `penalty or
regularization term
<https://en.wikipedia.org/wiki/Regularization_%28mathematics%29>`_, which
tends to help prevent `overfitting
<https://en.wikipedia.org/wiki/Overfitting>`_ that can occur in models using
a large number of predictors.

.. _logistic-regression:

Logistic regression
===================

.. note:: Resources for those interested in learning about logistic (and linear)
regression include Gelman and Hill (2006) :cite:`gelman_data_2006` and
Bishop (2007) :cite:`bishop_pattern_2007`. Stanford's OpenClassroom also has
a `series of lectures devoted to logistic regression
<http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=DeepLearning>`_.

Linear regression is one way of thinking about the relationship between two
variables. Logistic regression is a linear model as well; it assumes a linear,
additive relationship between the predictors and the *log odds* of a classification.
With a single predictor and an intercept term, the relationship between
a classification and a predictor has the following symbolic expression:

.. math::
P(y_i = \mathrm{tragedy}) &= \mathrm{logit}^{-1}(\beta_0 + \beta_1 x_i)\\
&= \frac{e^{\beta_0 + \beta x_i}}{1+e^{\beta_0 + \beta_1 x_i}}\\
&= \frac{1}{1+e^{-(\beta_0 + \beta_1 x_i)}}\\
&= \sigma(\beta_0 + \beta_1 x_i)\\
Typically we have more than one observation. Letting :math:`\sigma(x_i\beta)`
stand in for :math:`\frac{1}{1+e^{-(\beta_0 + \beta_1 x_i)}}` the `maximum
likelihood estimate
<https://en.wikipedia.org/wiki/Maximum_likelihood_estimate>`_ for :math:`\beta`
is the value of :math:`\beta` which maximizes the log
likelihood of the observations:

.. math::
\log \prod_{i=1}^n P(y_i = \mathrm{tragedy}) &= \sum \left( y_i \log \sigma(x_i \beta) + (1 - y_i) \log (1 - \sigma(x_i \beta)) \right)\\
While for linear regression there is frequently a closed-form solution for the
maximum, logistic regression lacks a tidy solution. The solution (there is
indeed a unique maximum) is typically found using `iteratively reweighted least
squares <https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares>`_.

The solution may be found in Python using ``statsmodels.api.GLM`` or in R using
the built-in ``glm`` function. The two functions should yield identical results.

.. FOOTNOTES
.. [#fn_sklearn_logisticregression] Scikit-learn's ``LogisticRegression``
includes a penalty term which prevents overfitting, something that is
a major concern when the number of predictors exceeds the number of
observations. Those wishing for a logistic regression model that mirrors
R's ``glm()`` should use ``statsmodels``'s ``GLM``.

0 comments on commit 8b8ce6c

Please sign in to comment.