Skip to content

Commit

Permalink
Fixes minor errors and footnote placement
Browse files Browse the repository at this point in the history
  • Loading branch information
Allen Riddell committed Feb 15, 2014
1 parent 4978af1 commit 9eaf1d3
Show file tree
Hide file tree
Showing 4 changed files with 32 additions and 27 deletions.
42 changes: 21 additions & 21 deletions source/feature_selection.rst
Expand Up @@ -37,18 +37,6 @@ of novels containing works by two authors: Jane Austen and Charlotte Brontë.
This :ref:`corpus of six novels <datasets>` consists of the following text
files:

.. ipython:: python
filenames
.. raw:: html
:file: generated/feature_selection_bayesian.txt


We will find that among the words that reliably distinguish Austen from Brontë
are "such", "could", and "any". This tutorial demonstrates how we arrived at
these words.

.. ipython:: python
:suppress:
Expand All @@ -66,6 +54,18 @@ these words.
CBRONTE_FILENAMES = ['CBronte_Jane.txt', 'CBronte_Professor.txt', 'CBronte_Villette.txt']
filenames = AUSTEN_FILENAMES + CBRONTE_FILENAMES
.. ipython:: python
filenames
We will find that among the words that reliably distinguish Austen from Brontë
are "such", "could", and "any". This tutorial demonstrates how we arrived at
these words.

.. raw:: html
:file: generated/feature_selection_bayesian.txt


.. note:: The following features an introduction to the concepts underlying
feature selection. Those who are working with a very large corpus and are
familiar with statistics may wish to skip ahead to the section on
Expand Down Expand Up @@ -345,13 +345,7 @@ in Brontë's novels were much more variable, say, 0.03, 0.04, and 0.66 (0.24 on
average). Although the averages remain the same, the difference does not seem
so pronounced; with only one observation (0.66) noticeably greater than we find in Austen, we
might reasonably doubt that there is evidence of a systematic difference between
the authors. [#fnlyon]_

.. [#fnlyon] Unexpected spikes in word use happen all the time. Word usage in a large corpus
is notoriously "bursty" (a technical term!) :cite:`church_poisson_1995`.
Consider, for example, ten French novels, one of which is set in Lyon.
While "Lyon" might appear in all novels, it would appear much (much) more
frequently in the novel set in the city.]
the authors. [#fn_lyon]_

One way of formalizing a comparison of two groups that takes account of the
variability of word usage comes from Bayesian statistics. To describe our
Expand Down Expand Up @@ -640,7 +634,7 @@ This produces a useful ordering of characteristic words. Unlikely `frequentist
observations within groups. This method will also work for small corpora
provided useful prior information is available. To the extent that we are
interested in a close reading of differences of vocabulary use, the Bayesian
method should be preferred. [#fnunderwood]_
method should be preferred. [#fn_underwood]_

.. _chi2:

Expand Down Expand Up @@ -936,7 +930,13 @@ Exercises

.. FOOTNOTES
.. [#fnunderwood] Ted Underwood has written a `blog post discussing some of the
.. [#fn_lyon] Unexpected spikes in word use happen all the time. Word usage in a large corpus
is notoriously *bursty* :cite:`church_poisson_1995`.
Consider, for example, ten French novels, one of which is set in Lyon.
While "Lyon" might appear in all novels, it would appear much (much) more
frequently in the novel set in the city.]
.. [#fn_underwood] Ted Underwood has written a `blog post discussing some of the
drawbacks of using the log likelihood and chi-squared test statistic in the
context of literary studies <http://tedunderwood.com/2011/11/09/identifying-the-terms-that-characterize-an-author-or-genre-why-dunnings-may-not-be-the-best-method/>`_.]
Expand Down
13 changes: 8 additions & 5 deletions source/topic_model_mallet.rst
Expand Up @@ -128,11 +128,6 @@ documentation in the Python library `itertools
<http://docs.python.org/dev/library/itertools.html>`_ describes a function
called ``grouper`` using ``itertools.izip_longest`` that solves our problem.

.. [#fnmapreduce] Those familiar with
`MapReduce <https://en.wikipedia.org/wiki/MapReduce>`_ may recognize the pattern of
splitting a dataset into smaller pieces and then (re)ordering them.
.. ipython:: python
:suppress:
Expand Down Expand Up @@ -483,3 +478,11 @@ to be associated more strongly with Austen's novels than with Brontë's.
.. raw:: html
:file: generated/topic_model_distinctive_avg_diff.txt

.. FOOTNOTES
.. [#fnmapreduce] Those familiar with
`MapReduce <https://en.wikipedia.org/wiki/MapReduce>`_ may recognize the pattern of
splitting a dataset into smaller pieces and then (re)ordering them.
2 changes: 2 additions & 0 deletions source/topic_model_visualization.rst
Expand Up @@ -438,6 +438,8 @@ This shows us that a greater diversity of vocabulary items are associated with
topic 3 (likely many of the French words that appear only in Brontë's *The
Professor*) than with topic 0.

.. FOOTNOTES
.. [#fnpritchard] The topic model now familiar as LDA was independently
discovered and published in 2000 by Pritchard et al.
:cite:`pritchard_inference_2000`.
2 changes: 1 addition & 1 deletion source/working_with_text.rst
Expand Up @@ -5,7 +5,7 @@
Working with text
===================

.. note:: This tutorial is also available in download for interactive use
.. note:: This tutorial is available for interactive use
with `IPython Notebook <http://ipython.org/notebook.html>`_: :download:`Working with text.ipynb <Working with text.ipynb>`.

Creating a document-term matrix
Expand Down

0 comments on commit 9eaf1d3

Please sign in to comment.