diff --git a/source/feature_selection.rst b/source/feature_selection.rst index efdd31e..b23ffec 100644 --- a/source/feature_selection.rst +++ b/source/feature_selection.rst @@ -37,18 +37,6 @@ of novels containing works by two authors: Jane Austen and Charlotte Brontë. This :ref:`corpus of six novels ` consists of the following text files: -.. ipython:: python - - filenames - -.. raw:: html - :file: generated/feature_selection_bayesian.txt - - -We will find that among the words that reliably distinguish Austen from Brontë -are "such", "could", and "any". This tutorial demonstrates how we arrived at -these words. - .. ipython:: python :suppress: @@ -66,6 +54,18 @@ these words. CBRONTE_FILENAMES = ['CBronte_Jane.txt', 'CBronte_Professor.txt', 'CBronte_Villette.txt'] filenames = AUSTEN_FILENAMES + CBRONTE_FILENAMES +.. ipython:: python + + filenames + +We will find that among the words that reliably distinguish Austen from Brontë +are "such", "could", and "any". This tutorial demonstrates how we arrived at +these words. + +.. raw:: html + :file: generated/feature_selection_bayesian.txt + + .. note:: The following features an introduction to the concepts underlying feature selection. Those who are working with a very large corpus and are familiar with statistics may wish to skip ahead to the section on @@ -345,13 +345,7 @@ in Brontë's novels were much more variable, say, 0.03, 0.04, and 0.66 (0.24 on average). Although the averages remain the same, the difference does not seem so pronounced; with only one observation (0.66) noticeably greater than we find in Austen, we might reasonably doubt that there is evidence of a systematic difference between -the authors. [#fnlyon]_ - -.. [#fnlyon] Unexpected spikes in word use happen all the time. Word usage in a large corpus - is notoriously "bursty" (a technical term!) :cite:`church_poisson_1995`. - Consider, for example, ten French novels, one of which is set in Lyon. - While "Lyon" might appear in all novels, it would appear much (much) more - frequently in the novel set in the city.] +the authors. [#fn_lyon]_ One way of formalizing a comparison of two groups that takes account of the variability of word usage comes from Bayesian statistics. To describe our @@ -640,7 +634,7 @@ This produces a useful ordering of characteristic words. Unlikely `frequentist observations within groups. This method will also work for small corpora provided useful prior information is available. To the extent that we are interested in a close reading of differences of vocabulary use, the Bayesian -method should be preferred. [#fnunderwood]_ +method should be preferred. [#fn_underwood]_ .. _chi2: @@ -936,7 +930,13 @@ Exercises .. FOOTNOTES -.. [#fnunderwood] Ted Underwood has written a `blog post discussing some of the +.. [#fn_lyon] Unexpected spikes in word use happen all the time. Word usage in a large corpus + is notoriously *bursty* :cite:`church_poisson_1995`. + Consider, for example, ten French novels, one of which is set in Lyon. + While "Lyon" might appear in all novels, it would appear much (much) more + frequently in the novel set in the city.] + +.. [#fn_underwood] Ted Underwood has written a `blog post discussing some of the drawbacks of using the log likelihood and chi-squared test statistic in the context of literary studies `_.] diff --git a/source/topic_model_mallet.rst b/source/topic_model_mallet.rst index def33d7..5ec8e81 100644 --- a/source/topic_model_mallet.rst +++ b/source/topic_model_mallet.rst @@ -128,11 +128,6 @@ documentation in the Python library `itertools `_ describes a function called ``grouper`` using ``itertools.izip_longest`` that solves our problem. -.. [#fnmapreduce] Those familiar with - `MapReduce `_ may recognize the pattern of - splitting a dataset into smaller pieces and then (re)ordering them. - - .. ipython:: python :suppress: @@ -483,3 +478,11 @@ to be associated more strongly with Austen's novels than with Brontë's. .. raw:: html :file: generated/topic_model_distinctive_avg_diff.txt + +.. FOOTNOTES + +.. [#fnmapreduce] Those familiar with + `MapReduce `_ may recognize the pattern of + splitting a dataset into smaller pieces and then (re)ordering them. + + diff --git a/source/topic_model_visualization.rst b/source/topic_model_visualization.rst index 892fa23..cd90d51 100644 --- a/source/topic_model_visualization.rst +++ b/source/topic_model_visualization.rst @@ -438,6 +438,8 @@ This shows us that a greater diversity of vocabulary items are associated with topic 3 (likely many of the French words that appear only in Brontë's *The Professor*) than with topic 0. +.. FOOTNOTES + .. [#fnpritchard] The topic model now familiar as LDA was independently discovered and published in 2000 by Pritchard et al. :cite:`pritchard_inference_2000`. diff --git a/source/working_with_text.rst b/source/working_with_text.rst index a647de6..6ddc628 100644 --- a/source/working_with_text.rst +++ b/source/working_with_text.rst @@ -5,7 +5,7 @@ Working with text =================== -.. note:: This tutorial is also available in download for interactive use +.. note:: This tutorial is available for interactive use with `IPython Notebook `_: :download:`Working with text.ipynb `. Creating a document-term matrix