Skip to content

Commit

Permalink
Add visualizing trends tutorial
Browse files Browse the repository at this point in the history
  • Loading branch information
Allen Riddell committed Jan 19, 2014
1 parent 0b7c709 commit 77c21f4
Showing 1 changed file with 273 additions and 0 deletions.
273 changes: 273 additions & 0 deletions source/visualizing_trends.rst
@@ -0,0 +1,273 @@
.. _visualizing-trends:

====================
Visualizing trends
====================

.. ipython:: python
:suppress:
import numpy as np; np.set_printoptions(precision=3)
Texts often have a sequence. Newspapers and periodicals have volumes.
Novels have chapters. Personal diaries have dated entries. Visualizations of
topic models may benefit from incorporating information about where a text falls
in a sequence.

As a motivating example, consider Victor Hugo's *Les Misérables*. Over 500,000
words long, the book counts as a lengthy text by any
standard.[#fn_les_mis]_ The novel comes in five volumes ("Fantine", "Cosette",
"Marius", "The Idyll in the Rue Plumet and the Epic in the Rue St. Denis", and
"Jean Valjean"). And within each volume we have a sequence of chapters. (And
within each chapter we have a sequence of paragraphs, ...). In this section we
will address how to visualize topic shares in sequence.

To whet your appetite, consider the rise and fall of a topic associated with
revolutionary activity in *Les Misérables*:

.. figure:: _static/plot_topics_over_time_series_les_misérables.png
:scale: 60 %
:alt: Les Misérables, Topic #35 ("barricade enjolras ...")

(`Enjolras <https://en.wikipedia.org/wiki/Enjolras>`_ is the leader of the
revolutionary *Les Amis de l'ABC*.)

.. note:: Probabilistic models such as topic models often benefit from
incorporating information about where an individual text falls in a larger
sequence of texts :cite:`blei_dynamic_2006`.


Plotting trends
===============

As always, we first need to fit a topic model to the corpus. As MALLET has no
built-in French stopword list we need to provide one. We will use the `French
stopword list
<http://svn.tartarus.org/snowball/trunk/website/algorithms/french/stop.txt>`_
from the Snowball stemmer package. Additionally, because we are dealing with
non-English text we need to use an alternate regular expression for
tokenization. MALLET helpfully suggests ``--token-regex '[\p{L}\p{M}]+'``.

.. code-block:: bash
mallet-2.0.7/bin/mallet import-dir --input data/hugo-les-misérables-split/ --output /tmp/topic-input-hugo.mallet --keep-sequence --remove-stopwords --stoplist-file data/stopwords/french.txt --token-regex '[\p{L}\p{M}]+'
mallet-2.0.7/bin/mallet train-topics --input /tmp/topic-input-hugo.mallet --num-topics 50 --output-doc-topics /tmp/doc-topics-hugo.txt --output-topic-keys /tmp/topic-keys-hugo.txt --word-topic-counts-file /tmp/word-topic-hugo.txt
.. ipython:: python
:suppress:
import os
import shutil
import subprocess
N_TOPICS = 50
MALLET_INPUT = 'source/cache/topic-input-hugo-les-misérables-split.mallet'
MALLET_TOPICS = 'source/cache/doc-topic-hugo-les-misérables-{}topics.txt'.format(N_TOPICS)
MALLET_WORD_TOPIC_COUNTS = 'source/cache/doc-topic-hugo-les-misérables-{}topics-word-topic.txt'.format(N_TOPICS)
MALLET_KEYS = 'source/cache/doc-topic-hugo-les-misérables-{}topics-keys.txt'.format(N_TOPICS)
if not os.path.exists(MALLET_INPUT):
subprocess.check_call("""mallet-2.0.7/bin/mallet import-dir --input data/hugo-les-misérables-split/ --output {} --keep-sequence --remove-stopwords --stoplist-file data/stopwords/french.txt --token-regex '[\p{{L}}\p{{M}}]+'""".format(MALLET_INPUT), shell=True)
.. ipython:: python
:suppress:
shutil.copy(MALLET_INPUT,'/tmp/topic-input-hugo.mallet')
if not os.path.exists(MALLET_TOPICS):
subprocess.check_call('mallet-2.0.7/bin/mallet train-topics --input /tmp/topic-input-hugo.mallet --num-iterations 5000 --num-topics {} --output-doc-topics {} --output-topic-keys {} --word-topic-counts-file {} --random-seed 1'.format(N_TOPICS, MALLET_TOPICS, MALLET_KEYS, MALLET_WORD_TOPIC_COUNTS), shell=True)
shutil.copy(MALLET_TOPICS,'/tmp/doc-topics-hugo.txt')
shutil.copy(MALLET_KEYS,'/tmp/topic-keys-hugo.txt')
shutil.copy(MALLET_WORD_TOPIC_COUNTS,'/tmp/word-topic-hugo.txt')
As usual, we post-process the MALLET output in order to get a matrix of topic
proportions. Each row of the matrix holds the topic proportions associated with
a document.

.. ipython:: python
import numpy as np
import itertools
import operator
import os
def grouper(n, iterable, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return itertools.zip_longest(*args, fillvalue=fillvalue)
doctopic_triples = []
with open("/tmp/doc-topics-hugo.txt") as f:
f.readline() # read one line in order to skip the header
for line in f:
docnum, docname, *values = line.rstrip().split('\t')
for topic, share in grouper(2, values):
triple = (docname, int(topic), float(share))
doctopic_triples.append(triple)
# sort the triples
doctopic_triples.sort(key=operator.itemgetter(0,1))
docnames = sorted(set([triple[0] for triple in doctopic_triples]))
docnames_base = np.array([os.path.splitext(os.path.basename(n))[0] for n in docnames])
num_topics = len(doctopic_triples) // len(docnames)
doctopic = np.empty((len(docnames), num_topics))
for i, (doc_name, triples) in enumerate(itertools.groupby(doctopic_triples, key=operator.itemgetter(0))):
doctopic[i, :] = np.array([share for _, _, share in triples])
docnames = docnames_base
# get the topic words
with open('/tmp/topic-keys-hugo.txt') as input:
topic_keys_lines = input.readlines()
topic_words = []
for line in topic_keys_lines:
_, _, words = line.split('\t') # tab-separated
words = words.rstrip().split(' ') # remove the trailing '\n'
topic_words.append(words)
Among the fifty topics there is one topic (#35 using 0-based indexing) that
jumps out as characteristic of events towards the close of the novel. The words
most strongly connected with this topic include "barricade", "fusil", and
"cartouches" ("barricade", "rifle", and "cartridges").

.. ipython:: python
','.join(topic_words[35])
Because the documents are ordered in a sequence, we can plot the fate, so to
speak, of this topic over time with the following lines of code:

.. ipython:: python
series = doctopic[:, 35]
@savefig plot_topics_over_time_series_simple.png width=7in
plt.plot(series, '.') # '.' specifies the type of mark to use on the graph
While this visualization communicates the essential information about the
prevalence of a topic in the corpus, it is not perfect. We can improve it. It
would, for instance, be useful to include an indication of where the various
volumes start and end. Another enhancement would add some kind of "smoothing" to
the time series in order to better communicate the underlying trend.

A rolling average of the topic shares turns out be a useful form of smoothing in
this case. We are interested in the prevalence of a topic over time and whether
a topic disappears completely in one 500-word chunk of text (only to reappear in
the next) does not interest us. We want to visualize the underlying trend, that
is, we need some model or heuristic capable of capturing the idea
that the topic (or any similar feature) has an underlying propensity to appear at
varying points of the novel and that while this propensity may change over time it
does not fluctuate wildly. [#fn_lowess]_

Recall that a rolling or moving average of a time series associates with each
point in the series the average of some fixed number of previous
observations (including the current observation). This fixed number of
observations is often
called a "window". The idea of a rolling mean (conveniently implemented in
``pandas.rolling_mean()``) is effectively communicated visually:

.. ipython:: python
import pandas as pd
z = np.array([ 3., 2., 3., 6., 2., 3., 1., 3., 8., 3., 5.,
8., 7., 8., 7., 6., 8., 7., 7., 5., 8., 6.,
11., 6., 7., 8., 8., 6., 9., 15., 13., 10., 9.])
pd.rolling_mean(z, 3)
.. ipython:: python
plt.plot(z, '.', alpha=0.5)
@savefig plot_topics_over_time_rolling_mean.png width=5in
plt.plot(pd.rolling_mean(z, 5), '-', linewidth=2)
After making these two improvements---marking the volume boundaries and adding
a trend line based on a rolling average---the time series for our topic does
a better job of orienting us in the novel and communicating the points in the
novel where the topic appears:

.. ipython:: python
import pandas as pd
# the values on the x-axis (xs) are simply a sequence of integers
# corresponding to the texts (also the rows in the document topic matrix)
xs = np.arange(len(series))
series_smooth = pd.rolling_mean(series, 15) # 15 seems to work well here
# now we need to calculate at what index each volume starts
# there are many ways to do this, two methods are shown below
# method #1
volume_names = ["tome-1-fantine", "tome-2-cosette", "tome-3-marius", "tome-4", "tome-5-jean-valjean"]
volume_indexes = []
for volname in volume_names:
for i, docname in enumerate(docnames):
if volname in docname:
volume_indexes.append(i)
break
@suppress
volume_indexes_prev = volume_indexes
# method #2, use NumPy functions
volume_indexes = []
for volname in volume_names:
volume_indexes.append(np.min(np.nonzero([volname in docname for docname in docnames])))
@suppress
assert volume_indexes == volume_indexes_prev
# now we can assemble the plot
plt.plot(series, '.', alpha=0.3)
plt.plot(series_smooth, '-', linewidth=2)
plt.vlines(volume_indexes, ymin=0, ymax=np.max(series))
text_xs = np.array(volume_indexes) + np.diff(np.array(volume_indexes + [max(xs)]))/2
text_ys = np.repeat(max(series), len(volume_names)) - 0.05
for x, y, s in zip(text_xs, text_ys, volume_names):
plt.text(x, y, s, horizontalalignment='center')
plt.title('Les Misérables, Topic #35 (barricade enjolras ...)')
plt.ylabel("Topic share")
plt.xlabel("Novel segment")
plt.ylim(0, max(series))
@savefig plot_topics_over_time_series_les_misérables.png width=7in
plt.tight_layout()
There are of many other topics that appear in our fit of the corpus. Looping
over the topics and saving an image for each topic is straightforward:

.. ipython:: python
for i in range(num_topics):
plt.clf() # clears the current plot
series = doctopic[:, i]
xs = np.arange(len(series))
series_smooth = pd.rolling_mean(series, 15)
plt.plot(series, '.')
plt.plot(series_smooth, '-', linewidth=2)
plt.title("Topic {}: {}".format(i, ','.join(topic_words[i])))
savefig_fn = "/tmp/hugo-topic{}.pdf".format(i)
plt.savefig(savefig_fn, format='pdf')
.. FOOTNOTES
.. [#fn_les_mis] The text of Les Misérables has been used in a variety of
(interactive) visualization projects, including `Les Misérables
Co-occurrence <http://bost.ocks.org/mike/miserables/>`_ and `Novel Views:
Les Miserables <http://neoformix.com/2013/NovelViews.html>`_.
.. [#fn_lowess] For generic smoothing those accustomed to using R will be
familiar with the function ``loess()`` which implements the most common form
of scatterplot smoothing. In Python a similar function
(``statsmodels.nonparametric.lowess()``) is available in the ``statsmodels``
package. While we might be tempted to use such a function to communicate
visually the basic trend, we will be better served if we think of the
sequence of topic shares as a proper time series rather than (merely)
a sequence of dependant and independent variables suitable for visualization
in a scatter plot.

0 comments on commit 77c21f4

Please sign in to comment.