forked from ariddell/tatom
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Allen Riddell
committed
Jan 19, 2014
1 parent
0b7c709
commit 77c21f4
Showing
1 changed file
with
273 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,273 @@ | ||
.. _visualizing-trends: | ||
|
||
==================== | ||
Visualizing trends | ||
==================== | ||
|
||
.. ipython:: python | ||
:suppress: | ||
import numpy as np; np.set_printoptions(precision=3) | ||
Texts often have a sequence. Newspapers and periodicals have volumes. | ||
Novels have chapters. Personal diaries have dated entries. Visualizations of | ||
topic models may benefit from incorporating information about where a text falls | ||
in a sequence. | ||
|
||
As a motivating example, consider Victor Hugo's *Les Misérables*. Over 500,000 | ||
words long, the book counts as a lengthy text by any | ||
standard.[#fn_les_mis]_ The novel comes in five volumes ("Fantine", "Cosette", | ||
"Marius", "The Idyll in the Rue Plumet and the Epic in the Rue St. Denis", and | ||
"Jean Valjean"). And within each volume we have a sequence of chapters. (And | ||
within each chapter we have a sequence of paragraphs, ...). In this section we | ||
will address how to visualize topic shares in sequence. | ||
|
||
To whet your appetite, consider the rise and fall of a topic associated with | ||
revolutionary activity in *Les Misérables*: | ||
|
||
.. figure:: _static/plot_topics_over_time_series_les_misérables.png | ||
:scale: 60 % | ||
:alt: Les Misérables, Topic #35 ("barricade enjolras ...") | ||
|
||
(`Enjolras <https://en.wikipedia.org/wiki/Enjolras>`_ is the leader of the | ||
revolutionary *Les Amis de l'ABC*.) | ||
|
||
.. note:: Probabilistic models such as topic models often benefit from | ||
incorporating information about where an individual text falls in a larger | ||
sequence of texts :cite:`blei_dynamic_2006`. | ||
|
||
|
||
Plotting trends | ||
=============== | ||
|
||
As always, we first need to fit a topic model to the corpus. As MALLET has no | ||
built-in French stopword list we need to provide one. We will use the `French | ||
stopword list | ||
<http://svn.tartarus.org/snowball/trunk/website/algorithms/french/stop.txt>`_ | ||
from the Snowball stemmer package. Additionally, because we are dealing with | ||
non-English text we need to use an alternate regular expression for | ||
tokenization. MALLET helpfully suggests ``--token-regex '[\p{L}\p{M}]+'``. | ||
|
||
.. code-block:: bash | ||
mallet-2.0.7/bin/mallet import-dir --input data/hugo-les-misérables-split/ --output /tmp/topic-input-hugo.mallet --keep-sequence --remove-stopwords --stoplist-file data/stopwords/french.txt --token-regex '[\p{L}\p{M}]+' | ||
mallet-2.0.7/bin/mallet train-topics --input /tmp/topic-input-hugo.mallet --num-topics 50 --output-doc-topics /tmp/doc-topics-hugo.txt --output-topic-keys /tmp/topic-keys-hugo.txt --word-topic-counts-file /tmp/word-topic-hugo.txt | ||
.. ipython:: python | ||
:suppress: | ||
import os | ||
import shutil | ||
import subprocess | ||
N_TOPICS = 50 | ||
MALLET_INPUT = 'source/cache/topic-input-hugo-les-misérables-split.mallet' | ||
MALLET_TOPICS = 'source/cache/doc-topic-hugo-les-misérables-{}topics.txt'.format(N_TOPICS) | ||
MALLET_WORD_TOPIC_COUNTS = 'source/cache/doc-topic-hugo-les-misérables-{}topics-word-topic.txt'.format(N_TOPICS) | ||
MALLET_KEYS = 'source/cache/doc-topic-hugo-les-misérables-{}topics-keys.txt'.format(N_TOPICS) | ||
if not os.path.exists(MALLET_INPUT): | ||
subprocess.check_call("""mallet-2.0.7/bin/mallet import-dir --input data/hugo-les-misérables-split/ --output {} --keep-sequence --remove-stopwords --stoplist-file data/stopwords/french.txt --token-regex '[\p{{L}}\p{{M}}]+'""".format(MALLET_INPUT), shell=True) | ||
.. ipython:: python | ||
:suppress: | ||
shutil.copy(MALLET_INPUT,'/tmp/topic-input-hugo.mallet') | ||
if not os.path.exists(MALLET_TOPICS): | ||
subprocess.check_call('mallet-2.0.7/bin/mallet train-topics --input /tmp/topic-input-hugo.mallet --num-iterations 5000 --num-topics {} --output-doc-topics {} --output-topic-keys {} --word-topic-counts-file {} --random-seed 1'.format(N_TOPICS, MALLET_TOPICS, MALLET_KEYS, MALLET_WORD_TOPIC_COUNTS), shell=True) | ||
shutil.copy(MALLET_TOPICS,'/tmp/doc-topics-hugo.txt') | ||
shutil.copy(MALLET_KEYS,'/tmp/topic-keys-hugo.txt') | ||
shutil.copy(MALLET_WORD_TOPIC_COUNTS,'/tmp/word-topic-hugo.txt') | ||
As usual, we post-process the MALLET output in order to get a matrix of topic | ||
proportions. Each row of the matrix holds the topic proportions associated with | ||
a document. | ||
|
||
.. ipython:: python | ||
import numpy as np | ||
import itertools | ||
import operator | ||
import os | ||
def grouper(n, iterable, fillvalue=None): | ||
"Collect data into fixed-length chunks or blocks" | ||
# grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx" | ||
args = [iter(iterable)] * n | ||
return itertools.zip_longest(*args, fillvalue=fillvalue) | ||
doctopic_triples = [] | ||
with open("/tmp/doc-topics-hugo.txt") as f: | ||
f.readline() # read one line in order to skip the header | ||
for line in f: | ||
docnum, docname, *values = line.rstrip().split('\t') | ||
for topic, share in grouper(2, values): | ||
triple = (docname, int(topic), float(share)) | ||
doctopic_triples.append(triple) | ||
# sort the triples | ||
doctopic_triples.sort(key=operator.itemgetter(0,1)) | ||
docnames = sorted(set([triple[0] for triple in doctopic_triples])) | ||
docnames_base = np.array([os.path.splitext(os.path.basename(n))[0] for n in docnames]) | ||
num_topics = len(doctopic_triples) // len(docnames) | ||
doctopic = np.empty((len(docnames), num_topics)) | ||
for i, (doc_name, triples) in enumerate(itertools.groupby(doctopic_triples, key=operator.itemgetter(0))): | ||
doctopic[i, :] = np.array([share for _, _, share in triples]) | ||
docnames = docnames_base | ||
# get the topic words | ||
with open('/tmp/topic-keys-hugo.txt') as input: | ||
topic_keys_lines = input.readlines() | ||
topic_words = [] | ||
for line in topic_keys_lines: | ||
_, _, words = line.split('\t') # tab-separated | ||
words = words.rstrip().split(' ') # remove the trailing '\n' | ||
topic_words.append(words) | ||
Among the fifty topics there is one topic (#35 using 0-based indexing) that | ||
jumps out as characteristic of events towards the close of the novel. The words | ||
most strongly connected with this topic include "barricade", "fusil", and | ||
"cartouches" ("barricade", "rifle", and "cartridges"). | ||
|
||
.. ipython:: python | ||
','.join(topic_words[35]) | ||
Because the documents are ordered in a sequence, we can plot the fate, so to | ||
speak, of this topic over time with the following lines of code: | ||
|
||
.. ipython:: python | ||
series = doctopic[:, 35] | ||
@savefig plot_topics_over_time_series_simple.png width=7in | ||
plt.plot(series, '.') # '.' specifies the type of mark to use on the graph | ||
While this visualization communicates the essential information about the | ||
prevalence of a topic in the corpus, it is not perfect. We can improve it. It | ||
would, for instance, be useful to include an indication of where the various | ||
volumes start and end. Another enhancement would add some kind of "smoothing" to | ||
the time series in order to better communicate the underlying trend. | ||
|
||
A rolling average of the topic shares turns out be a useful form of smoothing in | ||
this case. We are interested in the prevalence of a topic over time and whether | ||
a topic disappears completely in one 500-word chunk of text (only to reappear in | ||
the next) does not interest us. We want to visualize the underlying trend, that | ||
is, we need some model or heuristic capable of capturing the idea | ||
that the topic (or any similar feature) has an underlying propensity to appear at | ||
varying points of the novel and that while this propensity may change over time it | ||
does not fluctuate wildly. [#fn_lowess]_ | ||
|
||
Recall that a rolling or moving average of a time series associates with each | ||
point in the series the average of some fixed number of previous | ||
observations (including the current observation). This fixed number of | ||
observations is often | ||
called a "window". The idea of a rolling mean (conveniently implemented in | ||
``pandas.rolling_mean()``) is effectively communicated visually: | ||
|
||
.. ipython:: python | ||
import pandas as pd | ||
z = np.array([ 3., 2., 3., 6., 2., 3., 1., 3., 8., 3., 5., | ||
8., 7., 8., 7., 6., 8., 7., 7., 5., 8., 6., | ||
11., 6., 7., 8., 8., 6., 9., 15., 13., 10., 9.]) | ||
pd.rolling_mean(z, 3) | ||
.. ipython:: python | ||
plt.plot(z, '.', alpha=0.5) | ||
@savefig plot_topics_over_time_rolling_mean.png width=5in | ||
plt.plot(pd.rolling_mean(z, 5), '-', linewidth=2) | ||
After making these two improvements---marking the volume boundaries and adding | ||
a trend line based on a rolling average---the time series for our topic does | ||
a better job of orienting us in the novel and communicating the points in the | ||
novel where the topic appears: | ||
|
||
.. ipython:: python | ||
import pandas as pd | ||
# the values on the x-axis (xs) are simply a sequence of integers | ||
# corresponding to the texts (also the rows in the document topic matrix) | ||
xs = np.arange(len(series)) | ||
series_smooth = pd.rolling_mean(series, 15) # 15 seems to work well here | ||
# now we need to calculate at what index each volume starts | ||
# there are many ways to do this, two methods are shown below | ||
# method #1 | ||
volume_names = ["tome-1-fantine", "tome-2-cosette", "tome-3-marius", "tome-4", "tome-5-jean-valjean"] | ||
volume_indexes = [] | ||
for volname in volume_names: | ||
for i, docname in enumerate(docnames): | ||
if volname in docname: | ||
volume_indexes.append(i) | ||
break | ||
@suppress | ||
volume_indexes_prev = volume_indexes | ||
# method #2, use NumPy functions | ||
volume_indexes = [] | ||
for volname in volume_names: | ||
volume_indexes.append(np.min(np.nonzero([volname in docname for docname in docnames]))) | ||
@suppress | ||
assert volume_indexes == volume_indexes_prev | ||
# now we can assemble the plot | ||
plt.plot(series, '.', alpha=0.3) | ||
plt.plot(series_smooth, '-', linewidth=2) | ||
plt.vlines(volume_indexes, ymin=0, ymax=np.max(series)) | ||
text_xs = np.array(volume_indexes) + np.diff(np.array(volume_indexes + [max(xs)]))/2 | ||
text_ys = np.repeat(max(series), len(volume_names)) - 0.05 | ||
for x, y, s in zip(text_xs, text_ys, volume_names): | ||
plt.text(x, y, s, horizontalalignment='center') | ||
plt.title('Les Misérables, Topic #35 (barricade enjolras ...)') | ||
plt.ylabel("Topic share") | ||
plt.xlabel("Novel segment") | ||
plt.ylim(0, max(series)) | ||
@savefig plot_topics_over_time_series_les_misérables.png width=7in | ||
plt.tight_layout() | ||
There are of many other topics that appear in our fit of the corpus. Looping | ||
over the topics and saving an image for each topic is straightforward: | ||
|
||
.. ipython:: python | ||
for i in range(num_topics): | ||
plt.clf() # clears the current plot | ||
series = doctopic[:, i] | ||
xs = np.arange(len(series)) | ||
series_smooth = pd.rolling_mean(series, 15) | ||
plt.plot(series, '.') | ||
plt.plot(series_smooth, '-', linewidth=2) | ||
plt.title("Topic {}: {}".format(i, ','.join(topic_words[i]))) | ||
savefig_fn = "/tmp/hugo-topic{}.pdf".format(i) | ||
plt.savefig(savefig_fn, format='pdf') | ||
.. FOOTNOTES | ||
.. [#fn_les_mis] The text of Les Misérables has been used in a variety of | ||
(interactive) visualization projects, including `Les Misérables | ||
Co-occurrence <http://bost.ocks.org/mike/miserables/>`_ and `Novel Views: | ||
Les Miserables <http://neoformix.com/2013/NovelViews.html>`_. | ||
.. [#fn_lowess] For generic smoothing those accustomed to using R will be | ||
familiar with the function ``loess()`` which implements the most common form | ||
of scatterplot smoothing. In Python a similar function | ||
(``statsmodels.nonparametric.lowess()``) is available in the ``statsmodels`` | ||
package. While we might be tempted to use such a function to communicate | ||
visually the basic trend, we will be better served if we think of the | ||
sequence of topic shares as a proper time series rather than (merely) | ||
a sequence of dependant and independent variables suitable for visualization | ||
in a scatter plot. | ||