Topic modeling with MALLET

This section illustrates how to use MALLET to model a corpus of texts using a topic model and how to analyze the results using Python.

A topic model is a probabilistic model of the words appearing in a corpus of documents. (There are a number of general introductions to topic models available, such as [Ble12].) The particular topic model used in this section is Latent Dirichlet Allocation (LDA), a model introduced in the context of text analysis in 2003 [BNJ03]. LDA is an instance of a more general class of models called mixed-membership models. While LDA involves a greater number of distributions and parameters than the Bayesian model introduced in the section on group comparison, both are instances of a Bayesian probabilistic model. In fact, posterior inference for both models is typically performed in precisely the same manner, using Gibbs sampling with conjugate priors.

This section assumes prior exposure to topic modeling and proceeds as follows:

  1. MALLET is downloaded and used to fit a topic model of six novels, three by Brontë and three by Austen. Because these are lengthy texts, the novels are split up into smaller sections—a preprocessing step which improves results considerably.
  2. The output of MALLET is loaded into Python as a document-topic matrix (a 2-dimensional array) of topic shares.
  3. Topics, discrete distributions over the vocabulary, are analyzed.

Note that an entire section is devoted to visualizing topic models. This section focuses on using MALLET and processing the results.

This section uses six novels by Brontë and Austen. These novels are divided into parts as follows:

In [1]: import os

In [2]: CORPUS_PATH = os.path.join('data', 'austen-brontë-split')

In [3]: filenames = sorted([os.path.join(CORPUS_PATH, fn) for fn in os.listdir(CORPUS_PATH)])
# files are located in data/austen-brontë-split
In [4]: len(filenames)
Out[4]: 813

In [5]: filenames[:5]
Out[5]: 
['data/austen-brontë-split/Austen_Emma0000.txt',
 'data/austen-brontë-split/Austen_Emma0001.txt',
 'data/austen-brontë-split/Austen_Emma0002.txt',
 'data/austen-brontë-split/Austen_Emma0003.txt',
 'data/austen-brontë-split/Austen_Emma0004.txt']

Running MALLET

Note

The nltk package provides a thin wrapper for MALLET which may be worth investigating. See nltk.classify.mallet.

On Linux and BSD-based systems (such as OS X), the following commands should download and extract MALLET:

# alternatively: wget http://mallet.cs.umass.edu/dist/mallet-2.0.7.tar.gz
curl --remote-name http://mallet.cs.umass.edu/dist/mallet-2.0.7.tar.gz
tar zxf mallet-2.0.7.tar.gz

We will run MALLET using the default parameters. Using the option --random-seed 1 should guarantee that the results produced match those appearing below.

mallet-2.0.7/bin/mallet import-dir --input data/austen-brontë-split/ --output /tmp/topic-input-austen-brontë.mallet --keep-sequence --remove-stopwords
mallet-2.0.7/bin/mallet train-topics --input /tmp/topic-input-austen-brontë.mallet --num-topics 20 --output-doc-topics /tmp/doc-topics-austen-brontë.txt --output-topic-keys /tmp/topic-keys-austen-brontë.txt --random-seed 1

Under Windows the commands are similar. For detailed instructions see the article “Getting Started with Topic Modeling and MALLET”. The MALLET homepage also has instructions on how to install and run the software under Windows.

Processing MALLET output

We have already seen that a document-term matrix is a convenient way to represent the word frequencies associated with each document. Similarly, as each document is associated with a set of topic shares, it will be useful to gather these features into a document-topic matrix.

Note

Topic shares are also referred to as topic weights, mixture weights, or component weights. Different communities favor different terms.

Manipulating the output of MALLET into a document-topic matrix is not entirely intuitive. Fortunately the tools required for the job are available in Python and the procedure is similar to that reviewed in the previous section on grouping texts.

MALLET delivers the topic shares for each document into a file specified by the --output-doc-topics option. In this case we have provided the output filename /tmp/doc-topics-austen-brontë.txt. The first lines of this file should look something like this:

#doc name topic proportion ...
0    file:/.../austen-brontë-split/Austen_Pride0103.txt      3       0.2110215053763441      14      0.13306451612903225
1    file:/.../austen-brontë-split/Austen_Pride0068.txt      17      0.19915254237288135     3       0.14548022598870056
...

The first two columns of doc-topics.txt record the document number (0-based indexing) and the full path to the filename. The rest of the columns are best considered as (topic-number, topic-share) pairs. There are as many of these pairs as there are topics. All columns are separated by tabs (there’s even a trailing tab at the end of the line). With the exception of the header (the first line), this file records data using tab-separated values. There are two challenges in parsing this file into a document-topic matrix. The first is sorting. The texts do not appear in a consistent order in doc-topics.txt and the topic number and share pairs appear in different columns depending on the document. We will need to reorder these pairs before assembling them into a matrix.[#fnmapreduce]_ The second challenge is that the number of columns will vary with the number of topics specified (--num-topics). Fortunately, the documentation in the Python library itertools describes a function called grouper using itertools.izip_longest that solves our problem.

In [6]: import numpy as np

In [7]: import itertools

In [8]: import operator

In [9]: import os

In [10]: def grouper(n, iterable, fillvalue=None):
   ....:     "Collect data into fixed-length chunks or blocks"
   ....:     args = [iter(iterable)] * n
   ....:     return itertools.zip_longest(*args, fillvalue=fillvalue)
   ....: 

In [11]: doctopic_triples = []

In [12]: mallet_docnames = []

In [13]: with open("/tmp/doc-topics-austen-brontë.txt") as f:
   ....:     f.readline()  # read one line in order to skip the header
   ....:     for line in f:
   ....:         docnum, docname, *values = line.rstrip().split('\t')
   ....:         mallet_docnames.append(docname)
   ....:         for topic, share in grouper(2, values):
   ....:             triple = (docname, int(topic), float(share))
   ....:             doctopic_triples.append(triple)
   ....: 

# sort the triples
# triple is (docname, topicnum, share) so sort(key=operator.itemgetter(0,1))
# sorts on (docname, topicnum) which is what we want
In [14]: doctopic_triples = sorted(doctopic_triples, key=operator.itemgetter(0,1))

# sort the document names rather than relying on MALLET's ordering
In [15]: mallet_docnames = sorted(mallet_docnames)

# collect into a document-term matrix
In [16]: num_docs = len(mallet_docnames)

In [17]: num_topics = len(doctopic_triples) // len(mallet_docnames)

# the following works because we know that the triples are in sequential order
In [18]: doctopic = np.zeros((num_docs, num_topics))

In [19]: for triple in doctopic_triples:
   ....:     docname, topic, share = triple
   ....:     row_num = mallet_docnames.index(docname)
   ....:     doctopic[row_num, topic] = share
   ....: 
# The following method is considerably faster. It uses the itertools library which is part of the Python standard library.
In [20]: import itertools

In [21]: import operator

In [22]: doctopic = np.zeros((num_docs, num_topics))

In [23]: for i, (doc_name, triples) in enumerate(itertools.groupby(doctopic_triples, key=operator.itemgetter(0))):
   ....:     doctopic[i, :] = np.array([share for _, _, share in triples])
   ....: 

Now we will calculate the average of the topic shares associated with each novel. Recall that we have been working with small sections of novels. The following step combines the topic shares for sections associated with the same novel.

In [24]: novel_names = []

In [25]: for fn in filenames:
   ....:     basename = os.path.basename(fn)
   ....:     name, ext = os.path.splitext(basename)
   ....:     name = name.rstrip('0123456789')
   ....:     novel_names.append(name)
   ....: 

# turn this into an array so we can use NumPy functions
In [26]: novel_names = np.asarray(novel_names)

In [27]: doctopic_orig = doctopic.copy()

# use method described in preprocessing section
In [28]: num_groups = len(set(novel_names))

In [29]: doctopic_grouped = np.zeros((num_groups, num_topics))

In [30]: for i, name in enumerate(sorted(set(novel_names))):
   ....:     doctopic_grouped[i, :] = np.mean(doctopic[novel_names == name, :], axis=0)
   ....: 

In [31]: doctopic = doctopic_grouped
Topic 0 Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10 Topic 11 Topic 12 Topic 13 Topic 14
Austen_Emma 0.02 0.01 0.06 0.02 0.07 0.05 0.04 0.01 0.02 0.02 0.02 0.26 0.08 0.10 0.02
Austen_Pride 0.03 0.01 0.06 0.02 0.07 0.05 0.04 0.01 0.02 0.02 0.02 0.02 0.09 0.11 0.24
Austen_Sense 0.23 0.01 0.07 0.02 0.08 0.05 0.04 0.01 0.02 0.02 0.02 0.02 0.07 0.11 0.02
CBronte_Jane 0.02 0.02 0.05 0.09 0.05 0.05 0.05 0.05 0.11 0.08 0.09 0.02 0.03 0.03 0.02
CBronte_Professor 0.01 0.06 0.05 0.06 0.05 0.06 0.06 0.06 0.04 0.07 0.04 0.01 0.04 0.03 0.01
CBronte_Villette 0.01 0.09 0.04 0.06 0.04 0.06 0.05 0.10 0.03 0.07 0.04 0.01 0.04 0.03 0.01

In order to fit into the space available, the table above displays the first 15 of 20 topics.

Inspecting the topic model

The first thing we should appreciate about our topic model is that the twenty shares do a remarkably good job of summarizing our corpus. For example, they preserve the distances between novels (see figures below). By this measure, LDA is good at dimensionality reduction: we have taken a matrix of dimensions 813 by 14862 (occupying almost three megabytes of memory if stored in a spare matrix) and fashioned a representation that preserves important features in a matrix that is 813 by 20 (5% the size of the original).

In [32]: from sklearn.feature_extraction.text import CountVectorizer

In [33]: CORPUS_PATH_UNSPLIT = os.path.join('data', 'austen-brontë-split')

In [34]: filenames = [os.path.join(CORPUS_PATH_UNSPLIT, fn) for fn in sorted(os.listdir(CORPUS_PATH_UNSPLIT))]

In [35]: vectorizer = CountVectorizer(input='filename')

In [36]: dtm = vectorizer.fit_transform(filenames)  # a sparse matrix

In [37]: dtm.shape
Out[37]: (813, 22854)

In [38]: dtm.data.nbytes  # number of bytes dtm takes up
Out[38]: 2996776

In [39]: dtm.toarray().data.nbytes  # number of bytes dtm as array takes up
Out[39]: 148642416

In [40]: doctopic_orig.shape
Out[40]: (813, 20)

In [41]: doctopic_orig.data.nbytes  # number of bytes document-topic shares take up
Out[41]: 130080
_images/plot_topic_model_cosine_mds.png
_images/plot_topic_model_doctopic_euclidean_mds.png

Even though a topic model “discards” the “fine-grained” information recorded in the matrix of word frequencies, it preserves salient details of the underlying matrix. That is, the topic shares associated with a document have an interpretation in terms of word frequencies. This is best illustrated by examining the present topic model.

First let us identify the most significant topics for each text in the corpus. This procedure does not differ in essence from the procedure for identifying the most frequent words in each text.

In [42]: novels = sorted(set(novel_names))

In [43]: print("Top topics in...")
Top topics in...

In [44]: for i in range(len(doctopic)):
   ....:     top_topics = np.argsort(doctopic[i,:])[::-1][0:3]
   ....:     top_topics_str = ' '.join(str(t) for t in top_topics)
   ....:     print("{}: {}".format(novels[i], top_topics_str))
   ....: 
Austen_Emma: 11 13 12
Austen_Pride: 14 13 12
Austen_Sense: 0 13 4
CBronte_Jane: 8 3 10
CBronte_Professor: 17 9 5
CBronte_Villette: 7 19 1

Note

Recall that, like everything else in Python (and C, Java, and many other languages), the topics use 0-based indexing; the first topic is topic 0.

Each topic in the topic model can be inspected. Each topic is a distribution which captures in probabilistic terms, the words associated with the topic and the strength of the association (the posterior probability of finding a word associated with a topic). Sometimes this distribution is called a topic-word distribution (in contrast to the document-topic distribution). Again, this is best illustrated by inspecting the topic-word distributions provided by MALLET for our Austen-Brontë corpus. MALLET places (a subset of) the topic-word distribution for each topic in a file specified by the command-line option --output-topic-keys. For the run of mallet used in this section, this file is /tmp/topic-keys-austen-brontë.txt. The first line of this file should resemble the following:

0    2.5     long room looked day eyes make voice head till girl morning feel called table turn continued times appeared breakfast

We need to parse this file into something we can work with. Fortunately this task is not difficult.

In [45]: with open('/tmp/topic-keys-austen-brontë.txt') as input:
   ....:     topic_keys_lines = input.readlines()
   ....: 

In [46]: topic_words = []

In [47]: for line in topic_keys_lines:
   ....:     _, _, words = line.split('\t')  # tab-separated
   ....:     words = words.rstrip().split(' ')  # remove the trailing '\n'
   ....:     topic_words.append(words)
   ....: 

# now we can get a list of the top words for topic 0 with topic_words[0]
In [48]: topic_words[0]
Out[48]: 
['elinor',
 'mrs',
 'marianne',
 'sister',
 'mother',
 'edward',
 'dashwood',
 'colonel',
 'jennings',
 'willoughby',
 'john',
 'thing',
 'lucy',
 'great',
 'miss',
 'brandon',
 'day',
 'dear',
 'happy']

Now we have everything we need to list the words associated with each topic.

In [49]: N_WORDS_DISPLAY = 10

In [50]: for t in range(len(topic_words)):
   ....:     print("Topic {}: {}".format(t, ' '.join(topic_words[t][:N_WORDS_DISPLAY])))
   ....: 
Topic 0: elinor mrs marianne sister mother edward dashwood colonel jennings willoughby
Topic 1: madame monsieur paul de mademoiselle vous est emanuel la hand
Topic 2: man good make years life woman wife suppose father young
Topic 3: jane god st john heart mine felt put hand strange
Topic 4: time morning long left found return felt days wished leave
Topic 5: made moment looked eyes silence voice smile sat man gave
Topic 6: house looked good thought place found small fine large asked
Topic 7: madame beck dress pale knew light stood dark blue full
Topic 8: mr sir rochester don hall back heard night master ll
Topic 9: door long round house garden air black high rose great
Topic 10: room miss mrs chair table eyes long head hands hair
Topic 11: mr emma mrs miss harriet thing weston knightley elton jane
Topic 12: young lady evening general people pleasure party pretty attention ladies
Topic 13: feelings happiness opinion friend regard situation spirits affection ill give
Topic 14: mr elizabeth darcy jane bennet mrs bingley miss sister wickham
Topic 15: letter give make word told till heard cried knew speak
Topic 16: day night hour hand evening life thought heart sweet long
Topic 17: school english french hunsden frances mdlle pelet crimsworth read time
Topic 18: love mind felt good feel heart thought world sense feeling
Topic 19: bretton dr graham lucy home good john papa don child

There are many ways to inspect and to visualize topic models. Some of the more common methods are covered in next section.

Distinctive topics

Finding distinctive topics is analogous to the task of finding distinctive words. The topic model does an excellent job of focusing attention on recurrent patterns (of co-occurrence) in the word frequencies appearing in a corpus. To the extent that we are interested in these kinds of patterns (rather than the rare or isolated feature of texts), working with topics tends to be easier than working with word frequencies.

Consider the task of finding the distinctive topics in Austen’s novels. Here the simple difference-in-averages provides an easy way of finding topics that tend to be associated more strongly with Austen’s novels than with Brontë’s.

In [51]: austen_indices, cbronte_indices = [], []

In [52]: for index, fn in enumerate(sorted(set(novel_names))):
   ....:     if "Austen" in fn:
   ....:         austen_indices.append(index)
   ....:     elif "CBronte" in fn:
   ....:         cbronte_indices.append(index)
   ....: 

In [53]: austen_avg = np.mean(doctopic[austen_indices, :], axis=0)

In [54]: cbronte_avg = np.mean(doctopic[cbronte_indices, :], axis=0)

In [55]: keyness = np.abs(austen_avg - cbronte_avg)

In [56]: ranking = np.argsort(keyness)[::-1]  # from highest to lowest; [::-1] reverses order in Python sequences

# distinctive topics:
In [57]: ranking[:10]
Out[57]: array([11, 14,  0, 13, 17,  7,  9,  3,  1, 12])
Topic 0 Topic 1 Topic 3 Topic 7 Topic 9 Topic 11 Topic 12 Topic 13 Topic 14 Topic 17
Austen_Emma 0.02 0.01 0.02 0.01 0.02 0.26 0.08 0.10 0.02 0.02
Austen_Pride 0.03 0.01 0.02 0.01 0.02 0.02 0.09 0.11 0.24 0.01
Austen_Sense 0.23 0.01 0.02 0.01 0.02 0.02 0.07 0.11 0.02 0.01
CBronte_Jane 0.02 0.02 0.09 0.05 0.08 0.02 0.03 0.03 0.02 0.04
CBronte_Professor 0.01 0.06 0.06 0.06 0.07 0.01 0.04 0.03 0.01 0.16
CBronte_Villette 0.01 0.09 0.06 0.10 0.07 0.01 0.04 0.03 0.01 0.04
Topic 0 elinor mrs marianne sister mother edward dashwood colonel jennings willoughby
Topic 1 madame monsieur paul de mademoiselle vous est emanuel la hand
Topic 3 jane god st john heart mine felt put hand strange
Topic 7 madame beck dress pale knew light stood dark blue full
Topic 9 door long round house garden air black high rose great
Topic 11 mr emma mrs miss harriet thing weston knightley elton jane
Topic 12 young lady evening general people pleasure party pretty attention ladies
Topic 13 feelings happiness opinion friend regard situation spirits affection ill give
Topic 14 mr elizabeth darcy jane bennet mrs bingley miss sister wickham
Topic 17 school english french hunsden frances mdlle pelet crimsworth read time
[1]Those familiar with MapReduce may recognize the pattern of splitting a dataset into smaller pieces and then (re)ordering them.