Skip to content

Latest commit

 

History

History
324 lines (239 loc) · 11 KB

topic_model_python.rst

File metadata and controls

324 lines (239 loc) · 11 KB

topic model, non-negative matrix factorization, NMF

Topic modeling in Python

python

import numpy as np; np.set_printoptions(precision=2)

This section illustrates how to do approximate topic modeling in Python. We will use a technique called non-negative matrix factorization (NMF) that strongly resembles Latent Dirichlet Allocation (LDA) which we covered in the previous section, topic-model-mallet.1 Whereas LDA is a probabilistic model capable of expressing uncertainty about the placement of topics across texts and the assignment of words to topics, NMF is a deterministic algorithm which arrives at a single representation of the corpus. For this reason, NMF is often characterized as a machine learning algorithm. Like LDA, NMF arrives at its representation of a corpus in terms of something resembling "latent topics".

Note

The name "Non-negative matrix factorization" has the virtue of being transparent. A "non-negative matrix" is a matrix containing non-negative values (here zero or positive word frequencies). And factorization refers to the familiar kind of mathematical factorization. Just as a polynomial x2 + 3x + 2 may be factored into a simple product (x + 2)(x + 1), so too may a matrix $\bigl(\begin{smallmatrix} 6&2&4\\ 9&3&6 \end{smallmatrix} \bigr)$ be factored into the product of two smaller matrices $\bigl(\begin{smallmatrix} 2\\ 3 \end{smallmatrix} \bigr) \bigl(\begin{smallmatrix} 3&2&1 \end{smallmatrix} \bigr)$.

This section follows the procedures described in topic-model-mallet, making the substitution of NMF for LDA where appropriate.

This section uses the novels by Brontë and Austen. These novels are divided into parts as follows:

python

import os CORPUS_PATH = os.path.join('data', 'austen-brontë-split') filenames = sorted([os.path.join(CORPUS_PATH, fn) for fn in os.listdir(CORPUS_PATH)])

python

# files are located in data/austen-brontë-split len(filenames) filenames[:5]

Using Non-negative matrix factorization

As always we need to give Python access to our corpus. In this case we will work with our familiar document-term matrix.

python

import numpy as np # a conventional alias import sklearn.feature_extraction.text as text

vectorizer = text.CountVectorizer(input='filename', stop_words='english', min_df=20) dtm = vectorizer.fit_transform(filenames).toarray() vocab = np.array(vectorizer.get_feature_names())

dtm.shape len(vocab)

By analogy with LDA, we will use NMF to get a document-topic matrix (topics here will also be referred to as "components") and a list of top words for each topic. We will make analogy clear by using the same variable names: doctopic and topic_words

python

from sklearn import decomposition

num_topics = 20 num_top_words = 20

clf = decomposition.NMF(n_components=num_topics, random_state=1)

# this next step may take some time

python

# suppress this

import os import pickle

NMF_TOPICS = 'source/cache/nmf-austen-brontë-doc-topic.pkl' NMF_CLF = 'source/cache/nmf-austen-brontë-clf.pkl'

# the ipython directive seems to have trouble with multi-line indented blocks if not os.path.exists(NMF_CLF): doctopic = clf.fit_transform(dtm) pickle.dump(doctopic, open(NMF_TOPICS, 'wb')) pickle.dump(clf, open(NMF_CLF, 'wb'))

clf = pickle.load(open(NMF_CLF, 'rb')) doctopic = pickle.load(open(NMF_TOPICS, 'rb'))

doctopic = clf.fit_transform(dtm)

python

# print words associated with topics topic_words = [] for topic in clf.components: word_idx = np.argsort(topic)[::-1][0:num_top_words] topic_words.append([vocab[i] for i in word_idx])

To make the analysis and visualization of NMF components similar to that of LDA's topic proportions, we will scale the document-component matrix such that the component values associated with each document sum to one.

python

doctopic = doctopic / np.sum(doctopic, axis=1, keepdims=True)

Now we will average those topic shares associated with the same novel together --- just as we did with the topic shares from MALLET.

python

novel_names = [] for fn in filenames: basename = os.path.basename(fn) # splitext splits the extension off, 'novel.txt' -> ('novel', '.txt') name, ext = os.path.splitext(basename) # remove trailing numbers identifying chunk name = name.rstrip('0123456789') novel_names.append(name) # turn this into an array so we can use NumPy functions novel_names = np.asarray(novel_names)

@suppress assert len(set(novel_names)) == 6 @supress doctopic_orig = doctopic.copy()

# use method described in preprocessing section num_groups = len(set(novel_names)) doctopic_grouped = np.zeros((num_groups, num_topics)) for i, name in enumerate(sorted(set(novel_names))): doctopic_grouped[i, :] = np.mean(doctopic[novel_names == name, :], axis=0)

doctopic = doctopic_grouped

@suppress docnames = sorted(set(novel_names))

python

import pandas as pd OUTPUT_HTML_PATH = os.path.join('source', 'generated') rownames = sorted(set(novel_names)) colnames = ["NMF Topic " + str(i + 1) for i in range(doctopic.shape[1])][0:15] html = pd.DataFrame(np.round(doctopic[:,0:15], 2), index=rownames, columns=colnames).to_html() with open(os.path.join(OUTPUT_HTML_PATH, 'NMF_doctopic.txt'), 'w') as f: f.write(html)

In order to fit into the space available, the table above displays the first 15 of 20 topics.

Inspecting the NMF fit

The topics (or components) of the NMF fit preserve the distances between novels (see the figures below).

python

# COSINE SIMILARITY import os # for os.path.basename import matplotlib.pyplot as plt from sklearn.manifold import MDS from sklearn.metrics.pairwise import cosine_similarity

dist = 1 - cosine_similarity(dtm) mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1) pos = mds.fit_transform(dist) # shape (n_components, n_samples)

python

assert dtm.shape[0] == doctopic_orig.shape[0] # NOTE: the IPython directive seems less prone to errors when these blocks # are split up. xs, ys = pos[:, 0], pos[:, 1] names = sorted(set(novel_names)) for x, y, name in zip(xs, ys, names): color = 'orange' if "Austen" in name else 'skyblue' plt.scatter(x, y, c=color) plt.text(x, y, name)

plt.title("Distances calculated using word frequencies") @savefig plot_nmf_section_austen_brontë_cosine_mds.png width=7in plt.show()

python

# NMF import os # for os.path.basename import matplotlib.pyplot as plt from sklearn.manifold import MDS from sklearn.metrics.pairwise import euclidean_distances

dist = euclidean_distances(doctopic) mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1) pos = mds.fit_transform(dist) # shape (n_components, n_samples)

python

# NOTE: the IPython directive seems less prone to errors when these blocks are split up xs, ys = pos[:, 0], pos[:, 1] names = sorted(set(novel_names)) for x, y, name in zip(xs, ys, names): color = 'orange' if "Austen" in name else 'skyblue' plt.scatter(x, y, c=color) plt.text(x, y, name)

plt.title("Distances calculated using NMF components") @savefig plot_NMF_euclidean_mds.png width=7in plt.show()

Even though the NMF fit "discards" the fine-grained detail recorded in the matrix of word frequencies, the matrix factorization performed allows us to reconstruct the salient details of the underlying matrix.

As we did in the previous section, let us identify the most significant topics for each text in the corpus. This procedure does not differ in essence from the procedure for identifying the most frequent words in each text.

python

novels = sorted(set(novel_names)) print("Top NMF topics in...") for i in range(len(doctopic)): top_topics = np.argsort(doctopic[i,:])[::-1][0:3] top_topics_str = ' '.join(str(t) for t in top_topics) print("{}: {}".format(novels[i], top_topics_str))

And we already have lists of words (topic_words) most strongly associated with the components. For reference, we will display them again:

python

# show the top 15 words for t in range(len(topic_words)): print("Topic {}: {}".format(t, ' '.join(topic_words[t][:15])))

There are many ways to inspect and to visualize topic models. Some of the most common methods are covered in topic-model-visualization.

Distinctive topics

Consider the task of finding the topics that are distinctive of Austen using the NMF "topics". Using the simple difference-in-averages we can find topics that to be associated with Austen's novels rather than Brontë's.

python

austen_indices, cbronte_indices = [], [] for index, fn in enumerate(sorted(set(novel_names))): if "Austen" in fn: austen_indices.append(index) elif "CBronte" in fn: cbronte_indices.append(index)

austen_avg = np.mean(doctopic[austen_indices, :], axis=0) cbronte_avg = np.mean(doctopic[cbronte_indices, :], axis=0) keyness = np.abs(austen_avg - cbronte_avg) ranking = np.argsort(keyness)[::-1] # from highest to lowest; [::-1] reverses order in Python sequences

# distinctive topics: ranking[:10]

python

N_WORDS_DISPLAY = 10 N_TOPICS_DISPLAY = 10 topics_display = sorted(ranking[0:N_TOPICS_DISPLAY]) arr = doctopic[:, topics_display] colnames = ["Topic {}".format(t) for t in topics_display] rownames = sorted(set(novel_names)) html = pd.DataFrame(np.round(arr,2), index=rownames, columns=colnames).to_html() arr = np.row_stack([topic_words[t][:N_WORDS_DISPLAY] for t in topics_display]) rownames = ["Topic {}".format(t) for t in topics_display] colnames = ['']*N_WORDS_DISPLAY html += pd.DataFrame(arr, index=rownames, columns=colnames).to_html() with open(os.path.join(OUTPUT_HTML_PATH, 'topic_model_distinctive_avg_diff.txt'), 'w') as f: f.write(html)


  1. While there are significant differences between NMF and LDA, there are also similarities. Indeed, if the texts in a corpus have certain properties, NMF and LDA will arrive at the same representation of a corpus arora_practical_2013.