diff --git a/source/preliminaries.rst b/source/preliminaries.rst index 56eeb5b..f237d1d 100644 --- a/source/preliminaries.rst +++ b/source/preliminaries.rst @@ -8,13 +8,6 @@ These tutorials make use of a number of Python packages. This section describes how to install these packages. (If you do not already have Python 3 installed on your system you may wish to skip to the section :ref:`installing-python`.) -.. note:: - - An IPython Notebook with all the required libraries is available at the - following address: http://ipython.machinereadings.org:5555/. The password is - "wuerzburg". - - Required Python packages ======================== The tutorials collected here assume that Python version 3.3 or higher is @@ -184,13 +177,10 @@ Python is handled in homebrew. Installing packages on Windows ------------------------------ + There are a number of distributions of Python for Windows that come pre-packaged with packages relevant to scientific computing such as NumPy and SciPy. They -include, for example, `Anaconda Python `_. - -Installing packages from source under Windows requires some patience. As the -packages used in this tutorial are widely used, instructions on installing -specific packages are not difficult to find. For example, instructions on how to -install ``scikit-learn`` (which includes installing NumPy and SciPy) in -a Windows environment may be found at the ``scikit-learn`` website: `Installing -scikit-learn `_. +include, for example, `Anaconda Python +`_. Anaconda includes almost all the +packages used here. Also available are `instructions on how to use Python 3 with +Anaconda `. diff --git a/source/topic_model_python.rst b/source/topic_model_python.rst new file mode 100644 index 0000000..3ee78ee --- /dev/null +++ b/source/topic_model_python.rst @@ -0,0 +1,316 @@ +.. _topic-model-python: + +========================== + Topic modeling in Python +========================== + +.. ipython:: python + :suppress: + + import numpy as np; np.set_printoptions(precision=2) + +This section illustrates how to do approximate topic modeling in Python. We will +use a technique called `non-negative matrix factorization (NMF) +`_ that +strongly resembles Latent Dirichlet Allocation (LDA) which we covered in the +previous section, :ref:`topic-model-mallet`. [#fn_nmf]_ Whereas LDA is +a probabilistic model capable of expressing uncertainty about the placement of +topics across texts and the assignment of words to topics, NMF is +a deterministic algorithm which arrives at a single representation of the +corpus. For this reason, NMF is often characterized as a machine learning +algorithm. Like LDA, NMF arrives at its representation of a corpus in terms of +something resembling "latent topics". + +.. note:: The name "Non-negative matrix factorization" has the virtue of being + transparent. A "non-negative matrix" is a matrix containing non-negative + values (here zero or positive word frequencies). And + factorization refers to the familiar kind of mathematical factorization. + Just as a polynomial :math:`x^2 + 3x + 2` may be factored into a simple + product :math:`(x+2)(x+1)`, so too may a matrix + :math:`\bigl(\begin{smallmatrix} 6&2&4\\ 9&3&6 \end{smallmatrix} \bigr)` be + factored into the product of two smaller matrices + :math:`\bigl(\begin{smallmatrix} 2\\ 3 \end{smallmatrix} \bigr) + \bigl(\begin{smallmatrix} 3&2&1 \end{smallmatrix} \bigr)`. + +This section follows the procedures described in :ref:`topic-model-mallet`, +making the substitution of NMF for LDA where appropriate. + +This section uses the novels by Brontë and Austen. These novels are divided into +parts as follows: + +.. ipython:: python + + import os + CORPUS_PATH = os.path.join('data', 'austen-brontë-split') + filenames = sorted([os.path.join(CORPUS_PATH, fn) for fn in os.listdir(CORPUS_PATH)]) + +.. ipython:: python + + # files are located in data/austen-brontë-split + len(filenames) + filenames[:5] + +Using Non-negative matrix factorization +======================================= + +As always we need to give Python access to our corpus. In this case we will work +with our familiar document-term matrix. + +.. ipython:: python + + import numpy as np # a conventional alias + import sklearn.feature_extraction.text as text + + vectorizer = text.CountVectorizer(input='filename', stop_words='english', min_df=20) + dtm = vectorizer.fit_transform(filenames).toarray() + vocab = np.array(vectorizer.get_feature_names()) + + dtm.shape + len(vocab) + +By analogy with LDA, we will use NMF to get a document-topic matrix (topics here +will also be referred to as "components") and a list of top words for each +topic. We will make analogy clear by using the same variable names: +``doctopic`` and ``topic_words`` + +.. ipython:: python + + from sklearn import decomposition + + num_topics = 20 + num_top_words = 20 + + clf = decomposition.NMF(n_components=num_topics, random_state=1) + + # this next step may take some time + +.. ipython:: python + :suppress: + + # suppress this + + import os + import pickle + + NMF_TOPICS = 'source/cache/nmf-austen-brontë-doc-topic.pkl' + NMF_CLF = 'source/cache/nmf-austen-brontë-clf.pkl' + + # the ipython directive seems to have trouble with multi-line indented blocks + if not os.path.exists(NMF_CLF): + doctopic = clf.fit_transform(dtm) + pickle.dump(doctopic, open(NMF_TOPICS, 'wb')) + pickle.dump(clf, open(NMF_CLF, 'wb')) + + + clf = pickle.load(open(NMF_CLF, 'rb')) + doctopic = pickle.load(open(NMF_TOPICS, 'rb')) + +.. code-block:: python + + doctopic = clf.fit_transform(dtm) + +.. ipython:: python + + # print words associated with topics + topic_words = [] + for topic in clf.components_: + word_idx = np.argsort(topic)[::-1][0:num_top_words] + topic_words.append([vocab[i] for i in word_idx]) + +To make the analysis and visualization of NMF components similar to that of +LDA's topic proportions, we will scale the document-component matrix such that +the component values associated with each document sum to one. + +.. ipython:: python + + doctopic = doctopic / np.sum(doctopic, axis=1, keepdims=True) + +Now we will average those topic shares associated with the same novel together +--- just as we did with the topic shares from MALLET. + +.. ipython:: python + + novel_names = [] + for fn in filenames: + basename = os.path.basename(fn) + # splitext splits the extension off, 'novel.txt' -> ('novel', '.txt') + name, ext = os.path.splitext(basename) + # remove trailing numbers identifying chunk + name = name.rstrip('0123456789') + novel_names.append(name) + # turn this into an array so we can use NumPy functions + novel_names = np.asarray(novel_names) + + @suppress + assert len(set(novel_names)) == 6 + + # use method described in preprocessing section + num_groups = len(set(novel_names)) + doctopic_grouped = np.zeros((num_groups, num_topics)) + for i, name in enumerate(sorted(set(novel_names))): + doctopic_grouped[i, :] = np.mean(doctopic[novel_names == name, :], axis=0) + + doctopic = doctopic_grouped + + @suppress + docnames = sorted(set(novel_names)) + + +.. ipython:: python + :suppress: + + import pandas as pd + OUTPUT_HTML_PATH = os.path.join('source', 'generated') + rownames = sorted(set(novel_names)) + colnames = ["NMF Topic " + str(i + 1) for i in range(doctopic.shape[1])] + html = pd.DataFrame(np.round(doctopic, 2), index=rownames, columns=colnames).to_html() + with open(os.path.join(OUTPUT_HTML_PATH, 'NMF_doctopic.txt'), 'w') as f: + f.write(html) + +.. raw:: html + :file: generated/NMF_doctopic.txt + +Inspecting the NMF fit +====================== + +The topics (or components) of the NMF fit preserve the distances between novels (see the figures below). + +.. ipython:: python + :suppress: + + # COSINE SIMILARITY + import os # for os.path.basename + import matplotlib.pyplot as plt + from sklearn.manifold import MDS + from sklearn.metrics.pairwise import cosine_similarity + + dist = 1 - cosine_similarity(dtm) + mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1) + pos = mds.fit_transform(dist) # shape (n_components, n_samples) + +.. ipython:: python + :suppress: + + assert dtm.shape[0] == doctopic.shape[0] + # NOTE: the IPython directive seems less prone to errors when these blocks + # are split up. + xs, ys = pos[:, 0], pos[:, 1] + names = sorted(set(novel_names)) + for x, y, name in zip(xs, ys, names): + color = 'orange' if "Austen" in name else 'skyblue' + plt.scatter(x, y, c=color) + plt.text(x, y, name) + + plt.title("Distances calculated using word frequencies") + @savefig plot_nmf_section_austen_brontë_cosine_mds.png width=7in + plt.show() + +.. ipython:: python + :suppress: + + # NMF + import os # for os.path.basename + import matplotlib.pyplot as plt + from sklearn.manifold import MDS + from sklearn.metrics.pairwise import euclidean_distances + + dist = euclidean_distances(doctopic) + mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1) + pos = mds.fit_transform(dist) # shape (n_components, n_samples) + +.. ipython:: python + :suppress: + + # NOTE: the IPython directive seems less prone to errors when these blocks are split up + xs, ys = pos[:, 0], pos[:, 1] + names = sorted(set(novel_names)) + for x, y, name in zip(xs, ys, names): + color = 'orange' if "Austen" in name else 'skyblue' + plt.scatter(x, y, c=color) + plt.text(x, y, name) + + plt.title("Distances calculated using NMF components") + @savefig plot_NMF_euclidean_mds.png width=7in + plt.show() + +Even though the NMF fit "discards" the fine-grained detail recorded in the +matrix of word frequencies, the matrix factorization performed allows us to +reconstruct the salient details of the underlying matrix. + +As we did in the previous section, let us identify the most significant topics +for each text in the corpus. This procedure does not differ in essence from the +procedure for identifying the most frequent words in each text. + +.. ipython:: python + + novels = sorted(set(novel_names)) + print("Top NMF topics in...") + for i in range(len(doctopic)): + top_topics = np.argsort(doctopic[i,:])[::-1][0:3] + top_topics_str = ' '.join(str(t) for t in top_topics) + print("{}: {}".format(novels[i], top_topics_str)) + +And we already have lists of words (``topic_words``) most strongly associated +with the components. For reference, we will display them again: + +.. ipython:: python + + # show the top 15 words + for t in range(len(topic_words)): + print("Topic {}: {}".format(t, ' '.join(topic_words[t][:15]))) + + +There are many ways to inspect and to visualize topic models. Some of the most +common methods are covered in :ref:`topic-model-visualization`. + +Distinctive topics +------------------ + +Consider the task of finding the topics that are distinctive of Austen using the +NMF "topics". Using the simple difference-in-averages we can find topics that to +be associated with Austen's novels rather than Brontë's. + +.. ipython:: python + + austen_indices, cbronte_indices = [], [] + for index, fn in enumerate(sorted(set(novel_names))): + if "Austen" in fn: + austen_indices.append(index) + elif "CBronte" in fn: + cbronte_indices.append(index) + + austen_avg = np.mean(doctopic[austen_indices, :], axis=0) + cbronte_avg = np.mean(doctopic[cbronte_indices, :], axis=0) + keyness = np.abs(austen_avg - cbronte_avg) + ranking = np.argsort(keyness)[::-1] # from highest to lowest; [::-1] reverses order in Python sequences + + # distinctive topics: + ranking[:10] + +.. ipython:: python + :suppress: + + N_WORDS_DISPLAY = 10 + N_TOPICS_DISPLAY = 10 + topics_display = sorted(ranking[0:N_TOPICS_DISPLAY]) + arr = doctopic[:, topics_display] + colnames = ["Topic {}".format(t) for t in topics_display] + rownames = sorted(set(novel_names)) + html = pd.DataFrame(np.round(arr,2), index=rownames, columns=colnames).to_html() + arr = np.row_stack([topic_words[t][:N_WORDS_DISPLAY] for t in topics_display]) + rownames = ["Topic {}".format(t) for t in topics_display] + colnames = ['']*N_WORDS_DISPLAY + html += pd.DataFrame(arr, index=rownames, columns=colnames).to_html() + with open(os.path.join(OUTPUT_HTML_PATH, 'topic_model_distinctive_avg_diff.txt'), 'w') as f: + f.write(html) + +.. raw:: html + :file: generated/topic_model_distinctive_avg_diff.txt + +.. FOOTNOTES + +.. [#fn_nmf] While there are significant differences between NMF and LDA, there + are also similarities. Indeed, if the texts in a corpus have certain + properties, NMF and LDA will arrive at the same representation of a corpus + :cite:`arora_practical_2013`. +