More updates

DARIAH-DE · Jan 4, 2014 · c0eaf94 · c0eaf94
1 parent cb9b0a0
commit c0eaf94
Show file tree

Hide file tree

Showing 2 changed files with 321 additions and 15 deletions.
diff --git a/source/preliminaries.rst b/source/preliminaries.rst
@@ -8,13 +8,6 @@ These tutorials make use of a number of Python packages. This section describes
 how to install these packages. (If you do not already have Python 3 installed on
 your system you may wish to skip to the section :ref:`installing-python`.)
 
-.. note::
-
-    An IPython Notebook with all the required libraries is available at the
-    following address: http://ipython.machinereadings.org:5555/. The password is
-    "wuerzburg".
-
-
 Required Python packages
 ========================
 The tutorials collected here assume that Python version 3.3 or higher is
@@ -184,13 +177,10 @@ Python is handled in homebrew.
 
 Installing packages on Windows
 ------------------------------
+
 There are a number of distributions of Python for Windows that come pre-packaged
 with packages relevant to scientific computing such as NumPy and SciPy. They
-include, for example, `Anaconda Python <https://store.continuum.io/cshop/anaconda>`_.
-
-Installing packages from source under Windows requires some patience. As the
-packages used in this tutorial are widely used, instructions on installing
-specific packages are not difficult to find. For example, instructions on how to
-install ``scikit-learn`` (which includes installing NumPy and SciPy) in
-a Windows environment may be found at the ``scikit-learn`` website: `Installing
-scikit-learn <http://scikit-learn.org/stable/install.html>`_.
+include, for example, `Anaconda Python
+<https://store.continuum.io/cshop/anaconda>`_. Anaconda includes almost all the
+packages used here. Also available are `instructions on how to use Python 3 with
+Anaconda <http://continuum.io/blog/anaconda-python-3>`.
diff --git a/source/topic_model_python.rst b/source/topic_model_python.rst
@@ -0,0 +1,316 @@
+.. _topic-model-python:
+
+==========================
+ Topic modeling in Python
+==========================
+
+.. ipython:: python
+    :suppress:
+
+    import numpy as np; np.set_printoptions(precision=2)
+
+This section illustrates how to do approximate topic modeling in Python. We will
+use a technique called `non-negative matrix factorization (NMF)
+<https://en.wikipedia.org/wiki/Non-negative_matrix_factorization>`_ that
+strongly resembles Latent Dirichlet Allocation (LDA) which we covered in the
+previous section, :ref:`topic-model-mallet`. [#fn_nmf]_ Whereas LDA is
+a probabilistic model capable of expressing uncertainty about the placement of
+topics across texts and the assignment of words to topics, NMF is
+a deterministic algorithm which arrives at a single representation of the
+corpus. For this reason, NMF is often characterized as a machine learning
+algorithm. Like LDA, NMF arrives at its representation of a corpus in terms of
+something resembling "latent topics".
+
+.. note:: The name "Non-negative matrix factorization" has the virtue of being
+   transparent. A "non-negative matrix" is a matrix containing non-negative
+   values (here zero or positive word frequencies). And
+   factorization refers to the familiar kind of mathematical factorization.
+   Just as a polynomial :math:`x^2 + 3x + 2` may be factored into a simple
+   product :math:`(x+2)(x+1)`, so too may a matrix
+   :math:`\bigl(\begin{smallmatrix} 6&2&4\\ 9&3&6 \end{smallmatrix} \bigr)` be
+   factored into the product of two smaller matrices
+   :math:`\bigl(\begin{smallmatrix} 2\\ 3 \end{smallmatrix} \bigr)
+   \bigl(\begin{smallmatrix} 3&2&1 \end{smallmatrix} \bigr)`.
+
+This section follows the procedures described in :ref:`topic-model-mallet`,
+making the substitution of NMF for LDA where appropriate.
+
+This section uses the novels by Brontë and Austen. These novels are divided into
+parts as follows:
+
+.. ipython:: python
+
+    import os
+    CORPUS_PATH = os.path.join('data', 'austen-brontë-split')
+    filenames = sorted([os.path.join(CORPUS_PATH, fn) for fn in os.listdir(CORPUS_PATH)])
+
+.. ipython:: python
+
+    # files are located in data/austen-brontë-split
+    len(filenames)
+    filenames[:5]
+
+Using Non-negative matrix factorization
+=======================================
+
+As always we need to give Python access to our corpus. In this case we will work
+with our familiar document-term matrix.
+
+.. ipython:: python
+
+    import numpy as np  # a conventional alias
+    import sklearn.feature_extraction.text as text
+
+    vectorizer = text.CountVectorizer(input='filename', stop_words='english', min_df=20)
+    dtm = vectorizer.fit_transform(filenames).toarray()
+    vocab = np.array(vectorizer.get_feature_names())
+
+    dtm.shape
+    len(vocab)
+
+By analogy with LDA, we will use NMF to get a document-topic matrix (topics here
+will also be referred to as "components") and a list of top words for each
+topic. We will make analogy clear by using the same variable names:
+``doctopic`` and ``topic_words``
+
+.. ipython:: python
+
+    from sklearn import decomposition
+
+    num_topics = 20
+    num_top_words = 20
+
+    clf = decomposition.NMF(n_components=num_topics, random_state=1)
+
+    # this next step may take some time
+
+.. ipython:: python
+    :suppress:
+
+    # suppress this
+
+    import os
+    import pickle
+
+    NMF_TOPICS = 'source/cache/nmf-austen-brontë-doc-topic.pkl'
+    NMF_CLF = 'source/cache/nmf-austen-brontë-clf.pkl'
+
+    # the ipython directive seems to have trouble with multi-line indented blocks
+    if not os.path.exists(NMF_CLF):
+        doctopic = clf.fit_transform(dtm)
+        pickle.dump(doctopic, open(NMF_TOPICS, 'wb'))
+        pickle.dump(clf, open(NMF_CLF, 'wb'))
+
+
+    clf = pickle.load(open(NMF_CLF, 'rb'))
+    doctopic = pickle.load(open(NMF_TOPICS, 'rb'))
+
+.. code-block:: python
+
+   doctopic = clf.fit_transform(dtm)
+
+.. ipython:: python
+
+    # print words associated with topics
+    topic_words = []
+    for topic in clf.components_:
+        word_idx = np.argsort(topic)[::-1][0:num_top_words]
+        topic_words.append([vocab[i] for i in word_idx])
+
+To make the analysis and visualization of NMF components similar to that of
+LDA's topic proportions, we will scale the document-component matrix such that
+the component values associated with each document sum to one.
+
+.. ipython:: python
+
+    doctopic = doctopic / np.sum(doctopic, axis=1, keepdims=True)
+
+Now we will average those topic shares associated with the same novel together
+--- just as we did with the topic shares from MALLET.
+
+.. ipython:: python
+
+    novel_names = []
+    for fn in filenames:
+        basename = os.path.basename(fn)
+        # splitext splits the extension off, 'novel.txt' -> ('novel', '.txt')
+        name, ext = os.path.splitext(basename)
+        # remove trailing numbers identifying chunk
+        name = name.rstrip('0123456789')
+        novel_names.append(name)
+    # turn this into an array so we can use NumPy functions
+    novel_names = np.asarray(novel_names)
+
+    @suppress
+    assert len(set(novel_names)) == 6
+
+    # use method described in preprocessing section
+    num_groups = len(set(novel_names))
+    doctopic_grouped = np.zeros((num_groups, num_topics))
+    for i, name in enumerate(sorted(set(novel_names))):
+        doctopic_grouped[i, :] = np.mean(doctopic[novel_names == name, :], axis=0)
+
+    doctopic = doctopic_grouped
+
+    @suppress
+    docnames = sorted(set(novel_names))
+
+
+.. ipython:: python
+    :suppress:
+
+    import pandas as pd
+    OUTPUT_HTML_PATH = os.path.join('source', 'generated')
+    rownames = sorted(set(novel_names))
+    colnames = ["NMF Topic " + str(i + 1) for i in range(doctopic.shape[1])]
+    html = pd.DataFrame(np.round(doctopic, 2), index=rownames, columns=colnames).to_html()
+    with open(os.path.join(OUTPUT_HTML_PATH, 'NMF_doctopic.txt'), 'w') as f:
+        f.write(html)
+
+.. raw:: html
+    :file: generated/NMF_doctopic.txt
+
+Inspecting the NMF fit
+======================
+
+The topics (or components) of the NMF fit preserve the distances between novels (see the figures below).
+
+.. ipython:: python
+    :suppress:
+
+    # COSINE SIMILARITY
+    import os  # for os.path.basename
+    import matplotlib.pyplot as plt
+    from sklearn.manifold import MDS
+    from sklearn.metrics.pairwise import cosine_similarity
+
+    dist = 1 - cosine_similarity(dtm)
+    mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1)
+    pos = mds.fit_transform(dist)  # shape (n_components, n_samples)
+
+.. ipython:: python
+    :suppress:
+
+    assert dtm.shape[0] == doctopic.shape[0]
+    # NOTE: the IPython directive seems less prone to errors when these blocks
+    # are split up.
+    xs, ys = pos[:, 0], pos[:, 1]
+    names = sorted(set(novel_names))
+    for x, y, name in zip(xs, ys, names):
+        color = 'orange' if "Austen" in name else 'skyblue'
+        plt.scatter(x, y, c=color)
+        plt.text(x, y, name)
+
+    plt.title("Distances calculated using word frequencies")
+    @savefig plot_nmf_section_austen_brontë_cosine_mds.png width=7in
+    plt.show()
+
+.. ipython:: python
+    :suppress:
+
+    # NMF
+    import os  # for os.path.basename
+    import matplotlib.pyplot as plt
+    from sklearn.manifold import MDS
+    from sklearn.metrics.pairwise import euclidean_distances
+
+    dist = euclidean_distances(doctopic)
+    mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1)
+    pos = mds.fit_transform(dist)  # shape (n_components, n_samples)
+
+.. ipython:: python
+    :suppress:
+
+    # NOTE: the IPython directive seems less prone to errors when these blocks are split up
+    xs, ys = pos[:, 0], pos[:, 1]
+    names = sorted(set(novel_names))
+    for x, y, name in zip(xs, ys, names):
+        color = 'orange' if "Austen" in name else 'skyblue'
+        plt.scatter(x, y, c=color)
+        plt.text(x, y, name)
+
+    plt.title("Distances calculated using NMF components")
+    @savefig plot_NMF_euclidean_mds.png width=7in
+    plt.show()
+
+Even though the NMF fit "discards" the fine-grained detail recorded in the
+matrix of word frequencies, the matrix factorization performed allows us to
+reconstruct the salient details of the underlying matrix.
+
+As we did in the previous section, let us identify the most significant topics
+for each text in the corpus.  This procedure does not differ in essence from the
+procedure for identifying the most frequent words in each text.
+
+.. ipython:: python
+
+    novels = sorted(set(novel_names))
+    print("Top NMF topics in...")
+    for i in range(len(doctopic)):
+        top_topics = np.argsort(doctopic[i,:])[::-1][0:3]
+        top_topics_str = ' '.join(str(t) for t in top_topics)
+        print("{}: {}".format(novels[i], top_topics_str))
+
+And we already have lists of words (``topic_words``) most strongly associated
+with the components. For reference, we will display them again:
+
+.. ipython:: python
+
+    # show the top 15 words
+    for t in range(len(topic_words)):
+        print("Topic {}: {}".format(t, ' '.join(topic_words[t][:15])))
+
+
+There are many ways to inspect and to visualize topic models. Some of the most
+common methods are covered in :ref:`topic-model-visualization`.
+
+Distinctive topics
+------------------
+
+Consider the task of finding the topics that are distinctive of Austen using the
+NMF "topics". Using the simple difference-in-averages we can find topics that to
+be associated with Austen's novels rather than Brontë's.
+
+.. ipython:: python
+
+    austen_indices, cbronte_indices = [], []
+    for index, fn in enumerate(sorted(set(novel_names))):
+        if "Austen" in fn:
+            austen_indices.append(index)
+        elif "CBronte" in fn:
+            cbronte_indices.append(index)
+
+    austen_avg = np.mean(doctopic[austen_indices, :], axis=0)
+    cbronte_avg = np.mean(doctopic[cbronte_indices, :], axis=0)
+    keyness = np.abs(austen_avg - cbronte_avg)
+    ranking = np.argsort(keyness)[::-1]  # from highest to lowest; [::-1] reverses order in Python sequences
+
+    # distinctive topics:
+    ranking[:10]
+
+.. ipython:: python
+    :suppress:
+
+    N_WORDS_DISPLAY = 10
+    N_TOPICS_DISPLAY = 10
+    topics_display = sorted(ranking[0:N_TOPICS_DISPLAY])
+    arr = doctopic[:, topics_display]
+    colnames = ["Topic {}".format(t) for t in topics_display]
+    rownames = sorted(set(novel_names))
+    html = pd.DataFrame(np.round(arr,2), index=rownames, columns=colnames).to_html()
+    arr = np.row_stack([topic_words[t][:N_WORDS_DISPLAY] for t in topics_display])
+    rownames = ["Topic {}".format(t) for t in topics_display]
+    colnames = ['']*N_WORDS_DISPLAY
+    html += pd.DataFrame(arr, index=rownames, columns=colnames).to_html()
+    with open(os.path.join(OUTPUT_HTML_PATH, 'topic_model_distinctive_avg_diff.txt'), 'w') as f:
+        f.write(html)
+
+.. raw:: html
+    :file: generated/topic_model_distinctive_avg_diff.txt
+
+.. FOOTNOTES
+
+.. [#fn_nmf] While there are significant differences between NMF and LDA, there
+   are also similarities. Indeed, if the texts in a corpus have certain
+   properties, NMF and LDA will arrive at the same representation of a corpus
+   :cite:`arora_practical_2013`.
+