diff --git a/source/case_study_racine.rst b/source/case_study_racine.rst new file mode 100644 index 0000000..5d412d4 --- /dev/null +++ b/source/case_study_racine.rst @@ -0,0 +1,470 @@ +.. _case-study-racine: + +=============================================== + Case study: Racine's early and late tragedies +=============================================== + +This tutorial describes two stages of an analysis of a collection of French +tragedies: (1) exploratory data analysis and (2) a closer quantitative reading +of an aspect of interest in the corpus: `Jean Racine's +`_ early and late plays. + +Corpus: sixty tragedies +======================= + +The corpus being explored is a collection of 60 French tragedies gathered from +Paul Fièvre's `„Théâtre classique” collection of French seventeenth and +eighteenth-century plays `_ (available in +:ref:`datasets`). + +Following the maxim "know your data", we first want to gain a rough sense of the +length (in words) of the texts in a corpus. Are we dealing with 100,000 word +tomes or 1,000 word newspaper articles? Even though we know we are dealing with +French tragedies inspecting text lengths is always worthwhile. For example, +checking the length of the texts in a corpus is an easy way to spot problems +such as incomplete texts. In the present case, we might also consider checking +the number of lines (`alexandrines +`_) as this is the standard measure +for compositions in verse. Regardless of what unit is used, the easiest way +to visualize the distribution of lengths in a corpus is a histogram. In +order to create a histogram we need first to calculate the length of all the +texts in our corpus. If we have a document-term matrix available, the length in +words may be calculated by summing the rows of the document-term matrix, as each +row corresponds to one document. + +.. ipython:: python + + import os + + import numpy as np + import sklearn.feature_extraction.text as text + + data_dir = 'data/french-tragedy/' + + filenames = np.array(sorted(os.listdir(data_dir))) + filenames_with_path = [os.path.join(data_dir, fn) for fn in filenames] + + # check the first few filenames + filenames_with_path[0:4] + + vectorizer = text.CountVectorizer(input='filename') + dtm = vectorizer.fit_transform(filenames_with_path) + dtm = dtm.toarray() + vocab = np.array(vectorizer.get_feature_names()) + + # sum over rows to calculate lengths + lengths = np.sum(dtm, axis=1) + +Now that we have the texts' lengths stored in the array ``lengths`` we can +create a histogram with the matplotlib function ``hist``. + +.. ipython:: python + + import matplotlib.pyplot as plt + + plt.hist(lengths) + @savefig case_study_voltaire_hist.png width=7in + plt.title("Play length in words") + + # find the longest and the shortest texts + (np.max(lengths), filenames[np.argmax(lengths)]) + (np.min(lengths), filenames[np.argmin(lengths)]) + +Turning to the contents of the corpus, one way to explore a collection is to +plot the distances between texts. We have already seen how to do this in +:ref:`working-with-text`. As we have not normalized our texts by play +length---and the histogram of lengths shows us that there is considerable +variation---cosine distance is an appropriate choice for a measure of distance. + +.. ipython:: python + + from sklearn.manifold import MDS + from sklearn.metrics.pairwise import cosine_similarity + + # we need distance, not similarity + dist = 1 - cosine_similarity(dtm) + + mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1) + pos = mds.fit_transform(dist) # shape (n_components, n_samples) + + # create very short names for plotting + # filenames have form: Voltaire_TR-V-1724-Mariamne.txt + names = [] + authors = [] + for fn in filenames: + author = fn.split('_')[0] + year = fn.split('-')[2] + authors.append(author) + names.append(author + year) + +.. ipython:: python + + plt.figure(figsize=(11.3, 7)) # use a bigger canvas than usual + xs, ys = pos[:, 0], pos[:, 1] + + authors_unique = sorted(set(authors)) + colors = [authors_unique.index(a) for a in authors] + plt.scatter(xs, ys, c=colors, cmap='spring') + for x, y, name in zip(xs, ys, names): + plt.text(x, y, name, alpha=0.5, fontsize=10) + + @savefig plot_voltaire_mds.png width=11.3in height=7in + plt.tight_layout() + +This kind of plot can get overwhelming. A dendrogram plot offers an alternative +means of representing the same information (i.e., the distance between texts). +It is important, however, not to take the implied hierarchy too seriously. +While two texts paired together in the dendrogram are indeed nearest neighbors +in terms of distance, there are a variety of methods of hierarchical clustering +each often yielding different hierarchies. Right now we are interested in the +dendrogram as a convenient summary of the multi-dimensional scaling plot shown +above. + +.. ipython:: python + + from scipy.cluster.hierarchy import ward, dendrogram + + linkage_matrix = ward(dist) + + plt.figure(figsize=(11.3, 11.3)) # we need a tall figure + # match dendrogram to that returned by R's hclust() + dendrogram(linkage_matrix, orientation="right", labels=names, leaf_font_size=5); + @savefig plot_voltaire_ward_dendrogram.png width=11.3in + plt.tight_layout() # fixes margins + +It should come as no surprise that texts by the same author tend to be adjacent +in the dendrogram. It is well documented that authors frequently leave stylistic +"signatures" that are detectable at the level of word frequency. [#fn_authorship_attribution]_ +There are, however, a number of plays that do not follow the rule and are paired +with texts by other writers. A number of these plays are attributed to Racine +and it is to these plays we will turn our attention. + +Racine's atypical plays +======================= + +Racine's atypical plays are easiest to detect on the dendrogram. They include: + +- `La Thébaïde `_ (1664) +- `Phèdre `_ (1677) +- `Esther `_ (1689) +- `Athalie `_ (1691) + +Considering these outliers in the context of the chronology of Racine's works as +a whole is helpful. These plays include the first and the final three plays +written by Racine. To display this chronology visually we may use a raster +graph, coloring the outliers a distinct color. (This visualization has the added +benefit of showing the nearly ten year gap between plays in the 1680s.) + +.. ipython:: python + + @suppress + plt.figure(figsize=(11.3, 5)) # reset figure size + + outliers = [1664, 1677, 1689, 1691] + racine_years = [] + for fn in filenames: + author = fn.split('_')[0] + year = int(fn.split('-')[2]) + if author == "Racine": + racine_years.append(year) + racine_years = np.array(racine_years) + colors = [] + for year in racine_years: + colors.append('orange' if year in outliers else 'cyan') + plt.vlines(racine_years, 0, 1, linewidth=3, color=colors) + + plt.title("Year of publication of Racine's plays") + # gca() stands for get current axes. Axes are a matplotlib primitive. + # See http://matplotlib.org/users/pyplot_tutorial.html#working-with-multiple-figures-and-axes + ax = plt.gca() + @savefig plot_racine_voltaire_rastergram.png width=7in + ax.yaxis.set_visible(False) + +A provisional explanation for why the late plays stand out might draw on +Racine's religious turn. In 1679 he married Catherine de +Romanet and his `Jansenism `_ grew more +pronounced. The title *Esther* refers to the biblical book of the same name and +*Athalie*, Racine's final play, stages events from the Bible. + +Features +-------- + +A useful way to explore the atypical plays further is to directly compare the +outliers with a fictitious "average" Racine tragedy. To do this we will first +decompose our documents into an ersatz "topic model" using non-negative matrix +factorization and then we will average the topic shares of the "normal" Racine +plays and compare those shares with the shares of the atypical plays. + +We will fit the NMF model using the corpus of tragedies split into approximately +1,000-word sections. Recall that before feeding your document-term matrix into +NMF it is helpful to normalize each document by length. Here we will +normalize and, additionally, use tf-idf weighting as the invocation is simple: +we use ``TfidfVectorizer`` instead of ``CountVectorizer``. + +.. ipython:: python + + data_dir = 'data/french-tragedy-split/' + + filenames = np.array(sorted(os.listdir(data_dir))) + filenames_with_path = [os.path.join(data_dir, fn) for fn in filenames] + + # check the first few filenames + filenames_with_path[0:4] + + vectorizer = text.TfidfVectorizer(input='filename', min_df=15) + dtm = vectorizer.fit_transform(filenames_with_path) + dtm = dtm.toarray() + vocab = np.array(vectorizer.get_feature_names()) + +.. ipython:: python + + # fit NMF model + + from sklearn import decomposition + + num_topics = 15 + + clf = decomposition.NMF(n_components=num_topics, random_state=1) + + # this next step may take some time + +.. ipython:: python + :suppress: + + # suppress this + + import os + import pickle + + NMF_TOPICS = 'source/cache/nmf-racine-doc-topic-{}topics.pkl'.format(num_topics) + NMF_CLF = 'source/cache/nmf-racine-clf-{}topics.pkl'.format(num_topics) + + # the ipython directive seems to have trouble with multi-line indented blocks + if not os.path.exists(NMF_CLF): + doctopic_chunks = clf.fit_transform(dtm) + pickle.dump(doctopic_chunks, open(NMF_TOPICS, 'wb')) + pickle.dump(clf, open(NMF_CLF, 'wb')) + + clf = pickle.load(open(NMF_CLF, 'rb')) + doctopic_chunks = pickle.load(open(NMF_TOPICS, 'rb')) + +.. code-block:: python + + doctopic_chunks = clf.fit_transform(dtm) + +In order to interpret and visualize the NMF components in a manner analogous to +LDA topic proportions, we will scale the document-component matrix such that +the component values associated with each document sum to one. + +.. ipython:: python + + doctopic_chunks = doctopic_chunks / np.sum(doctopic_chunks, axis=1, keepdims=True) + +As we did in :ref:`previous sections `, we will aggregate +the text sections associated with a single play together and average their topic +proportions. + +.. ipython:: python + + import itertools + import re + import operator + + # Play sections have filenames like: Racine_TR-V-1677-Phedre0000.txt. We can split + # the last part "0000.txt" off using string slicing since we know that the part of + # the filename we do not want is always 8 characters in width. For example, + 'Racine_TR-V-1677-Phedre0000.txt'[:-8] + # alternatively, we could use a regular expression: + import re + re.sub(r'[0-9]+\.txt$','', 'Racine_TR-V-1677-Phedre0000.txt') + + play_names_chunks = [] + for fn in filenames: + play_names_chunks.append(fn[:-8]) + + num_plays = len(set(play_names_chunks)) + doctopic = np.zeros((num_plays, num_topics)) + play_row_pairs = zip(play_names_chunks, doctopic_chunks) + play_names = [] + for i, (name, pairs) in enumerate(itertools.groupby(play_row_pairs, key=operator.itemgetter(0))): + rows = [row for _, row in pairs] + doctopic[i, :] = sum(rows) / len(rows) + play_names.append(name) + +While we have used all the other plays to fit the NMF model---in deriving the +topic components and the word-topic associations---we care principally about +Racine's atypical plays and the synthetic "average" play that will serve as a proxy for +a "normal" Racine play. We will construct the average play by averaging the +shares of the typical plays (i.e., all those that are not atypical): + +.. ipython:: python + + racine_plays = [name for name in play_names if name.startswith('Racine')] + racine_atypical = ['Racine_TR-V-1664-Thebaide', 'Racine_TR-V-1677-Phedre', 'Racine_TR-V-1689-Esther', 'Racine_TR-V-1691-Athalie'] + racine_typical = [name for name in racine_plays if name not in racine_atypical] + # alternatively, an opportunity to use set difference + # racine_typical = list(set(racine_plays) - set(racine_atypical)) + + # examine the list of typical plays, making sure we have the right ones + racine_typical + + doctopic_racine_typical = np.mean(doctopic[np.in1d(play_names, racine_typical)], axis=0) + doctopic_racine_atypical = doctopic[np.in1d(play_names, racine_atypical)] + + # stack the typical and the atypical plays by row + doctopic_of_interest = np.row_stack([doctopic_racine_typical, doctopic_racine_atypical]) + + # as a last and final step we need to keep track of the names + # note that some of the manipulation of names and rows is fragile and relies on the names + # being sorted alphabetically. If this were a concern we might use a pandas DataFrame + # instead, as row and column names can be explicitly assigned + play_names_of_interest = ['Racine-1666-1674-AVERAGE'] + racine_atypical + +Now that we have our matrix of document-topic proportions for the atypical plays +and the composite Racine play, we can visualize the topic shares using +a heatmap, a procedure which should be familiar from +:ref:`topic-model-visualization`. + +.. ipython:: python + + @suppress + plt.figure(figsize=(11.3, 7)) # reset figure size + + plt.pcolor(doctopic_of_interest, norm=None, cmap='Blues') + + topic_labels = ['Topic #{}'.format(k) for k in range(num_topics)] + plt.xticks(np.arange(doctopic_of_interest.shape[1]) + 0.5, topic_labels); + plt.yticks(np.arange(doctopic_of_interest.shape[0]) + 0.5, play_names_of_interest); + + # flip the y-axis so the texts are in the order we anticipate + plt.gca().invert_yaxis() + + # rotate the ticks on the x-axis + plt.xticks(rotation=90) + + # add a legend + plt.colorbar(cmap='Blues') + + @savefig plot_racine_doctopic_heatmap.png width=10in + plt.tight_layout() # fixes margins + +Looking at this heatmap, a number of topics stand out as ones which we might +wish to examine. In this case there is no harm in visually identifying the +topics that vary the most (using our eyes). However, were we +confronted with a greater number of topics (say, 100 or 200 topics), such +a procedure would be tedious and error prone. We may as well come up with +a systematic way of identifying topics that vary substantially across texts of +interest. One way of doing this would be to calculate the standard deviation of +the document-topic shares across the topics. (Calculating the `entropy +`_ for topic-document associations would +also be a useful measure.) + +.. ipython:: python + + # examine topics of interest by ranking them by standard deviation + # reminder: NumPy's standard deviation differs from R's standard deviation. If you + # want them to return identical results include the argument ``ddof=1``. + # Essentially, NumPy's standard deviation divides the variance by ``n`` whereas R + # uses ``n-1`` (which is preferable as it gives an unbiased estimate of the variance). + # Using ``ddof=1`` makes NumPy use ``n-1``. + topics_by_std = np.argsort(np.std(doctopic_of_interest, axis=0, ddof=1))[::-1] + topics_by_std[0:10] + + # First we gather the words most associated with each topic + num_top_words = 25 + topic_words = [] + @suppress + assert len(clf.components_[0]) == len(vocab) + for topic in clf.components_: + word_idx = np.argsort(topic)[::-1][0:num_top_words] + topic_words.append([vocab[i] for i in word_idx]) + + # Now we examine the topic-word distributions for the topics that vary the most + for t in topics_by_std[0:5]: + topic_words_str = ' '.join(str(t) for t in topic_words[t]) + print("Topic {}: {}".format(t, topic_words_str)) + +As our ranking indicates, most of the mystery about the atypical plays is +resolved by inspecting topics #5 and #9. (Only *Phèdre* (1677) needs +additional scrutiny.) Given what we know about Racine's biography, topic #5 +(dieu, temple, chrétiens) does not require a great deal of additional +explanation. Topic #9 is more strongly associated with *Thébaïde* (1664) than +with any other play. Inspecting the words associated with topic #9 we see it +features words such as "et" and "un". If we read the text of the play it +appears that these words do indeed appear comparatively frequently. While we +will leave it to Racine scholars to provide a detailed account of the +difference, we may venture two provisional explanations: first, this was +Racine's first play and his style had yet to mature, and second, there is strong +evidence that Molière contributed to the editing of the play and this fact may +have something to do with the stylistic difference. + +To verify that the Topic #9 does indeed capture a salient difference, we may +compare the rates of the words "et" and "un" across all Racine +plays. The rate of "et" in *Thébaïde* does indeed stand out: + +.. ipython:: python + + # reassemble the document-term matrix + data_dir = 'data/french-tragedy/' + filenames = np.array(sorted(os.listdir(data_dir))) + filenames_with_path = [os.path.join(data_dir, fn) for fn in filenames] + vectorizer = text.CountVectorizer(input='filename') + dtm = vectorizer.fit_transform(filenames_with_path) + dtm = dtm.toarray() + vocab = np.array(vectorizer.get_feature_names()) + authors = np.array([fn.split('_')[0] for fn in filenames]) + + # convert to rates per 1000 words as this is easier to interpret + dtm = 1000 * dtm / np.sum(dtm, axis=1, keepdims=True) + +.. breaking up the ipython blocks appears to help avoid problems + +.. ipython:: python + + for word in ['et', 'un']: + print("Rate per 1,000 words of {}".format(word)) + filenames_racine = filenames[authors == 'Racine'] + rates_racine = dtm[authors == 'Racine', vocab == word] + for filename, rate in zip(filenames_racine, rates_racine): + # use some fancy formatting, see http://docs.python.org/3.3/library/string.html#formatspec + print("{:>40s}: {:.1f}".format(filename, rate)) + +In addition to the Christian vocabulary associated with topic #5, *Esther* and +*Athalie* also distinguish themselves through an *absence* of topic #10. Looking +at the words associated with these topics a pattern emerges: the words are +associated with narration or dialogue. Topic #10 includes the first person +singular pronouns "je" and "me" along with the first person singular forms of +the verbs "être" and "pouvoir" ("suis" and "puis"). Do Racine's final plays +perhaps feature dialogue to a lesser degree than Racine's other plays? + +Again, to validate the suspicion that the words "je" and "me" do indeed appear +more frequently in the final plays we will look directly at their word rates. +The low rates of "je" and "me" in the final two plays certainly do stand out. + +.. ipython:: python + + for word in ['je', 'me']: + print("Rate per 1,000 words of {}".format(word)) + filenames_racine = filenames[authors == 'Racine'] + rates_racine = dtm[authors == 'Racine', vocab == word] + for filename, rate in zip(filenames_racine, rates_racine): + # use some fancy formatting, see http://docs.python.org/3.3/library/string.html#formatspec + print("{:>40s}: {:.1f}".format(filename, rate)) + print() # print a blank line + +Finally, we turn back to *Phèdre* (1677). In terms of topic proportions, +*Phèdre* (1677) looks similar to the composite "average" Racine play. Inspecting +the dendrogram and the multidimensional scaling plot, we observe that the play +is, in fact, not so different from Racine's other plays; it stands out not +because it is as atypical as those discussed above but because it happens to be +similar (in terms of cosine distance) to several of Voltaire's plays. +Investigating why the works of a radical Enlightenment figure like Voltaire +should so strongly resemble Racine's is left as an exercise for the reader. + +.. FOOTNOTES + +.. [#fn_authorship_attribution] Such signatures do not always appear. + They can be eliminated with some modest effort on the part of the writer + :cite:`brennan_adversarial_2011`. There are also many instances of writers + changing their style over time---Henry James is an excellent example + :cite:`hoover_corpus_2007`.)