Skip to content

Commit

Permalink
Minor changes suggested by Fotis Jannidis
Browse files Browse the repository at this point in the history
  • Loading branch information
Allen Riddell committed Mar 27, 2015
1 parent 9c01a43 commit b7f9c68
Show file tree
Hide file tree
Showing 4 changed files with 16 additions and 18 deletions.
1 change: 1 addition & 0 deletions requirements.txt
Expand Up @@ -11,3 +11,4 @@ scipy>=0.13.3
sphinxcontrib-bibtex>=0.3.1
sphinxcontrib-tikz>=0.4.1
statsmodels>=0.6.0
sphinx-rtd-theme>=0.1.6
5 changes: 2 additions & 3 deletions source/conf.py
Expand Up @@ -41,7 +41,7 @@
'IPython.sphinxext.ipython_console_highlighting',
'matplotlib.sphinxext.only_directives',
'sphinxcontrib.tikz',
'sphinxcontrib.bibtex'
'sphinxcontrib.bibtex',
]

# Add any paths that contain templates here, relative to this directory.
Expand Down Expand Up @@ -112,9 +112,8 @@

# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
# html_theme = 'sphinxdoc'
html_theme = "nature"

html_theme = 'sphinxdoc'
#html_theme_path = ['./themes/']

# Theme options are theme-specific and customize the look and feel of a theme
Expand Down
6 changes: 1 addition & 5 deletions source/topic_model_mallet.rst
Expand Up @@ -490,8 +490,4 @@ to be associated more strongly with Austen's novels than with Brontë's.

.. FOOTNOTES
.. [#fnmapreduce] Those familiar with
`MapReduce <https://en.wikipedia.org/wiki/MapReduce>`_ may recognize the pattern of
splitting a dataset into smaller pieces and then (re)ordering them.
.. [#fnmapreduce] Those familiar with `MapReduce <https://en.wikipedia.org/wiki/MapReduce>`_ may recognize the pattern of splitting a dataset into smaller pieces and then (re)ordering them.
22 changes: 12 additions & 10 deletions source/working_with_text.rst
Expand Up @@ -54,10 +54,10 @@ parameter. Other important parameters include:
- ``min_df`` (default ``1``) remove terms from the vocabulary that occur in
fewer than ``min_df`` documents (in a large corpus this may be set to
``15`` or higher to eliminate very rare words)
- ``vocabulary`` ignore words that do not appear in the provided list of words
- ``vocabulary`` ignore words that do not appear in the provided list of words
- ``strip_accents`` remove accents
- ``token_pattern`` (default ``u'(?u)\b\w\w+\b'``) regular expression
identifying tokens–by default words that consist of a single character
identifying tokens–by default words that consist of a single character
(e.g., 'a', '2') are ignored, setting ``token_pattern`` to ``'(?u)\b\w+\b'``
will include these tokens
- ``tokenizer`` (default unused) use a custom function for tokenizing
Expand Down Expand Up @@ -88,7 +88,7 @@ into a NumPy array, as an array supports a greater variety of operations than
a list.

.. ipython:: python
# for reference, note the current class of `dtm`
type(dtm)
dtm = dtm.toarray() # convert to a regular array
Expand Down Expand Up @@ -195,7 +195,7 @@ avail ourselves of the ``scikit-learn`` function ``euclidean_distances``.
for j in range(n):
x, y = dtm[i, :], dtm[j, :]
dist[i, j] = np.sqrt(np.sum((x - y)**2))
from sklearn.metrics.pairwise import euclidean_distances
dist = euclidean_distances(dtm)
Expand Down Expand Up @@ -360,7 +360,7 @@ produces a hierarchical clustering of texts via the following procedure:
#. Start with each text in its own cluster

#. Until only a single cluster remains,

- Find the closest clusters and merge them. The distance between two clusters
is the change in the sum of squared distances when they are merged.

Expand All @@ -372,17 +372,19 @@ this algorithm and returns a tree of cluster-merges. The hierarchy of clusters
can be visualized using ``scipy.cluster.hierarchy.dendrogram``.

.. ipython:: python
from scipy.cluster.hierarchy import ward, dendrogram
linkage_matrix = ward(dist)
# match dendrogram to that returned by R's hclust()
dendrogram(linkage_matrix, orientation="right", labels=names);
dendrogram(linkage_matrix, orientation="right", labels=names)
@savefig plot_getting_started_ward_dendrogram.png width=7in
plt.tight_layout() # fixes margins
@savefig plot_getting_started_ward_dendrogram.png width=7in
plt.show()
For those familiar with R, the procedure is performed as follows:

.. code-block:: r
Expand All @@ -409,7 +411,7 @@ Exercises
text1 = "Indeed, she had a rather kindly disposition."
text2 = "The real evils, indeed, of Emma's situation were the power of having rather too much her own way, and a disposition to think a little too well of herself;"
text3 = "The Jaccard distance is a way of measuring the distance from one set to another set."
3. Using the document-term matrix just created, calculate the Euclidean
distance, `Jaccard distance <http://en.wikipedia.org/wiki/Jaccard_index>`_,
and cosine distance between each pair of documents. Make sure to calculate
Expand Down

0 comments on commit b7f9c68

Please sign in to comment.