Skip to content

Commit

Permalink
Additional clarifications
Browse files Browse the repository at this point in the history
Make additional clarifications kindly suggested by Fotis Jannidis.
  • Loading branch information
Allen Riddell committed Mar 27, 2015
1 parent 5f04651 commit 774a05c
Show file tree
Hide file tree
Showing 2 changed files with 49 additions and 3 deletions.
16 changes: 16 additions & 0 deletions source/preprocessing.rst
Expand Up @@ -181,6 +181,16 @@ shows the first plays in the corpus:
# rather than just 'Crebillon_TR-V-1703-Idomenee.txt' alone.
tragedy_filenames = [os.path.join(corpus_path, fn) for fn in sorted(os.listdir(corpus_path))]
@suppress
tragedy_filenames_orig = tragedy_filenames.copy()
# alternatively, using the Python standard library package 'glob'
import glob
tragedy_filenames = glob.glob(corpus_path + os.sep + '*.txt')
@suppress
assert sorted(tragedy_filenames) == sorted(tragedy_filenames_orig)
Every 1,000 words
-----------------
Expand Down Expand Up @@ -225,6 +235,8 @@ a number for the chunk, and the text of the chunk.
.. ipython:: python
tragedy_filenames = [os.path.join(corpus_path, fn) for fn in sorted(os.listdir(corpus_path))]
# alternatively, using glob
tragedy_filenames = glob.glob(corpus_path + os.sep + '*.txt')
chunk_length = 1000
chunks = []
for filename in tragedy_filenames:
Expand Down Expand Up @@ -411,6 +423,10 @@ will benefit from reviewing the introductions to NumPy mentioned in
@suppress
dtm_authors_method_numpy = dtm_authors.copy()
.. note:: Recall that gathering together the sum of the entries along columns is
performed with ``np.sum(X, axis=0)`` or ``X.sum(axis=0)``. This is
the NumPy equivalent of R's ``apply(X, 2, sum)`` (or ``colSums(X)``).

Grouping data together in this manner is such a common problem in data analysis
that there are packages devoted to making the work easier. For example, if you
have the `pandas library <http://pandas.pydata.org>`_ installed, you can
Expand Down
36 changes: 33 additions & 3 deletions source/working_with_text.rst
Expand Up @@ -100,8 +100,13 @@ a list.
matrix by default, consider a 4000 by 50000 matrix of word frequencies that
is 60% zeros. In Python an integer takes up four bytes, so using a sparse
matrix saves almost 500M of memory, which is a considerable amount of
computer memory. (Recall that Python objects such as arrays are stored in
memory, not on disk).
computer memory in the 2010s. (Recall that Python objects such as arrays are stored in
memory, not on disk). If you are working with a very large collection
of texts, you may encounter memory errors after issuing the commands above.
Provided your corpus is not truly massive, it may be advisable to locate
a machine with a greater amount of memory. For example, these days it is possible to
rent a machine with 64G of memory by the hour. Conducting experiments
on a random subsample (small enough to fit into memory) is also recommended.

With this preparatory work behind us, querying the document-term matrix is
simple. For example, the following demonstrate two ways finding how many times
Expand Down Expand Up @@ -130,7 +135,32 @@ matrix product :math:`XY`.)
<http://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.html>`_
data structure which can be useful if you are doing lots of matrix
operations such as matrix product, inverse, and so forth. In general,
however, it is best to stick to NumPy arrays.
however, it is best to stick to NumPy arrays. In fact, if you are
using Python 3.5 you can make use of the matrix multiplication operator ``@``
and dispense with any need for the ``matrix`` type.

Just so we have a sense of what we have just created, here is a section of the
document-term matrix for a handful of selected words:

.. ipython:: python
:suppress:
import os
import pandas as pd
OUTPUT_HTML_PATH = os.path.join('source', 'generated')
OUTPUT_FILENAME = 'working_with_text_dtm.txt'
names = [os.path.basename(fn) for fn in filenames]
vocab_oi = sorted(['house', 'of', 'and', 'the', 'home', 'emma'])
vocab_oi_indicator = np.in1d(vocab, vocab_oi)
ARR, ROWNAMES, COLNAMES = dtm[:, vocab_oi_indicator], names, vocab[vocab_oi_indicator]
html = pd.DataFrame(ARR, index=ROWNAMES, columns=COLNAMES).to_html()
open(os.path.join(OUTPUT_HTML_PATH, OUTPUT_FILENAME), 'w').write(html)
.. raw:: html
:file: generated/working_with_text_dtm.txt


Comparing texts
===============
Expand Down

0 comments on commit 774a05c

Please sign in to comment.