Uni Würzburg

Responsible DARIAH-DE developer: Steffen Pielström



With the increasing availability of large volumes of text and data, quantitative methods have found their way into a number of humanities and are increasingly able to supplement and reformulate qualitative approaches to use the properties of large data resources. Computer literacy is an important sub-discipline of computer science. Recent advances in statistical approaches to the recognition of literary subjects and topic models have been successfully used by scholars in various fields such as history, literature, and linguistics. The training materials "TAToM - Text Analysis with Topic Models for the Humanities and Social Sciences" consist of a series of tutorials covering basic methods of quantitative text analysis. Topics is thus a topic-modeling library with different LDA implementations (Latent Dirichlet Allocation implementations).

The tutorials cover the preparation of a textkorpus for the analysis and exploration of text collections using methods such as topic modeling and machine learning. The tutorials deal with both basic and advanced topics. They primarily use the Python programming language to deal with the text data, to organize, analyze and visualize it.

The contents in the overview:

The tutorials were written by Allen Riddell for DARIAH-DE and released in March 2014 in version 1.0. The coordination was with Christof Schöch at the chair of computer philology at the University of Würzburg.

For example, a visualization of the topic models is possible in this way:

Word associated with topics ``austen-brontë`` corpus. See :ref:`topic_model_visualization`.

To the Tutorial

For further information as well as a list of offered tools of topic modeling with direct linking to the tutorials of the individual tools you can find here. Click here to go to the source code on GitHub.

Footer Standarddienste


Are there still unresolved issues or would you like further information? You can reach us at