 
      In partnership with:
- University of Virginia, School of Data Science
- University of Virginia, Library
- University of Arizona, University Libraries
Funding for this 24-month project provided by: The National Endowment for the Humanities
Funding received: September, 2020
We are now planning TAP Institute 2023. For updates, join our mailing list.
- TAP Institute Participants- A public directory of teachers/researchers from the TAP Institute that have chosen to publicly share their information for the purposes of networking. (Add your information!)
- 2021-2022 TAP Institute Whitepaper- A paper describing our findings for the project
- Even more DH Text Analysis Teaching/Learning Materials
Open Educational Resources
Beginner Courses
Python Basics 1-5
This course is appropriate for complete beginners who have never programmed or done text analysis before.
If you've never programmed before, this course is a great introduction. Taught from a humanist perspective, this course will help you start writing your first code and unlock the potential of text analysis.
Introduction to R Programming
This course is appropriate for complete beginners who have never programmed or done text analysis before.
This course is a gentle introduction to R programming. With an emphasis on text analysis, this course will help you begin your adventures in programming.
A Gentle Introduction to Optical Character Recognition with PyTesseract
Python Basics required
This course will introduce the concept of “Optical Character Recognition” (OCR), various tools available for performing OCR, and important considerations for successfully OCRing digitized text. Using Tesseract in Python, we’ll walk through the entire process using a variety of examples to show the range of challenges scholars can face when performing OCR. By the end of the course, participants should be able to use the course’s Jupyter Notebooks to perform OCR on their own; they should be able to identify possible technical challenges presented by specific texts and propose potential solutions; and they should be able to assess the degree of accuracy they have achieved in performing OCR.
A Practical Guide to Text Data Curation
Python Basics required
No matter how exciting your research question is or how fancy your models are, all text analysis projects depend on having text data that is tidy enough to analyze. This course surveys some practices of text data curation to filter out irrelevant text, refine a corpus vocabulary, and identify text artifacts in real world text collections. We will explore how to approach these tasks using Python libraries such as NLTK and spaCy, as well as explore how some text models, like LDA topic models, can actually serve as a tool for diagnosing recurring corpus issues.
Web Scraping and Text Analysis in Bilingual Social Media
Requires Facebook account. No prior programming experience required.
This course is designed for attendees to learn how to web scrape social media posts, as well as download the information in csv format, clean it, and do basic analysis such as word frequency. To achieve this, we will rely on exercises with posts in Spanish, English or Spanglish, taken from Facebook pages belonging to organizations of migrants returned to Mexico. We will use some tools like Facepager, Notepad, Word, and RStudio.
Intermediate Courses
Python Intermediate 1-4
Nathan Kelber and Zhuo Chen

Python Basics required
An introduction to intermediate Python skills including comprehensions, working with .txt, .csv., and .json files, navigating filepaths with pathLib, and object-oriented programming (OOP).
Data Analysis with Pandas
Python Basics required
This workshop will introduce students to a popular Python package known as Pandas, a tool for data analysis and manipulation that is widely-used among data scientists. Participants will learn how to work with CSV files and JSON files, how to filter and aggregate data, how to make bar charts and time series plots, how to merge datasets with common values, and more. All case studies and examples will feature data relevant to the humanities, such as (potentially) library circulation data, screenplay data, and social media data.
Visualizing Humanities Data
Python Basics required
This course will introduce participants to some of the foundations and horizons of visualizing humanities data. To help us generate datasets we will lightly explore some text analysis methods, and then focus on some of the possibilities and pitfalls of visualizing data derived from these methods. In particular, this course will introduce participants to the principles of the grammar of graphics and exploratory data analysis through using the Python library Altair and Jupyter Notebooks. The goal of this course is to help participants learn how to incorporate visualizing humanities data into their research workflows, for both sharing aggregated information and making arguments.
Text Analysis in Ancient/Medieval Languages
Python Basics required
This workshop will introduce students to natural language processing (NLP) and text analysis in ancient and medieval languages. We will use Latin as a case study. Day 1 will focus on the basics of NLP and spaCy, one of the leading NLP libraries for Python. Day 2 will address the textual problems of working with ancient/medieval languages, including how to handle highly-inflected languages; lemmatization without a lemmatizer; and accounting for textual, geographical, and temporal variances of the language. Day 3 will address a single text analysis problem: named entity recognition (NER) in Latin. On this final day, we will develop a workflow for solving this problem. Students will leave this workshop with a strong understanding of NLP and NER. They will also have an understanding of how to solve text analysis problems in highly-inflected or dead languages. Students will be provided with the resources for further learning. Finally, students will leave the workshop with a working NER model that they can use and improve in the future.
Working with Twitter Data
Python Basics and command line experience recommended.
This course will prepare students to collect, analyze, and visualize Twitter data. Students will learn how to work with the Twitter API and with the Python library twarc, one of the most popular tools for Twitter data. We will also introduce basic text analysis methods that are appropriate for short documents like tweets. Participants who are eligible for the Academic Research Track of the Twitter API will have the opportunity to work with the entire historical archive of tweets (2006-2022).
Introduction to Natural Language Processing with spaCy
Python Basics required
This course will introduce the key concepts of natural language processing (NLP) and an NLP Python library, spaCy. SpaCy allows users to cultivate robust pipelines for text analysis. In Day 1 we will learn about NLP concepts and how to install and use the spaCy library generally. On Day 2, we will learn how to use spaCy to identify linguistic features within a document. On Day 3, we will learn about how to apply those features to solve real-world problems for information extraction.
Multilingual Newspaper Data and Visualizations
No prior programming experience required.
This course is designed for attendees to learn close reading text analysis with bilingual (Spanish and English) newspapers hosted in various digital repositories; create bilingual datasets and clean the data; select images from the newspapers and edit them; adapt these datasets for visualizations (mapping, timelines and networking) approaching it through time, space, cultural and historical contexts. We will use tools like Excel, Open Refine, Carto, Timeline JS, and GraphCommons.
Introduction to Pandas (William Mattingly)
Python Basics required.
This course introduces students to working with tabular data in Python through the Pandas library. On Day 1, you will learn how to install and import Pandas; you will also learn about some of its basic features, such as the DataFrame. Day 2 will focus on finding, organizing, and sorting data. Day 3 will focus on advanced searching methods, such as filtering, querying, grouping, and GroupBy. A few additional lessons will be provided on plotting data in Pandas.
Advanced Courses
Intro to Machine Learning
Python Basics required. Introduction to NLP with spaCy is recommended.
This workshop will introduce students to machine learning (ML), from its early beginnings to its modern applications; students will also be introduced to a branch of ML known as deep learning. We will specifically address how ML can be used to solve text-based problems. Day 1 will focus on the basics of ML, the key concepts and terms that practitioners must know. Day 2 will be dedicated to a common ML problem: text classification. Day 3 will focus on an adjacent problem: topic modeling. On both days, students will be provided a worfklow for solving these problems. Students will leave this workshop with a firm understanding of ML conceptually and a basic understanding of how to engage in ML via Python. Finally, students will be provided with the resources for further learning.
Intro to Machine Learning
Python Basics required. Knowledge of Pandas recommended.
This course will introduce you to many techniques available to analyze textual data with different Machine Learning techinques in Python. You will be introduced to the theory and method of Machine Learning and given some practical skills on how to write and execute machine learning code in Python. Some basic experience with Python will be required for participation in the class coding projects, but feel free to join us if you want to have a better understanding of what Machine Learning techniques can do for humanists. Generally speaking, this class will help you think about humanities problems through the lens of Machine Learning.
Named Entity Recognition
Python Basics required
This course will introduce participants to one of the core areas of natural language processing - named entity recognition. While annotating datasets with set standards is one of the oldest areas of DH research (particularly with the Text Encoding Initiative), this course will focus on some of the newer approaches for identifying and annotating objects of interest in any given text. The course will focus on using the Python library Spacy with both it's built-in functionality, and also learning how to expand upon it for more specific uses. While this course is taught in English, participants are encouraged to bring sources in multiple languages. Ultimately, participants will learn both how to leverage NER in their research and how to tailor NER to their specific textual sources.
Machine Learning for Humanists
Python Basics required. Knowledge of Pandas recommended.
This course will introduce students to the variety of machine learning (ML) algorithms available for textual analysis. Throughout the three days of the course, we will address how ML can be used to solve text-based problems. Day 1 will focus on the basics of ML and students will use supervised learning to work through a research question. Day 2 will be dedicated to a common ML technique: Topic Modeling. Day 3 will focus on more advanced techniques such as using language models to classify text. Everyday students will be provided a workflow for using these techniques on their own research questions.
Introduction to Multilingual Named Entity Recognition
Python Basics required. Introduction to NLP with spaCy is recommended.
This course will introduce students to named entity recognition with emphasis placed on multilingual documents. In Day 1, we will address some of the common issues one faces in handling multilingual documents, such as inconsistent text encoding and text standardization, and some of the current state-of-the-art transformer-based language models. We will also meet some of the key features of spaCy’s NER pipelines. On Day 2, we will jump into rules-based NER with spaCy. On Day 3, we will explore machine learning (ML) based NER in spaCy. Here, we will learn the essentials of creating good datasets for training NER models.
How to do Things with Topic Models
Python Basics and Python Intermediate recommended
This workshop will introduce students to the concept of topic models and how they have been used to advance humanistic research. Topics to be covered include topic models as a general task in text analytics, creating topic models from scratch using Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF), visualizing their results, evaluating their performance, and interpreting their results. In addition, students will be exposed to examples of how topic models have been used in humanistic and social science research. Work will be conducted using Python 3 and Jupyter Notebooks.
Even More DH Text Analysis Teaching/Learning Materials
- PythonHumanities.com by William Mattingly
- Programming Historian by various authors
- The Carpentries by various authors
- Digital Humanities Research Institutes by various authors
- Computational Humanities Research 
- YaleDHLab Lab Workshops 
- Jupyter notebooks for digital humanities curated by Quinn Dombrowski
- Data Sitter's Club by various authors
- HathiTrust Digital Library Collections and Tools
- Documenting the Now
Books on Python, Text Analysis, and DH
- Automate the Boring Stuff with Python: Practical Programming for Total Beginners (2019) by Al Sweigart
- Python Crash Course: A Handson, project-based introduction to programming (2019) by Eric Matthes
- Machine Learning with Python Cookbook (2018) by Chris Albon
- Natural Langauge Processing in Action (2019)by Hobson Lane, Cole Howard, and Hannes Max Hapke
- Humanities Data Analysis: Case Studies with Python by Folgert Karsdorp, Mike Kestemont, and Allen Riddell
- Technical Textbooks List by Scott B. Weingart
- Introduction to Named Entity Recognition by William Mattingly
Books on Data Ethics
- Algorithms of Oppression (2018) by Safiya Noble
- Race After Technology (2019) by Ruha Benjamin
- Data Feminism (2020) by Catherine D'Ignazio and Lauren F. Klein
Instructional Video
- DH, Coding, and Book History by Paul Vierthaler
- Python Tutorials for Digital Humanities by William Mattingly
Course Examples
- Humanities Analytics by Matt Lavin
- Introduction to Cultural Analytics and Python by Melanie Walsh
- CodeLab by Shane Lin, Zoe LeBlanc, and Brandon Walsh
- Computational and Inferential Thinking: The Foundations of Data Science by Ani Adhikari, John DeNero, David Wagner










