Foundations and Trends® in Accounting > Vol 14 > Issue 3–4

Using Python for Text Analysis in Accounting Research

By Vic Anand, University of Illinois at Urbana-Champaign, USA, vanand@illinois.edu | Khrystyna Bochkay, University of Miami, USA, kbochkay@bus.miami.edu | Roman Chychyla, University of Miami, USA, rchychyla@bus.miami.edu | Andrew Leone, Northwestern University, USA, andrew.leone@kellogg.northwestern.edu

 
Suggested Citation
Vic Anand, Khrystyna Bochkay, Roman Chychyla and Andrew Leone (2020), "Using Python for Text Analysis in Accounting Research", Foundations and Trends® in Accounting: Vol. 14: No. 3–4, pp 128-359. http://dx.doi.org/10.1561/1400000062

Publication Date: 03 Dec 2020
© 2020 Vic Anand, Khrystyna Bochkay, Roman Chychyla and Andrew Leone
 
Subjects
 

Free Preview:

Download extract

Share

Download article
In this article:
1. Introduction
2. Installing Python on Your Computer
3. Jupyter Notebooks
4. A Brief Introduction to the Python Programming Language
5. Working with Tabular Data:The Pandas Package
6. Introduction to Regular Expressions
7. Dictionary-Based Textual Analysis
8. Quantifying Text Complexity
9. Sentence Structure and Classification
10. Measuring Text Similarity
11. Identifying Specific Information in Text
12. Collecting Data from the Internet
Acknowledgements
References

Abstract

The prominence of textual data in accounting research has increased dramatically. To assist researchers in understanding and using textual data, this monograph defines and describes common measures of textual data and then demonstrates the collection and processing of textual data using the Python programming language. The monograph is replete with sample code that replicates textual analysis tasks from recent research papers.

In the first part of the monograph, we provide guidance on getting started in Python. We first describe Anaconda, a distribution of Python that provides the requisite libraries for textual analysis, and its installation. We then introduce the Jupyter notebook, a programming environment that improves research workflows and promotes replicable research. Next, we teach the basics of Python programming and demonstrate the basics of working with tabular data in the Pandas package.

The second part of the monograph focuses on specific textual analysis methods and techniques commonly used in accounting research. We first introduce regular expressions, a sophisticated language for finding patterns in text. We then show how to use regular expressions to extract specific parts from text. Next, we introduce the idea of transforming text data (unstructured data) into numerical measures representing variables of interest (structured data). Specifically, we introduce dictionary-based methods of (1) measuring document sentiment, (2) computing text complexity, (3) identifying forward-looking sentences and risk disclosures, (4) collecting informative numbers in text, and (5) computing the similarity of different pieces of text. For each of these tasks, we cite relevant papers and provide code snippets to implement the relevant metrics from these papers.

Finally, the third part of the monograph focuses on automating the collection of textual data. We introduce web scraping and provide code for downloading filings from EDGAR.

DOI:10.1561/1400000062
ISBN: 978-1-68083-760-5
248 pp. $99.00
Buy book (pb)
 
ISBN: 978-1-68083-761-2
248 pp. $280.00
Buy E-book (.pdf)
Table of contents:
1. Introduction
2. Installing Python on Your Computer
3. Jupyter Notebooks
4. A Brief Introduction to the Python Programming Language
5. Working with Tabular Data:The Pandas Package
6. Introduction to Regular Expressions
7. Dictionary-Based Textual Analysis
8. Quantifying Text Complexity
9. Sentence Structure and Classification
10. Measuring Text Similarity
11. Identifying Specific Information in Text
12. Collecting Data from the Internet
Acknowledgements
References

Using Python for Text Analysis in Accounting Research

Using Python for Text Analysis in Accounting Research provides an interactive step-by-step framework for analyzing spoken or written language for faculty and PhD students in social sciences. The goal is to demonstrate how textual analysis can enhance research by automatically extracting new and previously unknown information from voluminous disclosures, news articles, and social media posts. Materials are presented in a way that allows the reader to learn about a textual analysis concept or technique and also replicate it by doing.

The monograph begins by showing how to install and use Python, a popular general purpose programming language, reviewing Python’s basic programming syntax, operators, data types, functions, and so on; allowing the readers to familiarize themselves with the programming environment first. It discusses the Jupyter notebook, which is an open-source web application that allows creating, running, and testing your Python code interactively. And the monograph introduces the Pandas package for working with tabular data that aids researchers as they convert unstructured textual data into structured, tabular data. The authors introduce regular expressions which represent patterns for matching different elements in texts. They then proceed with the discussion and coding of different textual analysis methods used in accounting and finance studies. Finally, the monograph provides an overview of web scraping and file processing features in Python with a focus on downloading EDGAR filings and identifying specific sections in them.

Taken together, the first five chapters of this monograph will help readers get started with Python and prepare for writing their own code.

 
ACC-062

Supplementary material | 1400000062_supp.zip (ZIP).

This file contains the supplementary material code referred to in the monograph.

DOI: 10.1561/1400000062_supp