Choosing a Dataset

The University of North Carolina – Chapel Hill (UNC – Chapel Hill) maintains an online project titled “Documenting the American South.” This project contains several collections of documents, images, and literature related to various subjects about the history of the southern states. The collection used as the main dataset in this project is “The Church in the Southern Black Community,” which can be found here. This collection covers several different topics within a corpus of 144 texts of varying size. Without using digital means, this collection would take years to parse out. Even with digital means, it is still a sizable amount of information to tackle. Due to the sheer size of the collection, I chose to focus my project on one specific aspect of it that may be able to incorporate the other aspects. Throughout some of the documents, the authors wrote in hymns that they felt were important to their story. I found that the hymns seemed to cover themes that were important to the African American experience during and after slavery, and so were a perfect aspect to analyze for this project.

Data Curation

In order to make a separate dataset of the hymns, I needed to sort through the documents within the “Church in the Southern Black Community” collection to pick out the hymns within it. To do this efficiently, I used the search function ([Ctrl + F]) and told my computer to search for a tab in the text. 

Example of how a computer’s search function was utilized in order to quickly pick out tabs within large amounts of text, usually indicated hymns.

Once I found a hymn within the text, I copied it and converted it into a .txt file, then saved it as an individual hymn within a folder of the collected hymns. I named each file based on the author’s last name and the main ideas of the hymn. By the time I made it through the documents, I had accumulated 144 (or one gross) hymns. Fortunately, the “Documenting the American South” collection had created a .csv table of contents containing the title, author, and the year of publication. Using this .csv as a foundation for the metadata, I then added to this dataset information such as the denomination of the author, location the author was born, and geographic coordinates of that location. The program OpenRefine made it quick and simple to add this information. During the data curation process, I also sorted the hymns into individual folders based on the denomination of the author so I could more easily analyze hymns based on denomination. 

Data Analysis

Since most of this dataset is text, my main method of analyzing the data was text mining. The program Voyant Tools was used extensively to find recurring words, common themes, collocated words, repeated phrases and terms, links between words, and several other forms of text analysis. Most of these tools are used as visualizations throughout this project and are embedded within the website, allowing users to interact with each tool. was used to make a heatmap of which denomination’s hymns were more prominent in which areas. 


Conducting research on this topic was tricky. On one hand, there is a plethora of information on the church in the South, including African Americans contribution to the different communities. The MSU library had access to a majority of these. On the other, there is little literature about black hymnody. This was an obstacle because it is difficult to understand some reasons behind why African Americans in the South would sing certain hymns. However, after some research on the different black church communties of the South, the hymnological aspect was easier to understand.


To begin, it can be assumed that the assortment of texts that UNC – Chapel Hill collected is only a small fraction of the total pool of written documents related to the church in the southern black community, due to many being stored away, lost, or destroyed. As a result, the dataset can appear to be incomplete or inaccurate. Additionally, many of the texts and documents collected by UNC – Chapel Hill’s project were digitized using optical character recognition (OCR), which is known to be spotty in its accuracy.

Regarding shortcomings specifically with my project, there are no doubt several that were unavoidable. First, the names of the authors of the hymns are not always accurate. For the most part, the authors I provided were just the authors of the document in UNC – Chapel Hill’s collection unless it was explicitly stated who created the hymn. Additionally, the denomination of the author is found from the text under the assumption that the author would only include hymns from his own church. It is possible that the hymn overlapped between several denominations, which is not reflected in my data. Third, the locations used in the metadata are based off of where the person singing the hymn, or the author, is born assuming that the author traveled little and that their denomination was the norm in that area. Lastly, I experienced issues with creating visualizations based on my data since there are few concrete numbers I can use to make any meaningful visualizations.