VILLA Latin Text Database

What is VILLA?
Villa Is the Latin Literature Archive

VILLA aims to make Latin texts easier to study for students and academics across a variety of fields. Currently, there are some resources available, such as The Latin Library and the Perseus Project, but these resources leave much to be desired for the inter-disciplinary researcher.

Available resources lack modern, machine-accessible formats which enable powerful analysis. The Latin Library is great for grabbing texts for a homework assignment, but the inconsistent file structure makes it a challenge to scrape, although it can be done. The Perseus site presents similar issues – their reader tools are great for reading, understanding, and analyzing individual passages, but they fall short when attempting to analyze a corpus. Perseus does have all their corpora available for download in compressed xml files, which is the starting point for this project. It’s only a starting point, though, because the files are highly hierarchical and formatted in XML with TEI EpiDoc annotation. The information encoded is interesting and useful, but wrangling a file tree and XML parser before conducting any useful research is a significant barrier to entry and one that VILLA aims to break down.

The goal of VILLA is to increase access and understanding of ancient Latin literature by making the texts available in more usable formats. By converting cumbersome TEI EpiDocs into modern, intuitive formats like JSON and REST API’s, interdisciplinary students can extend the anaylsis of Latin literature using techniques founded in quantitative methods such as natural language processing and computational semantics. There is precedent for such study in projects such as Open Source Shakespeare and the Latin Wordnet, from which VILLA draws some inspiration.

The Database
After downloading the zip file from Perseus, I looked through some example files to get a feeling for the data and metadata that I would have to work with. The schema below is an aid to help me think about how I should start cleaning and re-organizing the data. By no means should it be used as documentation for the coming database, but it could help prospective users think about the data that we’re aiming to deliver in the same way that it helps me to decide how to begin transforming the data into a more usable format.

I still need to examine more files to determine what information from TEI EpiDoc is consistent across all the documents and what information is dependent on the transcribers that worked on the text, but the information in the schema should be available for the vast majority of the documents I have, based on initial analysis.

Missing Texts
An interesting anomaly I discovered while familiarizing myself with the corpus downloaded from Perseus is that some authors have no works associated with them in the dataset. For example, Augustus and Boethius both have top-level folders with nothing inside. A quick search shows that Perseus does have Augustus, so it’s tough to see why this isn’t included in the downloadable texts. It could be possible that the EpiDoc information isn’t complete and they didn’t want to release incomplete texts. Another possibility is that they obtained permission to transcribe text from a volume under copyright, since all of their texts appear to be copied from print works, and one of the conditions is that they aren’t allowed to redistribute it. The former explanation is unsatisfying because this project was undertaken 35 years ago, so it seems unlikely that they haven’t gotten the chance to annotate Augustus yet. The latter explanation is also unsatisfying because these texts have been around for millennia, so I have a hard time believing that this project was unable to find an authoritative edition that hasn’t passed into public domain. Of course, there are other places to find these texts so they can be added to VILLA, but that will take additional time and effort after the initial batch from Perseus has been cleaned and ingested.

Next Steps
While I’m still working out a database schema and learning more about the documents I’m working with, I’ll organize the documents into JSON files containing all data from the file tree and TEI tags. Python is much more compatible with JSON than XML, plus it’s more intutitve and widely used. Information held in the file structure is usually found somewhere in the XML as well and having three levels of file hierarchy containing author and language information needlessly hinders cross-work analysis, so I’ll flatten the file structure and ensure that all document information is contained within the document itself, instead of a file name or path.

After I have JSON documents that contain all the information currently held in the XML document and file tree, I’ll move on to creating the database. It’s likely that it will bear a resemblance to the schema laid out above, but it still isn’t decided whether it’ll be relational, a document database, or DynamoDB. A relational database offers familiarity in structure and SQL, but a document database may be a better use case for this project since the data will initially be organized into JSON. Lastly, without institutional backing, this project does have financial concerns to take into account and non-relational databases may be more cost-effective.

Ultimately, I’d like to create a user interface that allows students and researchers to create accounts where they can save annotations, commentary, online dictionary references, and in-depth, quantitative analyses but a front end is not within the scope for the first phase of this project.