Computing Melville is a small-scale annotation campaign on a collection of Herman Melville's manuscripts regarding his last novella, "Billy Budd", left unfinished in 1891, and "Rip Van Winkle's Lilac", an experimental combination of prose and poetry.
The campaign has been carried out using Transkribus, a platform for the digitisation, AI-powered text recognition, transcription and searching of historical documents.
Our aim is to develop a Machine Learning system able to perform a transcription task on Melville's handwritten documents: the model is trained on a selection of chapters from "Billy Budd" manually annotated by the team, and tested on some of "Rip Van Winkle's Lilac" manuscript's leaves.
The project has been developed for the "Semantic Digital Libraries" course of the Master Degree in Digital Humanities and Digital Knowledge of the University of Bologna, under prof. Giovanni Colavizza.
In this project's pipeline, the task definition step was fundamental: to perform supervised training on a ML model such as ours, it was crucial to first provide it with an annotated corpus to be used as a training set and a raw corpus for the evaluation phase.
When selecting our starting corpus, we have decided to reuse the data of the Melville Electronic Library: we have selected 16 chapters from "Versions of Billy Budd", an edition comprising not only the reading text but also MEL's TextLab platform, displaying all manuscript leaves and their diplomatic transcription to be used as our training corpus, while digital images of the "Rip Van Winkle's Lilac" manuscript were adopted as the validation set.
Due to the complexity in interpreting the text from the manuscripts, given also the amount of data, we have decided to rely on the transcriptions made available by the MEL for the development of our training corpus.
However, we have employed specific solutions for the challenges related to the peculiarities of Melville's last novella: the original manuscript showcases many leaves and leaf fragments, clearly suggesting the many revisions put in place by the author and complicating the ML task:
Issue | Example | Solution |
---|---|---|
Mounts | When leaves of the manuscript display mounts covering large parts of strikethrough text, we have decided to only consider and transcribe the leaf closer to the final authorial version (that is, the leaf with the mount) | |
Additions | As we are more interested in showcasing an "analytic" transcription of Melville's manuscripts, every addition to the text has been left in its original position on the page | |
Numbers | All the numbers present on the pages, both at the top and bottom, have been ignored, as the original source of the image does not state clearly their provenance and we are interested in Melville's handwriting only | |
Breaks | All the section breaks' glyphs between the chapters have not been transcribed to avoid any confusion in the recognition of the characters | |
Marks | Any other mark present on the page but not related to the content of the novella (e.g. circles, pencil smears) has been discarded |
zoom_inClick on the pictures to zoom in!
We have followed Transkribus' Transcription Conventions, adapting them to our data when needed.
In particular, strikethrough and superscript text passages have proved to be a challenge.
Example | Solution |
---|---|
Always to be tagged as strikethrough when appearing inline with the rest of the text line |
|
When co-occurring with superscript, the latter tag is to be preferred |
Example | Solution |
---|---|
When the x-height of the characters is included in the main line area, tag it as superscript |
|
When the x-height of the characters is not included in the main line area add an extra line just for the superscript, considering it as normal text (with edge cases, it is preferred to add an extra line) | |
With more than one superscript text, spaced out across the length of the main line, if there are no interruptions, consider the superscript words as part of the same extra line and add 5 blank spaces between them | |
With more than one superscript text, spaced out across the length of the main line, if there are interruptions (e.g. another superscript word is included in the main line area), add different extra lines for each superscript word |
zoom_inClick on the pictures to zoom in!
We have also put in place our own decisions when it came to underlined text passages (which we have ignored) and additions (see: Annotation Model section).
First, each of us processed 10 leaves of the Billy Budd's manuscript on Transkribus, doing both the layout parsing and the transcription. Then, we swapped the datasets and checked whether we agreed with the decision of the other annotator or not. In case of disagreement, we have discussed and changed our annotating and transcription parameters accordingly. This was also our first pilot campaign, whose results are described further on.
We performed, overall, three different pilot campaigns:
Once the two annotators reached an agreement on how to follow the guidelines derived from the previous pilots, the proper annotation campaign finally began, with the annotation of the remainder of the selected corpus. Chapters from 1 to 16 of Melville's "Billy Budd" and the last 7 pages of "Rip Van Winkle's Lilac" containing the homonymous poem were fully annotated, using the online software Transkribus Lite and reaching a total of 175 pages and more than 17000 words.
The annotated data were then used to train two recognition-models on the same software:
Both models were trained relying on the PyLaia HTR engine supported by Transkribus and the paramethers were defined according to Transkribus guidelines: "How To Train and Apply Handwritten Text Recognition Models in Transkribus".
For both models we sticked to a default early stopping of 50 epochs and a learning rate of 0,0001%, with the only difference being that, for Melville Handwriting Base Model, the total number of epochs was lowered from 250 to 150 to avoid overfitting.
Melville Handwriting | Melville Handwriting w/ Base Model | ||
---|---|---|---|
Paramethers | Number of Epochs | 250 | 150 |
Early Stopping | 50 | 50 | |
Learning Rate | 0,0001 | 0,0001 |
The accuracy of the resulting two models can be compared by analysing their Learning Curves (shown in the graphs below) indicating the variation of the Character Error Rate (i.e. the percentage of characters that have been transcribed incorrectly by the Text Recognition model) for number of epochs.
In the graphs the blue line represents the progress made by the model on the the training set, whereas the red one represents the progress of evaluations on the validation set, on which the program tests itself after the training.
The trend and final value of the CER on the validation set is of course the most significant as it shows how the model is capable of generalising, performing on pages that it has not been trained on.
The results of Melville Handwriting Base Model are slightly better performative than the one of the other model, as a CER of 10% or below can be seen as very efficient for automated transcription; however even Melville Handwriting staying with a CER lower than 20% is more than sufficient to work with and could definitely be improved with further work.
Melville Handwriting | Melville Handwriting w/ Base Model | ||
---|---|---|---|
Results | CER on Train Set | 13,11% | 9,00% |
CER on Validation Set | 15,30% | 8,80% | |
Best WER | 46,7% | 28,1% |
Comparing the transcriptions made on the validation set by the annotators and by the trained models, some criticalities emerged and are worth being highlightened.
As shown in the comparison below (annotators' transcription -green- and trained model without base model -red-) on the first page of "Rip Van Winkle's Lilac" it is clear that despite being correctly tagged during the annotation campaign, neither strikethroughs nor superscripts have been correctly recognized by the model during the training.
Certainly, this is only the beginning of what could be a much more extent campaign on Melville's original manuscripts. The limited dimensions of our team and, consequently, of the corpus we annotated, definitely prevented us from tackling more in-depth research on the topic. Some improvements could certainly be made by expanding the training corpus and by using HQ images (we could only take screenshots from MEL's website, as there is no download tool made available).
However, we are fairly convinced that this project could stand as an inspiring push towards authorial annotation campaigns by means of AI and ML systems.
In designing our annotation campaign, we have tried to apply the FAIR principles for data publication, making the results of our research findable, accessible, interoperable and reusable.
file_downloadIn addition to the documentation provided in this website, you can freely download the full report about this transcription and annotation campaign.