Computing Melville

The transcription pipeline

Task definition

In this project's pipeline, the task definition step was fundamental: to perform supervised training on a ML model such as ours, it was crucial to first provide it with an annotated corpus to be used as a training set and a raw corpus for the evaluation phase.

Corpus selection

When selecting our starting corpus, we have decided to reuse the data of the Melville Electronic Library: we have selected 16 chapters from "Versions of Billy Budd", an edition comprising not only the reading text but also MEL's TextLab platform, displaying all manuscript leaves and their diplomatic transcription to be used as our training corpus, while digital images of the "Rip Van Winkle's Lilac" manuscript were adopted as the validation set.

Annotation Model

Due to the complexity in interpreting the text from the manuscripts, given also the amount of data, we have decided to rely on the transcriptions made available by the MEL for the development of our training corpus.
However, we have employed specific solutions for the challenges related to the peculiarities of Melville's last novella: the original manuscript showcases many leaves and leaf fragments, clearly suggesting the many revisions put in place by the author and complicating the ML task:

Issue	Example	Solution
Mounts		When leaves of the manuscript display mounts covering large parts of strikethrough text, we have decided to only consider and transcribe the leaf closer to the final authorial version (that is, the leaf with the mount)
Additions		As we are more interested in showcasing an "analytic" transcription of Melville's manuscripts, every addition to the text has been left in its original position on the page
Numbers		All the numbers present on the pages, both at the top and bottom, have been ignored, as the original source of the image does not state clearly their provenance and we are interested in Melville's handwriting only
Breaks		All the section breaks' glyphs between the chapters have not been transcribed to avoid any confusion in the recognition of the characters
Marks		Any other mark present on the page but not related to the content of the novella (e.g. circles, pencil smears) has been discarded

zoom_inClick on the pictures to zoom in!

Annotation Guidelines

We have followed Transkribus' Transcription Conventions, adapting them to our data when needed.
In particular, strikethrough and superscript text passages have proved to be a challenge.

Strikethrough text passages

Example	Solution
	Always to be tagged as `strikethrough` when appearing inline with the rest of the text line
	When co-occurring with superscript, the latter tag is to be preferred

Superscript text passages

Example	Solution
	When the x-height of the characters is included in the main line area, tag it as `superscript`
	When the x-height of the characters is not included in the main line area add an extra line just for the superscript, considering it as normal text (with edge cases, it is preferred to add an extra line)
	With more than one superscript text, spaced out across the length of the main line, if there are no interruptions, consider the superscript words as part of the same extra line and add 5 blank spaces between them
	With more than one superscript text, spaced out across the length of the main line, if there are interruptions (e.g. another superscript word is included in the main line area), add different extra lines for each superscript word

zoom_inClick on the pictures to zoom in!

We have also put in place our own decisions when it came to underlined text passages (which we have ignored) and additions (see: Annotation Model section).

Evaluation of results

First, each of us processed 10 leaves of the Billy Budd's manuscript on Transkribus, doing both the layout parsing and the transcription. Then, we swapped the datasets and checked whether we agreed with the decision of the other annotator or not. In case of disagreement, we have discussed and changed our annotating and transcription parameters accordingly. This was also our first pilot campaign, whose results are described further on.

Pilot

We performed, overall, three different pilot campaigns:

The first one, on 20 leaves of the manuscript (specifically, the leaves 1-10 of the first and second chapters), led to the following adjustments:
- How to handle superscript and strikethrough text passages
- How to handle text not transcribed by MEL's experts: we do not include it in the text region and avoid transcribing it ourselves
- Keep the different dashes as in the MEL's transcriptions in order to avoid confusion between composed words (short dash: -) and other occurrences (long dash: —)
The second pilot was carried out on 20 other leaves of the manuscript (11-20 of the first chapter, 11-14 of the second and 1-6 of the fourth chapter) and allowed us to better define how to handle multiple spaced out superscript text passages
The third and final pilot was carried out randomly during the first moments of the final annotation campaign: each of us selected random leaves annotated by the other and eventually discussed different decisions, further refining the annotation model

Campaign

Corpus Annotation

Once the two annotators reached an agreement on how to follow the guidelines derived from the previous pilots, the proper annotation campaign finally began, with the annotation of the remainder of the selected corpus. Chapters from 1 to 16 of Melville's "Billy Budd" and the last 7 pages of "Rip Van Winkle's Lilac" containing the homonymous poem were fully annotated, using the online software Transkribus Lite and reaching a total of 175 pages and more than 17000 words.

Training the models

The annotated data were then used to train two recognition-models on the same software:

Melville Handwriting was trained having as traning set the whole 16 chapters from "Billy Budd" and as validation set the pages from "Rip Van Winkle's Lilac"
Melville Handwriting Base Model trained and validated on the same sets, but with the addition of an "English Handwriting" model provided by Transkribus to refine the recognition process

Both models were trained relying on the PyLaia HTR engine supported by Transkribus and the paramethers were defined according to Transkribus guidelines: "How To Train and Apply Handwritten Text Recognition Models in Transkribus".
For both models we sticked to a default early stopping of 50 epochs and a learning rate of 0,0001%, with the only difference being that, for Melville Handwriting Base Model, the total number of epochs was lowered from 250 to 150 to avoid overfitting.

		Melville Handwriting	Melville Handwriting w/ Base Model
Paramethers	Number of Epochs	250	150
	Early Stopping	50	50
	Learning Rate	0,0001	0,0001

The accuracy of the resulting two models can be compared by analysing their Learning Curves (shown in the graphs below) indicating the variation of the Character Error Rate (i.e. the percentage of characters that have been transcribed incorrectly by the Text Recognition model) for number of epochs.

In the graphs the blue line represents the progress made by the model on the the training set, whereas the red one represents the progress of evaluations on the validation set, on which the program tests itself after the training.
The trend and final value of the CER on the validation set is of course the most significant as it shows how the model is capable of generalising, performing on pages that it has not been trained on.

The results of Melville Handwriting Base Model are slightly better performative than the one of the other model, as a CER of 10% or below can be seen as very efficient for automated transcription; however even Melville Handwriting staying with a CER lower than 20% is more than sufficient to work with and could definitely be improved with further work.

		Melville Handwriting	Melville Handwriting w/ Base Model
Results	CER on Train Set	13,11%	9,00%
	CER on Validation Set	15,30%	8,80%
	Best WER	46,7%	28,1%

Critical Issues

Comparing the transcriptions made on the validation set by the annotators and by the trained models, some criticalities emerged and are worth being highlightened.
As shown in the comparison below (annotators' transcription -green- and trained model without base model -red-) on the first page of "Rip Van Winkle's Lilac" it is clear that despite being correctly tagged during the annotation campaign, neither strikethroughs nor superscripts have been correctly recognized by the model during the training.

Further work

Certainly, this is only the beginning of what could be a much more extent campaign on Melville's original manuscripts. The limited dimensions of our team and, consequently, of the corpus we annotated, definitely prevented us from tackling more in-depth research on the topic. Some improvements could certainly be made by expanding the training corpus and by using HQ images (we could only take screenshots from MEL's website, as there is no download tool made available).
However, we are fairly convinced that this project could stand as an inspiring push towards authorial annotation campaigns by means of AI and ML systems.

Publication and use

In designing our annotation campaign, we have tried to apply the FAIR principles for data publication, making the results of our research findable, accessible, interoperable and reusable.

file_downloadIn addition to the documentation provided in this website, you can freely download the full report about this transcription and annotation campaign.