Curation and Quality Assurance

The texts of the Text Creation Partnership co-exist in various states of (im)perfection. They contain known as well as unknown defects. Known defects are individual characters, words, lines, chunks or pages that were explicitly marked as illegible or missing by the professional typists who transcribed the texts from digital scans of microfilm images. The quality of their work is largely a function of the quality of the digital image they had before them.

Defects cluster heavily in texts transcribed from poor images. Texts in the top quartile have no defects. Texts in the bottom quartile account for 81% of all defects. Texts in the interquartile range have between 1 and 35 defects per 10,000 words and account for 19% of all defects. The following table uses quartiles, the conventional academic grading scheme, and everyday language to show the distribution of textual corruption in the TCP Archive.

Defect rates per 10,000 words for 25,000 TCP texts

Grade	Percentile	Min	median	Max	% of all defects
A	0-25	0	0	0	0
B	25-50	1	8	10	3
C	50-75	10	18	35	16
D	75-90	35	557	100	35
F	90-100	100	168	2857	46

Humans have a deeply ingrained habit of judging a barrel by its worst apples. The reputation of the TCP archive in some scholarly circles has suffered from that habit. The 18th century Shakespeare editor Edmond Malone said somewhere something like “the text of our author is not as corrupt as people think.” Something similar could be said of the TCP texts. That said, very few TCP texts have been proofread from cover to cover, and many of them require some editorial attention before they can be certified as good enough for most scholarly purposes.

Three generations of undergraduates at Amherst, Northwestern, and Washington University in St. Louis demonstrated that undergraduates are very good at performing the essential tasks of “textkeeping”. They worked on some 500 plays of texts before 1642 and corrected ~ 50,000 defects. Playbooks from before 1642 on average have higher defect rates than TCP texts as a whole, partly because earlier texts have more defects and partly because many playbooks were poorly printed to begin with. The table below shows the difference between TCP texts in general and playbooks in particular as well as the difference that undergraduate curators made to the quality of the texts they curated:

Defects rates per 10,000 words for 510 playbooks before and after curation

Corpus	25th percentile	Median	75th percentile	90th percentile
25,000 uncurated TCP texts	1	8	35	100
510 playtexts before curation	5	14	62	126
510 playtexts after curation	0	1.3	6.4	47.2

This reduction of textual defects by an order of magnitude (from a median value of 14 to a median value of 1.3) is something that is visible to the average reader. The length of most plays stays within a range of 20,000 to 25,000 words. So the error rate per play drops from ~30 to ~ 3.

Unknown defects

However, these numbers don’t tell the full story. While we can measure the rates of known defects (missing letters, words, punctuation marks, lines, and pages) in TCP texts because the transcribers marked these gaps, we have no immediate data about two kinds of “unknown” defects: transcription or printer errors. Typical errors include:

typographical errors, whether the printer’s or transcriber’s: ‘aſſliction’ => ‘affliction’; ‘hnsband’ => ‘hnsban’d
words wrongly joined: ‘thyspels’ => ‘thy spels’
words wrongly split: ‘neeren esse’ => ‘neerenesse’

Printer errors occur in the process of printing the original book and would have been corrected by writers or printers had they caught them. It is part of our editorial policy to correct such errors when we find them. The TCP corpus contains many instances of printers testifying to the shortcomings of their trade. Quite often they ask for the reader’s help, as in the following plea from the Errata section of Harding’s Sicily and Naples, a mid-seventeenth century play:

— Reader. Before thou proceed’st farther, mend with thy pen these few escapes of the presse: The delight & pleasure I dare promise thee to finde in the whole, will largely make amends for thy paines in correcting some two or three syllables.

Samuel Garey’s Great Brittans little calendar concludes with the terse and elegant Latin epigraph:

— Candido lectori: Humanum eſt errare, errata hic corrige (lector) quae penna, aut praelo lapſa fuiſſe vides._

That is an appeal to the gentle reader to correct “lapses of the pen or press”, since to err is human.

Transcriber errors refer to mistakes made by the TCP transcribers. While the TCP spot-checked a sample number of pages to confirm this rate, the transcriptions have not been fully proofread against high-quality images. To date, four texts in SHC have been proofread against good facsimile images, but most texts will require further editorial attention before they can be certified as good enough for most scholarly purposes. In our digital environment, most unknown defects can be easily spotted and corrected, and those corrections are reviewed, approved, and logged in the same way as changes to known defects. In our experience, there is roughly one unknown defect lurking for every five known defects.