The URL of EarlyPrint takes you to its home page. Clicking on Browse takes you inside, where you find your way around by using two search tools separately or in combination. One of them is a list of texts with a filter form for restricting which texts are listed, hereafter called the Text Filter, and the other is a Text Search form. For our purposes, filtering texts refers to choosing which texts are listed based on one or more items of bibliographical information or other attributes of entire texts, whereas searching refers to looking for words within the texts.
The Text Filter
By default, all available texts will be listed; to restrict that list, use the form near the top left of your screen that looks like this:
Select one or more criteria and click the filter button; the list will change to display only texts matching all the criteria you specified combined together. Click clear to remove the filter and return to the default list of all texts. The table below describes the filter criteria in detail.
|Author||Enter all or part of an author's name.|
|Title||Enter all or part of a title.|
|Year||Refers to the date of composition or first performance, not the publication date of the printed work. To filter by a single year (1595), simply enter it in the first of the two year fields. To filter by a range of years (1640 to 1660), enter the starting and ending years in the two year fields.|
|Identifier||Each text on this site has a TCP identifier which consists of five digits preceded by 'A' or 'B' and corresponds very roughly to the order in which the texts were transcribed by the Text Creation Partnership. The familiar STC and ESTC identifiers are supported in addition to TCP; enter any one of these identifiers that you happen to have handy. For ESTC identifiers with prefixes such as "Wing" or "Thomason" do not include that part to find a single text; just the identifier will do. If, however, you want to find all texts having one of those designations, then do simply enter "Wing" or "Thomason" in the identifier field.|
A keyword from a Library of Congress subject heading. For example, A21238
The Queen's Entertainment at Woodstock has the subject headings:
A category describing the kind of text. Currently includes:
|Curator||Enter the name of a curator to see texts curated by that person.|
|Proofreader||Enter the name of a proofreader to see texts proofread by that person. Not all texts have yet been proofread.|
|Grade||A grade from A through F representing the completeness of the transcription and its state of correction. Texts graded "A" have no known defects. Texts graded "F" have more than 100 defects per 10,000 words. For more detail, see Curation and Quality Assurance . After selecting a grade, use the adjacent drop-down to select "exactly" to choose only texts having the specified grade, "or better" to choose texts having a grade at least as good as the specified grade, or "or worse" to choose texts having a grade as bad as or worse than the specified grade.|
|Page images?||Select "Yes or "No". "Yes" returns texts for which we provide fresh images for side-by-side display and "No" excludes those texts. Leave unselected to display all texts regardless of whether images are available. Images are from another copy of the same edition (but not necessarily the same copy) as the transcription's source.|
|Corpus||A corpus in this context is simply a grouping of texts falling within a particular area of interest. A text may be a member of any number of corpora. Currently "Drama" and "English Civil War" are available; other areas of interest will certainly emerge as we add more texts.|
If you have not entered any filter criteria in the Text Filter, the application shows you all the documents in no particular order. Once you filter the list as described above, you can then open a particular document and work with it, or you can run Text Search across the subset of texts retrieved by Text Filter.
Text Search is a complex tool, and takes a little practice to become familiar with its tricks. It shows up as a tall and narrow image on the right side of the screen, and its top looks like this:
You enter your search term(s) on the line with the greyed-out text “search for”. If you click the line with the legend “Any Search Term”, it expands to a dropdown menu from which you choose the type of search you will use. The line with the legend “Search Sections” expands to a choice between searching in text or in headings and other forms of paratext.
Text Search lets you search for words or phrases in various ways but none of them are case sensitive. Here are the different ways of looking for ‘love’ and ‘death’:
- Entering ‘love’ and ‘death’ with Any Search Term retrieves 7268 hits in documents that contain either ‘love’ or ‘death’.
- Entering ‘love’ and ‘death’ with All Search Terms retrieves 2494 hits in documents that contain both ‘love’ and ‘death’.
- Entering ‘love’ and ‘death’ with Phrase Search retrieves 11 hits in documents that contain the sequence ‘love death’ with no intervening words but ignoring punctuation.
- Entering ‘love’ and ‘death’ with Proximity Search (ordered) retrieves 141 hits in documents where ‘death’ follows ‘love’ with or without intervening words.
- Entering ‘love’ and ‘death’ with Proximity Search (unordered) retrieves 222 hits in documents where both ‘death’ and ‘love’ occur in any order and with no or some words intervening.
In addition to that list Text Search also offers ‘fuzzy’, ‘regex’ and ‘Ngram’ search options, each of which deserves a little explanation.
Fuzzy searching. ‘Edit distance’ is a usefuly concept for establishing how close words are to each other. ‘Cat’ is one edit distance away from ‘Hat’, two from ‘Hit’, and three from ‘Hid’. ‘Fuzzy’ searching is a way of looking for spellings whose edit distance stays within some range. Fuzzy searching works much better with long than with short words. Most of the time, however, you will be better off with the regex search described in the next paragraph.
Regex searching “Regular expression” is the name of a very powerful and widely-used pattern matching language. If you know how to use it you will often wonder how you ever got along without it. If you don’t know it and looking for words or phrases is what you often do, it is well worth your while learning about it, and even a little knowledge of it goes quite a ways. Microsoft’s Regular Expression Language-Quick Reference offers a clear introduction.
The word ‘love’ is typically spelled ‘loue’ before 1630. The search term ‘lo[uv]e’ lets you look for either spelling. The square brackets are ‘metacharacters’ in regular expressions. The characters inside them are alternatives. So the regular expression ‘lo[uv]e’ will retrieve any spelling that begins with ‘lo’, ends with ‘e’ and has ‘u’ or ‘v’ in the middle. This search retrieves 7570 hits. Having half a dozen regex tricks up your sleeve will make you a much more sophisticated searcher.
The regex engine in this site has some limitations. It does not recognize the non-alphabetic unicode characters that are used in EarlyPrint for unknown characters.
Ngram searching Ngram is the term for a sequence of n words or characters. There are unigrams, bigrams, trigrams, tetragrams etc. In this environment, ‘ngrams’ refers to a sequence of characters whether or not they are complete words. In most circumtances, regex searches will be more useful, but because of limitations in the regex engine you must use the Ngram search if you want to look for words that contain the non-alphbetical characters. Use the following Ngram searches for different categories of defective words:
- The black circle ‘●’ (\u25cf) is the symbol for an untranscribed alphanumerical character. The Ngram search with that search term willl return 2617 hits.
- The black small square ‘▪’ (\u25aa) is the symbol for an undefined punctuation mark. The Ngram search with that search term willl return 1659 hits.
- The lozenge ‘◊’ (\u25ca) is the symbol for a missing word. The Ngram search with that search term will return 976 hits.
- A horizontal ellipsis within mathematical angle brackets ‘⟨…⟩’ (\u27e8, \u2026,\u27e9 ) is the symbol for a missing ‘span’ or short passage of undefined length. The Ngram search with that search term returns only one hit.