Overview

Correcting texts through the EarlyPrint Library is convenient and rewarding, but it’s sometimes necessary to correct thousands of similar words at a time. There are many repeated errors and similar missing characters in the EarlyPrint corpus, and a majority of these can be fixed using the surrounding context of the word, without the need to consult page images.

This kind of text correction requires a different set of tools. In this short tutorial, you’ll learn how to correct texts in a database that we at EarlyPrint have made available. Since this data is available in a database system called PostgreSQL, we recommend using a free desktop application called pgAdmin. In the sections below, you’ll learn:

  • how to download, install, and work with pgAdmin
  • how to access EarlyPrint’s remote database
  • how to select specific rows and columns to review
  • how to edit and update selected records

Setup

To begin, you’ll need to install pgAdmin. Go to their website, click “Download,” and select the installer appropriate to your computer and/or operating system. Once downloaded, follow the instructions to install and open pgAdmin.

When you first open pgAdmin, it will ask you to create a master password. This is unrelated to the EarlyPrint database and unique to you, so please choose any password you like. But make sure you remember it! You’ll need it every time you reopen pgAdmin.

Once you’ve installed the application and logged in, you should see a screen like this:

The pgAdmin login screen.

To set up the EarlyPrint database, click “Add New Server.” You’ll be prompted to enter a name for the database, which can be anything at all (we recommend “earlyprint”).

Enter database Name.

Once you’ve entered a name in the “General” tab, you can click the “Connection” tab to enter the information about the EarlyPrint database. Enter the following:

Host name/addresses: eumaios.ads.northwestern.edu
Port: 5432
Maintenance database: earlyprint

You will also need to enter a username and password for the database. If you’re interested in correcting texts this way, contact Martin Mueller for a valid username and password. When you’re done filling out the form, it will look like this:

Enter database connection.

You can select “Save password” so that pgAdmin will remember your credentials later, and you can ignore the rest of the tabs in this window (SSL, SSH Tunnel, etc.). When you’re ready, click “Save” and pgAdmin will set up your connection. When you’re done, you will see this main page:

pgAdmin dashboard.

On the right is a dashboard showing the database activity and currently logged on users, and you can mostly ignore this screen. On the left is a series of dropdown menus containing all of the tables and data that are part of the “earlyprint” database. It’s from here that you’ll be able to select and edit information in the database.

n.b. When you log in to the database, you may get a warning message that the database software isn’t up to date with your version of pgAdmin. This only refers to the software running the database on our server and the software that sits behind pgAdmin, but it is unlikely to cause any problems for this workflow. You can safely ignore this message.

Selecting Words and Texts

Viewing Data

Now you’re ready to work with the data itself. Servers contain databases, and databases contain tables where data is stored in rows and columns. This is why pgAdmin’s dropdown menus appear so complicated. To see a list of available tables in the earlyprint database, navigate the dropdown like this:

Servers > earlyprint > Databases > earlyprint > Schemas > public > Tables

You should see a list like this containing the tables currently available in the database:

Tables in earlyprint database.

The list of tables shown above is what is currently in the database; you may encounter a slightly different list. The main table you should care about is currently called eebo1623 and contains words from the first 150 years of the EarlyPrint corpus. To view the data, right-click on the table and select View/Edit Data > First 100 Rows. (Selecting “All Rows” may be very slow, and at this stage you’re just trying to get a sense of what’s in the database.) The results will look like this:

The first 100 rows of eebo1623.

At the bottom of the screen, under “Data Output” you’ll see a scrollable table of the first 100 rows of data. This is a “tabularized” output of the EarlyPrint texts, in which every row represents one word. The columns include all sorts of contextual information, including the word’s part of speech, standard spelling, and lemma; IDs for the individual word as well as for the text in which that word appears; “kwic” columns that show the keyword-in-context text that appears to the right and left of that word; and columns to contain corrected or updated information, which begin with “cor.”

Selecting Data

In the image above, the bottom half of the screen shows the data, and the top half of the screen shows the command that retrieved the data. This command is written in SQL, Structured Query Language, which is the most common way of interacting with relational databases. Our database system, PostgreSQL, uses SQL as its main language. (Though there are some commands unique to Postgres, the vast majority are part of the standard SQL syntax.)

SQL can do a lot of things, but in this tutorial we’ll just focus on a few simple commands that will get you very far. The SQL tutorials from W3Schools are a great resource if you’d like to learn more.

To write a new SQL query, click the “Query Tool” button at the top left of this screen (it looks like a little server with a play button on top of it). This will open a new, blank Query Tool window:

Blank Query Tool.

Now you’re ready to type and run a SQL command. To display data, you’ll use a SELECT query, in which you select specific columns and rows according to certain conditions. You can use the following command as an example:

SELECT xmlid,spelling,standard,lemma,corstandard FROM eebo1623
WHERE lemma = 'for' LIMIT 50;

In the command above, you are asking SQL to SELECT five columns (xmlid, spelling, standard, lemma, and corrected standard) FROM the table “eebo1623”. Rather than showing all rows, you’ve asked SQL to only show those rows WHERE the lemma column’s value is equal to “for”, and you’d like to LIMIT it further to only show 50 records.

[A few notes: It’s convention to put all of the SQL command words in all caps, but pgAdmin will understand you if you put them in lowercase, too. In some SQL applications, you must put the whole command on a single line, but pgAdmin lets you split a command onto several lines. pgAdmin will let you skip the semicolon at the end of the command, but it’s good to get into the habit of using it since other programs require it. Finally, it’s best to include xmlid in every SELECT statement that you write. Xmlid is the “primary key” of the database, and including it in all output will make sure that you can edit columns.]

You can execute the command by hitting the triangle “play” button at the center of the toolbar at the top of the screen. The results will look like this:

An example SELECT query.

You can see that the Data Output section has given you everything you asked for: 50 records where the lemma is “for”, showing just the five columns we selected. Note that in the “corstandard” column some of the standard spellings have already been fixed, but some haven’t yet been worked on (their value is “null”). These would be ones you could update with a corrected standard spelling.

Now you’re ready for a more complicated selection example:

SELECT xmlid, textid, spelldocfreq, spellcount, kwicl, spelling, kwicr, standard, cortoken,  corstandard, comment
FROM eebo1623
WHERE spelling IN (SELECT spelling FROM eebo1623 WHERE textid = '158-azb-A14467');

In this command, you are SELECTing more columns than you did last time, but it works in exactly the same way. You SELECT a subset of columns FROM the table eebo1623. Then you limit the rows you want by asking only for the rows WHERE the spelling is IN some other list. In this case that list comes from a separate, nested SQL command in parentheses. This commands asks for all spellings that occur in a single text (by matching the textid column to a specific text). The results will look like this:

An more complex example SELECT query.

Why write a SQL command like this? In the scenario above, you’re able to retrieve not only every word in that text, but also every time those same words appear in other texts too. If you fix the word in the text you care about, you can also fix it in all other places in the corpus at the exact same time. Convenient!

Updating Records

Now that you know how to select and view data, you’re ready to edit that same data. In SQL, editing records is referred to as “updating” them.

As a database user for EarlyPrint, you’re only allowed to upload certain columns: any of the columns that begin with “cor” (which stands for “corrected”), as well as a column for comments and an “interim” column for temporary values. By only updating the “cor” columns, we ensure that there’s no way to overwrite project data. This way you can improve data without worrying about replacing the original fields.

Single Records

From any Query Tool window in pgAdmin, you can update a single record directly in the Data Output section. Looking at the previous SELECT example, let’s say you found a word with a missing black-dot character, “●reater”, and you see from context that it should be changed to “greater”. In this case you would want to update the “cortoken” column to read “greater”, and you can do this by simply double-clicking on the cell you’d like to change:

Updating a single record.

An editing box will pop up above the cell, as you can see in the screenshot. Simply delete what’s already there and replace it with your correction. When you’re done click okay.

You’ll see the new value in bold. This means that it’s been edited, but the changes have not yet been saved to the database. To save your changes and sync them with the database, click the “Save Data Changes” button in the toolbar: it looks like a server with a little save disk in front, and it’s directly to the right of the normal Save button. Once you’ve saved the change to the database, the record will no longer be bold!

You can make a lot of changes and then save them all at once, or make a few changes and save more often. It’s up to you, but remember that none of the changes are in the database until you click “Save Data Changes.”

Multiple Records

This approach works great for one record at a time, but sometimes you want to edit a bunch of records at once. Say you’ve looked over all the instances of “●reater”, and every one of them should be “greater”. Updating them one at a time might be tedious, but where you used a SELECT query to view lots of records at once, you can use an UPDATE query to update lots of records at once.

WARNING: BE CAREFUL WHEN UPDATING RECORDS! THERE’S OFTEN NO WAY TO UNDO, AND IT’S EASY TO MAKE SWEEPING CHANGES BY ACCIDENT. CHECK AND DOUBLE CHECK ALL YOUR UPDATE QUERIES BEFORE YOU RUN THEM. YOU MUST ALWAYS INCLUDE A WHERE CONDITION OR ANOTHER CONDITION AS PART OF THE UPDATE QUERY. OTHERWISE, YOU’LL UPDATE ALL THE RECORDS IN THE COLUMN, ERASING PREVIOUS WORK!

An UPDATE statement is kind of like a SELECT statement in reverse. First you put the name of the table you’d like to update, followed by how you’d like to change the column value, and then the conditions for the rows you’d like to replace. Here’s the UPDATE statement to change all the instances of “●reater”:

UPDATE eebo1623
SET cortoken = 'greater'
WHERE spelling = '●reater';

In the command above, you UPDATE the “eebo1623” table and SET the value of the cortoken column to “greater”, but only WHERE the value of the spelling column is “●reater”. Remember, you must always include a WHERE condition: otherwise you’ll update every single row of the table!

Updating multiple records.

You can enter this query into the same window that you put the SELECT command, and click the same “play” button to execute the command. UPDATE queries are immediately executed in the database, so there’s no need to click “Save Data Changes” afterwards. However that also means that all UPDATE commands are final, and there’s no going back! But it’s easy to get the hang of it, and soon you’ll be selecting and updating records with ease.