Text Similarity

EarlyPrint’s Discovery Engine allows you to find a set of texts similar to any text in our corpus. It does this by using some basic measures of text similarity, and it’s easy to use if you’re interested in finding similar texts across the entire early modern corpus.

But you might be interested in finding similarity across a smaller subset of the corpus. In this tutorial, we’ll calculate similarity across the same set of 1666 texts that we used in the TF-IDF tutorial. You could easily do the same with any subset of texts that you’ve gathered using the Metadata tutorial.

This tutorial is meant as a companion to an explanation of text similarity that I wrote for The Programming Historian:

The article uses the same 1666 corpus as its example, but here we’ll work directly with the EarlyPrint XML instead of with plaintext files. For full explanations of the different similarity measures and how they’re used, please use that piece as a guide.

First, we’ll import necessary libraries. [n.b. In the Programming Historian tutorial, I use scipy’s implementation of pairwise distances. For simplicity’s sake, here we’re using Sci-kit Learn’s built-in distance function.]

import glob
import pandas as pd
from lxml import etree
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics import pairwise_distances
from collections import Counter

Next we use glob to get our list of files and isolate the filekeys to use later. This is the complete list of texts we’re working with in this example. You may have a different directory or filepath for your own files.

# Use the glob library to create a list of file names
filenames = glob.glob("1666_texts_full/*.xml")
# Parse those filenames to create a list of file keys (ID numbers)
# You'll use these later on.
filekeys = [f.split('/')[-1].split('.')[0] for f in filenames]
print(filekeys)
['B02845', 'A32444', 'A51130', 'A89320', 'A36358', 'A51877', 'A46196', 'A60482', 'B04513', 'A61956', 'A46152', 'B05117', 'A32566', 'B02144', 'A35206', 'A35114', 'A32314', 'A32207', 'A29439', 'A39345', 'B03164', 'A25743', 'A86466', 'A61929', 'B03114', 'B05479', 'A53272', 'A32916', 'A70852', 'A39482', 'A41058', 'B01661', 'A61594', 'A35608', 'A76131', 'A67420', 'A64861', 'A61503', 'A66579', 'A34136', 'A77996', 'A79302', 'A62436', 'A32313', 'A38556', 'B03489', 'A32751', 'A63370', 'B05916', 'B23164', 'B09634', 'A57484', 'A92820', 'A28339', 'A52005', 'A57421', 'A39246', 'B05319', 'A52396', 'A59608', 'A64918', 'A66752', 'A26426', 'A51442', 'A26249', 'B03672', 'A56039', 'A32503', 'A55410', 'A46087', 'A54027', 'B08665', 'B03352', 'A37096', 'A31237', 'A61867', 'A61891', 'A95297', 'A50777', 'A71231', 'A28989', 'A49061', 'A27397', 'A31124', 'A63952', 'A80818', 'A65702', 'A65296', 'A30203', 'A55387', 'A59325', 'A45620', 'B06022', 'A53911', 'B03891', 'A60250', 'A93278', 'B04360', 'A96435', 'A56381', 'A61600', 'A66777', 'A39714', 'A51369', 'A48909', 'A44801', 'A71109', 'A49213', 'A63951', 'A32233', 'A43020', 'A51346', 'A45206', 'A48218', 'A95690', 'A60606', 'A23770', 'A41053', 'A52519', 'A44938', 'A64258', 'A35851', 'A56390', 'B02572', 'A91186', 'A59229', 'A46193', 'B05875', 'B05308', 'A30143', 'A47951', 'A75822', 'A46046', 'A35574', 'A29694', 'B03376', 'B03317', 'A47095', 'B01318', 'B03106', 'A44879', 'B05318', 'A54070', 'A54418', 'A32967', 'A70287', 'A75960', 'A29110', 'A50520', 'A47546', 'A37291', 'A28209', 'B02089', 'B04153', 'A59168', 'A29017', 'A47367', 'A44334', 'A81069', 'A35538', 'A46108', 'B03109', 'B02123', 'A39466', 'A96936', 'A43741', 'A55322', 'A42533', 'A42537', 'A63571', 'A87330', 'A44627', 'A92373', 'B04154', 'B20017', 'A32612', 'A93280', 'A79623', 'A38792', 'B06375', 'A47545', 'A67572', 'A46030', 'A32581', 'A44478', 'A47379', 'A41072', 'B01399', 'A26496', 'A32557', 'A37237', 'A32614', 'A39839', 'B04338', 'A48797', 'B03631', 'A45529', 'A46137', 'A58750', 'A53307', 'A41266', 'A32484', 'A50075', 'A25198', 'A42820', 'A39442', 'B02051', 'A63431', 'A77803', 'A38741', 'A49793', 'A67762', 'A45552', 'A52328', 'A97379', 'A63849', 'A23851', 'A39938', 'A44061', 'A93281', 'A67335', 'A46743', 'B04701', 'A40254', 'A40151', 'A44594', 'A80816', 'B06473', 'A39974', 'A26482', 'B05591', 'A85898', 'B06427', 'A61206', 'B06872', 'A43177', 'A32555', 'A49471', 'A47547', 'A23912', 'A36329', 'A41527', 'B04364', 'A61207', 'A41958', 'A31229', 'A96485', 'A32288', 'A59614', 'A53049', 'A36272', 'A46071', 'A42544', 'B06591', 'A32567', 'A64060', 'B05057', 'A38630', 'A63767', 'A32559', 'A33006', 'A60948', 'A53818', 'A49697', 'A57156', 'A57634', 'A65985', 'B03763', 'A41955']

Get Features

In order to measure similarity between texts, you need features of those texts to measure. The Discovery Engine calculates similarity across three distinct sets of features for the same texts: TF-IDF weights for word counts, LDA Topic Modeling results, and XML tag structures. As our example here, we’ll use TF-IDF.

The code below is taken directly from the TF-IDF Tutorial, where you’ll find a full explanation of what it does. We loop through each text, extract words, count them, and convert those counts to TF-IDF values.

n.b. There are two key differences between the TF-IDF tutorial and this one. Below I am getting counts of lemmas, dictionary headwords, rather than simply regularized forms of the word. This allows us to group plurals or verb forms into a single term. Also, here we’ll use L2 normalization on our TF-IDF transformation. Normalizing values helps us account for very long or very short texts that may skew our similarity results.

# Create an empty lists to put all our texts into
all_tokenized = []

# Then you can loop through the files
for f in filenames:
    parser = etree.XMLParser(collect_ids=False) # Create a parse object that skips XML IDs (in this case they just slow things down)
    tree = etree.parse(f, parser) # Parse each file into an XML tree
    xml = tree.getroot() # Get the XML from that tree
    
    # Now we can use lxml to find all the w tags       
    word_tags = xml.findall(".//{*}w")
    # In this next line you'll do several things at once to create a list of words for each text
    # 1. Loop through each word: for word in word_tags
    # 2. Make sure the tag has a word at all: if word.text != None
    # 3. Get the lemmatized form of the word: word.get('reg', word.text)
    # 4. Make sure all the words are in lowercase: .lower()
    words = [word.get('lemma', word.text).lower() for word in word_tags if word.text != None]
    # Then we add these results to a master list
    all_tokenized.append(words)
    
# We can count all the words in each text in one line of code
all_counted = [Counter(a) for a in all_tokenized]

# To prepare this data for Tf-Idf Transformation, we need to put into a different form, a DataFrame, using pandas.
df = pd.DataFrame(all_counted, index=filekeys).fillna(0)

# First we need to create an "instance" of the transformer, with the proper settings.
# Normalization is set to 'l2'
tfidf = TfidfTransformer(norm='l2', sublinear_tf=True)
# I am choosing to turn on sublinear term frequency scaling, which takes the log of
# term frequencies and can help to de-emphasize function words like pronouns and articles. 
# You might make a different choice depending on your corpus.

# Once we've created the instance, we can "transform" our counts
results = tfidf.fit_transform(df)

# Make results readable using Pandas
readable_results = pd.DataFrame(results.toarray(), index=df.index, columns=df.columns) # Convert information back to a DataFrame
readable_results

Calculate Distance

Below we’ll calculate three different distance metrics—euclidean distance, “cityblock” distance, and cosine distance—and create DataFrames for each one. For explanations of each metric, and for a discussion of the difference between similarity and distance, you can refer to The Programming Historian tutorial which goes into these topics in detail.

Euclidean distance is first, because it’s the default in sklearn:

euclidean = pairwise_distances(results)
euclidean_df = pd.DataFrame(euclidean, index=df.index, columns=df.index)
euclidean_df
B02845 A32444 A51130 A89320 A36358 A51877 A46196 A60482 B04513 A61956 ... A32559 A33006 A60948 A53818 A49697 A57156 A57634 A65985 B03763 A41955
B02845 0.000000 1.330296 1.252074 1.343080 1.272442 1.381007 1.318771 1.320236 1.305981 1.310723 ... 1.330669 1.283922 1.290638 1.331735 1.315239 1.296993 1.295608 1.278693 1.288136 1.299440
A32444 1.330296 0.000000 1.330771 1.340797 1.337868 1.370927 1.240466 1.357205 1.334426 1.345882 ... 1.190446 1.317433 1.315577 1.280773 1.346539 1.333417 1.315543 1.348345 1.321553 1.321963
A51130 1.252074 1.330771 0.000000 1.316638 1.161953 1.376152 1.297971 1.226153 1.246694 1.213826 ... 1.324561 1.166884 1.179731 1.335540 1.186073 1.197500 1.227000 1.167605 1.233122 1.195988
A89320 1.343080 1.340797 1.316638 0.000000 1.325708 1.237975 1.342863 1.295708 1.330615 1.311224 ... 1.349694 1.314192 1.304918 1.349898 1.301733 1.305777 1.314975 1.316260 1.323349 1.298748
A36358 1.272442 1.337868 1.161953 1.325708 0.000000 1.369564 1.322658 1.258650 1.278494 1.261418 ... 1.328167 1.190528 1.208084 1.347102 1.199426 1.217275 1.262931 1.187049 1.225716 1.243505
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
A57156 1.296993 1.333417 1.197500 1.305777 1.217275 1.372673 1.316086 1.200557 1.246743 1.218061 ... 1.327030 1.110908 1.161417 1.335515 1.116767 0.000000 1.245640 1.187006 1.177661 1.204760
A57634 1.295608 1.315543 1.227000 1.314975 1.262931 1.381444 1.327350 1.243357 1.255162 1.215802 ... 1.324041 1.246059 1.204422 1.330255 1.238637 1.245640 0.000000 1.252065 1.267028 1.239778
A65985 1.278693 1.348345 1.167605 1.316260 1.187049 1.384899 1.339453 1.218992 1.254460 1.230040 ... 1.345416 1.129940 1.201704 1.352158 1.154239 1.187006 1.252065 0.000000 1.182711 1.213980
B03763 1.288136 1.321553 1.233122 1.323349 1.225716 1.371066 1.327480 1.257738 1.272967 1.263826 ... 1.327534 1.118590 1.214992 1.324017 1.188736 1.177661 1.267028 1.182711 0.000000 1.246016
A41955 1.299440 1.321963 1.195988 1.298748 1.243505 1.367297 1.330342 1.195975 1.219998 1.155711 ... 1.323801 1.196622 1.195257 1.328967 1.173217 1.204760 1.239778 1.213980 1.246016 0.000000

269 rows × 269 columns

Next is cityblock distance:

cityblock = pairwise_distances(results, metric='cityblock')
cityblock_df = pd.DataFrame(cityblock, index=df.index, columns=df.index)
cityblock_df
B02845 A32444 A51130 A89320 A36358 A51877 A46196 A60482 B04513 A61956 ... A32559 A33006 A60948 A53818 A49697 A57156 A57634 A65985 B03763 A41955
B02845 0.000000 24.650119 49.203414 38.780499 46.542462 27.171730 25.745305 78.787493 39.161498 55.773086 ... 25.947184 43.515562 45.353246 24.779750 75.313625 49.781847 38.587606 57.332514 36.210966 55.782928
A32444 24.650119 0.000000 47.669264 32.272188 43.999466 19.227561 16.057823 74.819409 34.544138 51.772394 ... 14.635631 39.542522 40.899367 15.875381 71.247461 45.793170 33.946812 55.328566 31.591160 51.816649
A51130 49.203414 47.669264 0.000000 56.477056 54.576265 49.983710 48.170012 84.538490 53.736563 63.487868 ... 48.639787 52.842010 54.400805 47.839720 77.494767 58.105044 53.010386 62.705428 52.478075 62.159071
A89320 38.780499 32.272188 56.477056 0.000000 53.704863 30.989173 33.793924 78.858523 45.477574 58.163339 ... 34.014387 49.995994 50.250660 32.622379 76.028412 54.251285 44.669043 62.598209 43.635359 58.310558
A36358 46.542462 43.999466 54.576265 53.704863 0.000000 45.860743 45.077440 85.634146 52.849591 65.225453 ... 44.822800 51.441112 54.023617 44.417045 76.936334 57.401979 52.306068 62.482273 48.969791 63.811401
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
A57156 49.781847 45.793170 58.105044 54.251285 57.401979 47.997987 46.659926 81.483471 51.999108 62.647620 ... 46.760935 47.783912 52.272594 45.851017 71.074501 0.000000 52.634620 63.482708 47.621520 61.908947
A57634 38.587606 33.946812 53.010386 44.669043 52.306068 37.110046 35.953995 78.607434 43.045998 54.840913 ... 35.485991 47.947224 46.251104 34.297440 74.725987 52.634620 0.000000 61.035598 42.686829 57.116200
A65985 57.332514 55.328566 62.705428 62.598209 62.482273 57.373788 56.423529 88.513195 60.766621 70.246288 ... 56.428125 56.711215 62.424108 55.515185 78.896892 63.482708 61.035598 0.000000 56.004211 68.587934
B03763 36.210966 31.591160 52.478075 43.635359 48.969791 34.287528 33.433897 79.047064 42.843973 57.316811 ... 33.233848 39.478019 45.686447 31.738186 71.259116 47.621520 42.686829 56.004211 0.000000 56.758533
A41955 55.782928 51.816649 62.159071 58.310558 63.811401 54.011823 53.286440 82.167830 55.554180 60.213976 ... 52.834872 58.372266 58.938908 52.005737 76.817342 61.908947 57.116200 68.587934 56.758533 0.000000

269 rows × 269 columns

And finally cosine distance, which is usually (but not always) preferable for text similarity:

cosine = pairwise_distances(results, metric='cosine')
cosine_df = pd.DataFrame(cosine, index=df.index, columns=df.index)
cosine_df
B02845 A32444 A51130 A89320 A36358 A51877 A46196 A60482 B04513 A61956 ... A32559 A33006 A60948 A53818 A49697 A57156 A57634 A65985 B03763 A41955
B02845 0.000000 0.884843 0.783845 0.901932 0.809554 0.953590 0.869578 0.871511 0.852793 0.858997 ... 0.885339 0.824228 0.832873 0.886759 0.864926 0.841096 0.839300 0.817528 0.829647 0.844272
A32444 0.884843 0.000000 0.885475 0.898868 0.894945 0.939720 0.769378 0.921002 0.890346 0.905700 ... 0.708580 0.867815 0.865371 0.820190 0.906584 0.889000 0.865327 0.909017 0.873252 0.873793
A51130 0.783845 0.885475 0.000000 0.866768 0.675067 0.946897 0.842365 0.751725 0.777123 0.736687 ... 0.877231 0.680809 0.695882 0.891833 0.703385 0.717004 0.752764 0.681651 0.760295 0.715194
A89320 0.901932 0.898868 0.866768 0.000000 0.878750 0.766291 0.901640 0.839429 0.885268 0.859654 ... 0.910837 0.863551 0.851406 0.911112 0.847254 0.852526 0.864579 0.866271 0.875627 0.843373
A36358 0.809554 0.894945 0.675067 0.878750 0.000000 0.937853 0.874712 0.792100 0.817274 0.795588 ... 0.882013 0.708679 0.729733 0.907342 0.719311 0.740879 0.797497 0.704543 0.751190 0.773152
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
A57156 0.841096 0.889000 0.717004 0.852526 0.740879 0.942116 0.866042 0.720669 0.777184 0.741836 ... 0.880504 0.617059 0.674445 0.891801 0.623584 0.000000 0.775809 0.704492 0.693443 0.725723
A57634 0.839300 0.865327 0.752764 0.864579 0.797497 0.954194 0.880929 0.772968 0.787716 0.739087 ... 0.876542 0.776331 0.725316 0.884790 0.767111 0.775809 0.000000 0.783833 0.802680 0.768525
A65985 0.817528 0.909017 0.681651 0.866271 0.704543 0.958973 0.897067 0.742971 0.786835 0.756500 ... 0.905072 0.638383 0.722046 0.914166 0.666133 0.704492 0.783833 0.000000 0.699403 0.736874
B03763 0.829647 0.873252 0.760295 0.875627 0.751190 0.939911 0.881102 0.790952 0.810223 0.798628 ... 0.881173 0.625622 0.738102 0.876510 0.706546 0.693443 0.802680 0.699403 0.000000 0.776278
A41955 0.844272 0.873793 0.715194 0.843373 0.773152 0.934750 0.884905 0.715178 0.744197 0.667834 ... 0.876224 0.715953 0.714320 0.883077 0.688219 0.725723 0.768525 0.736874 0.776278 0.000000

269 rows × 269 columns

Reading Results

Now that we have DataFrames of all our distance results, we can easily look at the texts that are most similar (i.e. closest in distance) to a text of our choice. We’ll use the same example as in the TF-IDF tutorial: Margaret Cavendish’s The Blazing World.

top5_cosine = cosine_df.nsmallest(6, 'A53049')['A53049'][1:]
print(top5_cosine)
A29017    0.558313
A61207    0.575897
A59608    0.599391
A56381    0.611200
A66752    0.611516
Name: A53049, dtype: float64

We now have a list of text IDs and their cosine similarities, but this list is hard to interpret without more information. We can use the techniques from the Metadata tutorial to get a DataFrame of metadata for all the 1666 texts:

# Get the full list of metadata files
# (You'll change this line based on where the files are on your computer)
metadata_files = glob.glob("../../epmetadata/header/*.xml")
nsmap={'tei': 'http://www.tei-c.org/ns/1.0'}

all_metadata = [] # Empty list for data
index = [] # Empty list for TCP IDs
for f in metadata_files: # Loop through each file
    tcp_id = f.split("/")[-1].split("_")[0] # Get TCP ID from filename
    if tcp_id in filekeys:
        metadata = etree.parse(f, parser) # Create lxml tree for metadata
        title = metadata.find(".//tei:sourceDesc//tei:title", namespaces=nsmap).text # Get title

        # Get author (if there is one)
        try:
            author = metadata.find(".//tei:sourceDesc//tei:author", namespaces=nsmap).text
        except AttributeError:
            author = None

        # Get date (if there is one that isn't a range)
        try:
            date = metadata.find(".//tei:sourceDesc//tei:date", namespaces=nsmap).get("when")
        except AttributeError:
            date = None

        # Add dictionary of data to data list
        all_metadata.append({'title':title,'author':author,'date':date})

        # Add TCP ID to index list
        index.append(tcp_id)


# Create DataFrame with data and indices
metadata_df = pd.DataFrame(all_metadata, index=index)
metadata_df
title author date
A48797 Wonders no miracles, or, Mr. Valentine Greatra... Lloyd, David, 1635-1692. 1666
A44938 A fast-sermon, preached to the Lords in the Hi... Hall, George, 1612?-1668. 1666
B02089 His Majestie's most gracious speech to both Ho... England and Wales. Sovereign (1660-1685 : Char... 1666
B02144 Seasonable thoughts of divine providence affor... Chishull, John. 1666
A35608 The Case of Cornelius Bee and his partners Ric... None 1666
... ... ... ...
A39938 Experimented proposals how the King may have m... Ford, Edward, Sir, 1605-1670. 1666
A50075 Several lavvs and orders made at the General C... Massachusetts. 1666
B06473 Vox civitatis: or, Londons call to her natural... None 1666
A50777 Exaltatio alæ The ex-ale-tation of ale / done ... Mews, Peter, 1619-1706. 1666
A75822 Avaritia coram tribunali: or, the miser arraig... Gentleman that loves men more than money. 1666

269 rows × 3 columns

And we can combine this with our cosine distance results to see the metadata for the texts most similar to The Blazing World:

metadata_df.loc[top5_cosine.index, ['author','title','date']]
author title date
A29017 Boyle, Robert, 1627-1691. The origine of formes and qualities, (accordin... 1666
A61207 Spurstowe, William, 1605?-1666. The spiritual chymist, or, Six decads of divin... 1666
A59608 Shaw, Samuel, 1635-1696. The voice of one crying in a wilderness, or, T... 1666
A56381 Parker, Samuel, 1640-1688. An account of the nature and extent of the div... 1666
A66752 Wither, George, 1588-1667. Ecchoes from the sixth trumpet. The first part... 1666

You now have all the tools you need to creat your own mini Discovery Engine, one focused on exactly the texts you care most about. For more on how to interpret these results and things to watch out for when calculating similarity, refer again to The Programming Historian.

Visualizing Results

Now that we’ve calculated similarity among all the 1666 texts, it’s helpful to explore further by visualizing those results in different ways. The first thing we need to do is import some simple graphing libraries.

from matplotlib import pyplot as plt
import seaborn as sns

Visualizing Words

In our results above, we found the text most similar to Cavendish’s Blazing World: Boyle’s The Origin of Forms and Qualities. (We know this similarity makes sense because Cavendish’s Blazing World also includes a scientific treatise: Observations upon Experimental Philosophy.)

We might want to know which features—in this case individual words—“drive” the similarity between these two texts. We can do this by graphing all the words that appear in both texts according to their TF-IDF values.

Luckily, pandas lets us do so easily by selecting for the IDs of each text:

# We need to "transpose" our results so that the texts are the columns and the words are the rows.
transformed_results = readable_results.T 

# Then we can graph by selecting our two texts
transformed_results.plot.scatter('A53049','A29017')
<matplotlib.axes._subplots.AxesSubplot at 0x7ff0a5aab130>
_images/similarity_22_1.png

You can see there are lots of words along the x- and y-axes that only appear in one text or the other. But there are plenty of words that appear in both, with varying TF-IDF scores.

The words we’re interested in will have high TF-IDF scores in both texts—those are the words that most account for the high similarity score between these two books. We’d like to label those words on this graph.

First, we can subselect a set of words based on their TF-IDF scores in the two columns we care about. This will create a new, much smaller DataFrame:

filtered_results = transformed_results[((transformed_results['A53049'] > 0.04) & (transformed_results['A29017'] > 0.005)) | ((transformed_results['A29017'] > 0.04) & (transformed_results['A53049'] > 0.005)) | ((transformed_results['A29017'] > 0.03) & (transformed_results['A53049'] > 0.03))] 
filtered_results
B02845 A32444 A51130 A89320 A36358 A51877 A46196 A60482 B04513 A61956 ... A32559 A33006 A60948 A53818 A49697 A57156 A57634 A65985 B03763 A41955
chemist 0.0 0.0 0.02112 0.000000 0.00000 0.0 0.0 0.008434 0.0 0.032280 ... 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000
figure 0.0 0.0 0.00000 0.065961 0.01499 0.0 0.0 0.017960 0.0 0.015254 ... 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.010337 0.0 0.000000
vegetable 0.0 0.0 0.00000 0.000000 0.00000 0.0 0.0 0.008294 0.0 0.031744 ... 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000
perception 0.0 0.0 0.00000 0.000000 0.00000 0.0 0.0 0.022225 0.0 0.000000 ... 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.014548
production 0.0 0.0 0.00000 0.000000 0.00000 0.0 0.0 0.019591 0.0 0.012041 ... 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.026699 0.000000 0.0 0.000000
sensitive 0.0 0.0 0.00000 0.000000 0.00000 0.0 0.0 0.019479 0.0 0.000000 ... 0.0 0.000000 0.000000 0.0 0.023907 0.0 0.000000 0.000000 0.0 0.000000
immaterial 0.0 0.0 0.00000 0.000000 0.00000 0.0 0.0 0.008745 0.0 0.000000 ... 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000
mineral 0.0 0.0 0.00000 0.000000 0.00000 0.0 0.0 0.020672 0.0 0.037429 ... 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000
experiment 0.0 0.0 0.00000 0.000000 0.00000 0.0 0.0 0.006660 0.0 0.000000 ... 0.0 0.000000 0.020265 0.0 0.007475 0.0 0.000000 0.000000 0.0 0.024826
inanimate 0.0 0.0 0.00000 0.000000 0.00000 0.0 0.0 0.018352 0.0 0.000000 ... 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000
fluid 0.0 0.0 0.00000 0.000000 0.00000 0.0 0.0 0.021281 0.0 0.000000 ... 0.0 0.000000 0.000000 0.0 0.016947 0.0 0.000000 0.000000 0.0 0.000000
texture 0.0 0.0 0.00000 0.000000 0.00000 0.0 0.0 0.009313 0.0 0.000000 ... 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000
vitriol 0.0 0.0 0.00000 0.000000 0.00000 0.0 0.0 0.000000 0.0 0.014606 ... 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000
particle 0.0 0.0 0.00000 0.000000 0.00000 0.0 0.0 0.000000 0.0 0.000000 ... 0.0 0.000000 0.000000 0.0 0.017683 0.0 0.000000 0.000000 0.0 0.000000
local 0.0 0.0 0.00000 0.000000 0.00000 0.0 0.0 0.000000 0.0 0.000000 ... 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000
ice 0.0 0.0 0.00000 0.000000 0.00000 0.0 0.0 0.000000 0.0 0.000000 ... 0.0 0.020307 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.020657
finite 0.0 0.0 0.00000 0.000000 0.00000 0.0 0.0 0.000000 0.0 0.000000 ... 0.0 0.000000 0.000000 0.0 0.015512 0.0 0.000000 0.025434 0.0 0.000000
phaenomena 0.0 0.0 0.00000 0.000000 0.00000 0.0 0.0 0.000000 0.0 0.000000 ... 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000
atom 0.0 0.0 0.00000 0.000000 0.00000 0.0 0.0 0.000000 0.0 0.000000 ... 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.015263 0.0 0.000000
corpuscle 0.0 0.0 0.00000 0.000000 0.00000 0.0 0.0 0.000000 0.0 0.000000 ... 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000
volatile 0.0 0.0 0.00000 0.000000 0.00000 0.0 0.0 0.000000 0.0 0.000000 ... 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.033120 0.000000 0.0 0.000000
saline 0.0 0.0 0.00000 0.000000 0.00000 0.0 0.0 0.000000 0.0 0.000000 ... 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000
camphire 0.0 0.0 0.00000 0.000000 0.00000 0.0 0.0 0.000000 0.0 0.000000 ... 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000
corporeal 0.0 0.0 0.00000 0.000000 0.00000 0.0 0.0 0.000000 0.0 0.000000 ... 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000
acid 0.0 0.0 0.00000 0.000000 0.00000 0.0 0.0 0.000000 0.0 0.000000 ... 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.015303
incorporeal 0.0 0.0 0.00000 0.000000 0.00000 0.0 0.0 0.000000 0.0 0.000000 ... 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000
perceptive 0.0 0.0 0.00000 0.000000 0.00000 0.0 0.0 0.000000 0.0 0.000000 ... 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000
nitre 0.0 0.0 0.00000 0.000000 0.00000 0.0 0.0 0.000000 0.0 0.000000 ... 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000
microscope 0.0 0.0 0.00000 0.000000 0.00000 0.0 0.0 0.000000 0.0 0.000000 ... 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000
charcoal 0.0 0.0 0.00000 0.000000 0.00000 0.0 0.0 0.000000 0.0 0.000000 ... 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000

30 rows × 269 columns

These are the 30 words that drive the similarity between Cavendish and Boyle. You could adjust the threshold values in the above filter to get a bigger or smaller list of words.

And we can add these words as labels to our graph in order to see their relative TF-IDF weights:

ax = transformed_results.plot.scatter('A53049','A29017')
for txt in filtered_results.index:
    x = transformed_results.A53049.loc[txt]
    y = transformed_results.A29017.loc[txt]
    ax.annotate(txt, (x,y))
plt.show()
_images/similarity_26_0.png

This graph tells us quite a bit about the similarity between these two texts. Words like “texture” and “corpuscle” have very high TF-IDF scores in Boyle and somewhat high scores in Cavendish. Words like “perception” and “sensitive” have very high scores in Cavendish and only somewhat high in Boyle. And a few select terms, like “microscope,” “mineral,” and “corporeal,” have high scores in both texts. This scientific vocabulary is exactly what we might expect to see driving similarity between two early science texts.

Visualizing Texts

In addition to visualizing the words in just two texts, it can also be helpful to visualize all of our texts at once. We can create a visualization of our entire similarity matrix by making a heatmap: a chart where values are expressed as colors.

Using the seaborn library, this is as easy as inputting our cosine distance DataFrame into a single function:

f, ax = plt.subplots(figsize=(15, 10)) # This line just makes our heatmap a little bigger
sns.heatmap(cosine_df, cmap='coolwarm_r') # This function creates the heatmap
<matplotlib.axes._subplots.AxesSubplot at 0x7ff0a593b850>
_images/similarity_28_1.png

Like the word-based visualization above, this heatmap of texts is also showing us a lot that we couldn’t see just by looking at a table of numbers.

Mainly, we can see that most of the texts are not all that similar! Most of the values are showing up as blue, on the coolest end of our heatmap spectrum. [Look at the key on the right, and remember that when measuring distance higher values mean that two texts are farther apart.] This makes sense, as a group of texts published in just one year won’t necessarily use much of the same vocabulary.

Down the center diagonal of our heatmap is a solid red line. This is where a text matches with itself in our matrix, and texts are always perfectly similar to themselves.

But all is not lost: notice that some of the points are much lighter blue. These texts are more similar than the dark blue intersections, so there is some variation in our graph. And a few points that are not along the diagonal are dark red, indicating quite low distance, i.e. very high similarity. You would need to use the metadata techniques we learned above to get more information, but it’s possible that these very similar texts were written by the same author or are about the same topics.

Visualization doesn’t answer all our questions, but it allows us to view similarity measures in a few different ways. And by seeing our data anew, we can generate more research questions that require further digging: a generative cycle.