Shakespearean Wordpieces

I used common natural language processing (NLP) and machine learning (ML) frameworks in Python3 to match twenty lexias from Shakespeare’s works to a hundred intertextual references from early modern English drama to these lexias. The principal methods used are encodings from a transformer-based language model (Bert, Devlin et al. 2018) and the textual similarity measure Word Mover’s Distance (WMD, Kusner et al. 2015). The masked language model was further fine-tuned on an early modern English corpus (VEP).

Liebl and Burghardt (2020) present a search engine for intertextual references. In contrast to their all-purpose engine, this project aims to find intertextual references between Shakespeares work and that of his peers by adapting the encoder to 17th century text. Their system uses fasttext (Joulin et al., 2016), which is replaced by transformer-based encodings for this project.
Word Mover’s Distance is already heavily employed for plagiarism detection, so successful employment in text reuse, which examines similar, if less malignant, references, would seem reasonable (Chang et al., 2020).

Code & Libraries

The bulk of the NLP tasks were done with spacy (v3, Honnibal et al. 2020), although using the newest version has caused incompatibilities along the way. I worked with the Huggingface transformers library (Wolf et al., 2020) to import and fine-tune Bert. The important computations ran on Google Colaboratory notebooks, the VEP corpus was segmented and split on a short script and NLTK (Loper and Bird, 2020) on my computer. Code snippets were borrowed from Stackoverflow users Aviade and winwin, the evaluation script and gold standard were provided by my supervisor. The Word Mover’s Distance was calculated using the wmd Python library, the gold data was handled with pandas.

Design & Methodology


The fine-tuning corpus consists of 479322 unlabeled and automatically segmented sentenced from the Expanded Drama 1660 collection of the VEP corpus.
There are 100 context phrases and 20 Shakespearean phrases in the gold standard provided, matching to one context phrase each. An example of such a tuple is given in the Table below.

lexia Shakespeare phrase context
madness and method though this be madness, yet there is method in it [Rowland:] Methought there was a Method in her madness, she did not know herself i’th Glass.

Bert & VEP-Bert

I used Huggingface’s off-the-shelf pretrained Bert model (12 heads, 12 layers, 768 hidden states, 110M parameters), which uses an attention-based transformer architecture (Vaswani et al., 2017) as a starting point for fine-tuning. Bert is trained on a masking task, predicting a masked token in a sentence from its context over the vocabulary. For fine-tuning, sentences from the target domain are selected and fed the model during further training steps. The model fine-tuned on the VEP data set will be referred to as VEP-Bert.


Bert uses a wordpiece tokenizer, meaning Bert’s vocabulary does not necessarily consists of natural tokens, e.g. words, but also word segments or other arbitrary character sequences from the input string.
I draw the embeddings from the last hidden layer (by convention, and my class notes from a Bachelor statistics course; I could not find the original source for this practice). Each wordpiece tensor thus has a uniform length of 768 values.
No further alignment is performed.


Word Mover’s Distance aims to find a minimal alignment between tokens of two documents in a shared vector space. Since the complexity of such an operation is \(O(n^3 log n)\), a relaxed Word Mover’s distance is employed (and implemented in the wmd library), where a single token can only move to another token in the target sentence, not split between them (from Kusner et al. 2015):
\(\text{subject to: }\sum_{j=1}^n\mathbf{T}_{ij} = d_i \forall i \in \{1,...,n\}\)
\(\mathbf{T}_{ij}^{*}= \begin{cases} d_i & \text{if } j=\text{argmin}_jc(ij)\\ 0 & \text{otherwise.} \end{cases}\)
with flow matrix \(\mathbf{T}\in\mathcal{R}^{n\times n}\) , word vectors \(i\in d\) and \(j\in d'\), nBOW sentence representations \(d\) and \(d'\) and travel cost \(c(i,j) = ||x_i-x_j||_2\).
This is implemented by iterating over the source sentence and matching each token to the closest target token, minimizing the complexity to \(O(n^2)\). Instead of sentence tokens, wordpieces were aligned for the Bert-based experiments.
Cosine similarity between two 1-dimensional tensors \(v\) and \(w\) with length \(n\) can be defined as \(sim(v,w) = \frac{\sum_{i=1}^nv_iw_i}{\sqrt{\sum_{i=1}^nv_i^1}\sqrt{\sum_{i=1}^nw_i^1}}.\) To calculate the cosine similarity between to the query and a quote, their wordpiece tensors are pooled into a single vector each. For the experiments presented below, the mean of the set of wordpiece vectors was chosen.
Contexts are ranked by their similarity to the query.


I ran experiments with Spacy’s Tok2Vec, Huggingface’s pretrained Bert model and Bert model fine-tuned on the VEP corpus.
The fine-tuning was initialized with a learning rate of 1e-5, a chunk size of 20 and a maximum gradient norm of 5. The loss converged and training was stopped early at about 5.44e-05 after 4185 iterations, before the first epoch ran through. Because of the necessary time for a run parameters were not optimized beyond some test runs with different learning rates (a learning rate of 1e-4 produced comparable results, 1e-2 much worse).
Each encoder is run with Word Mover’s Distance and cosine distance on the gold data. The formal task is to retrieve an unordered set of correct contexts for a Shakespeare phrase (query) from a list of 100 contexts. Results are reported in normalized discounted cumulative gain (nDCG, Järvelin and Kekäläinen 2020) for \(p=50\) over \(N\) correct contexts:
\(nDCG_p = \frac{DCG_p}{IDCG_p},\)
\(IDCG_p = \sum_{i=1}^{\min(p,N)} \frac{1}{\log_2(i+1)},\)
\(DCG_p = \sum_{i=1}^p \frac{rel_i}{\log_2(i+1)},\)
\(rel_i = \begin{cases} 1 & \text{if the $i$-th context retrieved is correct,}\\ 0 & \text{otherwise.} \end{cases}\)
The normalized discounted cumulative gain for each query is finally averaged over all 20 queries. The all results from the experiments are listed in the Table below.

Word Mover's Distance Cosine Distance
lexia Tok2Vec Bert VEP-Bert Tok2Vec Bert VEP-Bert
bear a brain 1.000 0.000 0.000 0.090 0.822 0.215
carry coals 0.817 0.050 0.048 0.393 0.320 0.688
hell gapes 0.575 0.360 0.499 0.449 0.834 0.691
Hieronimo go by 0.624 0.339 0.339 0.906 0.558 0.798
hillo ho ho 0.557 0.667 0.855 0.949 0.598 0.670
host of heaven 0.099 0.098 0.141 0.186 0.296 0.284
madness and method 0.727 0.142 0.150 0.310 0.327 0.688
mind's eye 0.721 0.449 0.402 0.756 0.352 0.399
my kingdom for a horse 0.852 0.426 0.365 0.695 0.782 0.806
old man twice child 0.922 0.432 0.488 0.804 0.838 0.745
pampered jades of Asia 0.736 0.435 0.786 0.982 0.886 0.820
planet strike 1.000 0.797 0.836 0.772 0.668 0.426
sea of troubles 1.000 0.594 0.655 0.854 0.618 0.693
springes to catch woodcocks 0.344 0.451 0.829 0.959 0.731 0.898
the rest is silence 0.698 0.273 0.213 0.751 0.102 0.515
thereby hangs a tale 0.967 0.107 0.088 0.343 0.375 0.671
to be or not to be 0.971 0.718 0.767 0.894 0.890 0.983
white liver indicates cowardice 0.560 0.123 0.123 0.469 0.223 0.380
woman's frailty 0.585 0.720 0.819 0.444 0.537 0.617
world as stage 0.920 0.000 0.000 0.265 0.553 0.329
Avg. 0.734 0.359 0.420 0.614 0.566 0.616
Scores for each lexia, similarity measure and encoding, values in normalized discounted cumulative gain (0 to 1, higher is better). Best overall score in bold.


The fine-tuned model outperformed the pretrained model for both distance measures. This was an expected outcome, since the Bert model used was pretrained on text data from Wikipedia and Google’s book corpus, which may include some early modern English data, but certainly only negligible amounts. No domain-adaptation was performed for the Tok2Vec encoder, which was taken “as is”. Fine-tuning of Tok2Vec might also lead to further improvements.
The Bert-based models seem to profit from cosine measures, while the Tok2Vec embeddings did perform better with WMD. This may be due to the fact that no alignment was applied after the fact to the wordpiece tokens produced by Bert’s tokenizer.


Since little hyper-parameter optimization was done before fine-tuning, the VEP-Bert encodings could very well outperform the other encodings in practice. Because Spacy released a new version half-way through this project, which introduced a host of new incompatibilities between libraries, I was largely restricted to the Spacy/Huggingface stack. The wmd library is one of those dependencies that worked in theory, but retroactively I would have been more happy if I wrote that from scratch with proper transformer hooks. I also struggled with alignments of the wordpiece tokeniser, the abandonment of which might be partially to blame for the worse results from Bert and VEP-Bert with Word Mover’s Distance. All in all, the experiments run do not discourage from the viability of domain-adaptation and the application of Word Mover’s distance for text-reuse, leaving room for improvement by further optimization and experimentation with different hyperparameters or models.


Kusner, M., Sun, Y., Kolkin, N., and Weinberger, K. 2015. From word embeddings to document distances. In International conference on machine learning (pp. 957–966).

Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR, abs/1810.04805.

Honnibal, M., Montani, I., Van Landeghem, S., and Boyd, A. 2020. spaCy: Industrial-strength Natural Language Processing in Python. In undefined. Zenodo.

Kusner, M., Sun, Y., Kolkin, N., and Weinberger, K. 2015. From word embeddings to document distances. In International conference on machine learning (pp. 957–966).

Järvelin, K., and Kekäläinen, J. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS), 20(4), p.422–446.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, L., and Polosukhin, I. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762.

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 38–45). Association for Computational Linguistics.

Bernhard Liebl, and Manuel Burghardt 2020. The Vectorian - Eine parametrisierbare Suchmaschine für intertextuelle Referenzen. Book of Abstracts, DHd.

Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., and Mikolov, T. 2016. FastText. zip: Compressing text classification models. arXiv, p.arXiv–1612.

Molz, J., 2019. A close and distant reading of Shakespearean intertextuality. (Doctoral dissertation, lmu).

Chang, C.M., Chang, C.H., and Hwang, S.Y. 2020. Employing word mover’s distance for cross-lingual plagiarized text detection. Proceedings of the Association for Information Science and Technology, 57(1), p.e229.

Loper, E., and Bird, S. 2002. NLTK: the Natural Language Toolkit. In Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics-Volume 1 (pp. 63–70).