This January I gave a short talk at PyData Berlin on best practices in ethical data collection. Behind that title, which in retrospect could have used some spicing-up, were a summary of three years’ worth of thinking, discussions and readings on ethics and how to incorporate and really cement them in a professional environment. As is abundantly obvious, I am not an ethicist. Still, I hope this write-up of the ideas that fed into that PyData talk will move one or two readers to reflect on their own role in their profession and seek out the people that really do know what they are talking about.
Victor, 31 March 2021 English
I used common natural language processing and machine learning frameworks in Python3 to match twenty lexias from Shakespeare's works to a hundred intertextual references from early modern English drama to these lexias. The principal methods used are encodings from a transformer-based language model and the textual similarity measure Word Mover's Distance. The masked language model was further fine-tuned on an early modern English corpus.
Victor, 12 February 2021 English