This talk presents a mediating perspective on computational phrase composition between naïve additive composition and complex latent representations from regression models. The aim of the work presented is to reduce the unrestricted transformation of latent embeddings to a classification problem of convolutional filters at phrase level. The classifier is embedded within a Tree-LSTM (Tai et al. 2015) architecture and trained on truth-grounded sister node prediction. Preliminary results show evidence that a restricted selection of filters is enough for phrase construction in high-dimensional spaces, but also, there is more to it than addition. This talk will introduce the distributional compositional semantics framework (Baroni 2014) and outline previous computational work into the texture of (semantic) merge, as well as present preliminary results from my Master thesis.
Absinth provides a novel unsupervised graph-based approach to word sense induction. This work combines small world coöccurrence networks with a graph propagation algorithm to induce per-word sense assignment vectors over a lexicon that can be aggregated for classification of whole snippets.
The reuse of text has a longstanding history in science. In qualitative research, besides verbatim quotations, the techniques of paraphrasing, translation, and summarization are instrumental to both teaching and learning scientific writing as well as to gaining new scientific insights. In quantitative research, the use of templates as an efficient way of reporting new results on otherwise standardized workflows is common. Especially in the context of economization and quality control in scientific writing, the detection and analysis of text reuse plays a central role in understanding the writing process, characterizing different notions of authorship, and reflecting on editorial practices. Yet, while text reuse has been quantitatively studied on a small scale and for many scientific disciplines in isolation, few studies assess the phenomenon at scale or in an interdisciplinary fashion. To provide for a solid new foundation for the investigation of scientific text reuse and its role in the process of scientific writing, we curate the Webis-STEREO-21 dataset, a massive collection of Scientific Text Reuse in Open-access publications. It contains more than 91 million cases of reused text passages found in 4.2 million unique open-access publications. Featuring a high coverage of scientific disciplines and varieties of reuse, as well as comprehensive metadata to contextualize each case, our dataset addresses the most salient shortcomings of previous ones on scientific writing. Webis-STEREO-21 allows for tackling a wide range of research questions from different scientific backgrounds. This talk presents first insights gained from the ongoing quantitative research based on the data, and introduces a web-based analysis tool that enables interactive access, faceting, and exploration of the data. It is designed to facilitate the qualitative analysis of the phenomenon and provide access to researchers who have not yet worked with scalable data analytics. Overall, Webis-STEREO-21 aims to provide a first-time grounding on the occurrence, importance, and function of reused text in scientific publications.
I present preliminary research on compositional phrase embeddings using a novel tree LSTM autoencoder. I will outline the necessary theoretical background for grammar-theoretical, neurolinguistic and NLP notions of (non-)compositionality and introduce the tree LSTM architecture. Finally I will go over my experimental set-up and the work ahead.
In their influencial 2016 paper on biases in common language identification systems Su Lin Blodgett et al. use demographic information and geo-tagged tweets to adjust those classifiers to produce more accurate language predictions. Yet despite relying heavily on geographic information, the paper is completely devoid of maps. This presentation draws on two projects from 2018 and 2021 that use static and interactive maps to explain Blodgett et al.’s findings and gives advice on when and how to best use maps for linguistics and science communication.
Python is an easy to learn programming language with a ton of tools for linguistic research, from simple string manipulation and data cleaning, to more advanced parsers and complex statistical algorithms. We will go over the basics of Python's features, look at Spacy and Scikit-learn and finally go through an example project. Some basic programming knowledge is advisory, but not strictly necessary to follow along.
It seems to have become tradition to confront the ethics of data science mostly through reflection on controversy and righting past wrongs. While this might lead to greater awareness and continual grwoth in the right direction is certainly to be applauded, the question still lingers whether or not this approach is enough in a field that changes so rapidly from day to day and year to year. Framed in the major philosophical schools of ethics, from consequentialism to virtue ethics, this talk walks through the best practices in ethical data collection that not only lead to more ethical project management, but also guide institutions and individuals towards lasting and preemptive change in the face of a field without clear role models.
The lack of an editing function in many instant messaging services gave rise to asterisk corrections, messages which only contain the correction intended for the original message, often preceded by an asterisk. This asterisk correction can take many forms, from simple substitutions to extensive clarifications, and can be approached from as many computational perspectives. I will present multiple of these approaches and try to define the task of asterisk correction resolution within a bridging resolution framework.
In the advent of ever more complicated machine learning algorithms employed in systems used by millions of users every day, recent research has shown that the biases and stereotypes we put into these "black-boxes" is reflected in the finished product. From recommendation systems and image labelling to language classification and recidivism prediction, there is no easy solution to deal with, mostly subliminally, racist or sexist data. So far, solutions are, if available, only tailored to specific and clearly outlined domains and problems. Furthermore, one type of bias is not like the other, making it difficult to translate approaches for sexism to racism or vice versa. A basic understanding of implicit bias and how it shapes the data we use is key to predict and avoid the amplification of prejudice within an increasingly connected online world.
Adversarial examples have been making headlines in the computer vision community for a few years now, but did not seem to have a huge impact in natural language processing until very recently. Small changes to an image, mostly invisible to the human eye, can fool a neural network into classifying a turtle as a gun, or a stop sign as a green light. Of course, a single sentence has significantly fewer features to perturb than a 512x512 colour image, still machines can be fooled by slight rephrasing and exploiting real world biases that have crept into the system. This talk gives a brief introduction into the technology and dangers of adversarial attacks and delves into the possible implications for testing and employing natural language processing systems.
One of the oldest human inventions, writing has been around for a long while. For the longest time the only governing body of what could be written was one's own wrist and writing utensil. Starting with the printing press this changed as people had to agree what would be part of a character set, and what would not be. For computers this task has mostly been taken over by the Unicode standard, just one in a long string of international rulebooks for graphemic standardisation. But what makes a good international standard when it comes to writing systems? Is Unicode the be-all and end-all of what we can expect of alphabets in the digital age or is there still more to come? If we were to create a new standard, could Emojis of all places be the way forward? In this workshop we will dive deep into some very diverse alphabets, explore the cultural and historic significance of the Unicode standard and explore its advantages and shortcomings.
In the advent of ever more complicated machine learning algorithms employed in systems used by millions of users every day, recent research has shown that the biases and stereotypes we put into these "black-boxes" is reflected in the finished product. From recommendation systems and image labelling to language classification and recidivism prediction, there is no easy solution to deal with, mostly subliminally, racist or sexist data. So far, solutions are, if available, only tailored to specific and clearly outlined domains and problems. Furthermore, one type of bias is not like the other, making it difficult to translate approaches for sexism to racism or vice versa. A basic understanding of implicit bias and how it shapes the data we use is key to predict and avoid the amplification of prejudice within an increasingly connected online world.