Natural Language Processing

Academic Year 2022/2023

Visit the UniBO website of the lecture for official and administrative details.

Prerequisites

A gentle introduction to Python

This topic wont be covered in class.

if you are a student of TraTec:
  you learned the intro to Python in PBR
else: 
  check the slides, notebooks, and 2021 video recordings

All materials are available at https://github.com/TinfFoil/learning_dit_python .

Regardless of whether you attended either of the introductions, I suggest you to do (or re-visit) all the exercises ASAP and before next session.

Course contents

Whereas the contents could be (slightly) adapted according to the students skills and interests, the general structure of the course is as follows.

1. Introduction to Natural Language Processing

[29/09/22] Slides

7. Intro to NN

[27/10/22] Slides on the perceptron
[27/10/22] Notebook on the perceptron
[03/11/22] Slides introducing neural networks and keras
[03/11/22] Notebook introducing neural networks and keras

8. Word Embeddings

[08/11/22] Slides on word2vec
[10/11/22] Slides on word embeddings
[10/11/22] Notebook

9. Doc2Vec

[15/11/22] Slides
[15/11/22] Notebook
[15/11/22] Project reminder

10. Convolutions for text

[17/11/22] Slides
[17/11/22] Notebook
[22/11/22] Lesson cancelled
[24/11/22] Lesson cancelled
[29/11/22] CNN

11. Text is Sequential / LSTM

[01/12/22] Slides on RNN
[01/12/22] Notebook on RNN
[06/12/22] Slides on BiRNN and LSTM
[06/12/22] Notebook on BiRNN
[06/12/22] Notebook on LSTM

12. Text generation

[15/12/22] Slides on characters and generation
[15/12/22] Notebook on characters
[15/12/22] Notebook on generation

13. Intro to Seq2Seq and Transformers ; Closing Remaks

[20/12/22] Slides for part one
[22/12/22] Slides for part two

Projects

The evaluation is based on a project. Look for inspiration, in the projects presented in previous years

Some project ideas

Whereas you are supposed to apply the acquired knowledge on a problem of your own interest, here are some ideas, in case you find yourself lost

Hate Speech

Sarcasm identification in implicit misogyny

This project consists of four main tasks. (a) Train a model for implicit hate speech identification using the the corpus from ElSherief et al. (b) Train a model on sarcasm/humour using (for instance) this kaggle dataset . (c) Apply both resulting models to the AMI 2018 dataset (d) Analyse whether the AMI dataset contains cases of both implicit misogyny and/or sarcasm/humour.
Dataset detective: where does this instance come from?

In this project, you will train a model to try to identify the dataset an instance comes from rather than the actual task it is intended to. (a) Train and evaluate one-vs-the-rest models for each of the four datasets. (b) Train and evaluate a multi-class model for all instances in all four datasets. (c) Repeat (a) and (b) but this time consider only positive (negative) instances
(d) Analyse the outcome: can you build a classifier that differentiates datasets? The four datasets are on hate speech and offensive language (2 partitions), aggression , and toxicity .

Do not translate

Identifying translated fragments in Wikipedia and press

Given a pair of Wikipedia or news paper articles, spot does text fragments that they have in common, becuase one is a translation from the other or because they come from a common source.
Identifying translations by non-native speakers

Given a text that has been presumably translated, identify whether the person that carried out the work is a native speaker of the target language.
Identifying the L2 level for a text

Create a model to score the complexity/readability/comprehensibility for a student at a given L2 level. The scores are A1, A2, B1, B2, C1 and C2. This is an ordinal regression task.
Spot definitions within Wikipedia articles in multiple languages

Given a Wikipedia article, spot the sentence with a definitional context. Do this in Wikipedia articles about the same topic in multiple languages.

Anaysis of TV series (with DAR)

The main object of study is medical TV dramas; in particular Grey’s Anatomy .

Available resources: we have a manual segmentation of the audiovisual text segment=portion of video characterised by the spatio-temporal-action continuity); codifications of each segment with the kind of narrative (SP=sentimental plot, PP=professional plot, MC=medical case plot).

Narrative identification

Identify the narrative line of short and long summaries from Grey’s anatomy fandom site to identify the evolution of the different storylines of the diverse stories.

Shared Tasks

SemEval 2023

This year there are 12 tasks at SemEval . I recommend to have a look at:

Task 2 MultiCoNER II. Multilingual Complex Named Entity Recognition

Identify medical terms, locations, creative works, groups, persons and products in multiple languages (incl. Italian, Frencj, Spanish, German, and English)
Task 3 Detecting the Genre, the Framing, and the Persuasion Techniques in Online News in a Multi-lingual Setup

Determine whether a news article is an opinion piece, aims at objective news reporting, or is a satire piece (there are two other tasks). The task runs in English, French, German, Italian, Polish and Russian).
Task 4 Human Value Detection

Given a textual argument and a human value category (e.g., humility, concern), classify whether or not the argument draws on that category
Task 10 Explainable Detection of Online Sexism

A. Predict whether a post is sexist or not sexist. B. Identify if a sexist post is (1) threat, (2) derogation, (3) animosity, (4) prejudiced discussion. (there is a third task).

CLEF 2023

CLEF focuses on information retrieval and access in multilingual and multimodal settinds. Consider:

CheckThat! lab on Check-Worthiness, Subjectivity, Political Bias, Factuality, and Authority of News Articles and their Sources

It offers tasks on identifying the checkworthiness of a text (+image) item, the level of subjectivity of a news article (incl. Italian), the political bias of a news article, etc.
eRisk lab on Early Risk Prediction on the Internet

The challenge consists of ranking sentences from a collection of user writings according to their relevance to one of 21 depression symptoms. It includes a task on pathological gambling and eatig disorders as well).
EXIST lab on sEXism Identification in Social neTworks

Decide whether a tweet is sexist (or describes a sexist behaviour), the intention of the tweet and the type of sexism.
JOKER lab on Automatic Wordplay Analysis

Detect puns; interpret puns and translate puns (all three tasks are offered in English, French and Spanish).

Standard research

Performing research on propaganda identification in other languages than English. For inspiration, see this IPM paper , this EMNLP paper , or this SemEval shared task
Estimating the complexity of a text for a non-native speaker. For inspiration, see READ-IT
Analysing the quality of Wikipedia articles across languages

Readings/Bibliography

Core

Hobson Lane, Cole Howard, Hannes Hapke (2019). Natural Language Processing in Action Understanding, analyzing, and generating text with Python . Manning Publications.

Optional

Dan Jurafsky and James H. Martin. Speech and Language Processing . 3rd ed. draft. Dec, 2020
Dirk Hovy. 2021. Text Analysis in Python for Social Scientists . Cambridge University Press. 2021
Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python
Yoav Goldberg. (2017). Neural Network Methods for Natural Language Processing (G. Hirst, ed.). Morgan & Claypool Publishers.
Emily M. Bender (2013). Linguistic Fundamentals for Natural Language Processing: 100 Essentials from Morphology and Syntax Synthesis . Lectures on Human Language Technologies. Morgan & Claypool Publishers.
Kenneth Ward Church. Unix for poets .

Teaching methods

The course is a combination of seminar and practical sessions. In either case, active participation of the students is expected. Assuming you know the basics of programming (e.g., by completing the python course in Topic 0 ) we will cover a (practical) description of diverse models and tasks

In order to succeed, the student has to carry our regular homework, which comes in the form of small exercises.

Evaluation

The student will work on addressing a problem within her own research interests with the knowledge acquired during the course. Upon agreement of the topic, the student will work on solving the problem and will produce a written report.

The final evaluation will be computed as a combination of both report and the oral exam around it.

Important points

Your project should be submitted 1 week before the appello to be considered valid.
Do you want to target 30L? Conference quality is a good way (but it is not the only one!). Talk to me well in advance if you aim this, as it would require my own heavy commitment to reach the necessary quality.

Teaching tools

Seminars will be carried out with slides and coding will be carried out with jupyter notebooks. Continuous exercises will be carried out.

Office hours

See my UniBO website

Previous final projects (submitted by September 2022)

2021-2022

Hate Speech Detection in Incel Online Spaces
Student: Gajo, P.
Fishing for catfishes: using a model trained on Twitter data to predict author gender in Reddit posts
Student: Kovacs, M.

2020-2021

Assessing Semantic Similarity between Original Texts and Machine Translations
Student: Hopkins, D.
Definition extraction on food-related Wikipedia articles
Student: Martinelli, M.

Identifying Characters’ Lines in Original and Translated Plays. The case of Golden and Horan’s Class
Student: Galletti, E.
Classifying An Imbalanced Dataset with CNN, RNN, and LSTM
Student: Yu, X.

2019-2020

AriEmozione: Identifying Emotions in Opera Verses
Students: Fernicola F. and Zhang S.
Developed under CRICC ; published in CLiC-it 2020
UniBO@AMI: A Multi-Class Approach to Misogyny and Aggressiveness Identification on Twitter Posts Using AlBERTo
Student: Muti, A.
Top-performing model in Evalita’s 2020 AMI shared task

For the record

You can visit previous editions of the course:

2021-2022 - Computational Linguistics - 6 cfu
2020-2021 - Computational Linguistics - 6 cfu
2019-2020 - Computational Linguistics - 6 cfu