Academic Year 2021/2022
Visit the official UniBO website for official and administrative details
Whereas the contents could be (slightly) adapted according to the students skills and interests, the general structure of the course will be as follows.
The evaluation is based on a project. Look for inspiration, in the projects presented in previous years
Whereas you are supposed to apply the acquired knowledge on a problem of your own interest, here are some ideas, in case you find yourself lost
Sarcasm identification in implicit misogyny
This project consists of four main tasks. (a) Train a model for implicit hate speech identification using the the corpus from ElSherief et al. . (b) Train a model on a sarcasm/humour using (for instance) this kaggle dataset . (c) Apply both resulting models to the AMI 2018 dataset (d) Analyse whether the AMI dataset contains cases of both implicit misogyny and/or sarcasm/humour.
Dataset detective: where does this instance come from?
In this project, you will train a model to try to identify the dataset an instance comes from rather than the actual task it is intended to.
(a) Train and evaluate one-vs-the-rest models for each of the four datasets.
(b) Train and evaluate a multi-class model for all instances in all four datasets.
(c) Repeat (a) and (b) but this time consider only positive (negative)
instances
(d) Analyse the outcome: can you build a classifier that differentiates datasets?
The four datasets are on
hate speech and offensive language
(2 partitions),
aggression
, and
toxicity
.
Identifying the L2 level for a text
Create a model to score the complexity/readability/comprehensibility for a student at a given L2 level. The scores are B1, B2, C1 and C2
Spot definitions within Wikipedia articles in multiple languages
Given a Wikipedia article, spot the sentence with a definitional context. Do this in Wikipedia articles about the same topic in multiple languages.
The main object of study is medical TV dramas; in particular Grey’s Anatomy .
Available resources: we have a manual segmentation of the audiovisual text segment=portion of video characterised by the spatio-temporal-action continuity); codifications of each segment with the kind of narrative (SP=sentimental plot, PP=professional plot, MC=medical case plot).
Subtitles and audiovisual text
Analysing the subtitles and their relationship with the segmentation and codification of the audiovisual text.
Linguistic analysis of subtitles
Analysing the subtitiles (of which we lack character tags) and explore them from the linguistic point of view.
Narrative identification
Identify the narrative line of short and long summaries from Grey’s anatomy fandom site to identify the evolution of the different storylines of the diverse stories.
Text Complexity DE Challenge 2022
Developing machine learning based regression models to predict the complexity of a sentence in German for German learners in the B level. More information at the DE challenge website . Release of the test set on 20th June; deadline on 4th July.
More Germeval tasks should come soon .
Irony and stereotypes spreaders
Given a Twitter feed in English, determine whether its author spreads Irony and Stereotypes. Input: Timelines of authors sharing Irony and Stereotypes towards, for instance, women or the LGTB community. More details at the PAN website ; deadline around 24 May.
Detection of Aggressive and Violent INCIdents from Social Media in Spanish
Violent event identification. Determine whether a given tweet is associated with a violent incident or not (binary classification). Violent event category recognition. Recognize the crime category (see above) to which a given tweet belongs (multi-class classification). More information at the DA-VINCIS website . The test set will be released on 10th May; deadline around 21st May.
PoliticEs: Spanish Author Profiling for Political Ideology
Task 1: identifying political ideology from a text collection (binary). Task 2: identifying the gender and the profession as demographic traits from a set of tweets of a user . More information at the IBERLEF 2022 website . Test set announced for April 18th. Deadline around May 4th.
Recommendation System, Sentiment Analysis and Covid Semaphore Prediction for Mexican Tourist Texts
Recommendation system: given a TripAdvisor tourist and a Mexican tourist place, predict the degree of satisfaction in range [1, 5] that the tourist will have when visiting that place. Sentiment analysis: given an opinion about a Mexican tourist place, determine the polarity in range [1, 5] and the type of opinion among hotel, restaurant or attraction. Epidemiological semaphore prediction: given the news related to covid of a Mexican region, determine the semaphore color of the weeks 0, 2, 4 and 8 in the future. More information at the Rest-Mex 2022 website . Test set released on 13th April. Deadline around May 4th.
sEXism Identification in Social neTworks
Sexism Identification: decide whether a tweet contains sexist expressions or behaviours. Sexism Categorization: once a message has been classified as sexist, categorize the message according to the type of sexism (5 classes). More information at the EXIST 2022 website . Test set release on 22 March; deadline on 12 April.
Many others
Consider the problems proposed in EVALITA 2020 . There is age and gender profiling , misogyny identification , and complexity evaluation .
Dan Jurafsky and James H. Martin. Speech and Language Processing . 3rd ed. draft. Dec, 2020
Dirk Hovy. 2021. Text Analysis in Python for Social Scientists . Cambridge University Press. 2021
Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python
Yoav Goldberg. (2017). Neural Network Methods for Natural Language Processing (G. Hirst, ed.). Morgan & Claypool Publishers.
Emily M. Bender (2013). Linguistic Fundamentals for Natural Language Processing: 100 Essentials from Morphology and Syntax Synthesis . Lectures on Human Language Technologies. Morgan & Claypool Publishers.
Kenneth Ward Church. Unix for poets .
The course is a combination of seminar and practical sessions. In either case, active participation of the students is expected. Assuming you know the basics of programming (e.g., by completing the python course in Topic 0 ) we will cover a (practical) description of diverse models and tasks.
The student will work on addressing a problem within her own research interests with the knowledge acquired during the course. Upon agreement of the topic, the student will work on solving the problem and will produce a written report.
The final evaluation will be computed as a combination of both report and the oral exam around it.
Seminars will be carried out with slides and coding will be carried out with jupyter notebooks. Continuous exercises will be carried out.
See my UniBO website
Identifying Characters’ Lines in Original and Translated Plays. The case of
Golden and Horan’s Class
Student: Galletti, E.
[
pdf
]
Classifying An Imbalanced Dataset with CNN, RNN, and LSTM
Student: Yu, X.
[
pdf
]
AriEmozione: Identifying Emotions in Opera Verses
Students: Fernicola F. and Zhang S.
Developed under
CRICC
;
published in
CLiC-it 2020
[
pdf
]
[
video
]
UniBO@AMI: A Multi-Class Approach to Misogyny and Aggressiveness
Identification on Twitter Posts Using AlBERTo
Student: Muti, A.
Top-performing model in
Evalita’s 2020
AMI
shared task
[
pdf
]
[
video
]
You can visit previous editions of the course:
2021 -
Computational Linguistics
- 6 cfu
2020 -
Computational Linguistics
- 6 cfu