Academic Year 2023/2024
Visit the UniBO website of the lecture for official and administrative details.
This topic wont be covered in class.
if you are a student of TraTec:
you learned the intro to Python in PBR
elif you are a student of SpecTra (1st year):
wait for the Advanced Professional Skills Lab, next semester
else:
check the slides, notebooks, and 2021 video recordings
The materials are available at https://github.com/TinfFoil/learning_dit_python .
Regardless of whether you attended either of the introductions, I suggest you to do (or re-visit) all the exercises ASAP.
Whereas the contents could be (slightly) adapted according to the students skills and interests, the general structure of the course is as follows.
* [05/12/23]
Slides
for part one
[11/12/23] CLIC-it 2023 tutorial (we will pay a visit to the cool materials from D. Croce and C.D. Hromei)
The evaluation is based on a project. Look for inspiration, in the projects presented in previous years
Whereas you are supposed to apply the acquired knowledge on a problem of your own interest, here are some ideas, in case you find yourself lost (consider them as sketches for you to elaborate further).
[if you had seen a project proposal which is not here anymore, it means that the proposal belongs to a previous edition of the lesson and it has been (partially) addressed already. You can visit previous editions of the lesson’s website for reference.]
What do restaurants in Romagna serve?
We have a dataset of menus from 200+ restaurants, for a total of 7000+ entries. We would like to obtain a cluster of dishes and main ingredients (e.g. ‘cappelletti’ vs. ’lasagne’, regardless of what else is in the dish), with frequencies of each item and possibly being able to navigate / list all the variations. The concept of “ingredient” requires extracting and normalizing words (e.g. lasagna = { lasagna, lasagne, lasagnetta } ).
After this step, a language model should be built to perform auto-completion. Given the dataset of menus to build a language model and a demo command line app to use it to predict or suggest the next word(s).
Cooking is hard: predicting the level difficulty of a recipe.
Given a recipe, including the ingredients and the process to cook a dish, determine automatically it difficulty of preparation (and eventually, the preparation time).
Dataset detective: where does this instance come from?
In this project, you will train a model to try to identify the dataset an instance comes from rather than the actual task it is intended to.
(a) Train and evaluate one-vs-the-rest models for each of the four datasets.
(b) Train and evaluate a multi-class model for all instances in all four datasets.
(c) Repeat (a) and (b) but this time consider only positive (negative) instances
(d) Analyse the outcome: can you build a classifier that differentiates datasets?
The four datasets are on
hate speech and offensive language
(2 partitions),
aggression
, and
toxicity
.
Identifying translated fragments in Wikipedia and the press
Given a pair of Wikipedia or news paper articles, spot those text fragments that they have in common, because one is a translation from the other or because they come from a common source.
Identifying translations by non-native speakers
Given a text that has been presumably translated, identify whether the person that carried out the work is a native speaker of the target language.
Identifying the L2 level for a text
Create a model to score the complexity/readability/comprehensibility for a student at a given L2 level. The scores are A1, A2, B1, B2, C1 and C2. This is an ordinal regression task.
SemEval 2024
This year there are 12 tasks at SemEval . I recommend to have a look at:
Task 8 Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection
It offers tasks on the identification of machine-generated text with different levels of complexity
Task 10 : Emotion Discovery and Reasoning its Flip in Conversation (EDiReF)
It offers two tasks on identifying the emotion of a utterence and of different utterences within a dialogue
CLEF 2024
CLEF focuses on information retrieval and access in multilingual and multimodal settinds. Consider:
CheckThat! lab on Checkworthiness, Subjectivity, Persuasion, Roles, Authorities and Adversarial Robustness
It offers tasks on identifying the checkworthiness of a text (+image) item, the level of subjectivity of a sentence (incl. Italian), and the use of persuasion in text, among others.
Exist lab on sEXism Identification in Social neTworks
Decide whether a tweet is sexist (or describes a sexist behaviour), the intention of the tweet and the type of sexism.
(the tasks are still in the shaping stage)
Dan Jurafsky and James H. Martin. Speech and Language Processing . 3rd ed. draft. Dec, 2020
Dirk Hovy. 2021. Text Analysis in Python for Social Scientists . Cambridge University Press. 2021
Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python
Yoav Goldberg. (2017). Neural Network Methods for Natural Language Processing (G. Hirst, ed.). Morgan & Claypool Publishers.
Emily M. Bender (2013). Linguistic Fundamentals for Natural Language Processing: 100 Essentials from Morphology and Syntax Synthesis . Lectures on Human Language Technologies. Morgan & Claypool Publishers.
Kenneth Ward Church. Unix for poets .
The course is a combination of seminar and practical sessions. In either case, active participation of the students is expected. Assuming you know the basics of programming (e.g., by completing the python course in Topic 0 ) we will cover a (practical) description of diverse models and tasks
In order to succeed, the student has to carry our regular homework, which comes in the form of small exercises.
The student will work on addressing a problem within her own research interests with the knowledge acquired during the course. Upon agreement of the topic, the student will work on solving the problem and will produce a written report.
The final evaluation will be computed as a combination of both report and the oral exam around it.
Seminars will be carried out with slides and coding will be carried out with jupyter notebooks. Continuous exercises will be carried out.
See my UniBO website
Hate Speech Detection in Incel Online Spaces
Student: Gajo, P.
Fishing for catfishes: using a model trained on Twitter data to predict
author gender in Reddit posts
Student: Kovacs, M.
Assessing Semantic Similarity between Original Texts and Machine Translations
Student: Hopkins, D.
Definition extraction on food-related Wikipedia articles
Student: Martinelli, M.
Identifying Characters’ Lines in Original and Translated Plays. The case of
Golden and Horan’s Class
Student: Galletti, E.
Classifying An Imbalanced Dataset with CNN, RNN, and LSTM
Student: Yu, X.
AriEmozione: Identifying Emotions in Opera Verses
Students: Fernicola F. and Zhang S.
Developed under
CRICC
;
published in
CLiC-it 2020
UniBO@AMI: A Multi-Class Approach to Misogyny and Aggressiveness
Identification on Twitter Posts Using AlBERTo
Student: Muti, A.
Top-performing model in
Evalita’s 2020
AMI
shared task
You can visit previous editions of the course:
2022-2023 -
Natural Language Processing
- 6 cfu
2021-2022 -
Computational Linguistics
- 6 cfu
2020-2021 -
Computational Linguistics
- 6 cfu
2019-2020 -
Computational Linguistics
- 6 cfu