Natural Language Processing
Academic Year 2024/2025
(frontpage illustration produced with deepai’s tool in October 2024; using prompt natural language processing class for translation and technology masters).
Visit the UniBO website of the lecture for official and administrative details.
Prerequisites
A gentle introduction to Python
This topic wont be covered in class.
if you are a student of TraTec:
you had the intro to Python in PBR
elif you are a student of SpecTra:
you had the intro to python in APS
else:
check the slides, notebooks, and 2021 video recordings
Regardless, you can find the materials on virtuale.
Regardless of whether you attended either of the introductions, I suggest you to do (or re-visit) all the exercises ASAP.
Homework
Homework is going to be handled through virtuale. No further contents are expected to be shared there. On 09/10, you should have obtained the password to access from me. If you did not, ping me. Homework has associated a hard deadline.
Course contents
Whereas the contents could be (slightly) adapted according to the students skills and interests, the general structure of the course is as follows.
1. Introduction to Natural Language Processing
- MO 30/09/24 Slides 1. Introduction
2. Words and the vector space model
- WE 02/10/24 Slides 2. Tokens and normalisation
- WE 02/10/24 Notebook 2. Tokens and normalisation
- WE 09/10/24 Slides 3. Vector Space Model
- WE 09/10/24 Notebook on VSM 3. Vector Space Model
3. Rule-based and Naïve Bayes
- TH 10/10/24 Slides 4. Rule-based sentiment analysis and Naive Bayes
- TH 10/10/24 Notebook on RB sentiment 4. Rule-based sentiment analysis
- WE 16/10/24 Notebook on Naïve Bayes 5. Naive Bayes
4. Word vectors
- TH 17/10/24 Slides 6. Term Frequency–Inverse Document Frequency
TH 17/10/24WE 23/10/24 Notebook 7. Term Frequency–Inverse Document Frequency
5. From Word Counts to Meaning
WE 23/10/24TH 24/10/24 Slides 8. From word counts to meaning (introducing topic modelling)WE 23/10/23TH 24/10/24 Notebook on topic modelling 8. From word counts to meaning (introducing topic modelling)
6. Training and Evaluation
7. Intro to NN
Intermezzo
8. Word Embeddings
- 14/11/24 Slides 12. Word2vec
- 18/11/24 Slides 13. Hands on word embeddings
- 18/11/24 Notebook 13. Hands on word embeddings
9. Doc2Vec
- 20/11/24 Slides 14. From word back to document representations (doc2vec)
- 20/11/24 Notebook 14. From word back to document representations (doc2vec)
10. Convolutions for text
11. Text is Sequential / LSTM
- 28/11/24 Slides 16. RNNs
- 28/11/24 Notebook 16. RNNs
- 02/12/24 Slides 17 BiRNNs and LSTMs
- 02/12/24 Notebook 17. BiRNNs
- 02/12/24 Notebook 17. LSTMs
- CLIC-it 2024
- Poster 1 Constructing a Multimodal, Multilingual Translation and Interpreting Corpus: A Modular Pipeline and an Evaluation of ASR for Verbatim Transcription
- Poster 2 On Cross-Language Entity Label Projection and Recognition
12. Text generation
- 09/12/24 Slides 18. LSTM: characters and generation
- 09/12/24 Notebook 18. LSTM: characters
09/12/2411/12/24 Notebook 19. LSTM: generation
13. Intro to Seq2Seq and Transformers
- 16/12/24 Slides 20. Into Transformers
- 16/12/24 Slides 20. Beyond; attention gif
14. A brief intro to LLMs + Closing Remaks
This section was not covered during the lesson and was left for furher studying
- CLIC-it 2023 tutorial (we will pay a visit to the cool materials from D. Croce and C.D. Hromei)
FIN
Calendars
Table 1 shows the calendar for the 20 NLP lessons.
This year, NLP has two sibling lessons:
- Selected Topics in Natural Language Processing is an optional (with credits). Further information about it is available on the UniBO website. Table 2 shows the calendar of the 8 lessons.
- Tutorato of NLP is made to support you in the programming side of NLP. Table 3 shows the calendar of the 10 lessons.
Lesson | Date | Time | Location | Lesson | Date | Time | Location |
---|---|---|---|---|---|---|---|
1 | MO 30 Sep | 15:15 | lab 10 | 11 | WE 13 Nov | 13:30 | lab 16 |
2 | WE 2 Oct | 13:30 | lab 16 | 12 | TH 14 Nov | 13:30 | lab 16 |
3 | WE 9 Oct | 13:30 | lab 16 | 13 | MO 18 Nov | 15:15 | lab 10 |
4 | TH 10 Oct | 13:30 | lab 10 | 14 | WE 20 Nov | 13:30 | lab 16 |
5 | WE 16 Oct | 13:30 | lab 16 | 15 | WE 27 Nov | 13:30 | lab 16 |
6 | TH 17 Oct | 13:30 | lab 10 | 16 | TH 28 Nov | 13:30 | lab 10 |
7 | WE 23 Oct | 13:30 | lab 16 | 17 | MO 2 Dec | 15:15 | lab 10 |
8 | TH 24 Oct | 13:30 | lab 10 | 18 | MO 9 Dec | 15:15 | lab 10 |
9 | WE 30 Oct | 13:30 | lab 16 | 19 | WE 11 Dec | 13:30 | lab 16 |
10 | TH 31 Oct | 13:30 | lab 10 | 20 | MO 16 Dec | 15:15 | lab 10 |
Lesson | Date | Time | Location | Lesson | Date | Time | Location |
---|---|---|---|---|---|---|---|
1 | WE 9 Oct | 17:00 | lab 9 | 5 | WE 13 Nov | 17:00 | lab 9 |
2 | WE 16 Oct | 17:00 | lab 9 | 6 | WE 20 Nov(1) | 17:00 | lab 9 |
3 | WE 23 Oct | 17:00 | lab 9 | 7 | WE 27 Nov | 17:00 | lab 9 |
4 | WE 30 Oct | 17:00 | lab 9 | 8 | WE 15 Dec | 17:00 | lab 9 |
(1) ImprenDITori di se stessi. TH aula 10
Lesson | Date | Time | Location | Lesson | Date | Time | Location |
---|---|---|---|---|---|---|---|
1 | TH 3 Oct | 17:00 | lab 10 | 6 | TH 7 Nov | 17:00 | lab 10 |
2 | TH 10 Oct | 17:00 | lab 10 | 7 | TH 14 Nov | 17:00 | lab 10 |
3 | TH 17 Oct | 17:00 | lab 10 | 8 | TH 21 Nov | 17:00 | lab 10 |
4 | TH 24 Oct | 17:00 | lab 10 | 9 | TH 28 Nov | 17:00 | lab 10 |
5 | TH 31 Oct | 17:00 | lab 10 | 10 | TH 5 Dec | 17:00 | lab 10 |
Projects
For your final mark, 80% comes from the final project. Look for inspiration, in the projects presented in previous years
Some project ideas
Eventually, I will drop here some ideas for final projects.
Previous final projects
2024-2025
yours will be here
2023-2024
- Cupin E., Galiero L., and Ciminari D. (2023).
Back to the Roots: Tracing Source Languages in Wikipedia with LABSE
🗎
2022-2023
- Mainardi. P (2023).
Identifying masculine generics in Italian
🗎
2021-2022
Gajo, P. (2022). Hate Speech Detection in Incel Online Spaces
🗎Kovacs, M. (2022). Fishing for catfishes: using a model trained on Twitter data to predict author gender in Reddit posts
🗎
2020-2021
- Hopkins, D. (2022). Assessing Semantic Similarity between Original Texts and Machine Translations
🗎
Galletti, E. (2021). Identifying Characters’ Lines in Original and Translated Plays. The case of Golden and Horan’s Class
🗎Yu, X. (2021). Classifying An Imbalanced Dataset with CNN, RNN, and LSTM
🗎
2019-2020
Fernicola F. and Zhang S. (2020). AriEmozione: Identifying Emotions in Opera Verses
(developed under CRICC; published in CLiC-it 2020)
🗎 🎦Muti, A. (2020). UniBO@AMI: A Multi-Class Approach to Misogyny and Aggressiveness Identification on Twitter Posts Using AlBERTo
(top-performing model in Evalita’s 2020 AMI shared task)
🗎 🎦