Natural Language Processing

Natural Language Processing

Academic Year 2024/2025

(frontpage illustration produced with deepai’s tool in October 2024; using prompt natural language processing class for translation and technology masters).

Visit the UniBO website of the lecture for official and administrative details.

Prerequisites

A gentle introduction to Python

This topic wont be covered in class.

if you are a student of TraTec:
  you had the intro to Python in PBR
elif you are a student of SpecTra:
  you had the intro to python in APS
else: 
  check the slides, notebooks, and 2021 video recordings

Regardless, you can find the materials on virtuale.

Regardless of whether you attended either of the introductions, I suggest you to do (or re-visit) all the exercises ASAP.

Homework

Homework is going to be handled through virtuale. No further contents are expected to be shared there. On 09/10, you should have obtained the password to access from me. If you did not, ping me. Homework has associated a hard deadline.

Course contents

Whereas the contents could be (slightly) adapted according to the students skills and interests, the general structure of the course is as follows.

1. Introduction to Natural Language Processing

  • MO 30/09/24 Slides 1. Introduction

2. Words and the vector space model

  • WE 02/10/24 Slides 2. Tokens and normalisation
  • WE 02/10/24 Notebook 2. Tokens and normalisation
  • WE 09/10/24 Slides 3. Vector Space Model
  • WE 09/10/24 Notebook on VSM 3. Vector Space Model

3. Rule-based and Naïve Bayes

4. Word vectors

  • TH 17/10/24 Slides 6. Term Frequency–Inverse Document Frequency
  • TH 17/10/24 WE 23/10/24 Notebook 7. Term Frequency–Inverse Document Frequency

5. From Word Counts to Meaning

  • WE 23/10/24 TH 24/10/24 Slides 8. From word counts to meaning (introducing topic modelling)
  • WE 23/10/23 TH 24/10/24 Notebook on topic modelling 8. From word counts to meaning (introducing topic modelling)

6. Training and Evaluation

  • 30/10/23 Slides 9. Training and evaluation
  • 30/10/23 Notebook 9. Training and evaluation

7. Intro to NN

  • 31/10/23 Slides 10. One neuron (perceptron)
  • 31/10/23 Notebook 10. One neuron (perceptron)

Intermezzo

  • 13/11/23 Slides 11. Neural networks and keras
  • 13/11/23 Notebook 11. Neural networks and keras

8. Word Embeddings

  • 14/11/24 Slides 12. Word2vec
  • 18/11/24 Slides 13. Hands on word embeddings
  • 18/11/24 Notebook 13. Hands on word embeddings

9. Doc2Vec

  • 20/11/24 Slides 14. From word back to document representations (doc2vec)
  • 20/11/24 Notebook 14. From word back to document representations (doc2vec)

10. Convolutions for text

11. Text is Sequential / LSTM

- CLIC-it 2024

  • Poster 1 Constructing a Multimodal, Multilingual Translation and Interpreting Corpus: A Modular Pipeline and an Evaluation of ASR for Verbatim Transcription
  • Poster 2 On Cross-Language Entity Label Projection and Recognition

12. Text generation

  • 09/12/24 Slides 18. LSTM: characters and generation
  • 09/12/24 Notebook 18. LSTM: characters
  • 09/12/2411/12/24 Notebook 19. LSTM: generation

13. Intro to Seq2Seq and Transformers

14. A brief intro to LLMs + Closing Remaks

This section was not covered during the lesson and was left for furher studying

FIN

Calendars

Table 1 shows the calendar for the 20 NLP lessons.

This year, NLP has two sibling lessons:

  • Selected Topics in Natural Language Processing is an optional (with credits). Further information about it is available on the UniBO website. Table 2 shows the calendar of the 8 lessons.
  • Tutorato of NLP is made to support you in the programming side of NLP. Table 3 shows the calendar of the 10 lessons.
LessonDateTimeLocationLessonDateTimeLocation
1MO 30 Sep15:15lab 1011WE 13 Nov13:30lab 16
2WE 2 Oct13:30lab 1612TH 14 Nov13:30lab 16
3WE 9 Oct13:30lab 1613MO 18 Nov15:15lab 10
4TH 10 Oct13:30lab 1014WE 20 Nov13:30lab 16
5WE 16 Oct13:30lab 1615WE 27 Nov13:30lab 16
6TH 17 Oct13:30lab 1016TH 28 Nov13:30lab 10
7WE 23 Oct13:30lab 1617MO 2 Dec15:15lab 10
8TH 24 Oct13:30lab 1018MO 9 Dec15:15lab 10
9WE 30 Oct13:30lab 1619WE 11 Dec13:30lab 16
10TH 31 Oct13:30lab 1020MO 16 Dec15:15lab 10
Table 1: Calendar overviewing all 20 NLP planned lessons.
LessonDateTimeLocationLessonDateTimeLocation
1WE 9 Oct17:00lab 95WE 13 Nov17:00lab 9
2WE 16 Oct17:00lab 96WE 20 Nov(1)17:00lab 9
3WE 23 Oct17:00lab 97WE 27 Nov17:00lab 9
4WE 30 Oct17:00lab 98WE 15 Dec17:00lab 9
Table 2: Calendar overviewing all 8 Selected Topics in NLP planned lessons.

(1) ImprenDITori di se stessi. TH aula 10

LessonDateTimeLocationLessonDateTimeLocation
1TH 3 Oct17:00lab 106TH 7 Nov17:00lab 10
2TH 10 Oct17:00lab 107TH 14 Nov17:00lab 10
3TH 17 Oct17:00lab 108TH 21 Nov17:00lab 10
4TH 24 Oct17:00lab 109TH 28 Nov17:00lab 10
5TH 31 Oct17:00lab 1010TH 5 Dec17:00lab 10
Table 3: Calendar overviewing all 10 NLP tutorato.

Projects

For your final mark, 80% comes from the final project. Look for inspiration, in the projects presented in previous years

Some project ideas

Eventually, I will drop here some ideas for final projects.

Previous final projects

2024-2025

yours will be here

2023-2024

  • Cupin E., Galiero L., and Ciminari D. (2023). Back to the Roots: Tracing Source Languages in Wikipedia with LABSE
    🗎

2022-2023

  • Mainardi. P (2023). Identifying masculine generics in Italian
    🗎

2021-2022

  • Gajo, P. (2022). Hate Speech Detection in Incel Online Spaces
    🗎

  • Kovacs, M. (2022). Fishing for catfishes: using a model trained on Twitter data to predict author gender in Reddit posts
    🗎

2020-2021

  • Hopkins, D. (2022). Assessing Semantic Similarity between Original Texts and Machine Translations
    🗎
  • Galletti, E. (2021). Identifying Characters’ Lines in Original and Translated Plays. The case of Golden and Horan’s Class
    🗎

  • Yu, X. (2021). Classifying An Imbalanced Dataset with CNN, RNN, and LSTM
    🗎

2019-2020

  • Fernicola F. and Zhang S. (2020). AriEmozione: Identifying Emotions in Opera Verses
    (developed under CRICC; published in CLiC-it 2020)
    🗎 🎦

  • Muti, A. (2020). UniBO@AMI: A Multi-Class Approach to Misogyny and Aggressiveness Identification on Twitter Posts Using AlBERTo
    (top-performing model in Evalita’s 2020 AMI shared task)
    🗎 🎦