Natural Language Processing | Alberto Barrón-Cedeño

Natural Language Processing

Academic Year 2024/2025

(frontpage illustration produced with deepai’s tool in October 2024; using prompt natural language processing class for translation and technology masters).

Visit the UniBO website of the lecture for official and administrative details.

Prerequisites

A gentle introduction to Python

This topic wont be covered in class.

if you are a student of TraTec:
  you had the intro to Python in PBR
elif you are a student of SpecTra:
  you had the intro to python in APS
else: 
  check the slides, notebooks, and 2021 video recordings

Regardless, you can find the materials on virtuale.

Regardless of whether you attended either of the introductions, I suggest you to do (or re-visit) all the exercises ASAP.

Homework

Homework is going to be handled through virtuale. No further contents are expected to be shared there. On 09/10, you should have obtained the password to access from me. If you did not, ping me. Homework has associated a hard deadline.

Course contents

Whereas the contents could be (slightly) adapted according to the students skills and interests, the general structure of the course is as follows.

1. Introduction to Natural Language Processing

MO 30/09/24 Slides 1. Introduction

2. Words and the vector space model

WE 02/10/24 Slides 2. Tokens and normalisation
WE 02/10/24 Notebook 2. Tokens and normalisation
WE 09/10/24 Slides 3. Vector Space Model
WE 09/10/24 Notebook on VSM 3. Vector Space Model

3. Rule-based and Naïve Bayes

TH 10/10/24 Slides 4. Rule-based sentiment analysis and Naive Bayes
TH 10/10/24 Notebook on RB sentiment 4. Rule-based sentiment analysis
WE 16/10/24 Notebook on Naïve Bayes 5. Naive Bayes

4. Word vectors

TH 17/10/24 Slides 6. Term Frequency–Inverse Document Frequency
~~TH 17/10/24~~ WE 23/10/24 Notebook 7. Term Frequency–Inverse Document Frequency

5. From Word Counts to Meaning

~~WE 23/10/24~~ TH 24/10/24 Slides 8. From word counts to meaning (introducing topic modelling)
~~WE 23/10/23~~ TH 24/10/24 Notebook on topic modelling 8. From word counts to meaning (introducing topic modelling)

6. Training and Evaluation

30/10/23 Slides 9. Training and evaluation
30/10/23 Notebook 9. Training and evaluation

7. Intro to NN

31/10/23 Slides 10. One neuron (perceptron)
31/10/23 Notebook 10. One neuron (perceptron)

Intermezzo

13/11/23 Slides 11. Neural networks and keras
13/11/23 Notebook 11. Neural networks and keras

8. Word Embeddings

14/11/24 Slides 12. Word2vec
18/11/24 Slides 13. Hands on word embeddings
18/11/24 Notebook 13. Hands on word embeddings

9. Doc2Vec

20/11/24 Slides 14. From word back to document representations (doc2vec)
20/11/24 Notebook 14. From word back to document representations (doc2vec)

10. Convolutions for text

27/11/24 Slides 15. CNNs
27/11/24 Notebook 15. CNNs

11. Text is Sequential / LSTM

28/11/24 Slides 16. RNNs
28/11/24 Notebook 16. RNNs
02/12/24 Slides 17 BiRNNs and LSTMs
02/12/24 Notebook 17. BiRNNs
02/12/24 Notebook 17. LSTMs

- CLIC-it 2024

Poster 1 Constructing a Multimodal, Multilingual Translation and Interpreting Corpus: A Modular Pipeline and an Evaluation of ASR for Verbatim Transcription
Poster 2 On Cross-Language Entity Label Projection and Recognition

12. Text generation

09/12/24 Slides 18. LSTM: characters and generation
09/12/24 Notebook 18. LSTM: characters
~~09/12/24~~11/12/24 Notebook 19. LSTM: generation

13. Intro to Seq2Seq and Transformers

16/12/24 Slides 20. Into Transformers
16/12/24 Slides 20. Beyond; attention gif

14. A brief intro to LLMs + Closing Remaks

This section was not covered during the lesson and was left for furher studying

CLIC-it 2023 tutorial (we will pay a visit to the cool materials from D. Croce and C.D. Hromei)

FIN

Calendars

Table 1 shows the calendar for the 20 NLP lessons.

This year, NLP has two sibling lessons:

Selected Topics in Natural Language Processing is an optional (with credits). Further information about it is available on the UniBO website. Table 2 shows the calendar of the 8 lessons.
Tutorato of NLP is made to support you in the programming side of NLP. Table 3 shows the calendar of the 10 lessons.

Table 1: Calendar overviewing all 20 NLP planned lessons.
Lesson	Date	Time	Location	Lesson	Date	Time	Location
1	MO 30 Sep	15:15	lab 10	11	WE 13 Nov	13:30	lab 16
2	WE 2 Oct	13:30	lab 16	12	TH 14 Nov	13:30	lab 16
3	WE 9 Oct	13:30	lab 16	13	MO 18 Nov	15:15	lab 10
4	TH 10 Oct	13:30	lab 10	14	WE 20 Nov	13:30	lab 16
5	WE 16 Oct	13:30	lab 16	15	WE 27 Nov	13:30	lab 16
6	TH 17 Oct	13:30	lab 10	16	TH 28 Nov	13:30	lab 10
7	WE 23 Oct	13:30	lab 16	17	MO 2 Dec	15:15	lab 10
8	TH 24 Oct	13:30	lab 10	18	MO 9 Dec	15:15	lab 10
9	WE 30 Oct	13:30	lab 16	19	WE 11 Dec	13:30	lab 16
10	TH 31 Oct	13:30	lab 10	20	MO 16 Dec	15:15	lab 10

Table 2: Calendar overviewing all 8 Selected Topics in NLP planned lessons.
Lesson	Date	Time	Location	Lesson	Date	Time	Location
1	WE 9 Oct	17:00	lab 9	5	WE 13 Nov	17:00	lab 9
2	WE 16 Oct	17:00	lab 9	6	WE 20 Nov(1)	17:00	lab 9
3	WE 23 Oct	17:00	lab 9	7	WE 27 Nov	17:00	lab 9
4	WE 30 Oct	17:00	lab 9	8	WE 15 Dec	17:00	lab 9

(1) ImprenDITori di se stessi. TH aula 10

Table 3: Calendar overviewing all 10 NLP *tutorato*.
Lesson	Date	Time	Location	Lesson	Date	Time	Location
1	TH 3 Oct	17:00	lab 10	6	TH 7 Nov	17:00	lab 10
2	TH 10 Oct	17:00	lab 10	7	TH 14 Nov	17:00	lab 10
3	TH 17 Oct	17:00	lab 10	8	TH 21 Nov	17:00	lab 10
4	TH 24 Oct	17:00	lab 10	9	TH 28 Nov	17:00	lab 10
5	TH 31 Oct	17:00	lab 10	10	TH 5 Dec	17:00	lab 10

Projects

For your final mark, 80% comes from the final project. Look for inspiration, in the projects presented in previous years

Some project ideas

Eventually, I will drop here some ideas for final projects.

Previous final projects

2024-2025

yours will be here

2023-2024

Cupin E., Galiero L., and Ciminari D. (2023). Back to the Roots: Tracing Source Languages in Wikipedia with LABSE
🗎

2022-2023

Mainardi. P (2023). Identifying masculine generics in Italian
🗎

2021-2022

Gajo, P. (2022). Hate Speech Detection in Incel Online Spaces
🗎
Kovacs, M. (2022). Fishing for catfishes: using a model trained on Twitter data to predict author gender in Reddit posts
🗎

2020-2021

Hopkins, D. (2022). Assessing Semantic Similarity between Original Texts and Machine Translations
🗎

Galletti, E. (2021). Identifying Characters’ Lines in Original and Translated Plays. The case of Golden and Horan’s Class
🗎
Yu, X. (2021). Classifying An Imbalanced Dataset with CNN, RNN, and LSTM
🗎

2019-2020

Fernicola F. and Zhang S. (2020). AriEmozione: Identifying Emotions in Opera Verses
(developed under CRICC; published in CLiC-it 2020)
🗎 🎦
Muti, A. (2020). UniBO@AMI: A Multi-Class Approach to Misogyny and Aggressiveness Identification on Twitter Posts Using AlBERTo
(top-performing model in Evalita’s 2020 AMI shared task)
🗎 🎦

Last updated on Dec 11, 2024

PhD Computing Thinking and Programming Nov 24, 2024 →