# DIT Natural Langiuage Processing 2025

### Neural networks


## Hands on word embeddings

Pre-trained embeddings are available from many companies and organisations. Adopting them them could save both time and resources.

Have a look at...
- [Gensim's documentation](https://radimrehurek.com/gensim/models/word2vec.html)
- [Google's word2vec project](https://code.google.com/archive/p/word2vec/)

There are may pre-trained word2vec models available. Consider...
- [Gensim's](https://github.com/RaRe-Technologies/gensim-data)
- [University of Oslo's](http://vectors.nlpl.eu/repository/)

The coming 2 cells represent alternaties to perform the same thing (almost)

1. Downloading the Google w2v embeddigns using the book library
2. Using w2v embeddings you have downloaded in advance

In [None]:
! pip install gensim

In [None]:
import os.path

from gensim.models.keyedvectors import KeyedVectors
from urllib import request

In [None]:
# Downloading a word2vec pre-trained model
# (run it only once; it takes long)

# This is the nlpia way. It does not work
# from nlpia.data.loaders import get_data
# import pandas as pd
# word_vectors = get_data('word2vec')


PATH_TO_GOOGLENEWS_VECTORS ="https://www.dropbox.com/s/965dir4dje0hfi4/GoogleNews-vectors-negative300.bin.gz?dl=1"
OUTPUT_FILE_NAME = "GoogleNews-vectors-negative300.bin.gz"

# Download only if not yet (the model is big!)

if not os.path.isfile(PATH_TO_GOOGLENEWS_VECTORS):
  request.urlretrieve(PATH_TO_GOOGLENEWS_VECTORS, OUTPUT_FILE_NAME)

In [None]:
from gensim.models.keyedvectors import KeyedVectors

# GOOGLE_VECTORS = "/some/reasonable/path/GoogleNews-vectors-negative300.bin.gz"
GOOGLE_VECTORS  = OUTPUT_FILE_NAME
KeyedVectors.load_word2vec_format(GOOGLE_VECTORS, binary=True)
word_vectors = KeyedVectors.load_word2vec_format(GOOGLE_VECTORS,
     binary=True, limit=200000)

# 200000 limits the number of loaded vectors to 200k only
# The aim is speeding up and saving some memory
# (handy if not many resources are available)

**Back to the slides**

## Retrieving the most similar vectors

In [None]:
word_vectors.most_similar(positive=['cooking', 'potatoes'], topn=5)

In [None]:
word_vectors.most_similar(positive=['cooking'], topn=5)

In [None]:
word_vectors.most_similar(positive=['bush', 'clinton'], topn=1) # not there with 200k

In [None]:
word_vectors.most_similar(positive=['Bush', 'Clinton'], topn=1)

In [None]:
word_vectors.most_similar(positive=['Bush', 'president'], topn=1)

In [None]:
word_vectors.most_similar(positive=['Biden', 'president'], topn=1)

In [None]:
word_vectors.most_similar(positive=['Meloni', 'president'], topn=1)

In [None]:
word_vectors.most_similar(positive=['bologna', 'pasta'], topn=3)

In [None]:
word_vectors.most_similar(positive=['Bologna', 'pasta'], topn=3)

In [None]:
word_vectors.most_similar(positive=['Kentucky', 'chicken'], topn=3)

Why Tennessee? Headquarters: https://en.wikipedia.org/wiki/History_of_KFC

(Sanders: "This ain't no goddam Tennessee Fried Chicken, no matter what some slick, silk-suited son-of-a-bitch says")

Why West Virginia? Maybe [location](https://maps.app.goo.gl/E2nxYSqL15UmL1zo7)

In [None]:
word_vectors.most_similar(positive=['Atlanta', 'baseball'], topn=3)

In [None]:
# Something else?
word_vectors.most_similar(positive=[None, None] , topn=3)

## Retrieving the most similar vectors, after subtraction

In [None]:
# Let us load a bigger model (if you went for the second alternative)
# binary: If True, indicates whether the data is in binary word2vec format.
word_vectors = KeyedVectors.load_word2vec_format(GOOGLE_VECTORS,
    binary=True, limit=400000)

In [None]:
word_vectors.most_similar(positive=['New_York', 'Italy'], negative=['America'], topn=3)

In [None]:
word_vectors.most_similar(positive=['Chicago', 'Italy'], negative=['America'], topn=3)

In [None]:
word_vectors.most_similar(positive=['Washington', 'Italy'], negative=['America'], topn=3)

In [None]:
# Not Germany with 200k
word_vectors.most_similar(positive=['Germany', 'France'], negative=['Europe'], topn=3)

In [None]:
word_vectors.most_similar(positive=['Spain', 'America'], negative=['Europe'], topn=3)

In [None]:
word_vectors.most_similar(positive=['Spain', 'America'], negative=['Europe', 'language'], topn=3)

**Back the slides**

## Finding the "outlier" (or indeed the least similar word)

In [None]:
word_vectors.doesnt_match("potatoes milk cake computer".split())

In [None]:
word_vectors.doesnt_match("Spanish Italian French".split())

In [None]:
word_vectors.doesnt_match("beer wine spritz water".split())

In [None]:
word_vectors.doesnt_match("linguistics semantics pragmatics speech".split())

In [None]:
word_vectors.doesnt_match("dog cat snake fish".split())

In [None]:
word_vectors.doesnt_match("cow sheep goat camel".split())

In [None]:
# US sport
word_vectors.doesnt_match("Bears Eagles Giants Titans Braves".split())

In [None]:
# Italian sport
word_vectors.doesnt_match("Atalanta Fiorentina Juventus Inter Udinese".split())

In [None]:
# Italian sport
word_vectors.doesnt_match("Atalanta Fiorentina Juventus Inter Udinese Ajax".split())

In [None]:
word_vectors.doesnt_match("fries pizza taco sushi".split())

**Back to the slides**

## Adding and subtracting

In [None]:
word_vectors.most_similar(positive=['king', 'woman'], negative=['man'], topn=3)

In [None]:
word_vectors.most_similar(positive=['pizza', 'mozzarella'], negative=['pineapple'], topn=3)

In [None]:
word_vectors.most_similar(positive=['Italy', 'mafia'], negative=['New_York'], topn=3) # italy not present

In [None]:
word_vectors.most_similar(positive=['black'], topn=10)

In [None]:
# Some other interesting example?
word_vectors.most_similar(positive=None, negative=None, topn=2)

**back to the slides**

## Similarity between two words

In [None]:
word_vectors.similarity('princess', 'queen')

In [None]:
word_vectors.similarity('prince', 'king')

In [None]:
word_vectors.similarity('prince', 'frog')

In [None]:
word_vectors.similarity('god', 'monster')

In [None]:
word_vectors.similarity('gaze', 'watch')

In [None]:
word_vectors.similarity('frog', 'toad')

In [None]:
word_vectors.similarity('headache', 'flu')

In [None]:
word_vectors.similarity('Aztec', 'Mayan')

In [None]:
word_vectors.similarity('Rome', 'Athens')

In [None]:
word_vectors.similarity('automobile', 'car')

In [None]:
word_vectors.similarity('rail', 'train')

In [None]:
word_vectors.similarity('ragu', 'pesto')

In [None]:
word_vectors.similarity('Toscana', 'Lombardia') # just the translation

In [None]:
word_vectors.similarity('Toscana', 'Lazio')

In [None]:
word_vectors.similarity('pizza', 'taco')

In [None]:
word_vectors.similarity('piadina', 'taco')

In [None]:
# Some other interesting example?
word_vectors.similarity(None, None)

**back to the slides**

## Accessing the actual vectors

In [None]:
word_vectors['phone']

In [None]:
len(word_vectors['phone'])

**back to the slides**

# Training a word2vec model

In [None]:
# Setup
from gensim.models.word2vec import Word2Vec
import nltk
from nltk.corpus import brown

num_features = 300   # The  cardinality of the embedding space
min_word_count = 3   # Words appearing less times will be discarded (depends on the size of the corpus)
num_workers = 2      # Number of CPU cores to be used (depends on hardware)
window_size = 6      # Size of the context
subsampling = 1e-3   # Threshold for configuring which higher-frequency words are randomly downsampled

In [None]:
# Loading some data
nltk.download('brown')
sentence_list = brown.sents()
len(sentence_list)

In [None]:
sentence_list

In [None]:
# Model initialisation
# I RAN THIS EARLIER. I wont do it now, as it takes some time
model = Word2Vec(
    sentence_list,
    workers=num_workers,
    vector_size=num_features,   # Notice that this parameter used to be size
    #min_count=min_word_count,
    window=window_size,
    sample=subsampling)

In [None]:
# Discarding the unneeded output weights and freezing the rest
# This is not necessary since gensim 4: https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4
# model.init_sims(replace=True)

In [None]:
# Saving the model
model_name = "my_domain_specific_word2vec_model"
model.save(model_name)

In [None]:
# Loading a model

model = Word2Vec.load(model_name)
model.wv.most_similar('brown')
# Notice that model.most_similar('brown') might be deprecated soon

**back to the slides**


## fastText

In [None]:
# Downloading the model
URL_TO_FASTTEXT = "https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.vec"
filename_fasttext = "wiki.en.vec"

# Download the file only if it does not exist
if not os.path.isfile(filename_fasttext):
  request.urlretrieve(URL_TO_FASTTEXT, filename_fasttext)

In [None]:
# from gensim.models import Word2Vec
print(filename_fasttext)
# binary: If True, indicates whether the data is in binary word2vec format.
ft_model = KeyedVectors.load_word2vec_format(filename_fasttext, binary=False)
# ft_model = Word2Vec.load(filename_fasttext)

In [None]:
ft_model.most_similar('calcio')

In [None]:
ft_model.most_similar('football')

**End of the notebook**