# Natural Language Processing 2025.
## LM Specialised Translation


### From bag-of-words to term frequency-inverse document frequency

These materials are derived from [Lane et al. (2019)](https://www.manning.com/books/natural-language-processing-in-action)

In [None]:
from collections import Counter
# from matplotlib import pyplot as plt
# from nlpia.loaders import clean_columns
from nltk.tokenize import TreebankWordTokenizer

import copy
import pandas as pd

In [None]:
# We are not going to use this dependency as of now
# ! pip3 install nlpia

## The difference between a binary and a _counting_ BoW

In [None]:
sentence = "The faster Harry got to the store, the faster Harry, the faster, would get home."
tokenizer = TreebankWordTokenizer()
tokens = tokenizer.tokenize(sentence.lower())

tokens

In [None]:
# Getting the vocabulary
set(tokens)

In [None]:
# Counting the tokens in the sentence
bag_of_words = Counter(tokens)
bag_of_words

In [None]:
bag_of_words.most_common(4)

Getting the term frequency (**tf**) for a specific word in a document

In [None]:
times_harry_appears = bag_of_words['harry']
num_unique_words = len(bag_of_words)         # what is this doing?

print("Times harry appears:", times_harry_appears)
print("Size of the vocabulary:", num_unique_words)

Back to the slides

### Normalising $tf$: dividing by the total number of words

In [None]:
tf = times_harry_appears / num_unique_words
tf

**Note**: I have usually seen:

$tf(w') = \frac{freq(w')}{\sum_W freq(w)}$

Look at the denominator, which represents the sum of all the frequencies.

I'm sticking to the book, which uses length of the vocabulary instead. With large vocabularies/collections, it should be roughly equivalent

## $tf$ on a slightly longer text

In [None]:
# Wikipedia article about the Israeli-Palestinian conflict as in fall 2025

txt = """The Israeli–Palestinian conflict is an ongoing military 
and political conflict about land and self-determination within 
the territory of the former Mandatory Palestine. Key aspects of 
the conflict include the Israeli occupation of the West Bank 
and Gaza Strip, the status of Jerusalem, Israeli settlements, 
borders, security, water rights, the permit regime in the West 
Bank and in the Gaza Strip, Palestinian freedom of movement, 
and the Palestinian right of return.

The conflict has its origins in the rise of Zionism in the late 
19th century in Europe, a movement which aimed to establish a 
Jewish state through the colonization of Palestine, 
synchronously with the first arrival of Jewish settlers to 
Ottoman Palestine in 1882. The Zionist movement garnered the 
support of an imperial power in the 1917 Balfour Declaration 
issued by Britain, which promised to support the creation of 
a "Jewish homeland" in Palestine. Following British occupation 
of the formerly Ottoman region during World War I, Mandatory 
Palestine was established as a British mandate. Increasing 
Jewish immigration led to tensions between Jews and Arabs, 
which grew into intercommunal conflict. In 1936, an Arab revolt 
erupted, demanding independence and an end to British support 
for Zionism, which was suppressed by the British. Eventually, 
tensions led to the United Nations adopting a partition plan 
in 1947, triggering a civil war.

During the ensuing 1948 Palestine war, more than half of the 
mandate's predominantly Palestinian Arab population fled or 
were expelled by Israeli forces. By the end of the war, Israel 
was established on most of the former mandate's territory, and 
the Gaza Strip and the West Bank were controlled by Egypt and 
Jordan respectively. Since the 1967 Six-Day War, Israel has been 
occupying the West Bank and the Gaza Strip, known collectively 
as the Palestinian territories. Two Palestinian uprisings 
against Israel and its occupation erupted in 1987 and 2000, the 
first and second intifadas respectively. Israel's occupation 
resulted in Israel constructing illegal settlements there, 
creating a system of institutionalized discrimination against 
Palestinians under its occupation called Israeli apartheid. 
This discrimination includes Israel's denial of Palestinian 
refugees from their right of return and right to their lost 
properties. Israel has also drawn international condemnation 
for violating the human rights of the Palestinians.
"""

kk2 = """The Russo-Ukrainian War is an ongoing war primarily
involving Russia, pro-Russian forces, and Belarus on one side,
and Ukraine and its international supporters on the other.
Conflict began in February 2014 following the Revolution of
Dignity, and focused on the status of Crimea and parts of the
Donbas, internationally recognised as part of Ukraine. The
conflict includes the Russian annexation of Crimea (2014),
the war in Donbas (2014–present), naval incidents, cyberwarfare,
and political tensions. Intentionally concealing its involvement,
Russia gave military backing to separatists in the Donbas from
2014 onwards. Having built up a large military presence on the
border from late 2021, Russia launched a full-scale invasion of
Ukraine on 24 February 2022, which is ongoing.

Following the Euromaidan protests and a revolution resulting in
the removal of pro-Russian President Viktor Yanukovych on 22
February 2014, pro-Russian unrest erupted in parts of Ukraine.
Russian soldiers without insignia took control of strategic
positions and infrastructure in the Ukrainian territory of
Crimea. Unmarked Russian troops seized the Crimean Parliament
and Russia organized a widely-criticised referendum, whose
outcome was for Crimea to join Russia. It then annexed Crimea.
In April 2014, demonstrations by pro-Russian groups in the
Donbas region of Ukraine escalated into a war between the
Ukrainian military and Russian-backed separatists of the
self-declared Donetsk and Luhansk republics.

In August 2014, unmarked Russian military vehicles crossed the
border into the Donetsk republic. An undeclared war began between
Ukrainian forces and separatists intermingled with Russian troops,
although Russia denied the presence of its troops in the Donbas.
The war settled into a stalemate, with repeated failed attempts
at ceasefire. In 2015, a package of agreements called Minsk II
were signed by Russia and Ukraine, but a number of disputes
prevented them from being fully implemented. By 2019, 7% of
Ukraine's territory was classified by the Ukrainian government
as temporarily occupied territories, while the Russian government
had indirectly acknowledged the presence of its troops in Ukraine.

In 2021 and early 2022, there was a major Russian military
build-up around Ukraine's borders. NATO accused Russia of planning
an invasion, which it denied. Russian President Vladimir Putin
criticized the enlargement of NATO as a threat to his country and
demanded Ukraine be barred from ever joining the military
alliance. He also expressed Russian irredentist views, questioned
Ukraine's right to exist, and stated Ukraine was wrongfully
created by Soviet Russia. On 21 February 2022, Russia officially
recognised the two self-proclaimed separatist states in the Donbas,
and sent troops to the territories. Three days later, Russia
invaded Ukraine after Putin announced a "special military
operation". Much of the international community and organizations
such as Amnesty International have condemned Russia for its actions
in post-revolutionary Ukraine, accusing it of breaking
international law and violating Ukrainian sovereignty. Many
countries implemented economic sanctions against Russia, Russian
individuals, or companies, especially after the 2022 invasion.
"""

kk = """Coronavirus disease 2019 (COVID-19) is an infectious disease caused by
severe acute respiratory syndrome coronavirus 2 (SARS coronavirus 2,
or SARS-CoV-2), a virus closely related to the SARS virus. The disease
was discovered and named during the 2019–20 coronavirus outbreak.
Those affected may develop a fever, dry cough, fatigue, and shortness
of breath. A sore throat, runny nose or sneezing is less common. While
the majority of cases result in mild symptoms, some can progress to
pneumonia and multi-organ failure.

The infection is spread from one person to others via respiratory droplets
produced from the airways, often during coughing or sneezing. Time from exposure
to onset of symptoms is generally between 2 and 14 days, with an average of 5
days. The standard method of diagnosis is by reverse transcription polymerase
chain reaction (rRT-PCR) from a nasopharyngeal swab or sputum sample, with
results within a few hours to 2 days. Antibody assays can also be used, using a
blood serum sample, with results within a few days. The infection can also be
diagnosed from a combination of symptoms, risk factors and a chest CT scan
showing features of pneumonia.

Correct handwashing technique, maintaining distance from people who are coughing
and not touching one's face with unwashed hands are measures recommended to
prevent the disease. It is also recommended to cover one's nose and mouth with a
tissue or a bent elbow when coughing. Masks are also recommended for those who
are taking care of someone with a suspected infection but not for the general
public. There is no vaccine or specific antiviral treatment, with management
involving treatment of symptoms, supportive care and experimental measures.
The case fatality rate is estimated at between 1% and 3%.

The World Health Organization (WHO) has declared the 2019–20 coronavirus
outbreak a Public Health Emergency of International Concern (PHEIC). As of 7
March 2020, evidence of local transmission of the disease has been found in
multiple countries across all six WHO regions.
"""

In [None]:
tokenizer = TreebankWordTokenizer()
tokens = tokenizer.tokenize(txt.lower())
token_counts = Counter(tokens)

# display the most common tokens in the document
token_counts.most_common(10)

Notice: the most frequent words are (almost) all **stopwords**

In [None]:
# Run only the first time
import nltk
nltk.download('stopwords', quiet=True)

In [None]:
stopwords = set(nltk.corpus.stopwords.words('english'))
# pay attention to what is going on here (do we need to unfold it?)
tokens = [x for x in tokens if x not in stopwords]
counts = Counter(tokens)
counts.most_common(10)

**Homework**:

1. Do the same exercise, but using spacy for the pre-processing
2. Do the same exercise, but grab an article in Italian
3. Play a bit with texts _kk_ and _kk2_
3. Normalise these frequencies

back to the slides

## Vectorising

Not a **dictionary** of counts, but a **vector** of counts

In [None]:
document_vector = []
doc_length = len(tokens)
for key, value in counts.most_common():
    document_vector.append(value / doc_length)

document_vector

In [None]:
docs = ["The faster Harry got to the store, the faster and faster Harry would get home."]
docs.append("Harry is hairy and faster than Jill.")
docs.append("Jill is not as hairy as Harry.")
print(docs, "\n")
print("\n".join(docs))

In [None]:
# Getting the full lexicon
doc_tokens = []
for doc in docs:
    #doc_tokens += [sorted(tokenizer.tokenize(doc.lower()))]
    doc_tokens.append(sorted(tokenizer.tokenize(doc.lower())))

print(len(doc_tokens[0]))
doc_tokens

In [None]:
# concatenating all lists into one
all_doc_tokens = sum(doc_tokens, [])
print(len(all_doc_tokens))
all_doc_tokens

In [None]:
lexicon = sorted(set(all_doc_tokens))
print(len(lexicon))

# Notice that usually we call the lexicon "types" (tokens vs types)
lexicon

Our vectors must have 18 values

Creating the initial zero vectors

In [None]:
from collections import OrderedDict

# Creating a template dictionary with all zeros ("base vector")
zero_vector = OrderedDict((token, 0) for token in lexicon)
zero_vector

In [None]:
# 1. Make copies of the base vector
# 2. Update the values of the vector for each document
# 3. Store them in an array
#
# We use copy.copy to avoid reference copies and do independent copies

doc_vectors = []
for doc in docs:
    vec = copy.copy(zero_vector)
    tokens = tokenizer.tokenize(doc.lower())
    token_counts = Counter(tokens)
    for key, value in token_counts.items():
        vec[key] = value / len(lexicon)
    doc_vectors.append(vec)

doc_vectors

Back to the slides

## Cosine similarity to compare two vectors

In [None]:
import math

def cosine_sim(vec1, vec2):
    """
    cosine_sim computes the cosine similarity between two vectors
    :param vec1: Dictionary with vector (Counter)
    :param vec2: Dictionary with vector (Counter)
    :return: cosine(vec1, vec2)
    """

    #Convert our dictionaries into lists for easier matching
    vec1 = [val for val in vec1.values()]
    vec2 = [val for val in vec2.values()]

    dot_prod = 0
    for i, v in enumerate(vec1):
        dot_prod += v * vec2[i]

    mag_1 = math.sqrt(sum([x**2 for x in vec1]))
    mag_2 = math.sqrt(sum([x**2 for x in vec2]))
    # this parenthesis is important. Why?
    return dot_prod / (mag_1 * mag_2)

In [None]:
print("cosine(0, 1)=", cosine_sim(doc_vectors[0], doc_vectors[1]))
print("cosine(0, 2)=", cosine_sim(doc_vectors[0], doc_vectors[2]))
print("cosine(1, 2)=", cosine_sim(doc_vectors[1], doc_vectors[2]))
print("cosine(1, 1)=", cosine_sim(doc_vectors[1], doc_vectors[1]))

**Homework**: what happens if vec1 and vec2 have different cardinalities? Be defensive and get sure to avoid that

back to the slides

## Zipf's Law

Let us play with the Brown corpus to see Zipf's law. The [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus) is one of the classical corpora in English, with 1M+ words, and a good alternative for testing.

In [None]:
# Run only the first time
nltk.download('brown')

In [None]:
from nltk.corpus import brown
# tokens
brown.words()[:10]

In [None]:
# Including part of speech
brown.tagged_words()[:5]

In [None]:
# Size of the Brown corpus (number of tokens)
len(brown.words())

In [None]:
# We define a set of punctuation marks to filter them out
puncts = set((',', '.', '--', '-', '!', '?', ':', ';', '``', "''", '(', ')', '[', ']'))

word_list = (x.lower() for x in brown.words() if x not in puncts)
token_counts = Counter(word_list)
token_counts.most_common(20)

#Does this follow Zipf's distribution?

Back to the slides

## Inverse Document Frequency

In [None]:
# Loading two sections of document "kite" (from NLP in Action)
# from nlpia.data.loaders import kite_text, kite_history

kite_text = """A kite is traditionally a tethered heavier-than-air craft with wing
  surfaces that react against the air to create lift and drag. A kite consists of
  wings, tethers, and anchors. Kites often have a bridle to guide the face of the
  kite at the correct angle so the wind can lift it. A kite's wing also may be so
  designed so a bridle is not needed; when kiting a sailplane for launch, the
  tether meets the wing at a single point. A kite may have fixed or moving anchors.
  Untraditionally in technical kiting, a kite consists of tether-set-coupled wing
  sets; even in technical kiting, though, a wing in the system is still often
  called the kite.

  The lift that sustains the kite in flight is generated when air flows around the
  kite's surface, producing low pressure above and high pressure below the wings.
  The interaction with the wind also generates horizontal drag along the direction
  of the wind. The resultant force vector from the lift and drag force components
  is opposed by the tension of one or more of the lines or tethers to which the
  kite is attached. The anchor point of the kite line may be static or moving
  (e.g., the towing of a kite by a running person, boat, free-falling anchors as
  in paragliders and fugitive parakites or vehicle).

  The same principles of fluid flow apply in liquids and kites are also used
  under water.

  A hybrid tethered craft comprising both a lighter-than-air balloon as well as
  a kite lifting surface is called a kytoon.

  Kites have a long and varied history and many different types are flown
  individually and at festivals worldwide. Kites may be flown for recreation,
  art or other practical uses. Sport kites can be flown in aerial ballet,
  sometimes as part of a competition. Power kites are multi-line steerable kites
  designed to generate large forces which can be used to power activities such
  as kite surfing, kite landboarding, kite fishing, kite buggying and a new
  trend snow kiting. Even Man-lifting kites have been made.
"""

kite_history = """Kites were invented in China, where materials ideal for kite
  building were readily available: silk fabric for sail material; fine,
  high-tensile-strength silk for flying line; and resilient bamboo for a strong,
  lightweight framework.

  The kite has been claimed as the invention of the 5th-century BC Chinese
  philosophers Mozi (also Mo Di) and Lu Ban (also Gongshu Ban). By 549 AD paper
  kites were certainly being flown, as it was recorded that in that year a paper
  kite was used as a message for a rescue mission. Ancient and medieval Chinese
  sources describe kites being used for measuring distances, testing the wind,
  lifting men, signaling, and communication for military operations. The
  earliest known Chinese kites were flat (not bowed) and often rectangular.
  Later, tailless kites incorporated a stabilizing bowline. Kites were decorated
  with mythological motifs and legendary figures; some were fitted with strings
  and whistles to make musical sounds while flying. From China, kites were
  introduced to Cambodia, Thailand, India, Japan, Korea and the western world.

  After its introduction into India, the kite further evolved into the fighter
  kite, known as the patang in India, where thousands are flown every year on
  festivals such as Makar Sankranti.

  Kites were known throughout Polynesia, as far as New Zealand, with the
  assumption being that the knowledge diffused from China along with the people.
  Anthropomorphic kites made from cloth and wood were used in religious
  ceremonies to send prayers to the gods. Polynesian kite traditions are used by
  anthropologists get an idea of early "primitive" Asian traditions that are
  believed to have at one time existed in Asia.
"""

kite_intro = kite_text.lower()
intro_tokens = tokenizer.tokenize(kite_intro)

kite_history = kite_history.lower()
history_tokens = tokenizer.tokenize(kite_history)

print(intro_tokens)
print(history_tokens)

In [None]:
intro_total = len(intro_tokens)
intro_total

In [None]:
history_total = len(history_tokens)
history_total

In [None]:
intro_tf = {}
history_tf = {}

intro_counts = Counter(intro_tokens)
intro_tf['kite'] = intro_counts['kite'] / intro_total

history_counts = Counter(history_tokens)
history_tf['kite'] = history_counts['kite'] / history_total

'Term Frequency of "kite" in intro is: {:.4f}'.format(intro_tf['kite'])

In [None]:
'Term Frequency of "kite" in history is: {:.4f}'.format(history_tf['kite'])

$freq(kite, intro) \sim 2 * freq(kite, history)$

Is _kite_intro_ more about kites than _kite_history_?

In [None]:
# giving perspective by looking at the frequency of other words
intro_tf['and'] = intro_counts['and'] / intro_total
history_tf['and'] = history_counts['and'] / history_total
print('Term Frequency of "and" in intro is: {:.4f}'.format(intro_tf['and']))
print('Term Frequency of "and" in history is: {:.4f}'.format(history_tf['and']))

$freq(and, \cdot)$ is quite similar to $freq(kite, \cdot)$

So, the document is about **kites** and about **ands**?

**Homework:** Write a function to compute the _tf_ and call with both intro_tokens and history_tokens instead of writing the same code twice

(back to the slides)

In [None]:
# Number of documents with *
num_docs_containing_and = 0
for doc in [intro_tokens, history_tokens]:
    if 'and' in doc:
        num_docs_containing_and += 1

num_docs_containing_kite = 0
for doc in [intro_tokens, history_tokens]:
    if 'kite' in doc:
        num_docs_containing_kite += 1

num_docs_containing_china = 0
for doc in [intro_tokens, history_tokens]:
    if 'china' in doc:
        num_docs_containing_china += 1

print("and:", num_docs_containing_and)
print("kite:", num_docs_containing_kite)
print("china:", num_docs_containing_china)

In [None]:
# tf(china)
intro_tf['china'] = intro_counts['china'] / intro_total
history_tf['china'] = history_counts['china'] / history_total

print(intro_tf)
print(history_tf)

In [None]:
# idf for all 3
num_docs = 2.0
idf = {}
idf['and'] = num_docs / num_docs_containing_and
idf['kite'] = num_docs / num_docs_containing_kite
idf['china'] = num_docs / num_docs_containing_china

print(idf)

In [None]:
# tf-idf for the intro

intro_tfidf = {}
intro_tfidf['and'] = intro_tf['and'] * idf['and']
intro_tfidf['kite'] = intro_tf['kite'] * idf['kite']
intro_tfidf['china'] = intro_tf['china'] * idf['china']

In [None]:
# tfidf for the history
history_tfidf = {}
history_tfidf['and'] = history_tf['and'] * idf['and']
history_tfidf['kite'] = history_tf['kite'] * idf['kite']
history_tfidf['china'] = history_tf['china'] * idf['china']

In [None]:
print(intro_tfidf)
print(history_tfidf)

**Homework:** Write a function to compute the tf-idf for all the documents in a (small) collection

Back to the slides

## Searching on "Harry"

In [None]:
docs = ["The faster Harry got to the store, the faster and faster Harry would get home."]
docs.append("Harry is hairy and faster than Jill.")
docs.append("Jill is not as hairy as Harry.")

In [None]:
# not an extremely efficient implementation (quite verbose as well)
document_tfidf_vectors = []
for doc in docs:
    vec = copy.copy(zero_vector)
    tokens = tokenizer.tokenize(doc.lower())
    token_counts = Counter(tokens)

    for key, value in token_counts.items():
        docs_containing_key = 0
        for _doc in docs:
            if key in _doc:
                docs_containing_key += 1
        tf = value / len(lexicon)
        if docs_containing_key:
            idf = len(docs) / docs_containing_key
        else:
            idf = 0
        vec[key] = tf * idf
    document_tfidf_vectors.append(vec)

document_tfidf_vectors
# Notice what happened to Harry

In [None]:
query = "How long does it take to get to the hairy store?"
query_vec = copy.copy(zero_vector)
# query_vec = copy.copy(zero_vector)

tokens = tokenizer.tokenize(query.lower())
token_counts = Counter(tokens)

for key, value in token_counts.items():
    docs_containing_key = 0
    for _doc in docs:
      if key in _doc.lower():
        docs_containing_key += 1
    if docs_containing_key == 0:
        continue
    tf = value / len(tokens)
    idf = len(docs) / docs_containing_key
    query_vec[key] = tf * idf
query_vec

In [None]:
cosine_sim(query_vec, document_tfidf_vectors[0])

In [None]:
cosine_sim(query_vec, document_tfidf_vectors[1])

In [None]:
cosine_sim(query_vec, document_tfidf_vectors[2])

In [None]:
print(docs)
print(query)

back to the slides

## Using sklearn to build tf-idf matrices

Now, rather than implementing everything ourselves, we will use a well-known python library to compute it for us.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# Same small corpus as before
docs = ["The faster Harry got to the store, the faster and faster Harry would get home."]
docs.append("Harry is hairy and faster than Jill.")
docs.append("Jill is not as hairy as Harry.")
corpus = docs

In [None]:
vectorizer = TfidfVectorizer(min_df=1)
"""min_df: ignore terms that have a document frequency
strictly lower than the given threshold (aka cut-off).
"""

model = vectorizer.fit_transform(corpus)
"""model is a sparse tf-idf matrix (mostly zeros)
sklearn does not store zeros to save resources"""

print(model)

In [None]:
# We can convert it into a matrix in one line!
print("\n--\n".join(corpus), "\n")
print(model.todense().round(2))

## End of the notebook