# Natural Language Processing 2025.
## LM Specialised Translation

# Extracting the tokens from a corpus

This jupyter notebook is intended for lesson 3 of DIT's Natural Language Processing course

We are going to setup a toy corpus and compute some _similarities_.

The libraries to be used today are:

* [numpy](https://numpy.org/)
* [pandas](https://pandas.pydata.org/)
* [nltk](https://www.nltk.org/)

## 0. Prerequisites

Since we need *non-standard* libraries, we need to set them up --if working on an ephemeral environment (e.g., colab)


In [None]:
! pip3 install nltk

## 1. Importing the necessary libraries

In [None]:
import nltk
import numpy as np
import pandas as pd

from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import TreebankWordTokenizer

## 2. Setting up the _corpus_

In [None]:
tokenizer = TreebankWordTokenizer()
stemmer = PorterStemmer()

sentences = """Thomas Jefferson began building Monticello at the age of 26.\n"""
sentences += """Construction was done mostly by local masons and carpenters.\n"""
sentences += "He moved into the South Pavilion in 1770.\n"
sentences += """Turning Monticello into a neoclassical masterpiece was Jefferson's obsession."""

sentence = sentences.lower()
sentence      # what is the diff wrt print()?

## 3. Bag-of-words representation

Let us compute the BoW representation for our toy corpus

In [None]:
# Loading the corpus into a dictionary
corpus ={}
for i, sent in enumerate(sentences.split('\n')):
    sentence = sent.lower()                 # Case folding
    tokens = tokenizer.tokenize(sentence)   # Tokenisation
    stems = [stemmer.stem(token) for token in tokens]

    corpus['sent{}'.format(i)] = dict((tok, 1) for tok in
         stems)

print(corpus)

In [None]:
# Loading the data into a pandas dataframe
df = pd.DataFrame.from_records(corpus).fillna(0).astype(int).T

# df[df.columns[:10]]
print(df)

## 4. Computing the dot product

"The sum of the products of the corresponding entries of two sequences of numbers".

Let us go and have a look at the [Wikipedia](https://en.wikipedia.org/wiki/Dot_product).


In [None]:
v1 = np.array([1, 2, 3])
v2 = np.array([2, 4, 6])


# The long way
sum_dot = 0

for i in range(len(v1)):
    sum_dot += v1[i] * v2[i]
    print("result at iteration {}: {}".format(i, sum_dot))
print("Result:", sum_dot)


In [None]:
# The smart way (we are "vectorising")
dot = (v1 * v2).sum()
print(dot)

In [None]:
# The numpy way
v1.dot(v2)

The dot product can be used to measure the overlap between two documents

In [None]:
# We first need to compute the transpose of the matrix
# because I need column vectors

# What is the transpose of a matrix??

df = df.T

In [None]:
#How can I print it?


In [None]:
df.sent0.dot(df.sent1)


In [None]:
df.sent0.dot(df.sent2)

In [None]:
df.sent0.dot(df.sent3)


In [None]:
# Where do these numbers come from?
print(sentences)
[(k, v) for (k, v) in (df.sent0 & df.sent3).items() if v]

## This is your first **vector space model**




# Homework 1

(you did this already... I will extend the deadline to Wednesday for you to add stopwording)

1. Implement the pre-processing pipeline with spacy.
2. Apply stopwording as part of the pre-processing.

**Questions**

1. Stopwording should be done before or after casefolding / stemming (or lemmatisation)?
2. Can I have a negative dot product?
3. What is the meaning of a dot product = 0?

**Find the homework on virtuale**
