# Natural Language Processing 2025.
## LM Specialised Translation (TraTec)


### A quick overview on pre-processing

These materials are mostly borrowed from [Lane et al. (2019)](https://www.manning.com/books/natural-language-processing-in-action)

We first need to import dependencies. In this case, **regex**

In [None]:
import re

## Tokenisation

In [None]:
txt = "Thomas Jefferson started building Monticello at the age of 26."
# What is the difference between " " and """ """ ?

A simple "tokeniser", which captures alphabetical characters only.


In [None]:
tokens = re.findall('[A-Za-z]+', txt)
print(tokens)

Python provides a "similar" tool to tokenise. It is strings function `split()`


In [None]:
tokens = txt.split()
print(tokens)

In general, still not enough

**Back to the slides**

 Obviously, we can design a better regular expression


In [None]:
tokens = re.split(r'([-\s.,;!?])+', txt)
print(tokens)

**Back to the slides**

In [None]:
text = "Monticello wasn't designated as UNESCO World Heritage Site until 1987"
tokens = re.split(r'([-\s.,;!?])+', text)
print(tokens)

**Back to the slides**

### Libraries

The community has created multiple libraries for pre-processing, which include fucntions to perform tokenisation and many other operations.

Two of the most popular ones are

* [NLTK](http://www.nltk.org)
* [Spacy](https://spacy.io/)

If it is the first time you use them (and this is mostly true if you are using an ephimerous platform, such as colab), you should install it.

You can do so with [pip](https://pip.pypa.io/en/stable/):

In [None]:
!pip install spacy

If you are working from the terminal, in local, you might have to do like this (for all dependencies)

```
$ pip install --user -U spacy
```

--user tels pip to install it only for you; -U tells to upgrade the package (if it was already installed)

**Note:** typically you install all the dependencies on top of the notebook, as it is the first thing you should do



An now we can import and use one of its tokenisers

In [None]:
# loading the library
import spacy

# downloading the model
import spacy.cli
spacy.cli.download("en_core_web_sm")

In [None]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(txt)
print([token.text for token in doc])

# Here is the equivalent process, using NLTK
# from nltk.tokenize import TreebankWordTokenizer # import one of the many tokenizers available
# tokenizer = TreebankWordTokenizer()             # invoke it
# tokens = tokenizer.tokenize(txt)
# print(tokens)

Now, see the difference between tokenising with split() and with spacy's web tokeniser on a different sentence.

In [None]:
sentence = "Monticello wasn't designated as UNESCO World Heritage Site until 1987."
tokens_split = sentence.split()
doc = nlp(sentence)

print("OUTPUT USING split()\t", tokens_split)
print("OUTPUT USING spacy\t", [token.text for token in  doc])

**Back to the slides**

## Normalisation

### Casefolding

In [None]:
sentence  = sentence.lower()
print(sentence)

**Back to the slides**

## Stemming

Once again, we can use a regular expression to do stemming

In [None]:
def stem(phrase):
    return ' '.join([re.findall('^(.*ss|.*?)(s)?$',
         word)[0][0].strip("'") for word in phrase.lower()
         .split()])

In [None]:
print("'houses' \t\t->", stem('houses'))
print("'Doctor House's admin staff calls' \t->", stem("Doctor House's admin staff calls"))
print("'stress' \t\t->", stem("stress"))

But we would need to include many more expressions to deal with all cases and exceptions.

Instead, once again we can rely on a library. Let's consider the **Porter stemmer**. NLTK has an implementation.

In [None]:
# Installing NLTK (and its dependency: numpy). Not necessary in colab!
# ! pip install --user -U nltk
# ! pip install --user -U numpy

In [None]:
from nltk.stem.porter import PorterStemmer # Import the stemmer
stemmer = PorterStemmer()                  # invoke the stemmer

# Notice:
# - this one-liner "tokenises", stems, and concatenates, all in one line!
# - these operations "appear" inverted in the code (let us have a look together)
x = ' '.join([stemmer.stem(w).strip("'") for w in "dish washer's washed dishes".split()])
print(x.split())

**Back to the slides**

## Lemmatisation

This is a more complex process, compared to stemming. Let us use a library.
In this particular case we are going to use NLTK's WordNet lemmatiser. If it is the first time you use it (or you are in an ephemeral environment!), you should download it as follows:

### The NLTK alternative

In [None]:
import nltk
## Download the Wordnet resources
# WordNet core resources
nltk.download('wordnet')
# Open Multilingual Wordnet resources
nltk.download('omw-1.4')

In [None]:
from nltk.stem import WordNetLemmatizer # importing the lemmatiser
lemmatizer = WordNetLemmatizer()        # invoking the lemmatiser

print("'better' alone \t->",lemmatizer.lemmatize("better"))
print("'better' incl. it's POS (adj) \t->",lemmatizer.lemmatize("better", pos="a"))

### The Spacy alternative

In [None]:
doc = nlp("better")
print([token.lemma_ for token in doc])

**Back to the slides**

## A quick overview on representations

### Bag of Words (BoW)

First, let us see a simple construction, using a dictionary

In [None]:
sentence = """Thomas Jefferson began building Monticello at the age of 26. Thomas"""

sentence_bow = {}
for token in sentence.split():
     sentence_bow[token] = 1
sorted(sentence_bow.items())


**Back to the slides**

Another option would be using **pandas**

In [None]:
# You might have to install it first
! pip install pandas

In [None]:
import pandas as pd

# Loading the corpus
sentences = """Thomas Jefferson began building Monticello at the age of 26.\n"""
sentences += """Construction was done mostly by local masons and carpenters.\n"""
sentences += "He moved into the South Pavilion in 1770.\n"
sentences += """Turning Monticello into a neoclassical masterpiece was Jefferson's obsession."""

# Loading the tokens into a dictionary (notice that we asume that each line is a document)
corpus = {}
for i, sent in enumerate(sentences.split('\n')):
    corpus['sent{}'.format(i)] = dict((tok, 1) for tok in
         sent.split())

# Loading the dictionary contents into a pandas dataframe.
df = pd.DataFrame.from_records(corpus).fillna(0).astype(int).T
# SEE THE .T, which transposes the matrix for visualisation purposes.


df[df.columns[:10]]


### One-hot vectors

This is our input sentence (and its vocabulary)

In [None]:
import numpy as np
sentence = "Thomas Jefferson began building Monticello at the age of 26."
token_sequence = str.split(sentence)
vocab = sorted(set(token_sequence))
print(vocab)

And now, we produce the one-hot representation

In [None]:
num_tokens = len(token_sequence)
vocab_size = len(vocab)

# create the |tokens| x |vocabulary| matrix of zeros
onehot_vectors = np.zeros((num_tokens, vocab_size), int)
print(token_sequence)
print(onehot_vectors)

In [None]:
for i, word in enumerate(token_sequence):
   onehot_vectors[i, vocab.index(word)] = 1  # switch on (1) the right element of the vector

print("Vocabulary:\t", vocab)
print("Sentence:\t", token_sequence)
onehot_vectors

Let us bring **pandas** into the game

In [None]:
pd.DataFrame(onehot_vectors, columns=vocab)

## Defining the preprocessing _pipeline_:

1. Tokenisation
2. Stemmming
3. Stopwording

In [None]:
import nltk
import numpy as np
import pandas as pd

from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import TreebankWordTokenizer

In [None]:
# invoking the necessary objects
tokenizer = TreebankWordTokenizer()
stemmer = PorterStemmer()

In [None]:
# a tiny test
print(tokenizer.tokenize("The input text."))
stemmer.stem("documents")

In [None]:
# both tokenisation and stemming
text = """Perseverance (nicknamed Percy) is a car-sized Mars
rover designed to explore the crater Jezero on Mars as part
of NASA's Mars 2020 mission."""

print([stemmer.stem(w) for w in tokenizer.tokenize(text)])

### Stopwording

In [None]:
# The first time you use the stopwords, you have to download them!
nltk.download("stopwords")

In [None]:
stop_words = stopwords.words("english")
print(stop_words[:100])
stop_words = set(stop_words)

### What is a stopword?

According to [the Wikipedia](https://en.wikipedia.org/wiki/Stop_word): "the words in a stop list (or stoplist or negative dictionary) which are **filtered out** (i.e. stopped) before or after processing of natural language data (text) because they are insignificant."

For some search engines, these are **some of the most common, short function words,** such as the, is, at, which, and on. In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as "The Who", "The The", or "Take That".

**Q: Can I create a list of stopwords on the fly?**

In [None]:
# tokenisation and stemming, and stopwording
text = """Perseverance (nicknamed Percy) is a car-sized Mars
rover designed to explore the crater Jezero on Mars as part
of NASA's Mars 2020 mission."""

print([stemmer.stem(w) for w in tokenizer.tokenize(text) if w not in stop_words])

## Homework

When each of the components is first introduced, the simplest versions of the others are used. Create a pipeline that performs the whole pre-precessing...

1. Using NLTK alone
2. Using spacy alone

**End of the notebook**