# Natural Language Processing 2025.
## LM Specialised Translation

### Topic Modeling

These materials are derived from [Lane et al. (2019)](https://www.manning.com/books/natural-language-processing-in-action)

In [None]:
# This time, I will import the dependencies
# the first time they are used (which is not necessarily a good practice)

## Thought Exercise

Training a topic model with _common sense_

1. Produce a random tf-idf representation of a document

In [None]:
import numpy as np

topic = {}
# Function zip() returns an iterator of tuples, where the i-th tuple
# contains the i-th element from each of the argument sequences

# np.random.rand(i) produces an array of i random numbers

tfidf = dict(list(zip('cat dog apple lion NYC love'.split(),
                      np.random.rand(6)))
            )
# Random tf-idf vector for our single document
tfidf

2. __Manually__ create common-sense weights

In [None]:
# Now, we multiply the tf-idf vector by the
# "hand-crafted” weights (notice the subtractions)
topic['petness'] = (
                  .3 * tfidf['cat']
                + .3 * tfidf['dog']
                +  0 * tfidf['apple']
                +  0 * tfidf['lion']
                - .2 * tfidf['NYC']
                + .2 * tfidf['love'])
topic['animalness'] = (
                  .1 * tfidf['cat']
                + .1 * tfidf['dog']
                - .1 * tfidf['apple']
                + .5 * tfidf['lion']
                + .1 * tfidf['NYC']
                - .1 * tfidf['love'])
topic['cityness'] = (
                   0 * tfidf['cat']
                - .1 * tfidf['dog']
                + .2 * tfidf['apple']
                - .1 * tfidf['lion']
                + .5 * tfidf['NYC']
                + .1 * tfidf['love'])
topic

**Back to the slides**

3. Transposing the 6x3 matrix to produce **topic weights for each word**

In [None]:
word_vector = {}

word_vector['cat'] = [.3*topic['petness'],
                    .1*topic['animalness'],
                    0*topic['cityness']]

word_vector['dog'] = [.3*topic['petness'],
                    .1*topic['animalness'],
                    -.1*topic['cityness']]

word_vector['apple']= [0*topic['petness'],
                    .1*topic['animalness'],
                    .2*topic['cityness']]

word_vector['lion'] = [0*topic['petness'],
                    .5*topic['animalness'],
                    -.1*topic['cityness']]
word_vector['NYC'] = [-.2*topic['petness'],
                    .1*topic['animalness'],
                    .5*topic['cityness']]
word_vector['love'] = [.2*topic['petness'],
                    -.1*topic['animalness'],
                    .1*topic['cityness']]
word_vector

**Back to the slides**

## Training a Linear Discriminant Analysis classifier

In [None]:
import pandas as pd

# Loading a labeled corpus: spam (this was with the dependency of nlpia)
# from nlpia.data.loaders import get_data
# sms = get_data('sms-spam')

# We can download it directly from their repo
url = "https://raw.githubusercontent.com/totalgood/nlpia/master/src/nlpia/data/sms-spam.csv"
sms = pd.read_csv(url)
print(sms[:-10])
# Just setting up the printing properties
pd.options.display.width = 120

In [None]:
# For display purposes: spam instances have a "!" added to the label
index = ['sms{}{}'.format(i, '!'*j) for (i,j) in \
         zip(range(len(sms)), sms.spam)]
print(index[:20])

In [None]:
#'!'*0
#'!'*1
#'!'*4

In [None]:
# Creating a pandas df, using the data and the new index
sms = pd.DataFrame(sms.values, columns=sms.columns, index=index)
sms['spam'] = sms.spam.astype(int)
print(sms)
# len(sms)

In [None]:
# QUESTION: what am I getting with this sum?
sms.spam.sum()

In [None]:
from nltk.tokenize.casual import casual_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

# Vectorising the corpus
tfidf_model = TfidfVectorizer(tokenizer=casual_tokenize)
tfidf_docs = tfidf_model.fit_transform(raw_documents=sms.text).toarray()
# QUESTION: what is the number on the right?
print(tfidf_docs.shape)
print(tfidf_docs)




We have
* 4837 messages
* 638 positive instances
* 9232 types

That's way too much for a Naïve Bayes' classifier

**Homework**: try it!

## Implementing Linear Discriminant Analysis 

We just need the centroids of spam and non-spam, so we implement it

(keep in mind that sklearn has an [LDA](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html))

In [None]:
# A mask (or "filter") to select only spam messages
mask = sms.spam.astype(bool).values
print(mask)

In [None]:
# Computing the spam centroid
spam_centroid = tfidf_docs[mask].mean(axis=0)
# axis=0 tells numpy to compute the mean for each column independently
print(spam_centroid.round(2))
len(spam_centroid)

In [None]:
# Computing the ham centroid
ham_centroid = tfidf_docs[~mask].mean(axis=0)
print(ham_centroid.round(2))
len(ham_centroid)

In [None]:
spam_centroid - ham_centroid

In [None]:
# Computing the centroid difference: "the line between spam and ham"
spamminess_score = tfidf_docs.dot(spam_centroid - ham_centroid)
print(spamminess_score.round(2))
len(spamminess_score)

Not just subtracting. We computed the dot product!

**spamminess_score** is $dis(centroid_{(spam)}, centroid_{(ham)})$

We compute it by projecting each TF-IDF vector onto that line between the centroids using the dot product (those were indeed 4,837 dot products computed at once!)

In [None]:
# Convert the vector spamminess_score into matrix:
spamminess_score.reshape(-1,1)
# Because the input to the next function must have shape: (n_samples, n_features)

[MinMaxScaler's fit_transform](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler.fit_transform) will scale each of the features to range [0, 1]

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Turning the scores into "probabilities"
sms['lda_score'] = MinMaxScaler().fit_transform(spamminess_score.reshape(-1,1))

# Turning them into predictions
sms['lda_predict'] = (sms.lda_score > .5).astype(int)

# Displaying them
sms['spam lda_predict lda_score'.split()].round(2).head(6)

In [None]:
# What is the accuracy of the model?
(1. - (sms.spam - sms.lda_predict).abs().sum() / len(sms)).round(3)

In [None]:
# Getting a confusion matrix (the nlpai way)
# from pugnlp.stats import Confusion
# Confusion(sms['spam lda_predict'.split()])

# we do it with sklearn
from sklearn.metrics import confusion_matrix

confusion_matrix(sms.spam, sms.lda_predict)

#          pred_spam   pred_no-spam
# spam       a              b
# no-spam    c              d

**end of the notebook**