# Natural Language Processing 2025.
## LM Specialised Translation


### A quick overview on preprocessing

These materials are derived from [Lane et al. (2019)](https://www.manning.com/books/natural-language-processing-in-action)

# Introducing the Naive Bayes Classifier

Now we will use annotated data to "learn" a sentiment classifier

In [None]:
# We do not install the dependency as in the book because it is necessary
# only to download data, which we do directly
# ! pip3 install nlpia
# ! pip3 install nltk

In [None]:
# Loading the dependencies
import os        # used to produce the path to the data
import pandas as pd

from collections import Counter

# See comment in the next cell about this commented line
# from nlpia.data.loaders import get_data

# The casual tokenizer can handle emoticons, unusual punctuation and slang better than the TreeBank tokenizer
from nltk.tokenize import casual_tokenize

## Setting up the _corpus_

Loading the movies corpus from Hutto movies


In [None]:
# Many of the book's datasets are here:
# https://github.com/totalgood/nlpia/tree/master/src/nlpia/

DATA_PATH = "https://raw.githubusercontent.com/totalgood/nlpia/master/src/nlpia/data/"
PATH_TO_HUTTO_MOVIES = os.path.join(DATA_PATH,
                                    "hutto_ICWSM_2014/movieReviewSnippets_GroundTruth.csv.gz")

# movies = get_data('hutto_movies')
movies = pd.read_csv(PATH_TO_HUTTO_MOVIES, nrows=None)

# Looking at some of the first instances
movies.head().round(2)

### Getting a description of the data (look at the range)

In [None]:
movies.describe().round(2)

In [None]:
# Helps display wide DataFrames in the console (so they look "prettier")
pd.set_option('display.width', 75)
movies.sentiment

### Loading the data into a BoW DataFrame through a list of dictionaries

In [None]:
bags_of_words = []

for text in movies.text:
    bags_of_words.append(Counter(casual_tokenize(text)))

df_bows = pd.DataFrame.from_records(bags_of_words)

# from_records() is a DataFrame constructor.
# INPUT: a sequence (list) of dictionaries
# OUTPUT: a DF with columns for all the keys and associated values.
# (Missing values become NaN!)
print(df_bows)

In [None]:
# So we fill them with 0:
df_bows = df_bows.fillna(0).astype(int)
print(df_bows)

### Let us look at the shape

**Spoiler**: A BoW can explode in size; even more when no normalisation is applied at all


In [None]:
df_bows.shape

Now, let us see the first instances (it is quite sparse)

In [None]:
df_bows.head()


**Homework**: Integrate the normalisation pipeline (lowercasing, stopwording and stemming or lemmatisation) and see how the dataframe gets affected

In [None]:
# write your code here
None

In [None]:
print(df_bows.head()[list(bags_of_words[0].keys())])
print(df_bows.head()[list(bags_of_words[1].keys())])

### Build the Nave Bayes' classifier

The dataset is now ready. Let us build a Multinomial NB classifier.

Multinomial NB is suitable for discrete features (e.g., word counts for text classification).

In [None]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [None]:
# "Binarising" the classes
movies.sentiment > 0

Now we can train ("fit") our model

In [None]:
# We are converting the class from float to Boolean,
# as this classifier only supports discrete labels
nb = nb.fit(df_bows, movies.sentiment > 0)

### We have a model and we can predict!

In [None]:
# predict_proba() gets continious-value predictions.
# We multiply and subtract it to convert the output to range [-4,4]

#print(predictions[:10])
# TODO there seems to be an error in th ebook code.
# predict_proba returns the scores for all the classes (2) and we aim at
# assigning only the one for the positive class.
# I had to do the following trick instead of the original
# movies['predicted_sentiment'] = nb.predict(df_bows) * 8 - 4
predictions = nb.predict_proba(df_bows) * 8 - 4
movies['predicted_sentiment'] = [x[1] for x in predictions]

movies

Now, we compute the [Mean Absolut Error](https://en.wikipedia.org/wiki/Mean_absolute_error) (MAE) "a measure of difference between two continuous variables"

In [None]:
movies['error'] = (movies.predicted_sentiment - movies.sentiment).abs()
# This is the mean absolute error (MAE)
round(movies.error.mean(), 2)

In [None]:
# abs(n)

# abs(5) -> 5
# abs(-34) -> 34
# abs(0) -> 0

Now, let us see some gold and predicted sentiments, together with the binary classification

In [None]:
# Gold standard is positive
movies['sentiment_ispositive'] = (movies.sentiment > 0).astype(int)

# Prediction is positive
movies['predicted_ispositive'] = (movies.predicted_sentiment > 0).astype(int)

# Let us have an overview of gold standard vs prediction
movies['''sentiment predicted_sentiment sentiment_ispositive predicted_ispositive'''.split()].head(8)

In [None]:
# And this is the percentage of "thumbs up" rating correctly predicted
(movies.predicted_ispositive == movies.sentiment_ispositive).sum() / len(movies)


## not bad at all!