In [None]:
# This is to have markdown tables left aligned
# It can be ignored
from IPython.core.display import HTML
table_css = 'table {align:left;display:block} '
HTML(''.format(table_css))

# DIT -- Computing Learning

This notebook is part of the materials for three 2023/2024 lessons:# DIT -- Computing Learning


| Code | Lesson |
| :--- | :----------- |
| B2696 | Computational Thinking |
| 99797 | Advanced Professional Skills |
| B3520 | Profession-based Research |

It has also been used for the PhD seminars on python offered to the PhD students in 2022, 2023, and 2024 

# Python for Poets 1

This Jupyter Notebook is derived from Keneth W. Church's [Unix for Poets](https://www.cs.upc.edu/~padro/Unixforpoets.pdf). From that chapter itself:

- "many researchers have more data than they know what to do with"
- "Many researchers believe that they don’t have sufficient computing resources to do these things for themselves."
- "This chapter will describe a set of simple Unix-based \[**Python in our case**\] tools that should
be more than adequate for counting trigrams on a corpus the size of the Brown Corpus"
- "this chapter will focus on examples and avoid definitions whenever possible"

The code has been developed using Python 3.6. It has been written using [PyCharm](), and tested on [Colab](). All snippets could be run on any machine with Python 3.6 (or higher) installed or online, as a Jupyter notebook.

Note: the solution to many of these exercises is simpler using Unix/GNU Linux command line one-liners!

## 1. Excercise 1: Count words in a text

From Chuch. "The problem is to input a text file, say Genesis (a good place to start), and output a list of words in the file along with their frequency counts. The algorithm consists of three steps:"

1. Open the file
2. Tokenize the text into a sequence of words ([re](https://docs.python.org/3.10/library/re.html)),
2. Count the words (with a [dictionary](https://docs.python.org/3.10/tutorial/datastructures.html?highlight=dictionary#dictionaries) or with [Counter](https://docs.python.org/3.10/library/collections.html?highlight=counter#collections.Counter))

But, before that, let us import the library that we need to tokenise


In [None]:
import re 

Let us open Genesis.

In [None]:
# do not forget to put/upload the txt file before 

with open("genesis.txt", 'r') as input:
 txt = input.read()

# Apply a regex to string txt and look for all occurrences of the given pattern
tokens = re.findall('[A-Za-z]+', txt)


In [None]:
print(tokens)

Counting option 1: using a [dictionary](https://docs.python.org/3/tutorial/datastructures.html#dictionaries)

In [None]:
mydict = {}
for token in tokens:
 if token not in mydict:
 mydict[token] = 0
 mydict[token] += 1
print(mydict)

Counting option 2: using a [Counter](https://docs.python.org/3/library/collections.html?highlight=counter#collections.Counter)

In [None]:
# Option 2: using a counter
from collections import Counter
counter = Counter(tokens)
print(counter)

print(counter['the'])

In [None]:
print("Counter", counter["his"])
print("dictionary", mydict["his"])

- There are many official Python (and contributed) libraries available. They are imported with _import_:
 - `import library`
 - `from library import module`
- Once a library has been imported, we have access to all its methods and classes 
- The contents of a (text) file are accessed with `open()`
- Regular expressions are powerful tools to find patterns
- Lists are precisely that: lists of elements. 
- Dictionaries are key-value pairs.
- Loops are repetitions until certain condition is true or until covering an iterator (here we use `for`)
- Conditionals execute a code snippet if a condition is `true` (here we use a simple `if`)

In [None]:
# print the first k words in the text

print(len(tokens))
print(tokens[3])

In [None]:
for i in range(0, 20):
 print(i, tokens[i])

In [None]:
print(tokens[0:7])

In [None]:
print(tokens[-7:])

In [None]:
# sort the words in the list

sorted_tokens = sorted(tokens)
print(sorted_tokens[:10])
print(sorted_tokens[-10:])

In [None]:
# counting again, but this time the sorted_tokens

mydict = {}
for token in sorted_tokens:
 if token not in mydict:
 mydict[token] = 0
 mydict[token] += 1
print(mydict)

## 2. Different ways of sorting a list of words

Ignore the case when counting: lower casing

In [None]:
print(txt)

In [None]:
txt = txt.lower()
tokens = re.findall('[A-Za-z]+', txt)

counter = Counter(tokens)
print(counter)

Count sequences of vowels

In [None]:
vowels = re.findall('[aeiou]+', txt)
counter = Counter(vowels)
print(counter)

Count sequences of consonants

In [None]:
consonants = re.findall('[bcdfghjklmnpqrstvwxyz]+', txt)
counter = Counter(consonants)
print(counter)


**From Unix for poets**

"These three examples are intended to show how easy it is to change the definition of what counts as a word. Sometimes you want to distinguish between upper and lower case, and sometimes you don’t [...] The same basic counting program can be used to count a variety of different things, depending on how you implement the definition of _thing_ (=token)."

### 2.1 Sort in dictionary order

In [None]:
# what am I doing here?
with open("genesis.txt", 'r') as input:
 txt = input.read()
 
tokens = re.findall('[A-Za-z]+', txt)
tokens = sorted(tokens)
print(tokens)

### 2.2 Sort in "rhyming" order

We have seen `[x]`, `[x:y]` and `[:x]` among others.

Let us meet `[::-1]`

In [None]:
word = ["hello how are you", "my name", "today"]
for w in word:
 print(w[::-1])

In [None]:
for i in range(len(tokens) -1 ):
 print(tokens[i:i+2])

In [None]:
# Notice this method!
def invert(word):
 return word[::-1]

# Note the additional parameter
rythm_tokens = sorted(tokens, key=invert)

print(rythm_tokens)

## 3. Compute n-gram statistics

Let us first look at string function `join()` 

In [None]:
"".join(["one", "two", "three"])

Producing _2_-grams

In [None]:
with open("genesis.txt", 'r') as input:
 txt = input.read()
txt = txt.lower()
tokens = re.findall('[A-Za-z]+', txt)

# What is going on here?
bigrams = [" ".join(tokens[i:i+2]) for i in range(len(tokens)-1) ] 

# This is called a list comprehension: https://peps.python.org/pep-0202/

c = Counter(bigrams)
print(c)

Producing _3_-grams

In [None]:
trigrams = [" ".join(tokens[i:i+3]) for i in range(len(tokens)-2)]
c = Counter(trigrams)
print(c)

Producing _n_-grams For **any** _n_

In [None]:
n = 7
grams = [" ".join(tokens[i:i+n]) for i in range(len(tokens)-n+1)]
c = Counter(grams)
print(c)

**End of the notebook**

(you might want to have a look at the exercises)