{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This is to have markdown tables left aligned\n",
    "# It can be ignored\n",
    "from IPython.core.display import HTML\n",
    "table_css = 'table {align:left;display:block} '\n",
    "HTML('<style>{}</style>'.format(table_css))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# DIT -- Computing Learning\n",
    "\n",
    "This notebook is part of the materials for three 2023/2024 lessons:# DIT -- Computing Learning\n",
    "\n",
    "\n",
    "| Code | Lesson |\n",
    "| :--- | :----------- |\n",
    "| B2696 | Computational Thinking |\n",
    "| 99797 | Advanced Professional Skills |\n",
    "| B3520 | Profession-based Research |\n",
    "\n",
    "It has also been used for the PhD seminars on python offered to the PhD students in 2022, 2023, and 2024 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Ik_5kKpgmgJ2"
   },
   "source": [
    "# Python for Poets 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_akqWAD8mgJ8"
   },
   "source": [
    "This Jupyter Notebook is derived from Keneth W. Church's [Unix for Poets](https://www.cs.upc.edu/~padro/Unixforpoets.pdf). From that chapter itself:\n",
    "\n",
    "- \"many researchers have more data than they know what to do with\"\n",
    "- \"Many researchers believe that they don’t have sufficient computing resources to do these things for themselves.\"\n",
    "- \"This chapter will describe a set of simple Unix-based \\[**Python in our case**\\] tools that should\n",
    "be more than adequate for counting trigrams on a corpus the size of the Brown Corpus\"\n",
    "- \"this chapter will focus on examples and avoid definitions whenever possible\"\n",
    "\n",
    "The code has been developed using Python 3.6. It has been written using [PyCharm](), and tested on [Colab](). All snippets could be run on any machine with Python 3.6 (or higher) installed or online, as a Jupyter notebook.\n",
    "\n",
    "Note: the solution to many of these exercises is simpler using Unix/GNU Linux command line one-liners!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "i1tz4v8smgJ_"
   },
   "source": [
    "## 1. Excercise 1: Count words in a text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "FptYHPNKmgKC"
   },
   "source": [
    "From Chuch. \"The problem is to input a text file, say Genesis (a good place to start), and output a list of words in the file along with their frequency counts. The algorithm consists of three steps:\"\n",
    "\n",
    "1. Open the file\n",
    "2. Tokenize the text into a sequence of words ([re](https://docs.python.org/3.10/library/re.html)),\n",
    "2. Count the words (with a [dictionary](https://docs.python.org/3.10/tutorial/datastructures.html?highlight=dictionary#dictionaries) or with [Counter](https://docs.python.org/3.10/library/collections.html?highlight=counter#collections.Counter))\n",
    "\n",
    "But, before that, let us import the library that we need to tokenise\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "bl4UCUsnmgKF"
   },
   "outputs": [],
   "source": [
    "import re   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us open Genesis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "eor7NbtKDVs0"
   },
   "outputs": [],
   "source": [
    "# do not forget to put/upload the txt file before \n",
    "\n",
    "with open(\"genesis.txt\", 'r') as input:\n",
    "    txt = input.read()\n",
    "\n",
    "# Apply a regex to string txt and look for all occurrences of the given pattern\n",
    "tokens = re.findall('[A-Za-z]+', txt)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "1WA6OW6sf1bX"
   },
   "outputs": [],
   "source": [
    "print(tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "9mZfZf2L2twj"
   },
   "source": [
    "Counting option 1: using a [dictionary](https://docs.python.org/3/tutorial/datastructures.html#dictionaries)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "71hXnC0ymgKa"
   },
   "outputs": [],
   "source": [
    "mydict = {}\n",
    "for token in tokens:\n",
    "    if token not in mydict:\n",
    "        mydict[token] = 0\n",
    "    mydict[token] += 1\n",
    "print(mydict)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "n8fdfXOH3SbM"
   },
   "source": [
    "Counting option 2: using a [Counter](https://docs.python.org/3/library/collections.html?highlight=counter#collections.Counter)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "RuQ0VH4nmgKj"
   },
   "outputs": [],
   "source": [
    "# Option 2: using a counter\n",
    "from collections import Counter\n",
    "counter = Counter(tokens)\n",
    "print(counter)\n",
    "\n",
    "print(counter['the'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "dl9squPQjdfs"
   },
   "outputs": [],
   "source": [
    "print(\"Counter\", counter[\"his\"])\n",
    "print(\"dictionary\", mydict[\"his\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "5oVxbjcPmgKq"
   },
   "source": [
    "- There are many official Python (and contributed) libraries available. They are imported with _import_:\n",
    "  - `import library`\n",
    "  - `from library import module`\n",
    "- Once a library has been imported, we have access to all its methods and classes \n",
    "- The contents of a (text) file are accessed with `open()`\n",
    "- Regular expressions are powerful tools to find patterns\n",
    "- Lists are precisely that: lists of elements. \n",
    "- Dictionaries are key-value pairs.\n",
    "- Loops are repetitions until certain condition is true or until covering an iterator (here we use `for`)\n",
    "- Conditionals execute a code snippet if a condition is `true` (here we use a simple `if`)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "KVxvC1VLmgKs"
   },
   "outputs": [],
   "source": [
    "# print the first k words in the text\n",
    "\n",
    "print(len(tokens))\n",
    "print(tokens[3])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "UgwzbyMgKx0n"
   },
   "outputs": [],
   "source": [
    "for i in range(0, 20):\n",
    "  print(i, tokens[i])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "snpTDm2hK_Cn"
   },
   "outputs": [],
   "source": [
    "print(tokens[0:7])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "cm1xAi92LBEF"
   },
   "outputs": [],
   "source": [
    "print(tokens[-7:])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "VSY0ajczmgKy"
   },
   "outputs": [],
   "source": [
    "# sort the words in the list\n",
    "\n",
    "sorted_tokens = sorted(tokens)\n",
    "print(sorted_tokens[:10])\n",
    "print(sorted_tokens[-10:])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "bJ8HCJ-rmgK6"
   },
   "outputs": [],
   "source": [
    "# counting again, but this time the sorted_tokens\n",
    "\n",
    "mydict = {}\n",
    "for token in sorted_tokens:\n",
    "    if token not in mydict:\n",
    "        mydict[token]  = 0\n",
    "    mydict[token] += 1\n",
    "print(mydict)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "flG0D9EYmgLA"
   },
   "source": [
    "## 2. Different ways of sorting a list of words"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "LxGTPnvEmgLB"
   },
   "source": [
    "Ignore the case when counting: lower casing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "_npE7hS7oviw"
   },
   "outputs": [],
   "source": [
    "print(txt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "x9D_32Q5mgLE"
   },
   "outputs": [],
   "source": [
    "txt = txt.lower()\n",
    "tokens = re.findall('[A-Za-z]+', txt)\n",
    "\n",
    "counter = Counter(tokens)\n",
    "print(counter)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "F_jz0b2WmgLK"
   },
   "source": [
    "Count sequences of vowels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "JK_bUBbYmgLL"
   },
   "outputs": [],
   "source": [
    "vowels = re.findall('[aeiou]+', txt)\n",
    "counter = Counter(vowels)\n",
    "print(counter)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "tdozPGM-mgLS"
   },
   "source": [
    "Count sequences of consonants"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "b5y2358dmgLT"
   },
   "outputs": [],
   "source": [
    "consonants = re.findall('[bcdfghjklmnpqrstvwxyz]+', txt)\n",
    "counter = Counter(consonants)\n",
    "print(counter)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "MzVdw-mgmgLY"
   },
   "source": [
    "\n",
    "**From Unix for poets**\n",
    "\n",
    "\"These three examples are intended to show how easy it is to change the definition of what counts as a word. Sometimes you want to distinguish between upper and lower case, and sometimes you don’t [...] The same basic counting program can be used to count a variety of different things, depending on how you implement the definition of _thing_ (=token).\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "iKtW0ep_Dl6f"
   },
   "source": [
    "### 2.1 Sort in dictionary order"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "OoSYXc8XFwbW"
   },
   "outputs": [],
   "source": [
    "# what am I doing here?\n",
    "with open(\"genesis.txt\", 'r') as input:\n",
    "    txt = input.read()\n",
    "    \n",
    "tokens = re.findall('[A-Za-z]+', txt)\n",
    "tokens = sorted(tokens)\n",
    "print(tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "S1iXxuOzmgLj"
   },
   "source": [
    "### 2.2 Sort in \"rhyming\" order\n",
    "\n",
    "We have seen `[x]`, `[x:y]` and `[:x]` among others.\n",
    "\n",
    "Let us meet `[::-1]`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "iAgJyIah_N-h"
   },
   "outputs": [],
   "source": [
    "word = [\"hello how are you\", \"my name\", \"today\"]\n",
    "for w in word:\n",
    "  print(w[::-1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "LIbAvPMQAvlg"
   },
   "outputs": [],
   "source": [
    "for i in range(len(tokens) -1 ):\n",
    "  print(tokens[i:i+2])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "b1mMoydrmgLk"
   },
   "outputs": [],
   "source": [
    "# Notice this method!\n",
    "def invert(word):\n",
    "    return word[::-1]\n",
    "\n",
    "# Note the additional parameter\n",
    "rythm_tokens = sorted(tokens, key=invert)\n",
    "\n",
    "print(rythm_tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "RUqC_jB-Mw8w"
   },
   "source": [
    "## 3. Compute n-gram statistics\n",
    "\n",
    "Let us first look at string function `join()` "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "52qxLhAiDkzM"
   },
   "outputs": [],
   "source": [
    "\"\".join([\"one\", \"two\", \"three\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "4PLJXZ81mgLt"
   },
   "source": [
    "Producing _2_-grams"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "T_i68Dv6mgLv"
   },
   "outputs": [],
   "source": [
    "with open(\"genesis.txt\", 'r') as input:\n",
    "    txt = input.read()\n",
    "txt = txt.lower()\n",
    "tokens = re.findall('[A-Za-z]+', txt)\n",
    "\n",
    "# What is going on here?\n",
    "bigrams = [\" \".join(tokens[i:i+2]) for i in range(len(tokens)-1) ] \n",
    "\n",
    "# This is called a list comprehension: https://peps.python.org/pep-0202/\n",
    "\n",
    "c = Counter(bigrams)\n",
    "print(c)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "RP2cTZrbmgL0"
   },
   "source": [
    "Producing _3_-grams"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "5FaDbwhbmgL1"
   },
   "outputs": [],
   "source": [
    "trigrams = [\" \".join(tokens[i:i+3]) for i in range(len(tokens)-2)]\n",
    "c = Counter(trigrams)\n",
    "print(c)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "A8V5vH-TmgL5"
   },
   "source": [
    "Producing _n_-grams For **any** _n_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "X6Gt5K0MmgL7"
   },
   "outputs": [],
   "source": [
    "n = 7\n",
    "grams = [\" \".join(tokens[i:i+n]) for i in range(len(tokens)-n+1)]\n",
    "c = Counter(grams)\n",
    "print(c)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Zmq9qLzqM5vl"
   },
   "source": [
    "**End of the notebook**\n",
    "\n",
    "(you might want to have a look at the exercises)"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "02_Python4Poets_1stpart_static.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}