{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# This is to have markdown tables left aligned\n", "# It can be ignored\n", "from IPython.core.display import HTML\n", "table_css = 'table {align:left;display:block} '\n", "HTML(''.format(table_css))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# DIT -- Computing Learning\n", "\n", "This notebook is part of the materials for three 2023/2024 lessons:# DIT -- Computing Learning\n", "\n", "\n", "| Code | Lesson |\n", "| :--- | :----------- |\n", "| B2696 | Computational Thinking |\n", "| 99797 | Advanced Professional Skills |\n", "| B3520 | Profession-based Research |\n", "\n", "It has also been used for the PhD seminars on python offered to the PhD students in 2022, 2023, and 2024 " ] }, { "cell_type": "markdown", "metadata": { "id": "Ik_5kKpgmgJ2" }, "source": [ "# Python for Poets 1" ] }, { "cell_type": "markdown", "metadata": { "id": "_akqWAD8mgJ8" }, "source": [ "This Jupyter Notebook is derived from Keneth W. Church's [Unix for Poets](https://www.cs.upc.edu/~padro/Unixforpoets.pdf). From that chapter itself:\n", "\n", "- \"many researchers have more data than they know what to do with\"\n", "- \"Many researchers believe that they don’t have sufficient computing resources to do these things for themselves.\"\n", "- \"This chapter will describe a set of simple Unix-based \\[**Python in our case**\\] tools that should\n", "be more than adequate for counting trigrams on a corpus the size of the Brown Corpus\"\n", "- \"this chapter will focus on examples and avoid definitions whenever possible\"\n", "\n", "The code has been developed using Python 3.6. It has been written using [PyCharm](), and tested on [Colab](). All snippets could be run on any machine with Python 3.6 (or higher) installed or online, as a Jupyter notebook.\n", "\n", "Note: the solution to many of these exercises is simpler using Unix/GNU Linux command line one-liners!" ] }, { "cell_type": "markdown", "metadata": { "id": "i1tz4v8smgJ_" }, "source": [ "## 1. Excercise 1: Count words in a text" ] }, { "cell_type": "markdown", "metadata": { "id": "FptYHPNKmgKC" }, "source": [ "From Chuch. \"The problem is to input a text file, say Genesis (a good place to start), and output a list of words in the file along with their frequency counts. The algorithm consists of three steps:\"\n", "\n", "1. Open the file\n", "2. Tokenize the text into a sequence of words ([re](https://docs.python.org/3.10/library/re.html)),\n", "2. Count the words (with a [dictionary](https://docs.python.org/3.10/tutorial/datastructures.html?highlight=dictionary#dictionaries) or with [Counter](https://docs.python.org/3.10/library/collections.html?highlight=counter#collections.Counter))\n", "\n", "But, before that, let us import the library that we need to tokenise\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bl4UCUsnmgKF" }, "outputs": [], "source": [ "import re " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us open Genesis." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "eor7NbtKDVs0" }, "outputs": [], "source": [ "# do not forget to put/upload the txt file before \n", "\n", "with open(\"genesis.txt\", 'r') as input:\n", " txt = input.read()\n", "\n", "# Apply a regex to string txt and look for all occurrences of the given pattern\n", "tokens = re.findall('[A-Za-z]+', txt)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1WA6OW6sf1bX" }, "outputs": [], "source": [ "print(tokens)" ] }, { "cell_type": "markdown", "metadata": { "id": "9mZfZf2L2twj" }, "source": [ "Counting option 1: using a [dictionary](https://docs.python.org/3/tutorial/datastructures.html#dictionaries)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "71hXnC0ymgKa" }, "outputs": [], "source": [ "mydict = {}\n", "for token in tokens:\n", " if token not in mydict:\n", " mydict[token] = 0\n", " mydict[token] += 1\n", "print(mydict)" ] }, { "cell_type": "markdown", "metadata": { "id": "n8fdfXOH3SbM" }, "source": [ "Counting option 2: using a [Counter](https://docs.python.org/3/library/collections.html?highlight=counter#collections.Counter)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "RuQ0VH4nmgKj" }, "outputs": [], "source": [ "# Option 2: using a counter\n", "from collections import Counter\n", "counter = Counter(tokens)\n", "print(counter)\n", "\n", "print(counter['the'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "dl9squPQjdfs" }, "outputs": [], "source": [ "print(\"Counter\", counter[\"his\"])\n", "print(\"dictionary\", mydict[\"his\"])" ] }, { "cell_type": "markdown", "metadata": { "id": "5oVxbjcPmgKq" }, "source": [ "- There are many official Python (and contributed) libraries available. They are imported with _import_:\n", " - `import library`\n", " - `from library import module`\n", "- Once a library has been imported, we have access to all its methods and classes \n", "- The contents of a (text) file are accessed with `open()`\n", "- Regular expressions are powerful tools to find patterns\n", "- Lists are precisely that: lists of elements. \n", "- Dictionaries are key-value pairs.\n", "- Loops are repetitions until certain condition is true or until covering an iterator (here we use `for`)\n", "- Conditionals execute a code snippet if a condition is `true` (here we use a simple `if`)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "KVxvC1VLmgKs" }, "outputs": [], "source": [ "# print the first k words in the text\n", "\n", "print(len(tokens))\n", "print(tokens[3])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "UgwzbyMgKx0n" }, "outputs": [], "source": [ "for i in range(0, 20):\n", " print(i, tokens[i])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "snpTDm2hK_Cn" }, "outputs": [], "source": [ "print(tokens[0:7])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "cm1xAi92LBEF" }, "outputs": [], "source": [ "print(tokens[-7:])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "VSY0ajczmgKy" }, "outputs": [], "source": [ "# sort the words in the list\n", "\n", "sorted_tokens = sorted(tokens)\n", "print(sorted_tokens[:10])\n", "print(sorted_tokens[-10:])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bJ8HCJ-rmgK6" }, "outputs": [], "source": [ "# counting again, but this time the sorted_tokens\n", "\n", "mydict = {}\n", "for token in sorted_tokens:\n", " if token not in mydict:\n", " mydict[token] = 0\n", " mydict[token] += 1\n", "print(mydict)" ] }, { "cell_type": "markdown", "metadata": { "id": "flG0D9EYmgLA" }, "source": [ "## 2. Different ways of sorting a list of words" ] }, { "cell_type": "markdown", "metadata": { "id": "LxGTPnvEmgLB" }, "source": [ "Ignore the case when counting: lower casing" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_npE7hS7oviw" }, "outputs": [], "source": [ "print(txt)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "x9D_32Q5mgLE" }, "outputs": [], "source": [ "txt = txt.lower()\n", "tokens = re.findall('[A-Za-z]+', txt)\n", "\n", "counter = Counter(tokens)\n", "print(counter)" ] }, { "cell_type": "markdown", "metadata": { "id": "F_jz0b2WmgLK" }, "source": [ "Count sequences of vowels" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "JK_bUBbYmgLL" }, "outputs": [], "source": [ "vowels = re.findall('[aeiou]+', txt)\n", "counter = Counter(vowels)\n", "print(counter)" ] }, { "cell_type": "markdown", "metadata": { "id": "tdozPGM-mgLS" }, "source": [ "Count sequences of consonants" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "b5y2358dmgLT" }, "outputs": [], "source": [ "consonants = re.findall('[bcdfghjklmnpqrstvwxyz]+', txt)\n", "counter = Counter(consonants)\n", "print(counter)" ] }, { "cell_type": "markdown", "metadata": { "id": "MzVdw-mgmgLY" }, "source": [ "\n", "**From Unix for poets**\n", "\n", "\"These three examples are intended to show how easy it is to change the definition of what counts as a word. Sometimes you want to distinguish between upper and lower case, and sometimes you don’t [...] The same basic counting program can be used to count a variety of different things, depending on how you implement the definition of _thing_ (=token).\"" ] }, { "cell_type": "markdown", "metadata": { "id": "iKtW0ep_Dl6f" }, "source": [ "### 2.1 Sort in dictionary order" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "OoSYXc8XFwbW" }, "outputs": [], "source": [ "# what am I doing here?\n", "with open(\"genesis.txt\", 'r') as input:\n", " txt = input.read()\n", " \n", "tokens = re.findall('[A-Za-z]+', txt)\n", "tokens = sorted(tokens)\n", "print(tokens)" ] }, { "cell_type": "markdown", "metadata": { "id": "S1iXxuOzmgLj" }, "source": [ "### 2.2 Sort in \"rhyming\" order\n", "\n", "We have seen `[x]`, `[x:y]` and `[:x]` among others.\n", "\n", "Let us meet `[::-1]`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "iAgJyIah_N-h" }, "outputs": [], "source": [ "word = [\"hello how are you\", \"my name\", \"today\"]\n", "for w in word:\n", " print(w[::-1])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "LIbAvPMQAvlg" }, "outputs": [], "source": [ "for i in range(len(tokens) -1 ):\n", " print(tokens[i:i+2])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "b1mMoydrmgLk" }, "outputs": [], "source": [ "# Notice this method!\n", "def invert(word):\n", " return word[::-1]\n", "\n", "# Note the additional parameter\n", "rythm_tokens = sorted(tokens, key=invert)\n", "\n", "print(rythm_tokens)" ] }, { "cell_type": "markdown", "metadata": { "id": "RUqC_jB-Mw8w" }, "source": [ "## 3. Compute n-gram statistics\n", "\n", "Let us first look at string function `join()` " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "52qxLhAiDkzM" }, "outputs": [], "source": [ "\"\".join([\"one\", \"two\", \"three\"])" ] }, { "cell_type": "markdown", "metadata": { "id": "4PLJXZ81mgLt" }, "source": [ "Producing _2_-grams" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "T_i68Dv6mgLv" }, "outputs": [], "source": [ "with open(\"genesis.txt\", 'r') as input:\n", " txt = input.read()\n", "txt = txt.lower()\n", "tokens = re.findall('[A-Za-z]+', txt)\n", "\n", "# What is going on here?\n", "bigrams = [\" \".join(tokens[i:i+2]) for i in range(len(tokens)-1) ] \n", "\n", "# This is called a list comprehension: https://peps.python.org/pep-0202/\n", "\n", "c = Counter(bigrams)\n", "print(c)" ] }, { "cell_type": "markdown", "metadata": { "id": "RP2cTZrbmgL0" }, "source": [ "Producing _3_-grams" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "5FaDbwhbmgL1" }, "outputs": [], "source": [ "trigrams = [\" \".join(tokens[i:i+3]) for i in range(len(tokens)-2)]\n", "c = Counter(trigrams)\n", "print(c)" ] }, { "cell_type": "markdown", "metadata": { "id": "A8V5vH-TmgL5" }, "source": [ "Producing _n_-grams For **any** _n_" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "X6Gt5K0MmgL7" }, "outputs": [], "source": [ "n = 7\n", "grams = [\" \".join(tokens[i:i+n]) for i in range(len(tokens)-n+1)]\n", "c = Counter(grams)\n", "print(c)" ] }, { "cell_type": "markdown", "metadata": { "id": "Zmq9qLzqM5vl" }, "source": [ "**End of the notebook**\n", "\n", "(you might want to have a look at the exercises)" ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "02_Python4Poets_1stpart_static.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 1 }