A busy week in data mining (Part 1: nltk)
By Kim Doyle
Last week, we ran our first training course on nltk since ResBaz2016. It was also the first training we’ve run in our fabulous new centre. Yay!
Look how cosy we all our :-)
Full room here at the #CoLab! @kim_doyle1 is running training in Python’s Natural Language Toolkit 😊 pic.twitter.com/MEwMAXi8G5
— Research Platforms (@ResPlat)April 5, 2016
This time, we tested out some new curriculum that focuses on text mining and analysis for beginners.
We began by using some of the dummy data that comes with the nltk book
Once we had a few texts to play with…
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908We got stuck-in with the Jupyter Notebook on the cloud.

Unfortunately, we experienced some technical difficulties with the cloud…

The show went on and I experimented with offline coding, briefly, before we were back tokenizing, tagging and parsing like pros.
sentence = "Coding like a badarse."
words = word_tokenize(sentence)
tagged = nltk.pos_tag(words, tagset="universal")
print(tagged)
[('Coding', 'VERB'), ('like', 'ADP'), ('a', 'DET'), ('badarse', 'NOUN')]
Inbetween learning python syntax and a bit of liguistics, we had time to chat about our research and inspect the drinks table…

Some people had come from Computer Science and wanted to brush up on python, learn a natural language library and understand the liguistic theory to analyse their data. Others were from the Humanities and Social Sciences and were curious to know what text mining can offer the Arts.
Beginning of the second day… Thanks for sticking it out guys!
@kim_doyle1 running the second day of NLTK training @ResPlat pic.twitter.com/gPRaqpzsOq
— Yuandra Ismiraldi (@iniandra)April 6, 2016
Day two was all about scraping that data and we powered through the second half of the course with Beautiful Soup.
No, not the song sung by the the Mock Turtle in Alice in Wonderland…

More like this!
!sudo pip3 install BeautifulSoup4 from urllib.request import urlopen from bs4 import BeautifulSoup url = "http://en.wikipedia.org/wiki/Smog" raw = urlopen(url).read() soup = BeautifulSoup(raw, 'html.parser')Or something like that…Beautiful Soup is a Python package for parsing HTML and XML documents, including malformed markup, i.e. non-closed tags nick-named ‘soup’. Hence the name Beautiful Soup ;-)
The last challenge of the final day was to chose and scrap our own webpage, with the help of the lovely Yuandra.

A natural teacher, Yuandra doing what he does best.
We were all tired by the end of two days of text mining, all except the #DataMiner platy who wouldn’t go back in her cardboard…
Playing hide and seek with the #DataMiner platy at @ResPlat nltk training @unimelb pic.twitter.com/nnc3pNQ8Ev
— Kim Doyle (@kim_doyle1)April 5, 2016
Next: Part 2: data acquisition
