Research Computing Services Blog

  • Archive
  • RSS
  • Got a question? Let's talk.

A busy week in data mining (Part 1: nltk)

By Kim Doyle

Last week, we ran our first training course on nltk since ResBaz2016. It was also the first training we’ve run in our fabulous new centre. Yay!

Look how cosy we all our :-)

Full room here at the #CoLab! @kim_doyle1 is running training in Python’s Natural Language Toolkit 😊 pic.twitter.com/MEwMAXi8G5

— Research Platforms (@ResPlat)
April 5, 2016

This time, we tested out some new curriculum that focuses on text mining and analysis for beginners.

We began by using some of the dummy data that comes with the nltk book

Once we had a few texts to play with…

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
                    Type the name of the text or sentence to view it.
                    Type: 'texts()' or 'sents()' to list the materials.
                    text1: Moby Dick by Herman Melville 1851
                    text2: Sense and Sensibility by Jane Austen 1811
                    text3: The Book of Genesis
                    text4: Inaugural Address Corpus
                    text5: Chat Corpus
                    text6: Monty Python and the Holy Grail
                    text7: Wall Street Journal
                    text8: Personals Corpus
                    text9: The Man Who Was Thursday by G . K . Chesterton 1908

We got stuck-in with the Jupyter Notebook on the cloud.

image

Unfortunately, we experienced some technical difficulties with the cloud…

(The only genuninely funny part of what on all accounts was a terrible movie)

Fortunately, we had the adorable Alan from our cloud computing team (who does understand the cloud) working to keep nltk running; we keep him in the kitchen :S
image

The show went on and I experimented with offline coding, briefly, before we were back tokenizing, tagging and parsing like pros.

sentence = "Coding like a badarse."
words = word_tokenize(sentence)
tagged = nltk.pos_tag(words, tagset="universal")
print(tagged)

[('Coding', 'VERB'), ('like', 'ADP'), ('a', 'DET'), ('badarse', 'NOUN')]

Inbetween learning python syntax and a bit of liguistics, we had time to chat about our research and inspect the drinks table…

image

Some people had come from Computer Science and wanted to brush up on python, learn a natural language library and understand the liguistic theory to analyse their data. Others were from the Humanities and Social Sciences and were curious to know what text mining can offer the Arts.

Beginning of the second day… Thanks for sticking it out guys!

@kim_doyle1 running the second day of NLTK training @ResPlat pic.twitter.com/gPRaqpzsOq

— Yuandra Ismiraldi (@iniandra)
April 6, 2016

Day two was all about scraping that data and we powered through the second half of the course with Beautiful Soup.

No, not the song sung by the the Mock Turtle in Alice in Wonderland…

image

More like this!

!sudo pip3 install BeautifulSoup4
from urllib.request import urlopen
from bs4 import BeautifulSoup


url = "http://en.wikipedia.org/wiki/Smog"
raw = urlopen(url).read()
soup = BeautifulSoup(raw, 'html.parser')
Or something like that…Beautiful Soup is a Python package for parsing HTML and XML documents, including malformed markup, i.e. non-closed tags nick-named ‘soup’. Hence the name Beautiful Soup ;-)

The last challenge of the final day was to chose and scrap our own webpage, with the help of the lovely Yuandra.

image

A natural teacher, Yuandra doing what he does best.

We were all tired by the end of two days of text mining, all except the #DataMiner platy who wouldn’t go back in her cardboard…

Playing hide and seek with the #DataMiner platy at @ResPlat nltk training @unimelb pic.twitter.com/nnc3pNQ8Ev

— Kim Doyle (@kim_doyle1)
April 5, 2016

Next: Part 2: data acquisition

    • #data mining
    • #data science
    • #nltk
  • 3 years ago
  • 1
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

1 Notes/ Hide

  1. barquar liked this
  2. resbaz posted this
← Previous • Next →

Portrait/Logo

About

Welcome to the Research Computing Services Blog. We're here to help you do your research better! We'll connect you with the best research tools, workshops, expertise & community. Need more information? Check out our pages below!

https://research.unimelb.edu.au/infrastructure/research-computing-services

Pages

  • About us
  • Sign-up for FREE researcher training HERE
  • ResPlat Training Catalogue
  • Calendar of Events and Trainings
  • CoLab: A New Collaborative Space for Researchers!
  • Mailing List
  • The Research Bazaar 2018
  • #MyResearch Video Campaign
  • Resbook

Me, Elsewhere

  • @ResPlat on Twitter
  • ResBaz on Youtube
  • ResBaz on Flickr
  • resbaz on github
  • ResBaz on Instagram
  • RSS
  • Random
  • Archive
  • Got a question? Let's talk.
  • Mobile
Effector Theme — Tumblr themes by Pixel Union