Research Computing Services Blog

  • Archive
  • RSS
  • Got a question? Let's talk.

Introducing text mining (and myself)

*by [Daniel McDonald](https://twitter.com/interro_gator)* Hullo everyone, I’m Daniel McDonald—a PhD student in Linguistics/Medicine here at the University of Melbourne. My thesis looks at how language use changes over the course of membership in an online support group for bipolar disorder. My background is in stock-standard linguistics—think syntax trees, cardinal vowels, Oxford commas, and wugs. Increasingly, though, my research has led me down the computational path. Because the dataset I’m using in my thesis is over eight million words long, I realised I’d need to do a bit of automatic text wrangling and number crunching to get the job done right. Like many others before me, I headed straight to Python as the language of choice. It fits a grammarian like a glove (enter a Python session and type `import this` if you want to understand why). After a couple of months, I was thoroughly hooked: I’d written [my first library](https://www.github.com/interrogator/corpkit), and irreparably damaged my relationship with my sub-editor brother by proclaiming that Python ‘could probably automate like half your job’. Anyway, recently, I’ve been helping out the Research Platforms gang, writing materials and giving free lessons to postgrads with the inimitable Fiona Tweedie (@FCTweedie) about the juicy intersection between Python, words, grammar and discourse.

Team NLTK at ResBaz

*Team NLTK at ResBaz* It’s perhaps a little bit of an unusual stream within the ResPlat group, because a fair chunk of the people interested in our lessons are from a humanities/social science background—that is to say, a fair chunk of people have never, ever, entered anything into the command line. So, our lessons cover quite a lot of ground: we start with basic Python/programming concepts, but we also try where possible to contextualise what we’re doing with some choice snippets from some of the Big Names, like Firth, Chomsky, Halliday and Widdowson (imagine *that* dinner party, linguists). So, I’m basically writing today to introduce myself (expect more blogs from me in the coming weeks!), and to share an insight I got out of teaching, both at [#ResBaz](https://twitter.com/search?q=resbaz) and at our four-session course that ended last week. (The lesson materials for both courses, by the way, are [freely available on GitHub](https://www.github.com/resbaz/nltk).)

NLTK training starts, @FCTweedie & @interro_gator introduce the course #resbaz @ResPlat @ResBaz pic.twitter.com/2kB0Utq8dz

— Lachlan Musicman (@datakid23)
April 16, 2015

Fiona, my trusty co-pilot, begins our lessons by stressing a particularly fundamental concept in text analysis: that language is data, just like stats—that you can learn things about the world by manipulating and interrogating it. We’re both passionate about this idea, actually: we sometimes repeat it like a bit of a mantra.

What’s interesting is the radically different way in which the Humanities- and STEM- folk in the classrooms react to this idea. Humanities students often roll their eyes a little … ’well, of course language is data’. They’ve never thought otherwise, and are often in our classes because they want training in analytical tools. The people from STEM, however, often seem to have their minds blown, and immediately invent whole new areas of interdisciplinary study to cope: ’My god—are—are you saying we can run Wikipedia articles through a DNA sequencer?!’.

These awesome and opposite reactions to a basic fact about language go a long way to show you the complementarity of totally different branches of research, and totally different kinds of researchers.

In our classes, half of the students come to us having known since birth the difference between integers and floats, and why you can’t parse HTML with regular expressions. These students are wonderful: they can help the others understanding the meaning of a famously vague Python error message, or fix an unexpected encoding error on their instructor’s MacBook if need be (thank you thank you thank you).

The other half of the students bring to the table a totally different kind of knowledge about research: they understand exactly why we shouldn’t conflate a word with the thing it denotes, or mix formal and functional theories of language willy-nilly.

Opposites attract, worlds collide, cliches abound, and many hands make light work: as a group, we’re always far more than the sum of our parts. An interdisciplinary classroom is a superhero, whose only weakness is deadlines.

    • #Daniel
    • #nltk
    • #digital humanities
    • #linguistics
    • #resbaz
  • 4 years ago
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+
← Previous • Next →

Portrait/Logo

About

Welcome to the Research Computing Services Blog. We're here to help you do your research better! We'll connect you with the best research tools, workshops, expertise & community. Need more information? Check out our pages below!

https://research.unimelb.edu.au/infrastructure/research-computing-services

Pages

  • About us
  • Sign-up for FREE researcher training HERE
  • ResPlat Training Catalogue
  • Calendar of Events and Trainings
  • CoLab: A New Collaborative Space for Researchers!
  • Mailing List
  • The Research Bazaar 2018
  • #MyResearch Video Campaign
  • Resbook

Me, Elsewhere

  • @ResPlat on Twitter
  • ResBaz on Youtube
  • ResBaz on Flickr
  • resbaz on github
  • ResBaz on Instagram
  • RSS
  • Random
  • Archive
  • Got a question? Let's talk.
  • Mobile
Effector Theme — Tumblr themes by Pixel Union