Research Computing Services Blog

  • Archive
  • RSS
  • Got a question? Let's talk.

Natural Language ToolKit meets Mr Fraser

by Fiona Tweedie

Text mining, sometimes called ‘distant reading’ lets researchers analyse large bodies of text and uncover patterns in the language. This sort of work is nothing new, but as digitisation makes the available corpora larger, it would be incredibly labour-intensive to do this work by hand - imagine going through millions of books with a highlighter trying to uncover key words. This is where the Natural Language Toolkit (NLTK) comes in. A library of the programming language Python, it allows researchers to quickly uncover key features of a text, such as lexical richness and key words, create topic models and present these findings graphically.

image

Being able to work effectively with large bodies of text, whether they are online discussion fora or digitised archival records, is increasingly useful for researchers in many fields of inquiry. We started teaching NLTK last year, and have since developed a course that has been taught as far afield as the US, UK and Canada. To demonstrate the power of NLTK, we needed a dataset that would be interesting to researchers from multiple backgrounds and were fortunate to be approached by the University of Melbourne Archives, who had digitised the Radio Electorate Talks from the Malcolm Fraser Archive.

The collection consists of a series of radio talks and press releases addressed to Mr Fraser’s electorate of Wannon, in rural Victoria. As the collection spans the thirty years of his parliamentary career, it makes a great test case for longitudinal analysis. For instance, topic modelling reveals his concerns changing from a rural backbencher, discussing issues of farming, through his time as Minister for the Army (January 19660 February 1968) and Minister for Defence (November 1969 - March 1971), in which the conflict in Vietnam looms large, and eventually his broader concerns as leader of the Liberal Party and Prime Minister (November 1975 - March 1983). 

image

This graph shows changes in the modal verbs used in the corpus

There were some challenges in working with the collection. The speeches all come with metadata sections, which contain information such as date, title and genre. However, the metadata isn’t completely consistent and contains variant terms such as Radio Talk, radio talk and radio talks. To a human reader, these are plainly the same thing, but to a computer, they’re distinct. This provides a valuable demonstration to participants in the importance of clean data. The text files themselves have been produced by OCR scanning typescripts of the the speeches, so aren’t as clean as a researcher could wish, either. We talk in the workshop about some of the steps that can be taken to try to clean up text (such as spell-checking).

We got really interested in the speeches and how NLTK can be used to navigate a corpus. Research Bazaar’s Daniel McDonald and Lachlan Musicman built this site to present the speeches, allow searching by bigram and trigram (which are good ways to identify the key topics of a document) and by year and genre. The site also presents the OCR text alongside the scanned typescript, which allows for correction of the text. The site shows how NLTK can easily provide the basis for navigating an archive. All of this exploration is only possible, of course because the University of Melbourne archives made the documents available to be downloaded and explored as plain text as well as PDF. This willingness to digitise the collection then open and share the results opens up the possibilities for reuse and novel research.

This work was made possible by funding from the Australian National Data Service to develop training materials showcasing Australian research data. We are very grateful for this support

image
    • #data carpentry
    • #nltk
    • #digital humanities
  • 4 years ago
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+
← Previous • Next →

Portrait/Logo

About

Welcome to the Research Computing Services Blog. We're here to help you do your research better! We'll connect you with the best research tools, workshops, expertise & community. Need more information? Check out our pages below!

https://research.unimelb.edu.au/infrastructure/research-computing-services

Pages

  • About us
  • Sign-up for FREE researcher training HERE
  • ResPlat Training Catalogue
  • Calendar of Events and Trainings
  • CoLab: A New Collaborative Space for Researchers!
  • Mailing List
  • The Research Bazaar 2018
  • #MyResearch Video Campaign
  • Resbook

Me, Elsewhere

  • @ResPlat on Twitter
  • ResBaz on Youtube
  • ResBaz on Flickr
  • resbaz on github
  • ResBaz on Instagram
  • RSS
  • Random
  • Archive
  • Got a question? Let's talk.
  • Mobile
Effector Theme — Tumblr themes by Pixel Union