A busy week in data mining (Part 3: data cleaning)
By Kim Doyle
We started our fourth day of data mining with an introduction to messy data, why it ruins our life and haunts our dreams… But it doesn’t have to be this way!

Yuandra introduced us to some basics of clean data. He used this handy digram that he got from the School of Data to show how important cleaning is to data workflow:

As you can see, cleaning data is an integral part of working with data. Cleaning before analysis is vital because messy data is not only difficult to work with, but it can distort your findings. Accurate analysis requires clean data. This might sound like the most boring part of the data cycle, but there are lots of reasons to get excited about clean data.
There are a bunch of free and exciting tools that can visualise your data in informative, interesting and beautiful ways, such as CartoDB, for mapping, and Plot.ly, for graphs and many, many more (speak to our awesome DataViz team for more tips to beautify your data). If you have clean data, it can be as simple as drop and drag and you’re done!
Here’s an example from our CartoDB training:
(mapping lobsters off the coast of Tasmania)
Anyway, getting back to the workshop, after Yuandra’s introduction is was my time to shine, teaching Open Refine :S
Here is Naomi getting her own back on Twitter.
My turn to document @kim_doyle1 dazzling us on the stage. @iniandra pic.twitter.com/IGp0m0imFn
— Naomi Sutanto (@nomisutanto)April 8, 2016
Formerly Google Refine, Open Refine is a program that helps us clean up our data. It began life as a ‘blue sky’ project of Google employees, but has since blossomed into a separate open-source project. It is free to use and well document. See our course curriculum here.
We ran our course off the cloud.
Look at the pretty cloud…

But you don’t have to. If you’d like to give Open Refine a try, download it to your own computer. And don’t worry about your data; it runs from your local server, so your super-secret research is safe (sshhhh!).
After we got started with some open government data from the US and learnt how to create a project, import our data and do some basic mass edits of mislabelled data, I handed over to the capable Naomi to run us through some of the fundamentals of General Refine Expression Language (GREL).
She explained that GREL transformations are similar to formulas in Excel. Except, rather than being stored in a cell, GREL transformation are applied at once to the entire dataset. In Refine, GREL lets you write powerful yet simple queries to filter and transform your data.
Next, Naomi handed over to Yuandra in our data mining tag-team act and we finished up by learning how to filter, cluster and transpose cells in Refine.
We finished the week on a high and most of us made it through ‘OK’.
Not sure what’s going on here, but it looks serious 😁 @nomisutanto @iniandra #DataMiner pic.twitter.com/Pqe6nJArPL
— Kim Doyle (@kim_doyle1)April 8, 2016
Previous: Part 2: data acquisition
