How to do an easy data piece using online tables

There’s lots of really great, in-depth, technically impressive data journalism work out there, but often when you’re working in a newsroom as a regular reporter, you might not have time (or the skills) to mess around with OutwitHub or R.

Plus, a lot of this ‘easy data’ that you’ll find online is in the perpetually annoying HTML table. This one, from the Trussell Trust, about the number of people using food banks in different regions, is a good example.

truseell

It looks fine on awebpage, but if you want to get this data and make a story out of it, it can be a pain in the neck. Sometimes just cutting and pasting won’t work, and you’ll end up with everything in a single column. But by doing a simple HTML import, you can cut this out. HTML imports also make it much quicker to get a lot of large tables into a spreadsheet.

For this tutorial, we’ll use this table from planecrashinfo.com, which is one that I used to do my last data story on this blog. I’ll be using Google sheets for this post.

The basic formula you need to import tables into Google sheets is this:

=importhtml(“YOURURL”,”table”,the table you want from the webpage)

In this case, we want the table on the 2014 page, so the formula is:

=importhtml(“http://www.planecrashinfo.com/2014/2014″,”table”,1)

Since there’s only one table on this page, sticking either ‘0’ or ‘1’ at the end of that formula will work. If you want a table from a page with multiple tables, you need to make that number correspond to where it is on the page. If you need the third table down, make it ‘3’. Easy.

Once you’ve put your formula in and hit ‘enter’, the page should look like this:

data1

Note that row 2 looks blank – in fact, the text in those cells has been changed to white, so you can’t see them. It’s a weird thing that sometimes happens when you import tables. It’s easy to change the text colour back to black, but make sure you don’t delete these blank-looking rows without checking whether they’re filled, as this could make your final dataset incorrect.

Now you’ve got the table in your spreadsheet, and you’re finally ready to have your wicked way with it.

For my plane crash story, I wanted to know the number of crashes that occured in each year. But the date format in the ‘Date’ column is in MM/DD/YYYY format. By formatting the cells to make them recognise these values are dates, you could filter the cells to find out the year.

Ultimately it’s much quicker just to find the date by taking the last four digits in the cells – they’re consistently formatted, so this is a viable way to isolate the year.

Make a new column on the right called ‘cleaned year’ or something like that, because this will contain the data in a cleaner format. We’ll use a formula to get the last four digits of the date cell, giving us the year.

data2

The formula is: =Direction you want your figures from(the cell you want them from, how many figures you want)

In this case, it’s =RIGHT(A2,4) – because we want the last 4 figures from the right of cell A2.

This gives us a simple year in our new column. Double-click the small square in the bottom right of this cell to copy down the formula in each row, giving you the cleaned year for each row in your dataset.

data3

Now, if you want more tables – for example, the crashes for all years going back to 1970, you just need to copy the first part of this process. In this example, we’ll just do another one, for 2013.

Planecrashinfo’s data for each year is annoying split into different pages, but the process is the same. Go the bottom of your sheet, and in the row under the last value, put that =importhtml formula in again, but with the correct domain.

data4

Don’t worry about having to do the cleaned year formula again – by just double-clicking on the square in the bottom right of the last cleaned year cell, it’ll copy down for the new rows and give you the correct year, as shown above.

Keep putting these new =importhtml formulas in for each page you want to do. If all the tables are on a single page, you’ll just have to change the last number in the formula. This can be slow, but it’s easy to build a big dataset very quickly – for the plane crash piece, I had 3,206 rows of data, which I got into the sheet in around 20 minutes.

Now, we want to find how many plane crashes were in each year. To do this, we’ll use a pivot table.

Highlight all your data by pressing Control+A, click the ‘Data’ button at the top, and then click ‘Pivot table’.

data5

This will open a new sheet you can find in the tabs at the bottom of the spreadsheet, where you can build your pivot table.

For this table, we want rows containing the years, with the number of crashes in each next in the column next to them. In the ‘Report editor’ bar on the right, click ‘rows’ and select your cleaned year column from the other sheet.

data6

This will bring up the years in rows in the pivot table (there’s only two in this demo).

data7

Don’t worry about that “ate*” row – that’s just left over from the title rows in the original sheet when we imported the new year’s figures below the old one. There’s no numbers in this, and you can just ignore it. It won’t affect your final figures.

Now, to get the number of crashes in each year, go back to the ‘Report editor’ bar, and click values. This will give you the choice of columns again. Basically, by using the COUNTUNIQUE function, the sheet will give us the number of each unique value.

The aircraft registration should be the closest to a unique value we can get – usually, the same plane doesn’t crash twice. So click ‘Aircraft type/Registration’, and then in the ‘Summarize by’ field in the Report editor, click ‘COUNTUNIQUE’.

data8

This gives us a lovely list showing how many plane crashes there were in each year. Again, ignore the spare row at the bottom – it’s counted ‘1’ because there is a value in the cell it refers to, even if it’s not related to plane crashes. Just remember to subtract this number from the grand total if you need to know that.

data9

Great! Now we’ve got a simple table showing us how many plane crashes there were in each year, and we found it much quicker than by manually counting each row.

Now, here’s how to visualise it.

There’s a number of great tools to visualise simple data like this – but for speed and ease, Datawrapper if the best.

The free version doesn’t allow embedding – so if you want to put your final chart on a website, it’ll have to be a static image.

Fortunately, if you’re a student and you send them a nice email, they’ll sometimes give you a free account. And if you work in an office or for a large company, you can ask them to pay.

For this part of the tutorial, we’ll use my full dataset, with figures going back to 1970 – just to make the final graph more interesting.

You can upload the file to Datawrapper, but for simple tables like this, it’s easier to copy and paste. Just go to Datawrapper, hit ‘new chart’, and paste your values into the box.

data10

Hit ‘Upload and continue’ and you can build your chart. Remember to add a credit for your data source with a link, if possible.

A line graph is probably the best way to show change over time, but Datawrapper has a bunch of options and is fairly easy to use – just play around with the settings to get to grips with it.

One important thing to remember before you finish is to check the ‘fill area below line’ box in the ‘refine’ tab when you’re in the visualisation stage. This shades the area below your line, but also sets the y-axis to zero – this is important, because when the y-axis starts with your lowest value, it makes the changes look more dramatic and can be misleading.

It’s a classic data journalism sin and if you don’t set the y-axis to zero, you’ll be laughed and probably spat at in the street by data journalists.

data12

Here, you can see both versions next to each other – on the top, the y-axis is set to the lowest value in the dataset. On the bottom, it’s set to zero. See how much more dramatic the change looks in the top graph?

Staying mindful of factors like this, that may skew how your final piece is interpreted, are very important – understanding them is a big part of being a good data journalist.

That’s it! Obviously this tutorial relates to a very specific story, but the skills you’ve learned here can be transferred to other simple data stories. Give it a go, and try doing a story of your own using similar techniques.

Easy data journalism like this is a great thing to get good at, especially if you’re working at a newspaper – the fact is, it’s easy to do, there’s loads of interesting data all over the internet, and not many people at newspapers know how to do even the simple stuff.

Practice with simple data techniques like this and you’ll be able to do many more stories, and you’ll impress people in the process.

If there’s any way I could improve this tutorial, or if there’s anything I should bear in mind for the future, then be sure to tell me in the comments below.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s