88
@PaulBradshaw, Online Journalism Blog Birmingham City University and City University London BBC, January 2015 Data Mining Search, scraping, FOI and feeds Image by Evan Long

Finding data BBC 15

Embed Size (px)

Citation preview

@PaulBradshaw, Online Journalism Blog Birmingham City University and City University London

BBC, January 2015

Data Mining Search, scraping, FOI and feeds

Image by Evan Long

1. Search tips and tools 2. Sources and feeds 3. Data requests 4. Scraping

1. Search tips and tools 2. Sources and feeds 3. Data requests 4. Scraping

Don’t ask for what you want: describe what you expect to find

Search operators

What text will it contain? Where will that text be? What text will it not contain?

Imagine the data: text

Specific references, not general:

Specify a constituency… …a school

…an institution code …an invoice number …a piece of jargon

“”

-

*

..

“disclosure log” “between * and 2014” “hate crime” -religion -"publication scheme"

Number ranges: 2000..2014

‘life expectancy Birmingham’

"life expectancy" "perry barr"

inurl:

inurl:foi inurl:ccg

inurl:intranet inurl:search.asp inurl:search.php

intitle: allintitle:

intitle:foi allintitle:disclosure log

intitle:“bank fines”

intext: allintext:

intext:“miserable failure” allintext:miserable failure

"life expectancy" "perry barr"

"life expectancy" "perry barr" filetype:xls

"life expectancy" "perry barr" filetype:xls site:ons.gov.uk

"life expectancy" "perry barr" filetype:xls site:ons.gov.uk 2009..2014

"life expectancy" "perry barr" filetype:xls site:ons.gov.uk 2009..2014 -winter

Where is it likely to be What format? When was it not published?

Imagine the data: meta data

site:

site:gov.uk site:nhs.uk

site:police.uk site:ac.uk site:org.uk

site:org site:birmingham.gov.uk site:met.police.uk/foi/

disclosure

filetype:

filetype:xls filetype:xlsx filetype:pdf filetype:csv filetype:ppt filetype:doc

filetype:docx filetype:xml

search tools

“disclosure log” site:gov.uk allintitle:hate crime report filetype:pdf site:police.uk art inurl:search.asp -library

Combine operators:

research.google.com

zanran.com

§

Do it now: Search for a disclosure log for a CCG Search for spreadsheets mentioning Andrew Mitchell MP

1. Search tips and tools 2. Sources and feeds 3. Data requests 4. Scraping

Audits and transparency data Parliamentary questions

Reports, research, sources FOI requests, disclosure logs

Press offices Public data and databases -

scraping

Open data initiatives & activism (TWFY)

Hackdays e.g. Rewired State Public data and databases -

scraping Crowdsourcing or surveys

Social networks

NOMIS, ONS, Data.gov.uk HES, NHSIC indicator portalData.Police.uk HEFCE, HESA, Ofsted, UCAS fullfact.org/finder

Key sources

§

Do it now: Set up Change Detection for the CCG disclosure log Set up email alerts for publications on Data.gov.uk

1. Search tips and tools 2. Sources and feeds 3. Data requests 4. Scraping

http://www.panopticonblog.com/2014/08/01/section-11-foia-and-the-form-of-a-request/ http://www.bailii.org/ew/cases/EWCA/Civ/2014/1086.html

As per the judgement in Innes v Information Commissioner [2014] EWCA Civ 1086 I would like to request the data in spreadsheet format…

§

Do it now: Draft an FOI request for a local body’s data dictionary Use WhatDoTheyKnow (so others googling codes can find you)

1. Search tips and tools 2. Sources and feeds 3. Data requests 4. Scraping

Automating the repetitive gathering of data, e.g. Multiple tables in one pageWebpage tablesMultiple spreadsheetsMultiple PDFs

What is scraping?

https://www.youtube.com/watch?v=Efr-VEkwWoM

http://blogs.ft.com/ftdata/2014/06/11/interactive-explore-the-statistical-identity-of-every-team-at-the-world-cup/?

http://www.mirror.co.uk/news/uk-news/singer-best-vocal-range-uk-4323076

*

http://www.w4mpjobs.org/SearchJobs.aspx?search=alljobs

Tip: empty search

Basic tables: WYSIWYG tools Google Sheets functions Programming: Scraperwiki

How to scrape?

Paul Bradshaw Leanpub.com/scrapingforjournalists*

<plug>

*

Function (Arguments) (aka parameters)

*

Query (XPath)

*

Tip: search for structure around data

http://www.w4mpjobs.org/SearchJobs.aspx?search=alljobs

*

http://www.w4mpjobs.org/SearchJobs.aspx?

search=alljobs

http://www.w4mpjobs.org/SearchJobs.aspx?search=alljobs

*

*

"//div[@class= 'leftcolumn']"

*

//div[starts-with(@class, ‘jobWrap’)]

*

A crib sheet:

Paul Bradshaw Leanpub.com/scrapingforjournalists*

Scraping tools

*

Chrome extension:

*

*

OutWit Hub

§

Do it now: Identify a website which has multiple pages or documents containing data you could combine Where’s the structure? Table? URL? Links?

§

1. Search: describe the data 2. Feeds: get regular updates 3. FOI: request detail, in CSV format 4. Scraping: look for structure and repetition

Thank you.

Image by Evan Long

@PaulBradshaw, Online Journalism Blog, HelpMeInvestigate Birmingham City University and City University London

BBC Future Day, September 2014