Upload
paul-bradshaw
View
1.156
Download
0
Tags:
Embed Size (px)
Citation preview
@PaulBradshaw, Online Journalism Blog Birmingham City University and City University London
BBC, January 2015
Data Mining Search, scraping, FOI and feeds
Image by Evan Long
What text will it contain? Where will that text be? What text will it not contain?
Imagine the data: text
Specific references, not general:
Specify a constituency… …a school
…an institution code …an invoice number …a piece of jargon
“disclosure log” “between * and 2014” “hate crime” -religion -"publication scheme"
Number ranges: 2000..2014
site:gov.uk site:nhs.uk
site:police.uk site:ac.uk site:org.uk
site:org site:birmingham.gov.uk site:met.police.uk/foi/
disclosure
filetype:xls filetype:xlsx filetype:pdf filetype:csv filetype:ppt filetype:doc
filetype:docx filetype:xml
“disclosure log” site:gov.uk allintitle:hate crime report filetype:pdf site:police.uk art inurl:search.asp -library
Combine operators:
§
Do it now: Search for a disclosure log for a CCG Search for spreadsheets mentioning Andrew Mitchell MP
Audits and transparency data Parliamentary questions
Reports, research, sources FOI requests, disclosure logs
Press offices Public data and databases -
scraping
Open data initiatives & activism (TWFY)
Hackdays e.g. Rewired State Public data and databases -
scraping Crowdsourcing or surveys
Social networks
NOMIS, ONS, Data.gov.uk HES, NHSIC indicator portalData.Police.uk HEFCE, HESA, Ofsted, UCAS fullfact.org/finder
Key sources
§
Do it now: Set up Change Detection for the CCG disclosure log Set up email alerts for publications on Data.gov.uk
http://www.panopticonblog.com/2014/08/01/section-11-foia-and-the-form-of-a-request/ http://www.bailii.org/ew/cases/EWCA/Civ/2014/1086.html
As per the judgement in Innes v Information Commissioner [2014] EWCA Civ 1086 I would like to request the data in spreadsheet format…
§
Do it now: Draft an FOI request for a local body’s data dictionary Use WhatDoTheyKnow (so others googling codes can find you)
Automating the repetitive gathering of data, e.g. Multiple tables in one pageWebpage tablesMultiple spreadsheetsMultiple PDFs
What is scraping?
http://blogs.ft.com/ftdata/2014/06/11/interactive-explore-the-statistical-identity-of-every-team-at-the-world-cup/?
*
http://www.w4mpjobs.org/SearchJobs.aspx?search=alljobs
Tip: empty search
*
Tip: search for structure around data
http://www.w4mpjobs.org/SearchJobs.aspx?search=alljobs
*
http://www.w4mpjobs.org/SearchJobs.aspx?
search=alljobs
http://www.w4mpjobs.org/SearchJobs.aspx?search=alljobs
§
Do it now: Identify a website which has multiple pages or documents containing data you could combine Where’s the structure? Table? URL? Links?
§
1. Search: describe the data 2. Feeds: get regular updates 3. FOI: request detail, in CSV format 4. Scraping: look for structure and repetition
Thank you.
Image by Evan Long
@PaulBradshaw, Online Journalism Blog, HelpMeInvestigate Birmingham City University and City University London
BBC Future Day, September 2014