Data Visualization in the Newsroom

Preview:

DESCRIPTION

 

Citation preview

Data visualization in the newsroom

{

“presented by”: “carl v. lewis”,

“for”: “the florida times-union”,

“slides”: “bit.ly/NIXkOD”,

“email”:“carl@carlvlewis.net”

}

What is data visualization?

•Data itself is the story; standalone narrative.

•Interactive, communicative, visual.

•Ranges from simple (charts) to complex (database-driven applications).

•Both a technique and a format.

•Both entertaining and factual.

• See: “The Many Words for Visualization”

The history of data journalism

•Grew out of CAR (computer assisted-reporting) tradition

•John Snow’s 1854 cholera map

•Has coincided with the era of “Big Data”

On the emergence of the field of data journalism:

•"When information was scarce, most of our efforts were devoted to hunting and gathering. Now that information is abundant, processing is more important." –Phillip Meyer, UNC Chapel Hill

On the growing importance of data-driven journalism:

•“Journalists need to be data-savvy . . . Data-driven journalism is the future.” –Sir Tim Berners Lee.

•“The explosion of Web-based tools and ways of sifting through and sharing data has created something approaching a revolution, and the potential benefits for journalism are only just beginning to reveal themselves.” –Matthew Ingram

What data journalism is not:

• Simply incorporating public data into your textual narrative

• Infographics

• Illustration

• Resource-intensive

• Just about numbers and programming

• Just about making data flashy

What data journalism is:

• Visual

• Often evergreen

• Transparent – direct access to primary source

• Credible

• Engaging

• A good business model

Hans Rosling

http://www.youtube.com/watch?v=jbkSRLYSojo

Democratization of data journalism

• Free and open-source tools (Google Drive, JavaScript libraries, etc.).

• Open Data laws.

• “Anyone can do it. Data journalism is the new punk.” -Simon Rogers, The Guardian

The job of the data journalist

• Part statistician, part journalist, part programmer.

• “We're statisticians. We don't program.”

• “We’re programmers. We don’t report.”

• “We’re journalists. We don’t code.”

Notable examples of data visualization

• “Mapping America: Every City, Every Block,” NYTimes.com.

• “Where Does My Money Go?”, Open Knowledge Foundation.

• “Illinois school report cards,” Chicago Tribune

• “We Feel Fine,” Jonathan Harris

• “Top Secret America,” The Washington Post

News organizations to follow forinnovative data projects

What are your favorite visualizations?

When to use data visualization:

• Show change over time

• Comparing discrete values

• Showing connections and flows

• Showing hierarchy

• Browsing large databases

When not to use data visualization:

• When text or multimedia tells story better

• When you have very few data pints

• When there is no statistical significance

• When a map is not a map

• When a table would do

Process of data journalism

1. Research – Think of topic and research factors.

2. Find the data – Locate and retrieve relevant public data

3. Analysis and evaluation – Crunch numbers, look for trends or inconsistencies

4. Visualize – Display the data in appropriate manner

II. Mining public dataResearch and retrieval

Research

1. Think of a topic – what factors influence it?

2. What public data might shed light on those factors?

3. Seek out the data

Locating public data• Thousands of public “data dumps” by

government bodies and nonprofits.

• Most commonly in delimited spreadsheet format (look for .csv, .xls), sometimes in XML and JSON.

• For geographic data, look for .kml or .shp

• Can be found directly at source or by search engine keyword

Search tips for data retrieval• If you don’t know which source to

look to find your data, an initial Web search might help.

• After your keywords, type “filetype:XLS”, “filetype:CSV”, or whatever the extension is of the data you’re seeking, and you’ll see only files of that type from across the Web.

• If you get no results, try broadening your search term to locate sources that cover the general discipline (i.e. instead of “malaria deaths,” try “public health data”)

• Florida’s “Sunshine” law requires all state agencies to provide open access to public records, including data.

• Chapter 119 of Florida State Statutes mandates that “any records made or received by any public agency in the course of its official business are available for inspection, unless specifically exempted by the Florida Legislature.”

Florida public data sources

• Dozens of useful open data sources maintained by Florida government agencies, including TransparencyFlorida.gov, FloridaHasARightToKnow.com and MyFlorida.gov

• Full-list of state-maintained databases by topic here.

• A few state-maintained databases worth mentioning: the Division of Elections’ campaign finance data, the DOE’s test score reports and the Department of Law Enforcement’s arrest and officer reports.

Florida public data sources

Florida public data sources

• A number of advocacy groups also maintain useful, downloadable statewide databases:

• FloridaOpenGov.org, which focuses on public employee payroll data.

• FloridaRedistricting.org, which provides demographic data (.csv) and geographic polygons (.shp) for new district boundaries.

• Florida Housing Data Clearinghouse, which provides regularly updated property values, housing data (.xls).

(for even more, see my semi-exhaustive list with descriptions here).

http://www.duvalelections.com/content.aspx?id=235

Georgia public data sources• Although Georgia has no law

requiring all government agencies to make public data accessible online, many do anyway.

• In 2008, the Transparency in Government Act expanded the public data site, Open.Georgia.gov, to include all three branches of government, regional education service agencies, local boards of education, and transactions made by the General Assembly.

Georgia public data sources

• A comprehensive list of downloadable databases from state agencies in Georgia can be found here.

• The State Ethics Committee has made all campaign finance reports, lobbyist reports and campaign contributions available in downloadable spreadsheets.

• OASIS provides a set of web-based tools to browse the Georgia Department of Public Health’s Data Warehouse, and download the data yourself if you wish.

Locating geographic data• Most geographic data available

as TIGER/Line Shapefile packages (archives containing .shp, .dbf, .prj, .xml, .shx) from U.S. Census Bureau.

• Google also hosts a directory of .kml files for most geographic boundaries here.

• Alternatively, Florida and Georgia GIS data can be found at FGDL.org, Geoplan and Data.GeorgiaSpatial.org.

What to look for• Most numeric spreadsheet data comes either as a comma-separated value

(.csv) or Microsoft Excel (.xls) file. Example of .csv structure:“Name”,“Date”,“Address”,”Zip”,”State”,”Country”,

• XML (eXtensible Markup Language) stores data hierarchically for the Web, and is good for building news applications because of its broad interoperability.

<menu id="file" value="File"> <popup> <menuitem value="New" onclick="CreateNewDoc()" /> <menuitem value="Open" onclick="OpenDoc()" /> <menuitem value="Close" onclick="CloseDoc()" /> </popup></menu>

• JSON (JavaScript Object Notation) – Similar to XML in structure, but has a “lighter” punctuation, based on JavaScript conventions. May eventually replace XML as standard. {"menu": {

"id": "file", "value": "File",

"popup": { "menuitem": [ {"value": "New", "onclick": "CreateNewDoc()"}, {"value": "Open", "onclick": "OpenDoc()"}, {"value": "Close", "onclick": "CloseDoc()"} ] } }}

Scraping other sources

• Scrape data from an HTML table with simple Google spreadsheet formula: =ImportHtml("http://the-url-goes-here", "table", 0)

• For database of HTML tables, try Haystax.

• For PDFs, try CometDocs.

• Scrape webpages by running or creating Python script at ScraperWiki.

APIs for data retrieval

• APIs (application programming interfaces) are how many websites and services share content with one another.

• Allows a computer system to fetch, interpret and use data created on another system, even if it used a different programming language or structure.

• Examples: Twitter Search API, Google Maps API, NYTimes Campaign Finance API.

• Usually returns data as XML, JSON or .txt

• Often requires use of an API key.

II. Analyzing and refining public data

Manipulating datasets• Data rarely ready for analysis and visualization out-of-the-

box (hence “raw data”).

• Spreadsheet applications most common and easiest way to work with data (Excel, Google Spreadsheets).

• Allow for complex calculations, formulas, sorting.

• Compatible with a variety of file formats (.xls, .ods, .csv, .txt, .tsv).

• Scripts may also be written to automate bulk manipulation (Python).

• R Project (r-project.org)

Data analysis

• To figure out what your data says, you’ll need to crunch the numbers.

• Statistical significance is litmus test.

• Skewed or normal distribution? Why?

• Outliers? If so, error or unexplained factor?

Benchmarks for analysis• Mean (μ) simplest to calculate, but

susceptible to errors caused by outliers.

• Median usually a better metric in determining conclusion, especially with skewed distribution.

• If mean=mode, no skewness.

• Standard deviation (σ) measures reliability of data set.

• Z-Score = how many standard deviations a value is away from the mean and, thus, its likelihood of being an outlier.

standard deviation

mean

z-score

Calculating values in Excel

• Mean: =AVERAGE(A1-A27)

• Median: MEDIAN(A1-A27)

• Standard deviation: STDEV(A1-A27)

• Z-score of a given value: Subtract mean of dataset from value. Divide result by the standard deviation

Other commonly used Excel formulas

• Concatenate to merge multiple columns.

• MID to split columns.

• Percent change to display relative change over time =(new_value-original_value)/ABS(original_value)

• See this guide of helpful Excel tricks for data journalists, compiled by Mary-Jo Webster of St. Paul Pioneer Press: https://docs.google.com/file/d/0ByLyArAQRhaBNDc3NjJjYTUtY2U0Yi00NmIwLThkNTgtYzNlYThmNGE1ZTEz/edit

Refining and cleaning data• Sometimes Excel and Google

Spreadsheets aren’t enough, especially when working with large datasets.

• Google Refine – free tool that lets you explore, power sort and process data.

• Useful for finding and fixing errors and inconsistencies, “power tool for working with messy data.”

• Facets to sort data

• Cleaning with clusters

• Shan Carter’s Mr. Data Converter to convert spreadsheets to more web-friendly format.

Other data analysis tips and tricks• Put field names in first row.

• Put geographic data in first columns

• When you have two different datasets, a good tool to merge them is Google Fusion Tables (make sure they share a common attribute).

• Never round until the end of calculations. Round to two decimal points for visualization purposes.

• Cut and paste calculations into a new column as values only.

• Know the principle data types (integer, real, string, boolean), and make sure numeric data is classified as either integer (whole numbers only) or real (any value).

III. Visualizing your data

Planning your visualization

• Identify your key message

• Choose the best data series to illustrate your point

• Consider the number of points in the data

• Think about complementary/supporting datasets you can incorporate, e.g. sanitation with poverty.

• Plan for user interaction, i.e. visual feedback.

• Make numerical changes to raw data to enhance your point, e.g. absolute values vs. percent change

• Brainstorm potential technologies

• Consult experts on topic to back up your interpretation of data

Choosing the right type of visualization

• Change of single variable over time: line chart.

• Comparison of single variable among multiple classes: bar chart.

• Two variables: scatter plot, bubble chart.

• Hierarchical data: treemap, bubbletree.

• Area charts for area only

• Makeup of whole: pie chart.

• Distribution: histograms, box-and-whisker plots.

• Geographic data (point, polygon, chloropleth and symbol maps).

• Records: searchable database.

• Chronological data: timeline, sparklines.

• Other possibilities: matrices, heatmap, games, slopegraphs, stepper graphics,

Visualization design principles

• Typography: clear, consistent, not distracting.

• Use bold, mix of serif/sans-serif to provide emphasis.

• Don’t set type at an angle

• Color: Let color correspond to variable, design for accessibility, choose from same side of color wheel, consider cultural associations but avoid thematic palletes. Use Adobe Kuler or 0to255.com

• Visual overload, emotional design, skewmorphism.

No white type on black background

No angled type

• Some guidelines for graphical integrity, according to Edward Tufte in The Visual Display of Quantitative Information:

1. Representation of numbers should be directly proportional to numerical qualities represented.

2. Clear, detailed labeling throughout.

3. Show data variation, not design variation.

4. Avoid excessive and unnecessary use of graphical effects

What Edward Tufte calls “the worstvisualization ever published.”

Visualization design principles

• Design for the eye

• User should be able to discern key message visually.

• Design for interaction

• Highlighting and details on demand (example)

• User-driven content selection (example)

Visualization design principles

Visualization design principles

Awful

Bad, but better

Visualization design principles

Awful, but better

Not bad

Awful

Visualization design principles

What’s wrong with this infographic?

Visualization design principles

Wireframing/prototyping

• Follow a structured grid system (i.e., 12 column, 960px grid – see 960.gs and Subtraction).

• Very selectively, you can break the grid to emphasize a certain visual element.

• Sketch out/prototype your wireframe on paper first (print templates such as this)

Selecting tools/technologies

• A wealth of free, open-source data visualization tools and libraries exist to shorten development times

• Examples: Google Visualization API, Google Fusion Tables, Highcharts.js, CartoDB, d3.js, Tableau Public.

• For everything else, HTML5 + CSS + JavaScript

IV. Building a Web app

Web app anatomy

Three components of a Web app:

1. HTML (structure)

2. CSS (styles)

3. JavaScript (interactivity)

Parts of an HTML fileAn HTML file is made up of:

1. Doctype declaration

2. Head <head>

3. CSS/JavaScript references

4. Title <title>

5. Body <body>

6. A Div container

7. Divs (IDs and classes)

Parts of a CSS file

A CSS file is made up of:

1. Container ID

2. Default paragraph (p) style

3. Default H1,H2, etc. styles

4. Default .body style

5. Styles for all divs

V. Maps

Maps 101• Interactive maps combine

geocoded data – points or polygons – along with metadata and/or numeric data.

• KML (keyhole markup language) quickly becoming popular file format, but Shapefile (shp.zip) is still the most widely available

• Geographic data can either be geocoded, downloaded from the Web, or custom-drawn.

• Good puveyor of news maps: The Texas Tribune.

Mapping services and libraries

• Google Fusion Tables – Quick, versatile and classic maps that integrate seamlessly with the Google Maps JavaScript API.

• CartoDB – A newer open-source tool much like Fusion Tables, but with a better looking out-of-the-box experience.

• Leaflet – An open-source, client-side mapping library with an API that allows you to achieve a number of advanced features. Plays nicely with Fusion Tables and CartoDB-hosted maps. Part of CloudMade suite.

Handy desktop mapping software

• qGis – Free program that supports almost every conceivable map file type, and allows you to add or manipulate vector data, which can then be then exported as a KML or Shapefile package.

• Tilemill – A map creation and styling software; ideal for those with little programming experience. UTF-grid enabled tilesets only.

Primary map types• Chloropleth – Colors

for each geometry correspond to numeric values of a given variable.

• Point – Locations on a map displayed by geocoded markers.

• Less frequently: proportional maps and geo maps.

Chloropleth map of Georgia voter turnout

Point map of Jacksonville polling locations

Tips and tricks

• If you have street address data, you can use BatchGeocode to convert them to lat-long coordinates.

• For chloropleth maps,

• Include no more than five fill colors or “buckets”

• Don’t define an equidistant color ramp; use ColorBrewer instead.

• Use MarkerClusterer when there are too many points for certain zoom levels.

Using ColorBrewer to define an accurate, accessible color ramp.

Using MarkerClusterer to cluster points at further zoom levels.

Tips and tricks

• To convert Shapefiles so they can be imported into Fusion Tables, either use Shape to Fusion, or export it as KML from CartoDB.

• Before using the embed tool in Fusion Tables or CartoDB, make sure the map is centered where you want it.

• Ensure your map is set to “Public.”

Export a Shapefile as KML in CartoDB.

Making your map public in Fusion Tables

V. Charts

Charts

• Basic building block of visualization

• Simple, but also easy to mess up.

• Should always be interactive.

• Should always include data source.

• Should always include a legend.

• Unless necessary, only show labels on mouseover.

Interactive charting tools

• Out-of-the-box: Google Drive charts, infogr.am.

• More advanced: Google Code Playground.

• Most agile: Highcharts.js.

• Most extendible: Tableau PublicA combo chart made using Highcharts.js

Charting best practices• Color: Pick palette of no more

than 3-4 colors from same side of color wheel.

• Increments: Use natural-increments like (0,2,4,6...) instead of, say, (0,3,6,9...)

• Scale: Don’t plot two unrelated series with one scale on left and one on right.

• Style: Flat and simple. No 3D effects, shadows, narrow bars or distracting shading.

Don’t plot two different variables on same scale.

Bars too narrow Distracting shading

Misleading 3D effects Pointless shadows

Source: The Wall Street Journal Guideto Information Graphics, Dona M. Wong.

Charting best practices

• Always set the baseline to zero.

• Always order starting with greatest value

• Use broken bars sparingly

• No more than five slices on pie charts; no “donut” pie charts.

• No more than 3-4 lines on line chart

Wrong order Right order

Wrong baseline Right baseline

No donut-pies

Source: The Wall Street Journal Guideto Information Graphics, Dona M. Wong.

V. Programming and beyond

Utilizing JavaScript/HTML5 libraries

• Together, JavaScript, HTML5 and jQuery have expanded boundaries of data visualization

• Abundance of open-source libraries and packages mean less programming required to produce unique, interactive visualizations.

• Examples: Timeline.js, Bubbletree.js, Raphael.js, ProPublica tools

The HTML5 revolution

• Adobe Edge for HTML5 development; end of Flash’s reign

• Platform-agnostic, mobile-first movement

• Forking resources and packages off GitHub

Pushing the limits

• RaphaelJS for easier manipulation of serialized vector graphics

• Other boundary-pushing data visualization projects: Processing!, Gephi, d3.js, IBM’s Many Eyes. A network map produced using D3.js

Helpful resources and communities

• Blogs/Tutorials: FlowingData.com,Vis4.net,Driven-by-data.net, Chryswu.com, datavisualization.ch

• Books: The Data Journalism Handbook, O’Reilly Media. Flowing Data Guide to Visualization, Chris Wyu. The Wall Street Journal Guide to Information Visualization, Dona M. Wong.

• Communities: visual.ly, Hacks/Hackers, NICAR.

Free data journalism handbook from O’Reilly Media

For slides and list of links, http://bit.ly/NIXkOD

@carlvlewis

Recommended