Upload
julia-minguillon
View
396
Download
2
Embed Size (px)
Citation preview
Introduction toOPEN DATA
and other hypes
J. MinguillónEIMT / UOC
what is Open Data?
what is Open?
what is Data?
plural of "datum" (thing given)
data is / data are
idea: the measure / amount / ... of something
42
42 what?
https://en.wikipedia.org/wiki/42_(disambiguation)
forty-two
quaranta-dos
amane nambili
representation
integer?
base / radix?
units?
D-I-K-W pyramid
D: 42
I: Patient's body temperature (t) is 42 degrees
K: Fever with t > 42 can cause severe brain damage
W: never let t reach 42 degrees!
t = 42 degrees?
Celsius: fever
Fahrenheit: cold body
Kelvin: cold body floating in outer space
data is not just numbers
tables, documents
wikipedia: pages / articles
flickr: images
twitter: tweets
internal structure
x
possible values
basic types
structured
semi-structured
basic types
integer, real, complex
vectors (RGB, ...)
characters, strings
structured data
flat: 1D, 2D, 3D, ...
hierarchical: tweets
relations: graphs
semi-structured data
documents
HTML pages
what is Open?
openness as freedom
5 Rs model
ReuseReviseRemix
RedistributeRetain
open vs free
https://theodi.org/blog/when-data-is-free-but-not-open
open is a combination of
no technological barriers
no legal barriers
technological barriers
technological barriers
data must be accessibledownloadablemanipulable
the 5 star model
* no manipulable: pdf, tiff ** proprietary: doc, ppt, xls
*** open formats: txt, csv, json**** accessible (link): xml, rdf
***** provide context: xml, rdf
http://5stardata.info/en/
open data needs at least 3 star
open formats
open software
linked data
linked data
use URIs to name thingsuse HTTP to provide access
describe data using metadatalink to related data sources
readable by machines
why linked data?
automatic web data extractiondata exchange / enrichmentconstruction of knowledge
semantic searches
example: wikidata
municipalities surrounding Barcelona?
https://en.wikipedia.org/wiki/Barcelona
https://www.wikidata.org/wiki/Q1492
"static" access
data is downloaded as a filefiles are "pictures of the past"
not defined by final users typical of data repositories
human oriented
http://dadesobertes.gencat.cat/en/cercador/detall-cataleg/?id=5
"dynamic" access
data is downloaded as a streamstreams are "pictures of the present"
parametrized by final users (API)typical of online services
machine oriented
legal barriers
legal barriers
reachable through Internet does not mean open
licensesterms and conditions
EULAs
licenses for open data
for datasets / databasesfacts cannot be restricted...
...but collections can!
http://opendatacommons.org/licenses/
terms and conditions
for web datalegal language
http://www.coca-colacompany.com/our-company/the-coca-cola-company-terms-of-use
EULA
End-User License Agreementfor apps and online services
legal languageabsurd!
https://www.eff.org/wp/dangerous-terms-users-guide-eulas
ethic issues
privacysecurity
transparency
some bad practices
AOL's searcher 4417749Ashley Madison leaked
AEMET paywall
why open data?
why not?
data belongs to their producersin most cases, users!
it promotes participationit discovers additional value
"data is the new oil" (C. Humby)
"data is the new soil" (D. McCandless)
data life-cycle
data is ...
generatedstored / published
gathered / capturedpreprocessed
analyzedvisualized
data generation
by humans / sensors / servicesanytime / anywherepersistent / volatilestored / published
data gathering
from repositoriesAPIs
social networksdatabases / logsweb scrapping
humans (captcha)
data preprocessing
filtering / selectionjoin (enrichment)feature extraction
conversionsummarize / aggregate
data analysis
statistical descriptorsinference
unsupervised (clustering)supervised (classification)
variable relevance...
data visualization
visual analysissummarization
reportingdashboards
maps / graphsinteractivity
big data
big data
3 Vs
volumevarietyvelocity
volume isthe number of elements
sample / population size
variety isthe number of different forms
dimensionality
velocity ishow fast data is produced or
changes
longitudinal
other Vs
veracityvalue
variabilityvisibility
...
example: Wal-Mart
(2015) 37 million peopleshop at Wal-Mart every dayfrom a list of 140,000 items
who buys what when?why?
other big huge data players
amazonVISAtelcos
facebook, twitter, ...google
big data also
uses multiple sourcesdeals with population, not samplesmakes traditional methods obsolete
requires supercomputing / cloud
example
include context datacustomer loyalty cards
product interestingness (RFID)CCTV camerassocial networks
...
tools(examples)
"engineering" approach
solve this problem now with the available tools
no tool solves all problemsproblems change, tools too
tools related to data life-cycle
data gathering
tabulascrapy
twitteR, TAGS, flockerinstagram, flickrwikipedia dumpsURL manipulation
example: URL manipulation
IDESCAT
names of newborn childrenparameters: year, sex, place
other: position, sort
example: URL manipulation
use scrapy for data gathering
define desired fieldscreate list of URLs
identify XPATH (inspect)
data preprocessing
Mr. Data ConverterJSON online editor
OpenRefinebash+awk, perl, python
data analysis
R, R Studiopython pandas, scikit
anacondagephi
...
data visualization
R: ggplot2, ggmap, ...python: Bokeh, plotly, ...
processingD3
openstreetmapother: tagxedo, infogr.am, ...
example
visualizing co-authorship at UOC
data gathered from SCOPUS
unify author names, build graph
no analysis
visualize graph
what knowledge can we extract from the visualization?
most profilic authors/departmentsinterdisciplinarity, connectorsinternal publication policies
"lone rangers"
what open data can we use to enrich the visualization?
from authors/departmentsfrom papers/journals
...
open datainitiatives
agenda oberta
civio
15mpedia
wheredoesmymoneygo?
...
data sources
social networksopen data repositories
scraped web data...
examples
league of legends & twittersmileys, weather & twitter
air tickets price fluctuationsbarcelona & flickr
barcelona & bicingUS air traffic patterns
project
requirements
teams of 3-4 peoplefree topic using open data
proof-of-conceptfinal report
report
summary and goalsdata life-cycle description
tools and data usedresults
legal and ethical issueslimitations and future work
bibliography and references
calendar
today: team, topic, abstractnext session: work in class
online mentoringdeadline: 23/01/2017
contact
jminguillona[at]uoc[dot]edu
@jminguillona
webpage
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.