Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
and
the Lean Data Diet
[email protected] @pantojacoder
and
the Lean Data Diet
Free Knowledge Movement
“Imagine a world in which every single human being can freely share in the sum of all knowledge”.
Should not have to provide any information to participate in free knowledge movement.
There cannot be access to free knowledge without a strong guarantee of privacy.
Free n le e Move nt Cor Bel ar nd Pri y
Anyonecan edit!
How is this guarantee of Privacy Expressed?
https://foundation.wikimedia.org/wiki/Privacy_policy
Build the wiki way
Build the wiki way Dis sion
to l 150,000 wo ds
Read or edit without account.
Register account without name, email or any other info.
Never selling/sharing your info with third parties.
After at most 90 days, data will be deleted, aggregated, or de-identified
In Practice the Privacy Policy has strong implications on how we do engineering
Read or edit without account.
Register account without name, email or any other info.
Never selling/sharing your info with third parties.
After at most 90 days, data will be deleted, aggregated, or de-identified
We compute metrics in privacy conscious ways, aggregate, release publicly and delete a lot of data
Deleting Data
Sanit ing Data
PrivacyCulture
Deleting DataAt S a e!
Usage Data - Web Request project es.wikipedia
ip_address 31.214.189.167
user_agent Mozilla/5.0 (X11; Linux ...
page COVID-19
username pepito_editor
ip_address 3x.214.189.167
user_agent Mozilla/5.0 (X11; Linux ...
session_id 8c878625792be023
edit_count 4257
ui_skin minerva
Usage Data -Behavioura
200,000 web requestsPER sec (at peak)
200,000 web requestsPER sec (at peak)
2,000eventsPER sec
Deleting DataDeleting DataAre you sure?
Cancel Delete
--dry-run undef -> execute
--tables-to-delete undef -> all
--execute undef -> dry-run
--tables-to-delete undef -> none * -> all
--database=event--tables=menuClicks--wikis=en.wikipedia--older-than=90--skip-trash=true
Executing tests… Tests passed. Starting DRY-RUN. Checking partitions to delete… Partitions that would be deleted by execution: - year=2019, month=1, day=1, hour=0, wiki=en.wikipedia - year=2019, month=1, day=1, hour=0, wiki=es.wiktionary - year=2019, month=1, day=1, hour=0, wiki=de.wikibooks - year=2019, month=1, day=1, hour=1, wiki=en.wikipedia - year=2019, month=1, day=1, hour=1, wiki=es.wiktionary - year=2019, month=1, day=1, hour=1, wiki=de.wikibooks - year=2019, month=1, day=1, hour=2, wiki=en.wikipedia - year=2019, month=1, day=1, hour=2, wiki=es.wiktionary - year=2019, month=1, day=1, hour=2, wiki=de.wikibooks DRY-RUN finished.
Parameter checksum: 57ca7987d987e9e98a6c79
--execute=<checksum>
--database=event--tables=menuClicks--wikis=en.wikipedia--older-than=90--skip-trash=true
Executing tests… Tests passed. Starting DRY-RUN. Checking partitions to delete… Partitions that would be deleted by execution: - year=2019, month=1, day=1, hour=0, wiki=en.wikipedia - year=2019, month=1, day=1, hour=0, wiki=es.wiktionary - year=2019, month=1, day=1, hour=0, wiki=de.wikibooks - year=2019, month=1, day=1, hour=1, wiki=en.wikipedia - year=2019, month=1, day=1, hour=1, wiki=es.wiktionary - year=2019, month=1, day=1, hour=1, wiki=de.wikibooks - year=2019, month=1, day=1, hour=2, wiki=en.wikipedia - year=2019, month=1, day=1, hour=2, wiki=es.wiktionary - year=2019, month=1, day=1, hour=2, wiki=de.wikibooks DRY-RUN finished.
Parameter checksum: 57ca7987d987e9e98a6c79
--execute=<checksum>
#1 Dry-run
#2 Execute
Sanit ing Data
Sanit ing Data
Ad ance
Clients
Event Processor (Spark)
HTTP Beacon Endpoint
VarnishkafkaKafka
Varnish HDFS
Behavioural data
Clients
Event Processor (Spark)
Sanitized Events
Events <90 days
HTTP Beacon Endpoint
VarnishkafkaKafka
Varnish
Allow-list
HDFS
https://github.com/wikimedia/analytics-refinery-source/tree/master/refinery-job
Behavioural data
date 2019-01-01
ip 31.214.189.167
user_agent Mozilla/5.0 (X11; Linux ...
wiki en.wikipedia
action click
target menu
Unsanitized
date 2019-01-01
ip 31.214.189.167
user_agent Mozilla/5.0 (X11; Linux ...
wiki en.wikipedia
action click
target menu
UnsanitizedDo-not-allow-list
date 2019-01-01
ip 31.214.189.167
user_agent Mozilla/5.0 (X11; Linux ...
wiki en.wikipedia
action click
target menu
Unsanitizeddate 2019-01-01
ip NULL
user_agent NULL
wiki en.wikipedia
action click
target menu
SanitizedDo-not-allow-list
date 2019-01-01
ip 31.214.189.167
user_agent Mozilla/5.0 (X11; Linux ...
wiki en.wikipedia
action click
target menu
cookie_id 724310
Unsanitizeddate 2019-01-01
ip NULL
user_agent NULL
wiki en.wikipedia
action click
target menu
cookie_id 724310
SanitizedDo-not-allow-list
date 2019-01-01
ip 31.214.189.167
user_agent Mozilla/5.0 (X11; Linux ...
wiki en.wikipedia
action click
target menu
UnsanitizedAllow-list
date 2019-01-01
ip 31.214.189.167
user_agent Mozilla/5.0 (X11; Linux ...
wiki en.wikipedia
action click
target menu
Unsanitizeddate 2019-01-01
ip NULL
user_agent NULL
wiki en.wikipedia
action click
target menu
SanitizedAllow-list
Unsanitizeddate 2019-01-01
ip 31.214.189.167
user_agent Mozilla/5.0 (X11; Linux ...
wiki en.wikipedia
action click
target menu
cookie_id 724310
date 2019-01-01
ip NULL
user_agent NULL
wiki en.wikipedia
action click
target menu
cookie_id NULL
SanitizedAllow-list
Unsanitizeddate 2019-01-01
ip 31.214.189.167
user_agent Mozilla/5.0 (X11; Linux ...
wiki en.wikipedia
action click
target menu
cookie_id 724310
date 2019-01-01
ip Spain
user_agent NULL
wiki en.wikipedia
action click
target menu
cookie_id NULL
SanitizedAllow-list
Unsanitizeddate 2019-01-01
ip 31.214.189.167
user_agent Mozilla/5.0 (X11; Linux ...
wiki en.wikipedia
action click
target menu
cookie_id 724310
date 2019-01-01
ip Spain
user_agent Linux
wiki en.wikipedia
action click
target menu
cookie_id NULL
SanitizedAllow-list
Unsanitizeddate 2019-01-01
ip 31.214.189.167
user_agent Mozilla/5.0 (X11; Linux ...
wiki en.wikipedia
action click
target menu
cookie_id 724310
date 2019-01-01
ip Spain
user_agent Linux
wiki en.wikipedia
action click
target menu
cookie_id 8d56ab209e10
SanitizedAllow-list
#
PrivacyCulture
Privacy is not the responsibility of one team.
All processes and metrics take privacy into account from the beginning until the end.
SELECT COUNT(DISTINCT uuid)FROM database.tableWHERE date = ’2019-01-01’;
UUID, REQ
Unique Device - DAU or MAU
UUID, REQ
UUID
Unique Device
UUID, REQ
UUID
SELECT page_title uuidFROM database.tableWHERE date = ’2019-01-01’ and uuid =<some>
LAST ACCESS
Unique Device
2020-09-01
https://diff.wikimedia.org/2016/03/30/unique-devices-dataset/
LAST ACCESS
LA, REQLA, REQ (today: 2020-10-15)
2020-09-01
2020-09-01
Unique Device
LAST ACCESS
LA, REQLA, REQ (today: 2020-10-15)
2020-09-01
2020-09-01
Timestamp IP Page Cookies
2020-10-15 776.9.* Titanic Last-Access=2020-09-01
Unique Device
LAST ACCESS
LA, REQLA, REQ (today: 2020-10-15)
2020-09-01Timestamp IP Page Cookies
2020-10-15 776.9.* Titanic Last-Access=2020-09-01
2020-09-01
Unique Device
LAST ACCESS
LA, REQLA, REQ (today: 2020-10-15)
2020-10-15
Timestamp IP Page Cookies
2020-10-15 776.9.* Titanic Last-Access=2020-09-01
2020-10-15
Unique Device
LAST ACCESS
Unique Device
2020-10-15
LAST ACCESS
LA, REQLA, REQ (today: 2020-10-15)
2020-10-15
2020-10-15
Unique Device
LAST ACCESS
LA, REQLA, REQ (today: 2020-10-15)
2020-10-15Timestamp IP Page Cookies
2020-10-15 776.9.* Titanic Last-Access=2020-09-01
2020-10-15
2020-10-15 776.9.* Everest Last-Access=2020-10-15
Unique Device
SELECT COUNT(*) FROM database.tableWHERE (last-access-date IS NULL OR last-access-date < date)AND date = ’2020-10-15’;
LAST ACCESS
LA, REQLA, REQ (today: 2020-10-15)
Unique Device
Timestamp IP Page Cookies
-> 2020-03-15 776.9.* Titanic Last-Access=2020-09-01
2020-03-15 776.9.* Everest Last-Access=2020-10-15
SELECT COUNT(*) FROM database.tableWHERE (last-access-date IS NULL OR last-access-date < date)AND date = ’2020-10-15’;
Unique Device
The Lean Data Diet
Less work related to data requests
Easier to make data public
Guarantee of Privacy
Extra work
Privacy culture needs time
Data Analysis needs a different mindset
Pr Con
Privacy is a Feature
Questions?
https://xkcd.com/285
[email protected] @pantojacoderAll pictures https://creativecommons.org/publicdomain/zero/1.0/