Upload
kelley-brooks
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Big Data and Programming(History 9808A)
27 October 2014
Today’s Agenda Proposals
How are we with the due date? A Short Introduction to Big Data
A Big Data Project: People In Motion
Data Deluge Bit, byte, kilobyte (kB) megabyte (MB), gigabyte,
terabyte, petabyte, exabyte, zettabytes.... Library of Congress = 200 terabytes
“Transferring “Libraries of Congress” of Data” IP traffic is around 667 exabytes It’s a deluge... “Big Data”
too large for current software to handle
Don’t be intimidated Not all DH sources (yet)
Instructive video – David McCandless, “The Beauty of Data Visualization
Big Data for History Tools for journalists, lit scholars and others
Where does history fit in? “Digital history does not offer truths, but only a new
way of interpreting and understanding traces of the past.” (S. Graham, I. Milligan, & S. Weingart)
Blog Leaders Taryn
“…we have to have a better understanding of how programming works so we can at least engage with Computer Scientists to help develop the complex systems required…”
Tamar The Strange Case of Belgium/Ancestry.com
Nick K. The Case of the Missing API
New approach: Crowdsourcing An “online, distributed problem-solving and
production model.” Examples:
Wikipedia reCAPTCHA
Luis von Ahn
Others... Transcribe Bentham Census transcription
A Database for Your Project? Think about how you might use a database
but perhaps not too big! Databases can be very small and still be DH-
worthy Are there public docs out there that you can
digest? Google Refine
Incorporate a search function into your website? Resources
MS Excel (spreadsheet) MS Access (relational database) Google Refine
Cleaning data
People in Motion:Longitudinal Data from
the Canadian CensusA Big Data Project at the University of Guelph
‘Unbiased’ links connecting individuals/households over several
census years
A comprehensive infrastructure of longitudinal data
What we are working towards
1851Census
1871Census
1881Census 1891
Census
1901Census
1906 Census
1916Census
1911Census
US 1880
Census
US 1900
Census
Stage 1: 1871 to 1881
100% of 1871
Census
Automatic Linking
4,277,807 records
3,601,663 records
Partners and collaborators: FamilySearch (Church of Latter Day Saints), Minnesota Population Center, Université de Montréal, Université Laval/CIEQ University of Alberta
100% of 1871
Census
100% of 1871
Census
100% of 1881
Census
100% of 1871
Census
Teaching a Computer to be a genealogist Training with existing manually-created (True)
links Ontario Industrial Proprietors – 8429 links Logan Township – 1760 links St. James Church, Toronto – 232 links Quebec City Boys – 1403 links
Bias concerns Think of any?
Logan Twp
Guelph
Attributes for Automatic Linking Last Name – string First Name – string Gender – binary Birthplace – code Age – number Marital status – single, married, divorced,
widowed, unknown
Automatic Linkage
The challenges:1) Identify the same person2) Deal with attribute characteristics3) Manage computational expense
The system:
Data Cleaning and Standardization Cleaning
Names – remove non-alpha numerical characters; remove titles
Age – transform non-numerical representations to corresponding numbers (e.g. 3 months);
All attributes - deal with English/French notations (e.g. days/jours, married/mariee)
Standardization Birthplace codes and granularity Marital status
Computational Expense Very expensive to compare all the possible pairs
of records
Computing similarity between 3.5 million records (1871 census) with 4 million records (1881 census)
Run-time estimate of : ( (3.5M x 4M)record pairs x 2 attributes being compared ) / (4M comparisons per second) / 60 (sec/min) / 60 (min/hour) / 24 (hours/day) = 40.5 days. (Big Data)
Managing Computational Expense Blocking
By first letter of last name By birthplace
Using HPC Running the system on multiple processors in
parallel
Record Comparison Comparing Strings
String measures: First letter, “edit Distance”, sound
Age +/- 2 years
Required exact matches Gender Birthplace
Linkage Results 1871-81-91-1901
Over 500,000 links… About 20%
Coding Playtime W3C tutorials The Programming Historian
http://programminghistorian.org/ Codeacademy
http://www.codecademy.com/learn