27
Rensselaer Polytechnic Institute Data Science, Fall 2010 Professor Peter Fox 6961-2010_A4_GROUP_A Tim Lebo Chitti Shravya Ravi Chad Ruhle Brian Wang Jia Zhang Using Someone Else's Data: Living with the Dead http://logd.tw.rpi.edu/demo/living_dead_-_november_2010

Rensselaer Polytechnic Institute Data Science, Fall 2010 Professor Peter Fox 6961-2010_A4_GROUP_A Tim Lebo Chitti Shravya Ravi Chad Ruhle Brian Wang Jia

Embed Size (px)

Citation preview

Page 1: Rensselaer Polytechnic Institute Data Science, Fall 2010 Professor Peter Fox 6961-2010_A4_GROUP_A Tim Lebo Chitti Shravya Ravi Chad Ruhle Brian Wang Jia

Rensselaer Polytechnic Institute Data Science, Fall 2010

Professor Peter Fox6961- 2010_A4_GROUP_A 

  Tim Lebo

Chitti Shravya RaviChad RuhleBrian WangJia Zhang

Using Someone Else's Data: Living with the Dead

http://logd.tw.rpi.edu/demo/living_dead_-_november_2010

Page 2: Rensselaer Polytechnic Institute Data Science, Fall 2010 Professor Peter Fox 6961-2010_A4_GROUP_A Tim Lebo Chitti Shravya Ravi Chad Ruhle Brian Wang Jia

Outline

• 1: Data and Metadata: Discovery, Formats, Use Goals• 2: Two Questions, Data Analysis, Tools and Methods Used• 3: Visual Data, Significance of Findings• 4: Data Management Plan

http://logd.tw.rpi.edu/demo/living_dead_-_november_2010

Page 3: Rensselaer Polytechnic Institute Data Science, Fall 2010 Professor Peter Fox 6961-2010_A4_GROUP_A Tim Lebo Chitti Shravya Ravi Chad Ruhle Brian Wang Jia

1: Data Discovery• "Living with the Dead" Data Discovery:

• http://ads.ahds.ac.uk/catalogue/specColl/lwtd• Verify and explore an archaeological data set in the study

reported by Martin King in 2004.• Why "Living with the Dead?"

o Easy to findo Interesting and easy-to-understando Easy to obtaino Provided “good enough” documentation

Shravya

Page 4: Rensselaer Polytechnic Institute Data Science, Fall 2010 Professor Peter Fox 6961-2010_A4_GROUP_A Tim Lebo Chitti Shravya Ravi Chad Ruhle Brian Wang Jia

1: Data and Metadata Formats

• Data -Three CSV and Four JPG files.• Metadata - (Mostly) Self explanatory headers.• CSV files were converted to RDF

o Better representative structureo Allows for more distributed and analytical functionso Metadata about conversion captured

• Only difficulty is understanding of the context of the data

Shravya

Page 5: Rensselaer Polytechnic Institute Data Science, Fall 2010 Professor Peter Fox 6961-2010_A4_GROUP_A Tim Lebo Chitti Shravya Ravi Chad Ruhle Brian Wang Jia

2. Data AnalysisBackground

Data:• nominal means of measurement• unordered entries

 Dataset:• difficult to form relationships between data fields• observational values• data entries did not differ greatly• hard to identify significance

Brian

Page 6: Rensselaer Polytechnic Institute Data Science, Fall 2010 Professor Peter Fox 6961-2010_A4_GROUP_A Tim Lebo Chitti Shravya Ravi Chad Ruhle Brian Wang Jia

2. Data AnalysisQuestions

"Does the displacement between the bodies and tombs display patterns?" • empty tomb locations • unburied body locations• determine if there is a possibility that some vandalism occurred that removed

the bodies from the tombs   "How did the treatment and context of the bodies change over time?"• two data fields, notice change over time

Brian

Page 7: Rensselaer Polytechnic Institute Data Science, Fall 2010 Professor Peter Fox 6961-2010_A4_GROUP_A Tim Lebo Chitti Shravya Ravi Chad Ruhle Brian Wang Jia

2. Data Analysis

 • Group discussed the original data files• Worked to understand its meaning• Identified discrepancies and outliers• Considered missing values, null(s) and error values• Uninterpretable values ("class" = 1,4) • lacking of relevant metadata makes the meaning unclear 

Jia

Data Validation

Page 8: Rensselaer Polytechnic Institute Data Science, Fall 2010 Professor Peter Fox 6961-2010_A4_GROUP_A Tim Lebo Chitti Shravya Ravi Chad Ruhle Brian Wang Jia

2. Data AnalysisTools and Methods

• SPARQL queries were used to extract the required data from the existing data sets.

 • The queries make it possible to discover something

"interesting" which are difficult or not possible to observe directly from the original data set. 

                        

Histogram of chronology (time period), context of remains, and how the  dead person was processed. 

  

 

Jia

Chronology Context Treatment CountNeolithic Cave Disarticulation 40Neolithic Pit Disarticulation 28Neolithic Pit Articulation 18Neolithic Cist Disarticulation 18

Page 9: Rensselaer Polytechnic Institute Data Science, Fall 2010 Professor Peter Fox 6961-2010_A4_GROUP_A Tim Lebo Chitti Shravya Ravi Chad Ruhle Brian Wang Jia

2. Data Analysis 

  

  Data analysis process:  • The original data from ADS• A parameterized conversion of the data• Query construction and execution• Query results processing• Visualization of results

 The above steps can be validated by a third party by reviewing three types of artifacts that the group created during this project • Inspection of data• Reviewing conversion parameters• Reviewing processing code (javascript)

 Jia

Analysis validation

Page 10: Rensselaer Polytechnic Institute Data Science, Fall 2010 Professor Peter Fox 6961-2010_A4_GROUP_A Tim Lebo Chitti Shravya Ravi Chad Ruhle Brian Wang Jia

3: Visual Data, Significance of Findings

• Google Map plotting Tombs and Bodies• Timelines plotting occurrences of 

o Occurrences of bodies' Treatment typeso Occurrences of bodies' Context types

http://logd.tw.rpi.edu/demo/living_dead_-_november_2010Tim

Page 11: Rensselaer Polytechnic Institute Data Science, Fall 2010 Professor Peter Fox 6961-2010_A4_GROUP_A Tim Lebo Chitti Shravya Ravi Chad Ruhle Brian Wang Jia

3: Map: Tomb and Person• Browse region via map• Select site from list to focus in map• Color and Letter symbols distinguish• "Site Data" link leads to Linked Data

http://logd.tw.rpi.edu/demo/living_dead_-_november_2010Tim

Page 12: Rensselaer Polytechnic Institute Data Science, Fall 2010 Professor Peter Fox 6961-2010_A4_GROUP_A Tim Lebo Chitti Shravya Ravi Chad Ruhle Brian Wang Jia

3: Timelines

Four graphs were constructed:

• Probability density function for treatment data• Cumulative density function for treatment data• Probability density function for context data• Cumulative density function for context data

 Allowed us to see trends over time -- cumulative density functions proved to be more useful

http://logd.tw.rpi.edu/demo/living_dead_-_november_2010Chad

Page 13: Rensselaer Polytechnic Institute Data Science, Fall 2010 Professor Peter Fox 6961-2010_A4_GROUP_A Tim Lebo Chitti Shravya Ravi Chad Ruhle Brian Wang Jia

3: Treatment Timelines

http://logd.tw.rpi.edu/demo/living_dead_-_november_2010Chad

Page 14: Rensselaer Polytechnic Institute Data Science, Fall 2010 Professor Peter Fox 6961-2010_A4_GROUP_A Tim Lebo Chitti Shravya Ravi Chad Ruhle Brian Wang Jia

3: Context Timelines

Chadhttp://logd.tw.rpi.edu/demo/living_dead_-_november_2010

Page 15: Rensselaer Polytechnic Institute Data Science, Fall 2010 Professor Peter Fox 6961-2010_A4_GROUP_A Tim Lebo Chitti Shravya Ravi Chad Ruhle Brian Wang Jia

3: Visual Data Results• Visual inspection of map indicates little correlation of Bodies

displaced from Tombso Distances too large for "vandalism"o Generally uniform distribution over UK o (with slightly higher density of both types in south)

•  Visual inspection of context timelines indicates little correlation between themo The most interesting thing is the sudden rise of "Cists" towards

the end of the timeline, overtaking "Caves"o "Occupation debris" also spikes around 4000BC

•  Visual inspection of treatment timelines indicates that disarticulation is consistently more commono Articulation and cremation grow at roughly the same paceo Little correlation between them

 Chad http://logd.tw.rpi.edu/demo/living_dead_-_november_2010

Page 16: Rensselaer Polytechnic Institute Data Science, Fall 2010 Professor Peter Fox 6961-2010_A4_GROUP_A Tim Lebo Chitti Shravya Ravi Chad Ruhle Brian Wang Jia

3: Visual Data Management

• Final demonstration hosted on logd.tw.rpi.eduo Map component developed by Tim on his laptopo Timeline components developed by Chad on his laptop

• Javascript used for all visualso Google Maps and Google Annotated Timeline APIs

• Data dynamically SPARQL-queried to LOGD's triple storeo Provides connection to relevant processing and source

http://logd.tw.rpi.edu/demo/living_dead_-_november_2010Tim

Page 17: Rensselaer Polytechnic Institute Data Science, Fall 2010 Professor Peter Fox 6961-2010_A4_GROUP_A Tim Lebo Chitti Shravya Ravi Chad Ruhle Brian Wang Jia

4: Data Management Plan

• Logical collection• Physical data handling • Persistence • Interoperability support • Security support • Data ownership • Data dissemination and publication • Metadata collection, management, and access • Knowledge and information discovery

Page 18: Rensselaer Polytechnic Institute Data Science, Fall 2010 Professor Peter Fox 6961-2010_A4_GROUP_A Tim Lebo Chitti Shravya Ravi Chad Ruhle Brian Wang Jia

4: Data Management Plan

Tim

<http://logd.tw.rpi.edu/source/ads-ahds-ac-uk/dataset/living-with-the-dead/version/2008-Mar-26/site_1_1>         dct:isReferencedBy <http://logd.tw.rpi.edu/source/ads-ahds-ac-uk/dataset/living-with-the-dead/version/2008-Mar-26> .

Logical collections • 3 orig CSVs• "column partition"

     • We logically organized by

source, dataset, and version

Page 19: Rensselaer Polytechnic Institute Data Science, Fall 2010 Professor Peter Fox 6961-2010_A4_GROUP_A Tim Lebo Chitti Shravya Ravi Chad Ruhle Brian Wang Jia

4: Data Management Plan Physical data handling

• LOGD center of physical data handlingo Downloaded CSV fileso Authored interpretation parameterso Converted and published RDF versionso SPARQL endpoint offers datao Visualization javascript hosted by LOGD

• Google Docs center of Data Management Plano Single document for meeting notes, assignment writeup,

data exploration notes, and technical documentation

Tim

Page 20: Rensselaer Polytechnic Institute Data Science, Fall 2010 Professor Peter Fox 6961-2010_A4_GROUP_A Tim Lebo Chitti Shravya Ravi Chad Ruhle Brian Wang Jia

4: Data Management Plan Persistence

• LOGD center of physical data handlingo Managed by TWC's LOGD groupo Will persist beyond course projecto Group is planning backupso Popularity of project aids continued maintenance

• Google Docs center of Data Management Plano Relying on Google's massive data centerso Can archive results at end of class to LOGD

• Submitting back to Archeological Data Service (ADS)

Tim

Page 21: Rensselaer Polytechnic Institute Data Science, Fall 2010 Professor Peter Fox 6961-2010_A4_GROUP_A Tim Lebo Chitti Shravya Ravi Chad Ruhle Brian Wang Jia

4: Data Management Plan Interoperability support

• Convert CSV format to RDFo Application independento Follows W3C's 1999 recommendationo Numbers of tools are developed and various operations

are available to perform operationso Hosted on SPARQL, allowing application with any type of

implementation to access via the common HTTP standard• Javascript source code available via web browser for

inspection, reuse, and repurposing

Jia

Page 22: Rensselaer Polytechnic Institute Data Science, Fall 2010 Professor Peter Fox 6961-2010_A4_GROUP_A Tim Lebo Chitti Shravya Ravi Chad Ruhle Brian Wang Jia

4: Data Management PlanSecurity Support

• Original Data Securityo Read-only access to the Archeological Data Service siteo Account Access to submit or modify data 

• Analyzed Data Securityo Cached versions and converted forms stored and served

by Tetherless World Constellationo Read access only, public cannot make changes

Jia

Page 23: Rensselaer Polytechnic Institute Data Science, Fall 2010 Professor Peter Fox 6961-2010_A4_GROUP_A Tim Lebo Chitti Shravya Ravi Chad Ruhle Brian Wang Jia

4: Data Management PlanData Ownership

• Common Access Agreement• ADS owns all of the data 

o allow anyone to use it o allow analysis and interpretations on it o non-commercial research or teaching purposes. 

• reproduced, re-hosted, and transformed the data  • all within the two listed purposes.

Brian

Page 24: Rensselaer Polytechnic Institute Data Science, Fall 2010 Professor Peter Fox 6961-2010_A4_GROUP_A Tim Lebo Chitti Shravya Ravi Chad Ruhle Brian Wang Jia

4: Data Management PlanData Dissemination and Publication

• ADS needs to be credited for the data o All forms of dissemination and publications

• Does not want to be linked or be held responsible for any further analysis or interpretation of their work

• If any article or document is published, ADS must be given a copy.

Brian

Page 25: Rensselaer Polytechnic Institute Data Science, Fall 2010 Professor Peter Fox 6961-2010_A4_GROUP_A Tim Lebo Chitti Shravya Ravi Chad Ruhle Brian Wang Jia

4: Data Management PlanMetadata Collection, Management, and Access

• ADS provided one paragraph about how data was collected in HTML, little else

• Provenance data captured from CSV and JPG files• RDF associated with CSVs• Our metadata is contained within the data dump downloads• Can be queried in the same way the data is (same SPARQL

endpoint, same named graph)

Chad

Page 26: Rensselaer Polytechnic Institute Data Science, Fall 2010 Professor Peter Fox 6961-2010_A4_GROUP_A Tim Lebo Chitti Shravya Ravi Chad Ruhle Brian Wang Jia

4: Data Management PlanKnowledge and information discovery

• Will be added to the dataset collection of LOGD soon.• Information discovery through the presentation in this class.• Email Martin King with the findings and data, in order to

pass it to the ADS community• A web site containing the visualizations, information about

the data and other findings has been created.

Shravya

Page 27: Rensselaer Polytechnic Institute Data Science, Fall 2010 Professor Peter Fox 6961-2010_A4_GROUP_A Tim Lebo Chitti Shravya Ravi Chad Ruhle Brian Wang Jia

Questions?Essential links

Original dataset URL:    http://ads.ahds.ac.uk/catalogue/specColl/lwtd

 RDF dataset URI:     http://logd.tw.rpi.edu/source/ads-ahds-ac-uk/dataset/living-with-the-dead/version/2008-Mar-26

Demo URL:    http://logd.tw.rpi.edu/demo/living_dead_-_november_2010

Data management plan:    https://docs.google.com/document/d/1UqQ45Cz7BJBHGLIv11EmKrr9tobuSHj6grG2bciYgrs

This presentation:    https://docs.google.com/present/edit?id=0AbTeDpS4-nUDZGNkYzQydm5fMzdnYnduZGpnNw