22
http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Data Collection Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos

CX4242: Data & Visual Analytics Data Collectionpoloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-2-DataCollection.pdf · Data you can just download Yahoo WebScope datasets NYC Taxi

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

http://poloclub.gatech.edu/cse6242CSE6242 / CX4242: Data & Visual Analytics

Data CollectionDuen Horng (Polo) Chau Assistant ProfessorAssociate Director, MS AnalyticsGeorgia Tech

Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos

How to Collect Data?

2

Method Effort

Download Low

API Medium

Scrape/Crawl High

Data you can just downloadYahoo WebScope datasetsNYC Taxi data: Trip (11GB), Fare (7.7GB)StackOverflow (xml)Atlanta crime data (csv)Soccer statistics…

3

More on course website: http://poloclub.gatech.edu/cse6242/2016fall/#datasets

Data you can just downloadIf you have leads, let us know on Piazza!

4

More datasets on course website: http://poloclub.gatech.edu/cse6242/2016fall/#datasets

5

http://yahoolabs.tumblr.com/post/137281912191/yahoo-releases-the-largest-ever-machine-learning?soc_src=mail&soc_trk=ma

6https://webscope.sandbox.yahoo.com

Collect Data via APIsTwitter (small subset)https://dev.twitter.com/streaming/overview

Last.fm (Pandora has unofficial API)

FlickrFacebook (your friends only)

CrunchBase (database about companies)

Rotten Tomatoes not free anymore :-(

iTunes7

Data that needs scrapingAmazon (reviews, product info)ESPNeBayGoogle PlayGoogle Scholar

8

How to Scrape? Google Play example

Goal: build network of similar apps

9

How to Scrape? Google Play example

Goal: build network of similar apps

10

https://play.google.com/store/apps/details?id=com.shazam.android&hl=en

https://play.google.com/store/apps/details?

id=com.spotify.music&hl=en

How to Store the Data?What’s the Easiest Way?

Easiest Way to Store DataAs comma-separated files (CSV)But may not be easy to parse. Why?

12

Easiest Way to Store Data

13https://en.wikipedia.org/wiki/Comma-separated_values

Most popular embedded database in the world iPhone (iOS), Android, Chrome (browsers), Mac, etc.

Self-contained: one file contains data + schemaServerless: database right on your computerZero-configuration: no need to set up!

http://www.sqlite.org/different.html 14http://www.sqlite.org

SQL Refresher: create table>sqlite3 database.db

sqlite> create table student(ssn integer, name text);

sqlite> .schema

CREATE TABLE student(ssn integer, name text);

ssn name

15

SQL Refresher: insert rowsinsert into student values(111, "Smith");

insert into student values(222, "Johnson");

insert into student values(333, "Obama");

select * from student;

ssn name111 Smith222 Johnson333 Obama

16

SQL Refresher: create another tablecreate table takes (ssn integer, course_id integer, grade integer);

sqlite>.schema

CREATE TABLE student(ssn integer, name text);

CREATE TABLE takes (ssn integer, course_id integer, grade integer);

ssn course_id grade

17

SQL Refresher: joining 2 tables

More than one tables - joinsE.g., create roster for this course (6242)

ssn course_id grade111 6242 100222 6242 90222 4000 80

ssn name111 Smith222 Johnson333 Obama

18

SQL Refresher: joining 2 tables + filtering

select name from student, takes where student.ssn = takes.ssn and takes.course_id = 6242;

ssn course_id grade111 6242 100222 6242 90222 4000 80

ssn name111 Smith222 Johnson333 Obama

19

SQL General Formselect a1, a2, ... an from t1, t2, ... tm where predicate [order by ....] [group by ...] [having ...]

20

Find ssn and GPA for each studentselect ssn, avg(grade) from takes group by ssn having avg(grade) > 90;

ssn course_id grade111 6242 100222 6242 90222 4000 80

ssn avg(grade)111 100222 85

21

What if slow?Important sanity check: Have you (or someone) created appropriate indexes?SQLite’s indices use B-tree data structure.O(logN) speed for adding/finding/deleting an itemcreate index student_ssn_index on student(ssn);

22