30
Data Science and Big Data Analytics Chap1: Intro to Big Data Analytics Charles Tappert Seidenberg School of CSIS, Pace University

Data Science and Big Data Analytics Chap1: Intro to Big Data Analytics Charles Tappert Seidenberg School of CSIS, Pace University

Embed Size (px)

Citation preview

Page 1: Data Science and Big Data Analytics Chap1: Intro to Big Data Analytics Charles Tappert Seidenberg School of CSIS, Pace University

Data Science and Big Data Analytics

Chap1: Intro to Big Data Analytics

Charles TappertSeidenberg School of CSIS, Pace

University

Page 2: Data Science and Big Data Analytics Chap1: Intro to Big Data Analytics Charles Tappert Seidenberg School of CSIS, Pace University

1.1 Big Data Overview

Industries that gather and exploit data Credit card companies monitor purchase

Good at identifying fraudulent purchases Mobile phone companies analyze calling

patterns – e.g., even on rival networks Look for customers might switch providers

For social networks data is primary product

Intrinsic value increases as data grows

Page 3: Data Science and Big Data Analytics Chap1: Intro to Big Data Analytics Charles Tappert Seidenberg School of CSIS, Pace University

Attributes Defining Big Data Characteristics

Huge volume of data Not just thousands/millions, but billions of

items Complexity of data types and

structures Varity of sources, formats, structures

Speed of new data creation and grow High velocity, rapid ingestion, fast

analysis

Page 4: Data Science and Big Data Analytics Chap1: Intro to Big Data Analytics Charles Tappert Seidenberg School of CSIS, Pace University

Sources of Big Data Deluge

Mobile sensors – GPS, accelerometer, etc. Social media – 700 Facebook updates/sec

in2012 Video surveillance – street cameras, stores,

etc. Video rendering – processing video for

display Smart grids – gather and act on information Geophysical exploration – oil, gas, etc. Medical imaging – reveals internal body

structures Gene sequencing – more prevalent, less

expensive, healthcare would like to predict personal illnesses

Page 5: Data Science and Big Data Analytics Chap1: Intro to Big Data Analytics Charles Tappert Seidenberg School of CSIS, Pace University

Sources of Big Data Deluge

Page 6: Data Science and Big Data Analytics Chap1: Intro to Big Data Analytics Charles Tappert Seidenberg School of CSIS, Pace University

Example:Genotyping from 23andme.com

Page 7: Data Science and Big Data Analytics Chap1: Intro to Big Data Analytics Charles Tappert Seidenberg School of CSIS, Pace University

1.1.1 Data Structures:Characteristics of Big

Data

Page 8: Data Science and Big Data Analytics Chap1: Intro to Big Data Analytics Charles Tappert Seidenberg School of CSIS, Pace University

Data Structures:Characteristics of Big

Data

Structured – defined data type, format, structure Transactional data, OLAP cubes, RDBMS, CVS files,

spreadsheets Semi-structured

Text data with discernable patterns – e.g., XML data Quasi-structured

Text data with erratic data formats – e.g., clickstream data Unstructured

Data with no inherent structure – text docs, PDF’s, images, video

Page 9: Data Science and Big Data Analytics Chap1: Intro to Big Data Analytics Charles Tappert Seidenberg School of CSIS, Pace University

Example of Structured Data

Page 10: Data Science and Big Data Analytics Chap1: Intro to Big Data Analytics Charles Tappert Seidenberg School of CSIS, Pace University

Example of Semi-Structured Data

Page 11: Data Science and Big Data Analytics Chap1: Intro to Big Data Analytics Charles Tappert Seidenberg School of CSIS, Pace University

Example of Quasi-Structured Data

visiting 3 websites adds 3 URLs to user’s log files

Page 12: Data Science and Big Data Analytics Chap1: Intro to Big Data Analytics Charles Tappert Seidenberg School of CSIS, Pace University

Example of Unstructured Data

Video about Antarctica Expedition

Page 13: Data Science and Big Data Analytics Chap1: Intro to Big Data Analytics Charles Tappert Seidenberg School of CSIS, Pace University

1.1.2 Types of Data Repositories

from an Analyst Perspective

Page 14: Data Science and Big Data Analytics Chap1: Intro to Big Data Analytics Charles Tappert Seidenberg School of CSIS, Pace University

1.2 State of the Practicein Analytics

Business Intelligence (BI) versus Data Science

Current Analytical Architecture Drivers of Big Data Emerging Big Data Ecosystem

and a New Approach to Analytics

Page 15: Data Science and Big Data Analytics Chap1: Intro to Big Data Analytics Charles Tappert Seidenberg School of CSIS, Pace University

Business Drivers for Advanced Analytics

Page 16: Data Science and Big Data Analytics Chap1: Intro to Big Data Analytics Charles Tappert Seidenberg School of CSIS, Pace University

1.2.1 Business Intelligence (BI) versus Data Science

Page 17: Data Science and Big Data Analytics Chap1: Intro to Big Data Analytics Charles Tappert Seidenberg School of CSIS, Pace University

1.2.2 Current Analytical Architecture

Typical Analytic Architecture

Page 18: Data Science and Big Data Analytics Chap1: Intro to Big Data Analytics Charles Tappert Seidenberg School of CSIS, Pace University

Current Analytical Architecture

Data sources must be well understood EDW – Enterprise Data Warehouse From the EDW data is read by

applications Data scientists get data for

downstream analytics processing

Page 19: Data Science and Big Data Analytics Chap1: Intro to Big Data Analytics Charles Tappert Seidenberg School of CSIS, Pace University

1.2.3 Drivers of Big DataData Evolution & Rise of Big Data

Sources

Page 20: Data Science and Big Data Analytics Chap1: Intro to Big Data Analytics Charles Tappert Seidenberg School of CSIS, Pace University

1.2.4 Emerging Big Data Ecosystem and a New Approach

to Analytics

Four main groups of players Data devices

Games, smartphones, computers, etc. Data collectors

Phone and TV companies, Internet, Gov’t, etc. Data aggregators – make sense of data

Websites, credit bureaus, media archives, etc. Data users and buyers

Banks, law enforcement, marketers, employers, etc.

Page 21: Data Science and Big Data Analytics Chap1: Intro to Big Data Analytics Charles Tappert Seidenberg School of CSIS, Pace University

Emerging Big Data Ecosystem and a New Approach to Analytics

Page 22: Data Science and Big Data Analytics Chap1: Intro to Big Data Analytics Charles Tappert Seidenberg School of CSIS, Pace University

1.3 Key Roles for theNew Big Data

Ecosystem

1. Deep analytical talent Advanced training in quantitative

disciplines – e.g., math, statistics, machine learning

2. Data savvy professionals Savvy but less technical than group 1

3. Technology and data enablers Support people – e.g., DB admins,

programmers, etc.

Page 23: Data Science and Big Data Analytics Chap1: Intro to Big Data Analytics Charles Tappert Seidenberg School of CSIS, Pace University

Three Key Roles of theNew Big Data

Ecosystem

Page 24: Data Science and Big Data Analytics Chap1: Intro to Big Data Analytics Charles Tappert Seidenberg School of CSIS, Pace University

Three Recurring Data Scientist Activities

1. Reframe business challenges as analytics challenges

2. Design, implement, and deploy statistical models and data mining techniques on Big Data

3. Develop insights that lead to actionable recommendations

Page 25: Data Science and Big Data Analytics Chap1: Intro to Big Data Analytics Charles Tappert Seidenberg School of CSIS, Pace University

Profile of Data ScientistFive Main Sets of Skills

Page 26: Data Science and Big Data Analytics Chap1: Intro to Big Data Analytics Charles Tappert Seidenberg School of CSIS, Pace University

Profile of Data ScientistFive Main Sets of Skills

Quantitative skill – e.g., math, statistics

Technical aptitude – e.g., software engineering, programming

Skeptical mindset and critical thinking – ability to examine work critically

Curious and creative – passionate about data and finding creative solutions

Communicative and collaborative – can articulate ideas, can work with others

Page 27: Data Science and Big Data Analytics Chap1: Intro to Big Data Analytics Charles Tappert Seidenberg School of CSIS, Pace University

1.4 Examples of Big Data Analytics

Retailer Target Uses life events: marriage, divorce,

pregnancy Apache Hadoop

Open source Big Data infrastructure innovation

MapReduce paradigm, ideal for many projects

Social Media Company LinkedIn Social network for working professionals Can graph a user’s professional network 250 million users in 2014

Page 28: Data Science and Big Data Analytics Chap1: Intro to Big Data Analytics Charles Tappert Seidenberg School of CSIS, Pace University

Data Visualization of User’s

Social Network Using InMaps

Page 29: Data Science and Big Data Analytics Chap1: Intro to Big Data Analytics Charles Tappert Seidenberg School of CSIS, Pace University

Summary

Big Data comes from myriad sources Social media, sensors, IoT, video surveillance, and

sources only recently considered Companies are finding creative and novel ways

to use Big Data Exploiting Big Data opportunities requires

New data architectures New machine learning algorithms, ways of working People with new skill sets

Always Review Chapter Exercises

Page 30: Data Science and Big Data Analytics Chap1: Intro to Big Data Analytics Charles Tappert Seidenberg School of CSIS, Pace University

Focus of Course

Focus on quantitative disciplines – e.g., math, statistics, machine learning

Provide overview of Big Data analytics

In-depth study of a several key algorithms