28
Sunnie Chung Cleveland State University

Sunnie Chung - Cleveland State Universityeecs.csuohio.edu/~sschung/CIS660/DataAnalyticsCloud_SunnieChungUpdated... · MSIA 490 Intro to Databases & Information Retrieval MSIA 411

  • Upload
    others

  • View
    24

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Sunnie Chung - Cleveland State Universityeecs.csuohio.edu/~sschung/CIS660/DataAnalyticsCloud_SunnieChungUpdated... · MSIA 490 Intro to Databases & Information Retrieval MSIA 411

Sunnie ChungCleveland State University

Page 2: Sunnie Chung - Cleveland State Universityeecs.csuohio.edu/~sschung/CIS660/DataAnalyticsCloud_SunnieChungUpdated... · MSIA 490 Intro to Databases & Information Retrieval MSIA 411

• Data Scientist

• Big Data Processing

• Data Mining

2Sunnie Chung Cleveland State University

Page 3: Sunnie Chung - Cleveland State Universityeecs.csuohio.edu/~sschung/CIS660/DataAnalyticsCloud_SunnieChungUpdated... · MSIA 490 Intro to Databases & Information Retrieval MSIA 411

• INTERSECT of Computer Scientists and Statisticianswith Knowledge of Data Mining AND Big data Processing Skills:

• to Handle Big Data

• to Collect, Process and Extract value from Big Data (giant and diverse data sets)

• to Understand, Visualize and Present their findings to non-data scientists

•Ability to Create Data-driven Solutions that boost profits, reduce costs and even help save the world

3Sunnie Chung Cleveland State University

Page 4: Sunnie Chung - Cleveland State Universityeecs.csuohio.edu/~sschung/CIS660/DataAnalyticsCloud_SunnieChungUpdated... · MSIA 490 Intro to Databases & Information Retrieval MSIA 411

And tackle big data projects on every level

• Big Data and Cloud Projects are in Every CEO’s To Do List

• The Defense Department

• NASA : Predict Earthquake (specially after Nepal’s Earthquake)

• NSA, Homeland Security : Predict and Prevent Terrorists’ Acts

• Internet start-ups

• Financial institutions

4Sunnie Chung Cleveland State University

Page 5: Sunnie Chung - Cleveland State Universityeecs.csuohio.edu/~sschung/CIS660/DataAnalyticsCloud_SunnieChungUpdated... · MSIA 490 Intro to Databases & Information Retrieval MSIA 411

• Volume : Unprecedentedly Huge Volume of Data fueled by web based business, social networking, micro blogs (e.g., click streams captured in web server logs)e.g.) Ebay processes 8 Peta Bytes data per night

• Various Structures of Data (No Structure) :Structured (Database, Data Warehouse)

Semi-structured (Web pages) and

Unstructured (Web Server Log, Sensor Data) – most of time !!

• Velocity : Unprecedentedly generate new data at a high rate

e.g.) Streaming Twitter MessagesMachine-generated data streaming in from smart devices, sensors, monitors and meters needs big data analytics

5Sunnie Chung Cleveland State University

Page 6: Sunnie Chung - Cleveland State Universityeecs.csuohio.edu/~sschung/CIS660/DataAnalyticsCloud_SunnieChungUpdated... · MSIA 490 Intro to Databases & Information Retrieval MSIA 411

• Numerous new analytic and business intelligence opportunities like:

• Fraud detection

• Customer profiling

• Customer loyalty analysis

• All of which directly affect revenue of business and critical business decisions.

6Sunnie Chung Cleveland State University

Page 7: Sunnie Chung - Cleveland State Universityeecs.csuohio.edu/~sschung/CIS660/DataAnalyticsCloud_SunnieChungUpdated... · MSIA 490 Intro to Databases & Information Retrieval MSIA 411

• Identifying Field Specific Motive/Purposes

• Identify Nature of Big Data Source and Data Specific Processes

• Decisions on Building IT Infrastructure of Big Data Processing Systems

• Public Cloud/Private Cloud

• Which MPP Big Data Systems should be built for our specific Big Data Source and Volume

• Execution of Data Analytics• Data Source Modeling

• Apply Data Mining Strategies

• Research solutions• Implement Big Data Processing Steps for Solutions/Strategies

• Analyze Results/Interpretation -- Feedback

7Sunnie Chung Cleveland State University

Page 8: Sunnie Chung - Cleveland State Universityeecs.csuohio.edu/~sschung/CIS660/DataAnalyticsCloud_SunnieChungUpdated... · MSIA 490 Intro to Databases & Information Retrieval MSIA 411

Massively Parallel Processing (MPP)

• Parallel Data Warehouse (PDW) System

Oracle, IBM, Teradata, Microsoft

• Hadoop System with Map Reduce

Google, Yahoo, Facebook, Twitter, LinkedIn

• Hybrid of Both

• MPP System on CloudAmazon, Google, Microsoft, Oracle

8Sunnie Chung Cleveland State University

Page 9: Sunnie Chung - Cleveland State Universityeecs.csuohio.edu/~sschung/CIS660/DataAnalyticsCloud_SunnieChungUpdated... · MSIA 490 Intro to Databases & Information Retrieval MSIA 411

• MPP System

• Virtual Machine (VM)

• Cloud TypeCloud as Service

Cloud as Platform

Cloud as Service

• Amazon Elastic Cloud

• Google Cloud

• Microsoft Cloud: Azure

9Sunnie Chung Cleveland State University

Page 10: Sunnie Chung - Cleveland State Universityeecs.csuohio.edu/~sschung/CIS660/DataAnalyticsCloud_SunnieChungUpdated... · MSIA 490 Intro to Databases & Information Retrieval MSIA 411

Anomaly detectionThe identification of unusual data records, that might be interesting or data errors that require further investigation.

Association rule learning (Dependency modelling) Searches for relationships between variables. For example a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.

ClusteringThe task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data.

ClassificationThe task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam".Regression – attempts to find a function which models the data with the least error.

Summarization Providing a more compact representation of the data set, including visualization and report generation.

Results validation10Sunnie Chung Cleveland State University

Page 11: Sunnie Chung - Cleveland State Universityeecs.csuohio.edu/~sschung/CIS660/DataAnalyticsCloud_SunnieChungUpdated... · MSIA 490 Intro to Databases & Information Retrieval MSIA 411

• Statistics

Naive Bayes, Clustering• Machine Learning

• Classification Algorithms: Decision Tree, Neural Network, Support Vector Machine

New Algorithm: Convolutional Neural Network - still evolving in fast rate

• Database

• Association Rule Mining, Data Warehouse OLAP • Big Data Processing ���� Most Recent - still evolving in fast rate

• Information Retrieval• Google Search Engine -> Artificial Intelligence - still evolving

in fast rate

11Sunnie Chung Cleveland State University

Page 12: Sunnie Chung - Cleveland State Universityeecs.csuohio.edu/~sschung/CIS660/DataAnalyticsCloud_SunnieChungUpdated... · MSIA 490 Intro to Databases & Information Retrieval MSIA 411

• Databases

Advanced Modern Databases and Data Processing Strategies

• Big Data Processing with:

• Parallel Data Warehouse and OLAP (Online Analytic Processing)

• Map Reduce

• Hadoop Based MPP Systems

• Statistics

• Data Mining

- Database: Association Rule Mining, Data Warehouse OLAP

- Statistics: Bayesian, Clustering

- Machine Learning: Decision Tree, Neural Network

And More on recent developments

12Sunnie Chung Cleveland State University

Page 13: Sunnie Chung - Cleveland State Universityeecs.csuohio.edu/~sschung/CIS660/DataAnalyticsCloud_SunnieChungUpdated... · MSIA 490 Intro to Databases & Information Retrieval MSIA 411

MPP Systems

• Parallel Data Warehouse Based Systems : • Oracle, Tera Data, Microsoft PDW, IBM

• In Memeory NEW SQL Systems

• Hadoop/MapReduce Based Systems: No SQL systems• Mongo DB

• Pig Latin

• Hbase

• Hive

• And So many Others

• Cloud: Big Data Processing Systems on Cloud• Google Cloud, Amazon Cloud, Microsoft Azure, Oracle, IBM

13Sunnie Chung Cleveland State University

Page 14: Sunnie Chung - Cleveland State Universityeecs.csuohio.edu/~sschung/CIS660/DataAnalyticsCloud_SunnieChungUpdated... · MSIA 490 Intro to Databases & Information Retrieval MSIA 411

14

http://blogs.the451group.com/opensource/2011/04/15/nosql-newsql-and-beyond-the-answer-to-sprained-relational-databases/

Sunnie Chung Cleveland State University

Page 15: Sunnie Chung - Cleveland State Universityeecs.csuohio.edu/~sschung/CIS660/DataAnalyticsCloud_SunnieChungUpdated... · MSIA 490 Intro to Databases & Information Retrieval MSIA 411

Popular Free Open Source

• R/ Map R: A programming language and software environment for statistical computing, data mining, and graphics. GNU Project.

• Sparks: Streaming Data Processing• Google Tensorflow: Python, C++ based Image Processing Library, Natural Language Processing Libraray

• Weka: A suite of machine learning software applications written in the Javaprogramming language

• UIMA:(Unstructured Information Management Architecture) is a component framework for analyzing unstructured content such as text, audio and video – originally developed by IBM

Major Commercial:

• SAS Enterprise Miner

• Microsoft Business Intelligence Data Analytic Tool using Databases

15Sunnie Chung Cleveland State University

Page 16: Sunnie Chung - Cleveland State Universityeecs.csuohio.edu/~sschung/CIS660/DataAnalyticsCloud_SunnieChungUpdated... · MSIA 490 Intro to Databases & Information Retrieval MSIA 411

On Databases

CIS 530 : Database Concept and Modern Database Processing

CIS 611 : Advanced Data Processing Techniques in PDW

– Parallel Data Warehouse and OLAP

On Big Data Processing and Management Systems

CIS 612 : Big Data Processing Systems

and Modern Database Programming

– Hadoop and MapReduce

- VM(Virtual Machine), Cloud

CIS 695: Practicum in Data Analytics and Big Data Processing

(Scheduled to be created in Spring 2016)

CIS 696: One more new courses will be created on recent research16Sunnie Chung Cleveland State University

Page 17: Sunnie Chung - Cleveland State Universityeecs.csuohio.edu/~sschung/CIS660/DataAnalyticsCloud_SunnieChungUpdated... · MSIA 490 Intro to Databases & Information Retrieval MSIA 411

• Data Mining

CIS 660: Data Mining Techniques from Database, Statistics

and Machin Learning, Text/Web Mining Techniques

EEC 525 Data Mining

17Sunnie Chung Cleveland State University

Page 18: Sunnie Chung - Cleveland State Universityeecs.csuohio.edu/~sschung/CIS660/DataAnalyticsCloud_SunnieChungUpdated... · MSIA 490 Intro to Databases & Information Retrieval MSIA 411

• Math and Statistics

Graduate Certificate in Applied Predictive Modeling

MTH 521 : Time Series Analysis

MTH 531 : Categorical Data Analysis

MTH 537 : Operation Research

MTH 567 : Applied Linear Models I

MTH 638 : Operation Research II

MTH 668 : Applied Linear Models II

MTH 675 : Applied Multivariate Statistics

18Sunnie Chung Cleveland State University

Page 19: Sunnie Chung - Cleveland State Universityeecs.csuohio.edu/~sschung/CIS660/DataAnalyticsCloud_SunnieChungUpdated... · MSIA 490 Intro to Databases & Information Retrieval MSIA 411

• Business Analytic Certificates

Focus on SAS Certificate with SAS Enterprise Miner Tool

BUS 575 : Introduction to Business Analytics

BUS 600 : Applied Business Analytics

BUS 601 : Managing Databases for Business Analytics

BUS 602 : Strategy for Business Analytics

BUS 603 : SAS for Data and Statistical Analysis

BUS 604: Advanced Business Analytics I

BUS 606: Practicum in Business Analytics

19Sunnie Chung Cleveland State University

Page 20: Sunnie Chung - Cleveland State Universityeecs.csuohio.edu/~sschung/CIS660/DataAnalyticsCloud_SunnieChungUpdated... · MSIA 490 Intro to Databases & Information Retrieval MSIA 411

• Explorys by IBM• website: https://www.explorys.com/

• Data Analytic/ Big Data Processing on Health and Wellness Data

• Data Analytic for Cleveland Clinic (Tera Data PDW), Metro Health

• Progressive• Big Data Processing on Auto Insurance : Hadoop Based MPP Systems

• PNC (Tera Data MPP PDW)• Big Data Processing Systems on Financial Data

20Sunnie Chung Cleveland State University

Page 21: Sunnie Chung - Cleveland State Universityeecs.csuohio.edu/~sschung/CIS660/DataAnalyticsCloud_SunnieChungUpdated... · MSIA 490 Intro to Databases & Information Retrieval MSIA 411

• Hadoop Big Data Processing Workshop/Meetup

EECS Dept of CSU Planning to host the meeting annually to

connect our students to the local Big Data Companies

• Data Scientist Group

Regular webinar on Advanced Data Analytic Topics

21Sunnie Chung Cleveland State University

Page 22: Sunnie Chung - Cleveland State Universityeecs.csuohio.edu/~sschung/CIS660/DataAnalyticsCloud_SunnieChungUpdated... · MSIA 490 Intro to Databases & Information Retrieval MSIA 411

Current Research/Publications at CSU (by Sunnie Chung)

• Research on the Problems in Developing MPP Systems

• Research on Integrating Big Data Management Systems (BDBMS) -- Most recent research trends

• Research on Data Mining for Machine Fault Detection

22Sunnie Chung Cleveland State University

Page 23: Sunnie Chung - Cleveland State Universityeecs.csuohio.edu/~sschung/CIS660/DataAnalyticsCloud_SunnieChungUpdated... · MSIA 490 Intro to Databases & Information Retrieval MSIA 411

• 10 out of 23 Programs are Master Degrees on Business Analytics

• Limited in Basic Statistics and Marketing/Business Oriented

• Data Mining Tools Only (SAS, MS BI Data Analysis Tool)

• For Data Scientist Oriented Programs (Typical East Coast Theory Oriented Programs: Columbia, NYU, DePaul)

• Focus on Predictive Analysis Skill (Math and Stats),

• Computational Theory on Machine Learning Algorithms Oriented

• Lack of Practical Data Processing Courses or Big Data System/Cloud

• Not Many Courses are available

• Good Data Analytics Programs with Good Balance of Core Subjects, Analytic Skills and Practicum

• North Western University

• Indiana University Bloomington

• Canegie Mellon 23Sunnie Chung Cleveland State University

Page 24: Sunnie Chung - Cleveland State Universityeecs.csuohio.edu/~sschung/CIS660/DataAnalyticsCloud_SunnieChungUpdated... · MSIA 490 Intro to Databases & Information Retrieval MSIA 411

MSIA 401 Statistical Methods for Data Mining

MSIA 431 Analytics for Big Data

MSIA 489 Industry Practicum

MSIA 490-21 Predictive Models for Credit Risk Managment

MSIA 490-23 Healthcare Analytics

MSIA 490-25 Intro to Java Programming

MSIA 490-27 Social Networks Analysis

MSIA 490 Intro to Databases & Information Retrieval

MSIA 411 Data Visualization

MSIA 420 Predictive Analytics

MSIA 421 Data Mining

MSIA 430 Introduction to Data Warehousing and Workflow Management

MSIA 490-20 Text Analytics

MSIA 490-20 Topics in Analytics with Python

MSIA 440 Optimization and Heuristics24Sunnie Chung Cleveland State University

Page 25: Sunnie Chung - Cleveland State Universityeecs.csuohio.edu/~sschung/CIS660/DataAnalyticsCloud_SunnieChungUpdated... · MSIA 490 Intro to Databases & Information Retrieval MSIA 411

• 2 years of Master of Data Science/Data Analytics or• Hybrid : Master of Data Science and Computer Information Science

• Good balance of Courses on Core Subjects:Big Data Processing ApplicationAdvanced Database Advanced AlgorithmStatisticsData MiningSecurity in Network SystemInformation VisualizationCloud Computing

• Variety of good related Courses are available25Sunnie Chung Cleveland State University

Page 26: Sunnie Chung - Cleveland State Universityeecs.csuohio.edu/~sschung/CIS660/DataAnalyticsCloud_SunnieChungUpdated... · MSIA 490 Intro to Databases & Information Retrieval MSIA 411

• MSIT – Business Intelligence & Data Analytics Curriculum:

• Prerequisite: OOP Programming Courses and 3 years Working Experience

Course # Core Courses (60 units required) Units

95-703 Database Management 12

95-796 Statistics for IT Managers 6

95-710 Economic Analysis 6

95-797 Data Warehousing 6

94-806 Privacy in the Digital Age 6

95-868 Exploring and Visualizing Data 6

95-791 Data Mining 6

95-852 Analytics and Business Intelligence 6

95-866 Advanced Business Analytics 6

26Sunnie Chung Cleveland State University

Page 27: Sunnie Chung - Cleveland State Universityeecs.csuohio.edu/~sschung/CIS660/DataAnalyticsCloud_SunnieChungUpdated... · MSIA 490 Intro to Databases & Information Retrieval MSIA 411

• 30 credit hours in 2 years

CIS 530 : Database Concept and Modern Database Processing

CIS 611 : Advanced Data Processing Techniques in Parallel Data Warehouse and OLAP

CIS 612 : Big Data Processing Systems and Information Retrieval

Hadoop and MapReduce

VM(Virtual Machine), Cloud

CIS 695: Practicum in Data Analytics and Big Data Processing (In Spring 2016)

CIS 660: Data Mining Techniques from Database, Statistics and Machin Learning

EEC 525 Data Mining: Web Data Mining Techniques from Database

CIS 660: Advanced Algorithm

CIS 340: System Programming

CIS 260: Java Programming

CIS 675 Information Security

EEC 693 Network Security and Privacy

Applied Predictive Modeling:

MTH 531 : Categorical Data Analysis

MTH 567 : Applied Linear Models I

MTH 668 : Applied Linear Models II

MTH 675 : Applied Multivariate Statistics

BUS 603 : SAS for Data and Statistical Analysis

BUS 604: Advanced Business Analytics I

BUS 606: Practicum in Business Analytics 27Sunnie Chung Cleveland State University

Page 28: Sunnie Chung - Cleveland State Universityeecs.csuohio.edu/~sschung/CIS660/DataAnalyticsCloud_SunnieChungUpdated... · MSIA 490 Intro to Databases & Information Retrieval MSIA 411

• Data Visualization

28Sunnie Chung Cleveland State University