32
Jongwook Woo HiPIC CSULA By Priyanka Kale GUIDE: Dr. Jongwook Woo High-Performance Information Computing Center (HiPIC) California State University Los Angeles Geolocation Data Analysis for Safe Residence using HiveQL

Data analysis using hive ql & tableau

Embed Size (px)

Citation preview

Page 1: Data analysis using hive ql & tableau

Jongwook Woo

HiPIC

CSULA

By Priyanka KaleGUIDE: Dr. Jongwook Woo

High-Performance Information Computing Center (HiPIC)California State University Los Angeles

Geolocation Data Analysis for Safe

Residence using HiveQL

Page 2: Data analysis using hive ql & tableau

High Performance Information Computing CenterJongwook Woo

CSULA

Myself

Name: Priyanka Kale Experience: Since Dec 2015 member at,

HiPIC (High-Performance Information Computing Center

Computer Information Systems Department of California State University Los Angeles. http://web.calstatela.edu/centers/hipic/

Yelp Data Analysis:

Yelp API provides open dataset for yearly challenge and using this same dataset I have analyzed and created Microsoft Azure machine learning model predicting the ratings (stars) a particular business will receive based on the past history of review counts and ratings. This model is build using Multiclass Logistic regression algorithm. https://github.com/priya708/Yelp-Data-Analysis

Page 3: Data analysis using hive ql & tableau

High Performance Information Computing CenterJongwook Woo

CSULA

Contents

IntroductionBig DataProcess FlowSpecificationsImplementationVisualizationGitHubBusiness PerspectiveReferences

Page 4: Data analysis using hive ql & tableau

High Performance Information Computing CenterJongwook Woo

CSULA

Contents

Introduction To Big DataBig DataProcess FlowSpecificationsImplementationVisualizationGitHubBusiness PerspectiveReferences

Page 5: Data analysis using hive ql & tableau

High Performance Information Computing CenterJongwook Woo

CSULA

Introduction

Goal- To determine if a location is safe or not by analyzing huge crime data (1.3 GB) for Chicago city in IL collected from 2001 to present(November 2015).

This is a study of real dataset provided by the government of United States of America using Big Data Analytics and related Tools.

Query output is visualized using different graphs and maps for better interpretation.

Page 6: Data analysis using hive ql & tableau

High Performance Information Computing CenterJongwook Woo

CSULA

ContentsIntroduction To Big DataBig DataProcess FlowSpecificationsImplementationVisualizationGitHubBusiness PerspectiveReferences

Page 7: Data analysis using hive ql & tableau

High Performance Information Computing CenterJongwook Woo

CSULA

VolumeBig Data

Complexity

Variety

Variability

Page 8: Data analysis using hive ql & tableau

High Performance Information Computing CenterJongwook Woo

CSULA

ContentsIntroduction To Big DataBig DataProcess FlowSpecificationsImplementationVisualizationGitHubBusiness PerspectiveReferences

Page 9: Data analysis using hive ql & tableau

High Performance Information Computing CenterJongwook Woo

CSULA

Process-Flow

Download Dataset

Upload Data file to Hue

Fire queries via Beeswax

Download result csv file

Visualizes result file using google fusion table

Page 10: Data analysis using hive ql & tableau

High Performance Information Computing CenterJongwook Woo

CSULA

ContentsIntroduction To Big DataBig DataProcess FlowSpecificationsImplementationVisualizationGitHubBusiness PerspectiveReferences

Page 11: Data analysis using hive ql & tableau

High Performance Information Computing CenterJongwook Woo

CSULA

11

Specifications

Microsoft Azure Hortonwork’s sandbox: Linux systemNo. of nodes: 48 coresSize-14 Gb

• Microsoft Azure Hortonwork’s sandbox: 1. Linux system2. No. of nodes: 43. 8 cores4. Size-14 Gb

Page 12: Data analysis using hive ql & tableau

High Performance Information Computing CenterJongwook Woo

CSULA

Specifications

Page 13: Data analysis using hive ql & tableau

High Performance Information Computing CenterJongwook Woo

CSULA

ContentsIntroduction To Big DataBig DataProcess FlowSpecificationsImplementationVisualizationGitHubBusiness PerspectiveReferences

Page 14: Data analysis using hive ql & tableau

High Performance Information Computing CenterJongwook Woo

CSULA

ImplementationHue is a web application which helps to browse HDFS and work with Hive and Cloudera Impala queries, MapReduce jobs.

Page 15: Data analysis using hive ql & tableau

High Performance Information Computing CenterJongwook Woo

CSULA

Creation of tables in Hcatalog

Page 16: Data analysis using hive ql & tableau

High Performance Information Computing CenterJongwook Woo

CSULA

Hive and Beeswax

Hive is an infrastructure built on top of Hadoop for data summarization, query and analysis. Beeswax an application to perform HIVE queries

Page 17: Data analysis using hive ql & tableau

High Performance Information Computing CenterJongwook Woo

CSULA

Processing in Beeswax

Page 18: Data analysis using hive ql & tableau

High Performance Information Computing CenterJongwook Woo

CSULA

ContentsIntroduction To Big DataBig DataProcess FlowSpecificationsImplementationVisualizationGitHubBusiness PerspectiveReferences

Page 19: Data analysis using hive ql & tableau

High Performance Information Computing CenterJongwook Woo

CSULA

Queries and Visualization

Total no and rank of crime type –

select primary_type, count(iucr), rank() over (ORDER BY count(iucr) desc) from crime group by primary_type limit 100;

Page 20: Data analysis using hive ql & tableau

High Performance Information Computing CenterJongwook Woo

CSULA

Page 21: Data analysis using hive ql & tableau

High Performance Information Computing CenterJongwook Woo

CSULA

Number of crime as per location type for a given area-

Select location_description, count(iucr) from crime where address = '008XX N MICHIGAN AVE' group by location_description limit 100;

Page 22: Data analysis using hive ql & tableau

High Performance Information Computing CenterJongwook Woo

CSULA

Page 23: Data analysis using hive ql & tableau

High Performance Information Computing CenterJongwook Woo

CSULA

Query to determine which type of crime is highest in a given area:

SELECT Crime_type, count (iucr), rank () over (ORDER BY count (iucr) desc) AS rank from CrimeWhere address = "037XX W OHIO ST"GROUP BY Crime_type;

Page 24: Data analysis using hive ql & tableau

High Performance Information Computing CenterJongwook Woo

CSULA

Query to determine number of crimes occurred on a particular month:

SELECT Month, count (Case_number) from crime GROUP BY Month;

Page 25: Data analysis using hive ql & tableau

High Performance Information Computing CenterJongwook Woo

CSULA

Final Outcome of Analysis:

Query to determine the most unsafe and safest area in Chicago:

CREATE TABLE UnsafeArea row format delimited fields terminated by ',' STORED AS RCFile AS select address, count(iucr) AS total_crimes,rank() over (ORDER BY count(iucr) desc) AS rank from crime GROUP BY address;

Page 26: Data analysis using hive ql & tableau

High Performance Information Computing CenterJongwook Woo

CSULA

Final Outcome of Analysis:

Page 27: Data analysis using hive ql & tableau

High Performance Information Computing CenterJongwook Woo

CSULA

Contents

Introduction To Big DataBig DataProcess FlowSpecificationsImplementationVisualizationGitHubBusiness PerspectiveReferences

Page 28: Data analysis using hive ql & tableau

High Performance Information Computing CenterJongwook Woo

CSULA

GitHub

URL: https://github.com/priya708/Project-520

Page 29: Data analysis using hive ql & tableau

High Performance Information Computing CenterJongwook Woo

CSULA

Contents

Introduction To Big DataBig DataProcess FlowSpecificationsImplementationVisualizationGitHubBusiness PerspectiveReferences

Page 30: Data analysis using hive ql & tableau

High Performance Information Computing CenterJongwook Woo

CSULA

Business Perspective

Get better advertisement

Predictive Policing for Police department: The future of Law enforcement?

• Reducing Random Gunfire• Connecting Burglaries and Code

Violations

Page 31: Data analysis using hive ql & tableau

High Performance Information Computing CenterJongwook Woo

CSULA

ContentsIntroduction To Big DataBig DataProcess FlowSpecificationsImplementationVisualizationGitHubBusiness PerspectiveReferences

Page 32: Data analysis using hive ql & tableau

High Performance Information Computing CenterJongwook Woo

CSULA

https://catalog.data.gov

https://cwiki.apache.org/confluence/display/Hive/Tutorial

https://hortonworks.com/tutorials

References