43
Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

Embed Size (px)

Citation preview

Page 1: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

Big Data

Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

Page 2: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

OverviewWhat is Big Data?

Characteristics

Architecture

Business Analytics

Applications of Big Data

Challenges and Critiques

The Future of Big Data

Page 3: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

What is Big Data?Big data is similar to small data, except it is, well, big. It can be categorized as data sets with sizes beyond the ability of commonly used software tools.

The “size” of big data can vary, and has gone from being considered to be a few dozen terabytes to many petabytes of data.

Having larger data, however, requires new technologies and techniques to make use of it.

Definition: Big Data represents the information assets characterized by such a high volume, velocity and variety to require specific technology and analytical methods for its transformation into value.

Page 4: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

A Short History1965 – The US govt. plans the world’s first data center to store 742 million tax returns and 175 million sets of fingerprints on magnetic tape.

1970 – Relational Database model developed

1996 – The price of digital storage falls to cheaper than paper.

1999 – First use of term big data in academic paper

2001 – Three “Vs” first defined by Doug Laney

2005 – Hadoop developed by Apache

2008 – 9.57 zetabytes processed globally.

2010 – McKinsey report states that by 2018 will face a shortfall of professional data scientists and that the full value of big data needs to be realized.

Page 5: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

Characteristics of Big Data

Volume

Variety

Velocity

Veracity

Known as “The Four V’s of Big Data

Also:Complexity

Variability

Page 6: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

The Four “Vs”

Page 7: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

VolumeScale of Data

Large amounts of data

Increasing exponentially

There are 2.5 Quintillion bytes of data generated every day (so much that 90% of the data in the world today has been created in the last two years)

Comes from many places:Business Transactions

Social media Sites

Location-based Data

Page 8: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

VarietyDifferent forms of data

Relational Data (Tables/Transaction/Legacy Data)

Text Data (Web)

Graph Data

Streaming Data

Big Public Data

Structured, Unstructured, images, etc.

Page 9: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

VelocitySpeed at which data is processed

Data is being generated fast so it needs to be processed fast

Late Decisions = missed opportunities

Modern cars have close to 100 sensors that monitor items such as fuel level and tire pressure

New York Stock Exchange captures 1TB of trade information each trading session

Page 10: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

VeracityUncertainty of Data

“Accuracy, Fidelity, or Truthfulness”

Poor data quality costs the US economy around $3.1 Trillion a year

1 in 3 Business Leaders don’t trust the information they use to make decisions

Page 11: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

Variability & Complexity

Some definitions of big data also include these characteristics.

VariabilityA factor which can be a problem for those who analyze the data. Refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively.

ComplexityData mgmt. can be a very complex process, especially when large volumes come from multiple sources. These data need to be linked, connected and correlated in order to be able to grasp the information that is supposed to be conveyed.

Page 12: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

Architecture5C Architecture

Connection

Conversion

Cyber

Cognition

Configuration

Big Data Analytics for Manufacturing Applications based on this

Page 13: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn
Page 14: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

MPP ArchitectureBig data requires certain technology to efficiently process large amounts of data.

For example:MPP (massively parallel processing) databases and architecture

Processing of a program by multiple processors that work on different parts of the program

Each processor uses its own data

Page 15: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn
Page 16: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

Big Data LakeCapturing and storing of data has emerged into a sophisticated system

Big Data Lake allows an organization to shift focus from centralized control to a shared model to respond to the changing dynamics of information management

Enable quick segregation of data into the data lake thereby reducing the overhead time

Page 17: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

Big Lake Data Cont.

Uses a flat architecture to store data

Each data element is assigned a unique identifier and tagged with a set of extended metadata tags

If a business needs a question answered, the data lake can be queried for relevant data and that smaller set of data can then be analyzed to help answer the question

Page 18: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

Storing and Harnessing Big

DataApache Hadoop Distributed File System

MongoDB

Big Data AnalyticsOLTP

OLAP

Page 19: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

Apache HadoopThe Core of Hadoop consists of a storage part (HDFS) and a processing part (MapReduce)

Files are split into large blocks and distributed among the nodes in the cluster allowing the nodes to locally analyze data instead of accessing the entire database

More efficient then the conventional supercomputer architecture using parallel processing over high-speed networks

Page 20: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

Apache Hadoop continued

MapReduce – a JobTasker using FIFO and tracking of the node where data is located

Provides a programming model that addresses the problem of disk reads and writes

Transforms data into a computation of keys and values.

Hbase – based on Google’s BigTable, a fault-tolerant way of storing large quantities of sparse data

Hive – Data warehouse infrastructure built on top of Hadoop.

Developed by Facebook

Used by both Netflix and Amazon (for S3)

Page 21: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

PostgresSQLAn open source object relational database

Postgres-XLAn extra large version of Postgres used to handle big data

Uses sharded data amongst multiple nodes

Supports business intelligence applications

Page 22: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

MongoDB

Cross Platform

Classified as a NoSQL Database

Document-Oriented DatabaseEschews traditional table-based relational database structure

Uses JSON-like documents with dynamic schemes

Makes the integration of applications easier and faster

Page 23: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

Big Data AnalyticsOLTP – Online Transaction Processing

Facilitates and manages transaction-oriented applications

Typically used for data entry and retrieval of transaction processing

OLAP – Online Analytic ProcessingEnables users to analyze multi-dimensional data interactively from multiple persepctives

Allows complex analytical and ad hoc queries with a rapid execution time

Page 24: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

Big Data Analytics Continued

TeraDataMassive parallel processing system running a shared-nothing architecture

Nodes offer back-up for one another during downtime, and load balance during normal operation

Differentiates between “hot” and “cold” data, ie data that is not often used is stored in slower storage sections

Partnered with Microsoft, SAP, Symantec, IBM, and NetApp

Page 25: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

Big Data Analytics Continued

Why is it important?Helps companies make more informed business decisions

Lets analytics professionals analyze large volumes of transactional data

Gives access to forms of data that generally go untapped by conventional business intelligence

Page 26: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

Big Data Analytics Continued

The processing power in data centers is growing quickly and server clusters are getting large.

Companies like HortonWorks focus on problems such as these.

Their focus is on making data analytics faster and more efficient in a large environment.

Page 27: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn
Page 28: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

Big Data Analytics Continued

Oracle Big Data Appliance Consists of hardware and software from Oracle designed to integrate both structured and unstructured data for enterprise purposes.

Includes:Oracle Exadata Database Machine

Oracle Exalytics Business Intelligence Machine

Apache Hadoop

Oracle NoSQL Database

Oracle Data Integrator

Page 29: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn
Page 30: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn
Page 31: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

Applications of Big Data

Between 1990 and 2005, more than a billion people worldwide entered the middle class which means more and more people who gain money will become more literate which in turn leads to information growth.

This has resulted in the further need and usefulness for big data in various applications like government, business, science, etc.

There are some problems and criticisms of this, however.

Page 32: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

GovernmentCrime prediction and prevention

The increasing variety, velocity, and volumes of data will require that agencies identify emerging threats.

Social program fraud, waste and errorsBig data can present a much clearer citizen-centric picture and uncover invisible connections to do things like proactively detect fraud and abuse.

Tax fraud and abuseBig data can help tax agencies more accurately determine who should be investigated for fraud or denied refunds by detecting deception tactics or uncovering multiple identities.

Applications of Big Data

Page 33: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

International developmentKnown as ICT4D (information and communication technologies for development)

Advancements have created cost-effective opportunities to improve development in areas such as health care, poverty, resource management, employment, and natural disaster response.

Challenges include inadequate technological infrastructure, economic and resource scarcity, which exacerbates existing issues such as privacy, interoperability, and imperfect methodologies.

Applications of Big Data

Page 34: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

ManufacturingPrognostics and Health Management

Predictive manufacturing requires vast amounts of data to be successful.

Sensory data such as acoustics, vibrations, presusure,voltage, etc. are often used for this.

5S approach

Big data provides an infrastructure for transparency in manufacturing industry

Unravels uncertainties in things like inconsistent component performance and availability.

Applications of Big Data

Page 35: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

Private SectorCompanies in retail, real estate, marketing, etc. are using big data more and more to increase their bottom line.

Walmart handles over a million customer transactions an hour, which is imported into a database containing over 2.5 petabytes of data.

Companies like Facebook, Google, and eBay have vast amounts of big, unstructured data that they use to increase their profits.

Some consider this to be an invasion of privacy and a dirty marketing technique.

Applications of Big Data

Page 36: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

Applications of Big Data

FacebookA large reason for Facebook’s growth has been due to their use of big data. They are one of the world’s largest repositories of data.

They use insane amounts of big data to connect people with things they might like (other people, advertisements, etc.)

Their service “Graph Search” is an advanced search method that allows people to refine searches by very specific criteria.

You might see “Sponsored Posts” on your Facebook sometimes – these heavily use big data to advertise based on what users might like.

This may concern some due to privacy concerns.

Page 37: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

ScienceMany scientific advancements have and can be made with the help of big data.

The Large Hadron Collider has over 150 million sensors delivering data 40 million times per second.

Internet of ThingsA network of physical objects embedded with electronics and software that help enable further value in many different applications.

Applications of Big Data

Page 38: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

Internet of Thingshttps://www.youtube.com/watch?v=LVlT4sX6uVs

Applications of Big Data

Page 39: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

Data MiningWithout big data, data mining would not be the same.

Simply put, big data is the asset and data mining is the handler used to provide results from those assets.

Data WarehousingWhile they sound similar, they’re not quite the same but can be complimentary.

Data warehousing can provide a way to organize big data in a meaningful way, though neither are required for the other to exist in a business setting.

Applications of Big Data

Page 40: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

Critiques and Challenges of Big

DataBig data needs to be complimented by “big judgment”.

Meaning that not many people have the skills to correctly analyze big data to determine what is and isn’t useful.

Big data analysis is often formed with the past, or at best the present, in mind.

Since many algorithms are fed by data from the past, the analysis of big data may not be useful for predicting the future if there’s a lot of change going on.

Meeting the need for speedIn today’s hypercompetitive environment, speed is important but can be difficult to achieve with the amount of data constantly being created.

Page 41: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

Dealing with outliersOutliers typically represent 1 to 5 percent of big data, but it can be hard to determine what is an outlier when there is so much data.

Privacy concernsMany citizens are concerned with the amount of data that businesses and the government are collecting.

Some believe that enhanced visualization techniques and software are the key to overcoming the challenges of big data.

Critiques and Challenges of Big

Data

Page 42: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn

The Future of Big Data

The industry is worth more than $100b and is growing at almost 10% a year, roughly twice as fast as the software industry.

The McKinsey Global Institute estimates that data volume is growing more than 40% per year, and will grow 44x between 2009 and 2020.

The global Hadoop market value will reach $50.2b by 2020.

Page 43: Big Data Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn