Upload
paulina-rice
View
220
Download
2
Tags:
Embed Size (px)
Citation preview
Big Data
Zach Robison, Craig Byrd, Cody Cobb, Thomas Wynn
OverviewWhat is Big Data?
Characteristics
Architecture
Business Analytics
Applications of Big Data
Challenges and Critiques
The Future of Big Data
What is Big Data?Big data is similar to small data, except it is, well, big. It can be categorized as data sets with sizes beyond the ability of commonly used software tools.
The “size” of big data can vary, and has gone from being considered to be a few dozen terabytes to many petabytes of data.
Having larger data, however, requires new technologies and techniques to make use of it.
Definition: Big Data represents the information assets characterized by such a high volume, velocity and variety to require specific technology and analytical methods for its transformation into value.
A Short History1965 – The US govt. plans the world’s first data center to store 742 million tax returns and 175 million sets of fingerprints on magnetic tape.
1970 – Relational Database model developed
1996 – The price of digital storage falls to cheaper than paper.
1999 – First use of term big data in academic paper
2001 – Three “Vs” first defined by Doug Laney
2005 – Hadoop developed by Apache
2008 – 9.57 zetabytes processed globally.
2010 – McKinsey report states that by 2018 will face a shortfall of professional data scientists and that the full value of big data needs to be realized.
Characteristics of Big Data
Volume
Variety
Velocity
Veracity
Known as “The Four V’s of Big Data
Also:Complexity
Variability
The Four “Vs”
VolumeScale of Data
Large amounts of data
Increasing exponentially
There are 2.5 Quintillion bytes of data generated every day (so much that 90% of the data in the world today has been created in the last two years)
Comes from many places:Business Transactions
Social media Sites
Location-based Data
VarietyDifferent forms of data
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Graph Data
Streaming Data
Big Public Data
Structured, Unstructured, images, etc.
VelocitySpeed at which data is processed
Data is being generated fast so it needs to be processed fast
Late Decisions = missed opportunities
Modern cars have close to 100 sensors that monitor items such as fuel level and tire pressure
New York Stock Exchange captures 1TB of trade information each trading session
VeracityUncertainty of Data
“Accuracy, Fidelity, or Truthfulness”
Poor data quality costs the US economy around $3.1 Trillion a year
1 in 3 Business Leaders don’t trust the information they use to make decisions
Variability & Complexity
Some definitions of big data also include these characteristics.
VariabilityA factor which can be a problem for those who analyze the data. Refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively.
ComplexityData mgmt. can be a very complex process, especially when large volumes come from multiple sources. These data need to be linked, connected and correlated in order to be able to grasp the information that is supposed to be conveyed.
Architecture5C Architecture
Connection
Conversion
Cyber
Cognition
Configuration
Big Data Analytics for Manufacturing Applications based on this
MPP ArchitectureBig data requires certain technology to efficiently process large amounts of data.
For example:MPP (massively parallel processing) databases and architecture
Processing of a program by multiple processors that work on different parts of the program
Each processor uses its own data
Big Data LakeCapturing and storing of data has emerged into a sophisticated system
Big Data Lake allows an organization to shift focus from centralized control to a shared model to respond to the changing dynamics of information management
Enable quick segregation of data into the data lake thereby reducing the overhead time
Big Lake Data Cont.
Uses a flat architecture to store data
Each data element is assigned a unique identifier and tagged with a set of extended metadata tags
If a business needs a question answered, the data lake can be queried for relevant data and that smaller set of data can then be analyzed to help answer the question
Storing and Harnessing Big
DataApache Hadoop Distributed File System
MongoDB
Big Data AnalyticsOLTP
OLAP
Apache HadoopThe Core of Hadoop consists of a storage part (HDFS) and a processing part (MapReduce)
Files are split into large blocks and distributed among the nodes in the cluster allowing the nodes to locally analyze data instead of accessing the entire database
More efficient then the conventional supercomputer architecture using parallel processing over high-speed networks
Apache Hadoop continued
MapReduce – a JobTasker using FIFO and tracking of the node where data is located
Provides a programming model that addresses the problem of disk reads and writes
Transforms data into a computation of keys and values.
Hbase – based on Google’s BigTable, a fault-tolerant way of storing large quantities of sparse data
Hive – Data warehouse infrastructure built on top of Hadoop.
Developed by Facebook
Used by both Netflix and Amazon (for S3)
PostgresSQLAn open source object relational database
Postgres-XLAn extra large version of Postgres used to handle big data
Uses sharded data amongst multiple nodes
Supports business intelligence applications
MongoDB
Cross Platform
Classified as a NoSQL Database
Document-Oriented DatabaseEschews traditional table-based relational database structure
Uses JSON-like documents with dynamic schemes
Makes the integration of applications easier and faster
Big Data AnalyticsOLTP – Online Transaction Processing
Facilitates and manages transaction-oriented applications
Typically used for data entry and retrieval of transaction processing
OLAP – Online Analytic ProcessingEnables users to analyze multi-dimensional data interactively from multiple persepctives
Allows complex analytical and ad hoc queries with a rapid execution time
Big Data Analytics Continued
TeraDataMassive parallel processing system running a shared-nothing architecture
Nodes offer back-up for one another during downtime, and load balance during normal operation
Differentiates between “hot” and “cold” data, ie data that is not often used is stored in slower storage sections
Partnered with Microsoft, SAP, Symantec, IBM, and NetApp
Big Data Analytics Continued
Why is it important?Helps companies make more informed business decisions
Lets analytics professionals analyze large volumes of transactional data
Gives access to forms of data that generally go untapped by conventional business intelligence
Big Data Analytics Continued
The processing power in data centers is growing quickly and server clusters are getting large.
Companies like HortonWorks focus on problems such as these.
Their focus is on making data analytics faster and more efficient in a large environment.
Big Data Analytics Continued
Oracle Big Data Appliance Consists of hardware and software from Oracle designed to integrate both structured and unstructured data for enterprise purposes.
Includes:Oracle Exadata Database Machine
Oracle Exalytics Business Intelligence Machine
Apache Hadoop
Oracle NoSQL Database
Oracle Data Integrator
Applications of Big Data
Between 1990 and 2005, more than a billion people worldwide entered the middle class which means more and more people who gain money will become more literate which in turn leads to information growth.
This has resulted in the further need and usefulness for big data in various applications like government, business, science, etc.
There are some problems and criticisms of this, however.
GovernmentCrime prediction and prevention
The increasing variety, velocity, and volumes of data will require that agencies identify emerging threats.
Social program fraud, waste and errorsBig data can present a much clearer citizen-centric picture and uncover invisible connections to do things like proactively detect fraud and abuse.
Tax fraud and abuseBig data can help tax agencies more accurately determine who should be investigated for fraud or denied refunds by detecting deception tactics or uncovering multiple identities.
Applications of Big Data
International developmentKnown as ICT4D (information and communication technologies for development)
Advancements have created cost-effective opportunities to improve development in areas such as health care, poverty, resource management, employment, and natural disaster response.
Challenges include inadequate technological infrastructure, economic and resource scarcity, which exacerbates existing issues such as privacy, interoperability, and imperfect methodologies.
Applications of Big Data
ManufacturingPrognostics and Health Management
Predictive manufacturing requires vast amounts of data to be successful.
Sensory data such as acoustics, vibrations, presusure,voltage, etc. are often used for this.
5S approach
Big data provides an infrastructure for transparency in manufacturing industry
Unravels uncertainties in things like inconsistent component performance and availability.
Applications of Big Data
Private SectorCompanies in retail, real estate, marketing, etc. are using big data more and more to increase their bottom line.
Walmart handles over a million customer transactions an hour, which is imported into a database containing over 2.5 petabytes of data.
Companies like Facebook, Google, and eBay have vast amounts of big, unstructured data that they use to increase their profits.
Some consider this to be an invasion of privacy and a dirty marketing technique.
Applications of Big Data
Applications of Big Data
FacebookA large reason for Facebook’s growth has been due to their use of big data. They are one of the world’s largest repositories of data.
They use insane amounts of big data to connect people with things they might like (other people, advertisements, etc.)
Their service “Graph Search” is an advanced search method that allows people to refine searches by very specific criteria.
You might see “Sponsored Posts” on your Facebook sometimes – these heavily use big data to advertise based on what users might like.
This may concern some due to privacy concerns.
ScienceMany scientific advancements have and can be made with the help of big data.
The Large Hadron Collider has over 150 million sensors delivering data 40 million times per second.
Internet of ThingsA network of physical objects embedded with electronics and software that help enable further value in many different applications.
Applications of Big Data
Internet of Thingshttps://www.youtube.com/watch?v=LVlT4sX6uVs
Applications of Big Data
Data MiningWithout big data, data mining would not be the same.
Simply put, big data is the asset and data mining is the handler used to provide results from those assets.
Data WarehousingWhile they sound similar, they’re not quite the same but can be complimentary.
Data warehousing can provide a way to organize big data in a meaningful way, though neither are required for the other to exist in a business setting.
Applications of Big Data
Critiques and Challenges of Big
DataBig data needs to be complimented by “big judgment”.
Meaning that not many people have the skills to correctly analyze big data to determine what is and isn’t useful.
Big data analysis is often formed with the past, or at best the present, in mind.
Since many algorithms are fed by data from the past, the analysis of big data may not be useful for predicting the future if there’s a lot of change going on.
Meeting the need for speedIn today’s hypercompetitive environment, speed is important but can be difficult to achieve with the amount of data constantly being created.
Dealing with outliersOutliers typically represent 1 to 5 percent of big data, but it can be hard to determine what is an outlier when there is so much data.
Privacy concernsMany citizens are concerned with the amount of data that businesses and the government are collecting.
Some believe that enhanced visualization techniques and software are the key to overcoming the challenges of big data.
Critiques and Challenges of Big
Data
The Future of Big Data
The industry is worth more than $100b and is growing at almost 10% a year, roughly twice as fast as the software industry.
The McKinsey Global Institute estimates that data volume is growing more than 40% per year, and will grow 44x between 2009 and 2020.
The global Hadoop market value will reach $50.2b by 2020.