Upload
alan-quayle
View
117
Download
2
Embed Size (px)
DESCRIPTION
Sample of some of the slides presented at the pre-conference workshop on Telco Big Data.
Citation preview
Introduction to Big Data and Real Time Analytics Workshop
Telco Big Data & Real Time Analytics Summit 2012
3-5 December 2012, London
www.alanquayle.com/blog
© 2012 Alan Quayle Business and Service Development 1
"There are three kinds of lies:
lies, damned lies, and statistics."
British Prime Minister Benjamin Disraeli (1804–1881), or perhaps
Samuel Langhorne Clemens (1835 – 1910) better known as Mark Twain
© 2012 Alan Quayle Business and Service Development 2
Never Forget This!
© 2012 Alan Quayle Business and Service Development 3
People
Process
Technology
Most projects fail here
© 2012 Alan Quayle Business and Service Development 4
The Data Tsunami!
Why are we measuring so many things?
• Atoms vibrate at about 10^13 Hz, assuming we only measure the atom and not the
subatomic constituents to the resolution of only 1 byte, that’s 10TB per second
• Now there are rough 7*10^27 atoms in the human body
• So just monitoring one human body’s atoms will generate 7*10^40 bytes per second.
• That’s 2*10^48 bytes in a year, that’s 2 yotta yotta bytes
• By 2020, the quantity of electronically stored data will reach 35 trillion gigabytes,
that’s only 35*10^21
• Its easy (fun) to play with numbers! Lies, damned lies and statistics!
• We do not need to measure each revolution of an airplane’s turbine, only when an
event (out of tolerance) occurs does it matter.
o Events and collecting what matters, NOT collecting everything all the time!
o How do we know what matters? Common sense, knowing your business and experimentation!
© 2012 Alan Quayle Business and Service Development 5
© 2012 Alan Quayle Business and Service Development 6
Beware the “Bait and Switch”
© 2012 Alan Quayle Business and Service Development 7 Data You Need Lots of It!!
© 2012 Alan Quayle Business and Service Development 8
But There’s a Shortage of Data Scientists to Do Anything With It
© 2012 Alan Quayle Business and Service Development 9
So Give Me All Your Money
Introduction
• The purpose of this one day workshop is to provide both an introduction and pragmatic insight
into Big Data, Data Science and Real-Time Analytics.
• This course will provide a frank and objective review of the state of the art and the market.
Examining what is working in practice and what is not through an extensive series of case studies.
• Big data usually includes data sets with sizes beyond the ability of commonly used software tools
to capture, manage, and process the data.
o Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many
petabytes.
o A new platform of "big data" tools has arisen to handle sense-making over large quantities of data, for
example the Apache Hadoop Big Data Platform.
• Analyzing large data sets in near real-time is not new, business intelligence is as old as business
itself (that is as old as human society).
o IT automated it, and enabled an organization to own it rather than in the wet-ware of a few human
brains (generally the owners of a business.)
o Some real-time analysis results in automated triggers, so called machine learning, most analysis still
requires human interpretation which is not straight forward.
o Analysis of such large and mixed data sources has its own problems, as we’ll discuss in the course.
o Privacy and regulation cannot be ignored, for some industries this will limit the application of Big
Data.
© 2012 Alan Quayle Business and Service Development 10
Structure Part 1 of 5
• 09:00 Registration
• 09:30 History and Overview: Understanding Big Data and Real-Time Analytics in
Context
• What do we mean by Big Data?
• Why does Big Data matter?
• Big Data Maturity
• The 3Vs” Volume, Variety and Velocity
• What are the Domains of Big Data?
• Big Data Technologies
• What Enterprises Think of Big Data
• How Enterprise Verticals are Impacted by Big Data
• Why Now?
• Key Trends driving towards Big Data
• 10:45 Coffee Break
© 2012 Alan Quayle Business and Service Development 11
• History of Big Data
• Taxonomy of Big Data Companies
• Big Data Landscape
• List of Companies in Big Data (and their Big
Data revenues)
• Big Data Market Sizing
• Telecoms and Real-Time
• O2 More: Proof we can do it!
Structure Part 2 of 5
• 11:00 Quick Technology Review: Diving into a little detail on a few of the key technologies
(only as deep as the architecture) to understand their history and capabilities /
limitations
• Hadoop
o What is Hadoop?
o Ecosystem
o History
o Design Axioms
o Hadoop Distributed File System
o MapReduce: Distributed Processing
o Architecture
o Data Schemas
o Query Language Flexibility
o Economics
o Case Studies
• Hadoop and Hbase in the Cloud (Amazon)
• NoSQL and Cassandra + some use cases
• Hbase versus Cassandra
• Graph Database introduction
© 2012 Alan Quayle Business and Service Development 12
Structure Part 3 of 5
• 12:00 & 14:00 Application of Big Data
• Hardware and Software Trends
o Execution and Results Characteristics
o Framework: Ecosystem, Application Services, Data
Management
• Real-Time Analytics
o Use Cases
o Extended RDMS versus MapReduce / Hadoop
o Requirements, Trends, People and Organization
Issues, Outlook
• Big Data and the Cloud
o Why the Cloud and Big Data?
o Cloud benefits
o Use Cases: Bankinter, Etsy, Razorfish
• 13:00ish Lunch
© 2012 Alan Quayle Business and Service Development 13
• The Social Enterprise
o Business Benefits
o ALU example
o Drivers
o Social + Data Analysis = Business
intelligence
o AT&T Case Study
o Lessons Learned
• Telcos and Big Data
o TMF Survey
o Big Data Framework
o Predictive / Adaptive Analytics
o Decision Engineering
o The Problem with Telecom
• Telco Analytics
o Customer Profiling
o Next Product Tools
o Marketing Mix Modeling
o Cost of Acquisition Tools
o Case Study
Structure Part 4 of 5
• 15:00 Ecosystem, Taxonomies and
Suppliers: Understanding the many
suppliers, technology camps, and
approaches
• Taxonomy of Big Data Companies
• Big Data Landscape
• Cloudera
• Autonomy
• Vertica
• InfoChimps
• Guavas
• Matrix
© 2012 Alan Quayle Business and Service Development 14
• Case Studies
• Real Time Analytics for Big Data Lessons from
o Quick technology review
o Facebook Real-time Analytics System
o Goal
o Actual Analytics
o Solution
o Memory, Collocate, Economics
• Real Time Analytics for Big Data Lessons from
o Requirements
o Actual Analytics
o Challenges
o Performance
o One data any API
o Solution
o Memory, Collocate, Economics
• Other Case Studies
• Orbitz, Hertz, Yelp
Structure Part 5 of 5
• 16:00 Global Enterprise and Telecom Survey on Big Data and Real-Time
Analytics
• Background
• The Questions
• The Importance of Analytics
• Impact of Big Data on Analytics
• Size of Data Sets, Number of Data Sources
• Update Frequency
• Integration of Data Sources
• Data Set Responsibility
• Types of Data, Types of Processing and Analytics
• Challenges
• Big Data Analytics Platforms
• Benefits and Plans
• Data Analytics Storage and IT Infrastructure Requirements
• Increasing Interest in Hadoop MapReduce Framework Technology
• Conclusions
• Recommendations and Wrap Up
© 2012 Alan Quayle Business and Service Development 15
Alan Quayle
• 22 years of experience in the telecommunication industry, focused on developing
profitable new businesses in service providers, suppliers and start-ups.
• Customers include
o Operators such as AT&T, BT, Charter, Etisalat, M1, O2, Rogers, Swisscom, T-Mobile,
Telstra, Time Warner Cable, Verizon and Vodafone;
o Suppliers such as Adobe, Alcatel-Lucent, Ericsson, Huawei, Nokia Siemens Networks,
and Oracle; and
o Innovative start-ups such as Apigee, AppTrigger (sold to Metaswitch), Camiant (sold to
Tekelec), OpenCloud, and Voxeo.
• Work with the developer community and on the board of developers such as
GotoCamera, hSenid Mobile, as well as suppliers such as Sigma Systems.
• Weblog www.alanquayle.com/blog
• Linkedin http://www.linkedin.com/in/alanquayle
© 2012 Alan Quayle Business and Service Development 16
A Thank You to Those helping me Put this Course Together
• In putting this workshop together I’d like to thank the following
suppliers for their time, openness, willingness to review, and provide
material to ensure this workshop is up-to-the-minute.
o And especially for not requiring any editorial control over the content or my
views expressed in this material (in reverse alphabetically order).
• Guavas
• HP (don’t mention the Autonomy deal)
• Versant, NoSQL database vendor
• Ty Wang, social media entrepreneur using FB Social Graph
• Lorien Pratt, Data / Decision Scientist with Telco focus
• Amazon Web Services
• Matrixx
© 2012 Alan Quayle Business and Service Development 17
18
(c) 2012 Alan Quayle Business and Service Development
Introductions
• Spend 2 minutes to introduce yourself
o Name, current employer and job
o Let us know your favorite hobby
• For me its hiking with my family
o What you want to get out of this course
• What topics are most important to you?
History and Overview Understanding Big Data and Real-Time Analytics in Context
© 2012 Alan Quayle Business and Service Development 19
Structure
• What do we mean by Big Data?
• Why does Big Data matter?
• Big Data Maturity
• The 3Vs” Volume, Variety and Velocity
• What are the Domains of Big Data?
• Big Data Technologies
• What Enterprises Think of Big Data
• How Enterprise Verticals are Impacted
by Big Data
• Why Now?
• Key Trends driving towards Big Data
© 2012 Alan Quayle Business and Service Development 20
• History of Big Data
• Taxonomy of Big Data Companies
• Big Data Landscape
• List of Companies in Big Data (and
their Big Data revenues)
• Big Data Market Sizing
• Telecoms and Real-Time
• O2 More: Proof we can do it!
What Do We Mean by Big Data?
© 2012 Alan Quayle Business and Service Development 21
IDC’s Definition of Big Data
© 2012 Alan Quayle Business and Service Development 22
What is Big Data
© 2012 Alan Quayle Business and Service Development 23
Why does Big Data Matter?
© 2012 Alan Quayle Business and Service Development 24
© 2012 Alan Quayle Business and Service Development 25
© 2012 Alan Quayle Business and Service Development 26
© 2012 Alan Quayle Business and Service Development 27
Another Version of the 3 Vs
• Volume: Data sets are expanding constantly. A strategic approach to
big data takes into account ways to store and manage the huge
volumes of data that are being generated.
• Variety: Big data comes in many forms. Analyzing multi-structured
data can yield important insights that can help direct a business
strategy.
• Velocity: The speed at which data is analyzed is everything,
especially when working in a time sensitive business environment.
© 2012 Alan Quayle Business and Service Development 28
© 2012 Alan Quayle Business and Service Development 29
© 2012 Alan Quayle Business and Service Development 30
© 2012 Alan Quayle Business and Service Development 31
© 2012 Alan Quayle Business and Service Development 32
© 2012 Alan Quayle Business and Service Development 33
What are the Domains of Big Data?
© 2012 Alan Quayle Business and Service Development 34
Big Data Technology Stack
© 2012 Alan Quayle Business and Service Development 35
Big Data Technologies
© 2012 Alan Quayle Business and Service Development 36
The Technology has Become Quite Fashionable
© 2012 Alan Quayle Business and Service Development 37
© 2012 Alan Quayle Business and Service Development 38
© 2012 Alan Quayle Business and Service Development 39
© 2012 Alan Quayle Business and Service Development 40
© 2012 Alan Quayle Business and Service Development 41
© 2012 Alan Quayle Business and Service Development 42
Big Data Use Cases
© 2012 Alan Quayle Business and Service Development 43
© 2012 Alan Quayle Business and Service Development 44
Companies in Big Data
• Storage: HP, EMC, IBM, Dell, NetApp, Hitachi Ltd., Fujitsu, Oracle, NEC
• Servers: IBM, HP, Dell, Oracle, Fujitsu, Acer, Cray, Groupe Bull, Hitachi, NEC, SGI, Stratus
Technologies, Unisys, Cisco, Lenovo
• Networking: Cisco, Brocade, HP, Dell, IBM, Alcatel-Lucent, F5 Networks, Citrix
• Relational database software: Oracle Exadata, IBM Netezza, IBM Smart Analytics System,
Teradata, HP Vertica and Autonomy, SAP Sybase IQ, EMC Greenplum DB and HD, Microsoft SQL
Server Parallel Edition, IBM Netezza High Capacity Appliance, Teradata Extreme Performance
Appliance, SAP-Sybase IQ
• Hadoop-based data management and analysis software: Cloudera, MapR, EMC Greenplum HD,
Oracle Big Data Appliance, IBM BigInsights, Hstreaming, Platfora, Zettaset, DataStax,
Karmashere, Datameer, Hadapt, and so forth
• XML databases: MarkLogic, Oracle XML DB, IBM pureXML, Software AG webMethods, Tamino
XML Server, TigerLogic, Xyleme, and so forth
© 2012 Alan Quayle Business and Service Development 45
Companies in Big Data
• Object-oriented databases: Jade Software, Objectivity, Progress Software, Versant
• Graph databases: Neo Technology, Objectivity, Franz Inc., Sones, Ravel
• Ultra-high-speed streaming data technologies: IBM InfoSphere Streams, Informatica
Ultra Messaging Streaming Edition, TIBCO FTL and BusinessEvents, Progress Software
Apama CEP
• Analytics and discovery software: SAS, IBM, Attivio, HP Autonomy, Skytree, Attivio,
Oracle Advanced Analytics, IBM SPSS, Microsoft, Vivisimo, ZyLAB, Sinequa, Revolution
Analytics, KXEN, BA Insight, Palantir, Perfect Search, Wolfram Alpha
• Decision support and automation software including applications: Webtrends, Adobe-
Omniture, IBM Coremetrics, FICO
• Services: Accenture, Deloitte, TCS, HP, Teradata, Mu Sigma, Think Big Analytics,
• Hortonworks, Hashrocket, KloudData, Trendwise Analytics
© 2012 Alan Quayle Business and Service Development 46
Big Data Is a Big Market & Big Business - $50 Billion Market by 2017 (according to Wikibon)
• Open source analyst firm Wikibon pegs the current Big Data market at just over $5
billion (IDC and others agree with)
• Wikibon forecast the Big Data market will grow at a CAGR of 58% between now and
2017, hitting the $50 billion within five years.
• Vendors from whales like IBM and HP to pure-plays like Vertica and Cloudera are
bringing in significant revenue today helping enterprises, governments and
healthcare organizations process and make sense of the torrents of unstructured data
flowing from mobile devices, sensors, social media and other sources.
• Today Big Data technologies like Hadoop are mostly in production at Web and online
gaming companies, large financial services firms and banks, and online retailers.
© 2012 Alan Quayle Business and Service Development 47
Big Data Is Big Market & Big Business - $50 Billion Market by 2017
• Another important point is that, while Hadoop may be the poster child of Big Data,
there are other important technologies at play.
o Hadoop: open source framework for distributing data processing across multiple nodes, these
include massively parallel data warehouses “that deliver fast data loading and real-time
analytic capabilities,”
o Analytic platforms and applications that allow Data Scientists and Business Analysts to
manipulate Big Data; and
o Data Visualization tools that bring insights from Big Data analysis alive for end users.
• Of the current market, Big Data pure-play vendors account for $300 million in Big
Data-related revenue.
o Despite their relatively small percentage of current overall revenue (approximately 5%), Big
Data pure-play vendors – such as Vertica, Splunk and Cloudera — are responsible for the vast
majority of new innovations and modern approaches to data management and analytics that
have emerged over the last several years and made Big Data the hottest sector in IT.
© 2012 Alan Quayle Business and Service Development 48
Wikibon Forecast
© 2012 Alan Quayle Business and Service Development 49
IDC’s Forecast
© 2012 Alan Quayle Business and Service Development 50
© 2012 Alan Quayle Business and Service Development 51
© 2012 Alan Quayle Business and Service Development 52
© 2012 Alan Quayle Business and Service Development 53
© 2012 Alan Quayle Business and Service Development 54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
Technology Review Diving into a little detail on a few of the key technologies (only as deep as the architecture) to understand their history and capabilities / limitations
© 2012 Alan Quayle Business and Service Development 73
Structure Part 2 of 4
• Hadoop
o What is Hadoop?
o Ecosystem
o History
o Design Axioms
o Hadoop Distributed File System
o MapReduce: Distributed Processing
o Architecture
o Data Schemas
o Query Language Flexibility
o Economics
o Case Studies
• Hadoop and Hbase in the Cloud (Amazon)
• NoSQL and Cassandra + some use cases
• Hbase versus Cassandra
• Graph Database introduction
© 2012 Alan Quayle Business and Service Development 74
Hbase Versus Cassandra: History
• HBase and its required supporting systems are derived from what is
known of the original Google BigTable and Google File System
designs (as known from the Google File System paper Google
published in 2003, and the BigTable paper published in 2006).
• Cassandra on the other hand is a recent open source fork of a
standalone database system initially coded by Facebook, which
while implementing the BigTable data model, uses a system inspired
by Amazon’s Dynamo for storing data (in fact much of the initial
development work on Cassandra was performed by two Dynamo
engineers recruited to Facebook from Amazon).
© 2012 Alan Quayle Business and Service Development 75
Hbase Versus Cassandra:
• These differing histories have resulted in HBase being more suitable for data
warehousing, and large scale data processing and analysis (for example, such as
that involved when indexing the Web)
• Cassandra being more suitable for real time transaction processing and the
serving of interactive data.
• For lightweight validation you’ll find the current makeup of the key committers
interesting:
o the primary committers to HBase work for Bing (M$ bought their search company last
year, and gave them permission to continue submitting open source code after a couple
of months).
o By contrast the primary committers on Cassandra work for Rackspace, which supports
the idea of an advanced general purpose NOSQL solution being freely available to
counter the threat of companies becoming locked in to the proprietary NOSQL solutions
offered by the likes of Google, Yahoo and Amazon EC2.
© 2012 Alan Quayle Business and Service Development 76
• The CAP Theorem, and was developed by Professor Eric Brewer, Co-founder and Chief Scientist of
Inktomi.
• The theorem states, that a distributed (or “shared data”) system design, can offer at most two out of three
desirable properties – Consistency, Availability and tolerance to network Partitions. Consistency means
that if someone writes a value to a database, thereafter other users will immediately be able to read the
same value back. Availability means that if some number of nodes fail in your cluster the distributed
system can remain operational, and Tolerance to Partitions means that if the nodes in your cluster are
divided into two groups that can no longer communicate by a network failure, again the system remains
operational
• If you search online posts related to HBase and Cassandra comparisons, you will regularly find the HBase
community explaining that they have chosen CP, while Cassandra has chosen AP
• BUT the CAP theorem only applies to a single distributed algorithm. But there is no reason why you
cannot design a single system where for any given operation, the underlying algorithm and thus the trade-
off achieved is selectable.
• Thus while it is true that a system may only offer two of these properties per operation, what has been
widely missed is that a system can be designed that allows a caller to choose which properties they want
when any given operation is performed.
• Not only that, reality is not nearly so black and white, and it is possible to offer differing degrees of
balance between consistency, availability and tolerance to partition. This is Cassandra. © 2012 Alan Quayle Business and Service Development 77
Application of Big Data
© 2012 Alan Quayle Business and Service Development 78
Structure
• Hardware and Software Trends
o Execution and Results Characteristics
o Framework: Ecosystem, Application
Services, Data Management
• Real-Time Analytics
o Use Cases
o Extended RDMS versus MapReduce /
Hadoop
o Requirements, Trends, People and
Organization Issues, Outlook
• Big Data and the Cloud
o Why the Cloud and Big Data?
o Cloud benefits
o Use Cases: Bankinter, Etsy, Razorfish
© 2012 Alan Quayle Business and Service Development 79
• The Social Enterprise
o Business Benefits
o ALU example
o Drivers
o Social + Data Analysis = Business
intelligence
o AT&T Case Study
o Lessons Learned
• Telcos and Big Data
o TMF Survey
o Big Data Framework
o Predictive / Adaptive Analytics
o Decision Engineering
o The Problem with Telecom
• Telco Analytics
o Customer Profiling
o Next Product Tools
o Marketing Mix Modeling
o Cost of Acquisition Tools
o Case Study
Use Cases for Big Data Analytics
• Search ranking.
o All search engines attempt to rank the relevance of a webpage to a search request against all
other possible webpages
o Google’s page rank algorithm is, of course, the poster child for this use case
• Ad tracking.
o E-commerce sites typically record an enormous river of data including every page event in
every user session
o This allows for very short turnaround of experiments in ad placement, color, size, wording,
and other features
o When an experiment shows that such a feature change in an ad results in improved click
through behavior, the change can be implemented virtually in real time
• Location and proximity tracking.
o Many use cases add precise GPS location tracking, together with frequent updates, in
operational applications, security analysis, navigation, and social media
o Precise location tracking opens the door for an enormous ocean of data about other locations
nearby the GPS measurement
© 2012 Alan Quayle Business and Service Development 80
Use Cases for Big Data Analytics
• Causal factor discovery.
o Point-of-sale data has long been able to show us when the sales of a product goes sharply up
or down. But searching for the causal factors that explain these deviations has been, at best, a
guessing game or an art form.
o The answers may be found in competitive pricing data, competitive promotional data
including print and television media, weather, holidays, national events including disasters,
and virally spread opinions found in social media.
• Social CRM.
o This use case is one of the hottest new areas for marketing analysis. The Altimeter Group has
described a very useful set of key performance indicators for social CRM that include share of
voice, audience engagement, conversation reach, active advocates, advocate influence,
advocacy impact, resolution rate, resolution time, satisfaction score, topic trends, sentiment
ratio, and idea impact.
o The calculation of these KPIs involves in-depth trolling of a huge array of data sources,
especially unstructured social media.
© 2012 Alan Quayle Business and Service Development 81
Use Cases for Big Data Analytics
• Document similarity testing.
o Two documents can be compared to derive a metric of similarity. There is a large body of academic
research and tested algorithms, for example latent semantic analysis, that is just now finding its way to
driving monetized insights of interest to big data practitioners.
o For example, a single source document can be used as a kind of multifaceted template to compare against a
large set of target documents. This could be used for threat discovery, sentiment analysis, and opinion
polls. For example: "find all the documents that agree with my source document on global warming.“
• Genomics analysis: e.g., commercial seed gene sequencing.
o A few months ago the cotton research community was thrilled by a genome sequencing announcement that
stated in part "The sequence will serve a critical role as the reference for future assembly of the larger
cotton crop genome.
o Cotton is the most important fiber crop worldwide and this sequence information will open the way for
more rapid breeding for higher yield, better fiber quality and adaptation to environmental stresses and for
insect and disease resistance.” Scientist Ryan Rapp stressed the importance of involving the cotton
research community in analyzing the sequence, identifying genes and gene families and determining the
future directions of research.
o This use case is just one example of a whole industry that is being formed to address genomics analysis
broadly, beyond this example of seed gene sequencing.
© 2012 Alan Quayle Business and Service Development 82
Use Cases for Big Data Analytics • Discovery of customer cohort groups.
o Customer cohort groups are used by many enterprises to identify common demographic trends and
behavior histories. We are all familiar with Amazon's cohort groups when they say other customers who
bought the same book as you have also bought the following books. Of course, if you can sell your product
or service to one member of a cohort group, then all the rest may be reasonable prospects. Cohort groups
are represented logically and graphically as links, and much of the analysis of cohort groups involves
specialized link analysis algorithms.
• In-flight aircraft status.
o This use case as well as the following two use cases are made possible by the introduction of sensor
technology everywhere. In the case of aircraft systems, in-flight status of hundreds of variables on engines,
fuel systems, hydraulics, and electrical systems are measured and transmitted every few milliseconds. The
value of this use case is not just the engineering telemetry data that could be analyzed at some future point
in time, but drives real-time adaptive control, fuel usage, part failure prediction, and pilot notification.
• Smart utility meters.
o It didn't take long for utility companies to figure out that a smart meter can be used for more than just the
monthly readout that produces the customer’s utility bill. By drastically cranking up the frequency of the
readouts to as much as one readout per second per meter across the entire customer landscape, many
useful analyses can be performed including dynamic load-balancing, failure response, adaptive pricing,
and longer-term strategies for incenting customers to utilize the utility more effectively (either from the
customers’ point of view or the utility's point of view!)
© 2012 Alan Quayle Business and Service Development 83
Use Cases for Big Data Analytics
• Building sensors.
o Modern industrial buildings and high-rises are being fitted with thousands of small
sensors to detect temperature, humidity, vibration, and noise.
o Like the smart utility meters, collecting this data every few seconds 24 hours per day
allows many forms of analysis including energy usage, unusual problems including
security violations, component failure in air-conditioning and heating systems and
plumbing systems, and the development of construction practices and pricing strategies.
• Satellite image comparison.
o Images of the regions of the earth from satellites are captured by every pass of certain
satellites on intervals typically separated by a small number of days.
o Overlaying these images and computing the differences allows the creation of hot spot
maps showing what has changed. This analysis can identify construction, destruction,
changes due to disasters like hurricanes and earthquakes and fires, and the spread of
human encroachment.
© 2012 Alan Quayle Business and Service Development 84
Use Cases for Big Data Analytics
• CAT scan comparisons.
o CAT scans are stacks of images taken as "slices" of the human body. Large
libraries of CAT scans can be analyzed to facilitate the automatic diagnosis of
medical issues and their prevalence.
• Financial account fraud detection and intervention.
o Account fraud, of course, has immediate and obvious financial impact. In
many cases fraud can be detected by patterns of account behavior, in some
cases crossing multiple financial systems. For example, "check kiting" requires
the rapid transfer of money back and forth between two separate accounts.
o Certain forms of broker fraud involve two conspiring brokers selling a security
back-and-forth at ever increasing prices, until an unsuspecting third party
enters the action by buying the security, allowing the fraudulent brokers to
quickly exit. Again, this behavior may take place across two separate
exchanges in a short period of time.
© 2012 Alan Quayle Business and Service Development 85
Use cases for big data analytics • Computer system hacking detection and intervention.
o System hacking in many cases involves an unusual entry mode or some other kind of behavior
that in retrospect is a smoking gun but may be hard to detect in real-time.
• Online game gesture tracking.
o Online game companies typically record every click and maneuver by every player at the most
fine grained level. This avalanche of "telemetry data" allows fraud detection, intervention for a
player who is getting consistently defeated (and therefore discouraged), offers of additional
features or game goals for players who are about to finish a game and depart, ideas for new
game features, and experiments for new features in the games.
o This can be generalized to television viewing. Your DVR box can capture remote control
keystrokes, recording events, playback events, picture-in-picture viewing, and the context of
the guide. All of this can be sent back to your provider.
• Big science including atom smashers, weather analysis, space probe telemetry feeds.
o Major scientific projects have always collected a lot of data, but now the techniques of big data
analytics are allowing broader access and much more timely access to the data. Big science
data, of course, is a mixture of all forms of data, scalar, vector, complex structures, analog wave
forms, and images.
© 2012 Alan Quayle Business and Service Development 86
Use Cases for Big Data Analytics • "Data bag" exploration.
o There are many situations in commercial environments and in the research
communities where large volumes of raw data are collected. One example might be data
collected about structure fires. Beyond the predictable dimensions of time, place,
primary cause of fire, and responding firefighters, there may be a wealth of
unpredictable anecdotal data that at best can be modeled as a disorderly collection of
name value pairs, such as "contributing weather= lightning.” Another example would be
the listing of all relevant financial assets for a defendant in a lawsuit.
o Again such a list is likely to be a disorderly collection of name value pairs, such as
"shared real estate ownership =condominium.” The list of examples like this is endless.
What they have in common is the need to encapsulate the disorderly collection of name
value pairs which is generally known as a "data bag.” Complex data bags may contain
both name value pairs as well as embedded sub data bags. The challenge in this use case
is to find a common way to approach the analysis of data bags when the content of the
data may need to be discovered after the data is loaded.
© 2012 Alan Quayle Business and Service Development 87
Use Cases for Big Data Analytics • The final two use cases are old and even predate data warehousing itself. But
new life has been breathed into these use cases because of the exciting potential
of ultra-atomic customer behavior data.
o Loan risk analysis and insurance policy underwriting. In order to evaluate the risk of a
prospective loan or a prospective insurance policy, many data sources can be brought
into play ranging from payment histories, detailed credit behavior, employment data,
and financial asset disclosures. In some cases the collateral for a loan or the insured
item may be accompanied by image data.
o Customer churn analysis. Enterprises concerned with churn want to understand the
predictive factors leading up to the loss of a customer, including that customer’s detailed
behavior as well as many external factors including the economy, life stage and other
demographics of the customer, and finally real time competitive issues.
© 2012 Alan Quayle Business and Service Development 88
Big Data on the Cloud
In the Real World
How the Cloud Is
Big Data’s Best Friend
Characteristics
of Big Data
Characteristics of
Big Data
Features driven by MapReduce
Big Data is Getting Bigger
2.7 Zetabytes in 2012
Over 90% will be
unstructured
Data spread across a
wide array of silos
Why is Big Data Hard (and Getting Harder)?
Changing Data Requirements
Faster response time of fresher data
Sampling is not good enough & history is
important
Increasing complexity of analytics
Users demand inexpensive experimentation
Where is it Coming From?
Computer
Generated
• Application server
logs (web sites,
games)
• Sensor data (weather,
water, smart grids)
• Images/videos
(traffic, security
cameras)
Human
Generated
• Twitter “Fire Hose”
50m tweets/day
1,400% growth per
year
• Blogs/Reviews/Emails
/Pictures
• Social Graphs:
Facebook, Linked-in,
Contacts
Big Data Verticals
Media/Advertising
Targeted Advertising
Image and Video
Processing
Oil & Gas
Seismic Analysis
Retail
Recommend
Transactions Analysis
Life Sciences
Genome Analysis
Financial Services
Monte Carlo Simulations
Risk Analysis
Security
Anti-virus
Fraud Detection
Image Recognition
Social Network/Gaming
User Demographi
cs
Usage analysis
In-game metrics
Bank – Monte Carlo Simulations
“The AWS platform was a good fit for its unlimited and flexible computational power to our risk-simulation process requirements. With AWS, we now have the power to decide how fast we want to obtain simulation results, and, more importantly, we have the ability to run simulations not possible before due to the large amount of infrastructure required.” – Castillo, Director, Bankinter
23 Hours to 20 Minutes
etsy.com/gifts
Recommendations
Gift Ideas for Facebook Friends
Recommendations
Targeted Ad
User recently
purchased a
sports movie and
is searching for
video games (1.7 Million per day)
Click Stream Analysis
The Social Enterprise
• Implementations are getting bigger and growing faster than ever
• Virtually all data continue to show sustained real-world benefits (McKinsey,
IBM, Frost and Sullivan, AIIM)
• Everything is becoming social: Social features are appearing in virtually all types
of applications
• There continues to be considerable confusion about who “owns” social in the
organization
• The predicted social data explosion: It happened
• Mining insight from social data has now become a major industry (#bigdata,
#analytics)
• The blur between internal and external social business has not progressed as far
as many thought
• The first serious talk about open social business standards has begun
© 2012 Alan Quayle Business and Service Development 101
© 2012 Alan Quayle Business and Service Development 102
Decision Engineering
Adaptive Analytics
Predictive Analytics
Reporting
Data Management (including data migration, data quality, data
modeling)
Decision Engineering
Adaptive Analytics
Predictive Analytics
Reporting
Data Management (including data migration, data quality, data
modeling)
Will this customer churn?
Yes/No data: If customer has an open trouble ticket: Yes, otherwise: No
Real-Valued: If customer age < 30: Yes, otherwise: No
Combination: If customer age <30 AND has an open trouble ticket: Yes,
otherwise: No
Linear Combination: If 2.3 x Age + 4.4 x Income > 40: Yes, otherwise: No
Predictive Analytics: Obtain these numbers by analyzing historical data
Adaptive Analytics: Update your historical data, and re-derive the numbers
periodically to take changing situations into account.
Nonlinear Analytics:
age
Income vs.
age
Income
Pattern
Predictive/Adaptive Analytics on 1 slide
Decision Engineering
Adaptive Analytics
Predictive Analytics
Reporting
Data Management (including data migration, data quality, data
modeling)
Decision Model (part of Decision Engineering)
From: Agile Decision Making: Improving business results with analytics TM Forum Quick Insight report, 2011. Source: Lorien Pratt
…Decision engineering places analytics in the larger business context. Each “f” here is an analytic, or based on human expertise
Data used to construct the
analytic
If 2.3 x Age + 4.4 x Income > 40: Yes, otherwise: No
Operational data
1
2
3
Sally is likely enough to churn that we should call her
Sally
4
5
Key Distinctions
• Automated versus human-in-the-loop while building
analytics
• Automated versus human-in-the-loop while using
analytics
• Strategic versus tactical goals
• One-size fits all versus demographic versus personalized
• Within-silo versus between-silo
• Cleansing for operational versus analytic purposes
How do I create a
responsive analytics
capability, and
governance relative to
the right-time
application of analytic
decision making?
How do I leverage and
operationalize customer
insights and experience
data to drive personal,
timely, and relevant
interactions across all
channels?
How to dynamically
manage margin and
brand perception with
the right mix of regular,
promotional and
markdown products
across categories,
channels, and formats?
Are inventory and
demand data leveraged
to optimize the customer
experience and
effectively respond to
changing marketing
conditions?
Operations
Marketing & Sales Merchandising
Supply Chain
Multi-Channel Operations
Supplier/Partner Collaboration
Moving Analytics to the Center: Retailers face new competition that is driving an advanced view of customers and interactions to the center of the business.
Advanced Customer
Intelligence
Semantic Framework: Applied Customer Analytics Capability
The New Analytical Competency
Focus of Efforts in the Past New Competency Requirements
Large-scale Integration of All Data Sources
Connected Information & Analytics Governance for the Enterprise
Central Control of Meta Data and Information Usage
Provisioning Information & Insights to Point of Leverage
Developing the Most Technically Correct Analytical Point Solution Possible
Agile Analytical Modeling Processes & Rapid Evaluation of Business Lift
Example- FROM: How can we use all possible customer dimensions to predict customer churn? TO: What is the optimum behavior modeling framework to rapidly build and deploy models applicable to multiple business objectives that change over time?
Predictive Analytics
Historical Approaches Rely on
Static Data
• Propensity to Churn • Propensity to Buy • Propensity to Pay • Customer Lifetime
Value
Future Needs Require a More
Dynamic Approach
• Ability to intervene in customer interactions to create desired outcomes
Problem Statements
Telcos are not traditionally nimble
Telcos look at customers in groups, not individually.
Telcos have very little idea what drives customer behavior
Telcos have no idea how to influence customer behavior
Even if they knew how to influence customer behavior, Telcos do not have the nimble decisioning tools required to impact customer behavior in real time.
Ecosystem, Taxonomies and Supplier Review Understanding the many suppliers, technology camps, and approaches
© 2012 Alan Quayle Business and Service Development 115
Structure Part 4 of 5
• 15:00 Ecosystem, Taxonomies and
Suppliers: Understanding the many
suppliers, technology camps, and
approaches
• Taxonomy of Big Data Companies
• Big Data Landscape
• Cloudera
• Autonomy
• Vertica
• InfoChimps
• Guavas
• Matrix
© 2012 Alan Quayle Business and Service Development 116
• Case Studies
• Real Time Analytics for Big Data Lessons from
o Quick technology review
o Facebook Real-time Analytics System
o Goal
o Actual Analytics
o Solution
o Memory, Collocate, Economics
• Real Time Analytics for Big Data Lessons from
o Requirements
o Actual Analytics
o Challenges
o Performance
o One data any API
o Solution
o Memory, Collocate, Economics
• Other Case Studies
• Orbitz, Hertz, Yelp
provides integrated solutions to enable rapid decisions on big data for CSPs
Guavus delivers big data
solutions, not just
technology components
Unique ability to rapidly fuse huge
quantities of data from
diverse sources
Patent pending streaming analytics
technology proven over
10+ years
Current customers include leading wireless,
IP, and video service
providers
Guavus at a Glance
• 3 of the top 5 NA mobile operators, 3 of the top 5 IP / MPLS backbone carriers, & CDN Networks
• 4 of the top 6 largest global communications infrastructure equipment vendors
• Mature (10+ years) patent-pending technology
Silicon Valley Venture Backed Company
Tier-1 CSP Customers & Partnerships
Industry Proven & Recognized
• US HQ in San Mateo, CA, R&D Offices in India • Raised $48 Million, 350 employees worldwide
Guavus Empowers LOB to Make Decisions
Data Collection, Fusion and Mining Across Disparate Data Sources
Information Systems
Enterprise
Apps
Data
Warehouses
Databases
Networks
Devices & Networks
Data at Rest Views
Data in Motion Flows
Finance & Regulatory
• Profitability Analysis
• Tiered Pricing Optimization
• Contract/SLA Enforcement
Network & Operations
• Traffic Engineering
• Capacity Planning
• Peering Optimization
Marketing
• Customer Segmentation
• Campaign management
Executives
• Continuous Business Optimization
• Predictive Planning
Customer Care & Sales
• Churn Prediction
• Focused Prospecting
• Targeted Up-Sell & Cross-Sell
Operator Challenges in a Big Data World
DATA SITTING IN SILOS
EXPONENTIAL [ STREAMING ] DATA GROWTH
TIMELY INSIGHTS
DISTRIBUTED NETWORK
GENERATION
Key Data Sources & Insights
Streaming Analytics Insights
Content trending & consumption
Fused network events
Subscriber dynamic usage
profiles
Network usage patterns
Policy control functions
CONTENT PROVIDERS
INTERNET CDN
EDGE NETWORK
ACCESS NETWORK
CPE OR END DEVICE
Transforming the Big Data Analytics Economic Model
• Consolidate data in a repository • Transport and store data- Transport
and storage costs alone may put it over budget
• Project may not even get started
Traditional Centralized, Store-First
Architecture
• Move processing to data edge • Focus spend on analytics first • Continuous processing yields timely and
actionable insights • Reduce overall spend per new analytics
questions • Leverage off the shelf low cost processing
and storage
Streaming Centric Distributed, Compute-First
Architecture
RESOURCES & TIME RESOURCES & TIME
TRANSORT STORAGE COMPUTE [ Insights ]
TR
AN
SP
OR
T
ST
OR
AG
E
COMPUTE [ Insights ]
Master Fusion
Machine Learning
Clustering &
Classifying
Master Aggregation
Business Logic
Centralized
Compute &
Analyze
Analytics Applications Examples
Mobility Digital Media
Broadband
3rd Party Feeds & Customer Tools
Market Research
Ad Targeting
Capacity Planning
Data Warehouse
s
Big Data Streaming Analytics Architecture
Aggregation
Data Fusion
Streaming / Batch Ingest
Distributed Site 3
Local Data Store
Aggregation
Data Fusion
Streaming / Batch Ingest
Distributed Site 2
Local Data Store
Aggregation
Data Fusion
Streaming / Batch Ingest
Distributed Site 1
Local Data Store
DPI Data
PDN Flows
AAA Data
Web Activity
Web Taxonomy
Advertising Traffic
Media Type Meta
Data
Flow & Routing
Service Consumption
Traffic
Data Sources
Mobility Reflex
Central Compute ( Fusion, Aggregation & Compute )
Guavus External API & POC Sandbox
Network Management, Field Inventory, etc.
Data Stores (IT, DWH, Cloud)
XDR
Traditional ETL Layers
…
Guavus Applications
Enterprise Reporting
Customer UI Portals Insight Discovery 3rd Party System Support
PM / FM CRM Inventor
y
Ingest Export
IP Reflex
CDN Reflex
Ad Reflex
Consumer Reporting
API HBASE API
SQL/Hive
Data Store
Distributed Data Collectors
Distributed Data Collectors
Distributed Data Collectors
Streaming Data Feeds
Gu
avu
s S
tream
Pro
cessin
g P
ipelin
e
Caching Compute Nodes ( Bus Cubes, Machine Learning Caching )
Analysis Store
Cube API
SQL
DPI PCMD IPDR NetFlow RADIUS DNS …
Guavus Analytics Platform Details
Matrixx. Parallel-MATRIXX™
• Parallel-MATRIXX™ technology has completely re-invented
transactional real-time and eliminated limitations with
contemporary technologies described earlier.
• The Next slide identifies the Parallel-MATRIXX™ functional
architecture based on multiple patented technologies, and offering a
performance improvement of at least two orders of magnitude
relative to legacy approaches.
© 2012 Alan Quayle Business and Service Development 125
© 2012 Alan Quayle Business and Service Development 126
Matrixx. Algebraic-Decision Engine
• OCS raters can be broadly classified as rule- or data driven.
o The former offer great flexibility to configure rating scenarios of arbitrary sophistication
but which can become challenging to maintain beyond a certain complexity.
o Data driven systems typically offer a rich catalog of off-the-shelf templates that are easily
configured to create real offers.
• These templates are “baked” into code so performance can be highly optimized.
The challenge with this approach arises when no suitable template is available,
often requiring complex and costly customization.
• With respect to real-time performance, both approaches share a common
weakness. Every transaction results in execution of conditional logic reflecting
the rating discriminators (if weekend, and if URL is On-net, and if…).
• As rating, or indeed policy, rules become more sophisticated, execution code
paths extend and performance degrades – often unpredictably.
© 2012 Alan Quayle Business and Service Development 127
Matrixx. Algebraic-Decision Engine
• The Parallel-MATRIXX™ Algebraic-Decision engine eliminates this degradation by
building on the simple principle that any pricing concept can be represented as a set
of mathematical equations.
• Modern CPUs capable of 200 million multiplications per second are exceptionally
efficient at solving such equations.
• Pricing plans, offers, and policies are configured via a GUI and transparently
compiled into an n-dimensional matrix where each dimension corresponds to a
rating normalizer (such as time, location, service, etc.).
• Stored at each matrix “intersection” is a linear equation representing the rating
formula to be applied. As each transaction is mapped to the relevant intersection,
solution of the associated linear equation is extremely fast.
• As offers are extended with additional normalizers (for example, adding a device
dependency to offer lower rates for a promoted device), the matrix dimensionality is
extended accordingly. This simply results in a few additional CPU cycles to solve the
rate equation with no significant impact on latency. © 2012 Alan Quayle Business and Service Development 128
Contention-Free In-Memory Database and Parallel-MATRIXX™ Processing
• Maintaining data and transaction integrity is a mission-critical requirement for any
database containing CSP customer or financial data. For example, an attempt to
transfer funds between two customers must complete successfully or be cleanly
aborted.
• A situation where the donor’s account is debited but some technical failure results in
the recipient not receiving the funds would leave the database in an invalid state.
• As described earlier, current real-time systems rely heavily on OLTP and locking
techniques to assure data integrity but which can lead to rapidly degrading and
unpredictable performance.
• Parallel-MATRIXX™ technology is based on an in-memory database that does not
utilize locking while still supporting full ACID-compliant transactions.
• No transaction is ever blocked from accessing or updating data while newly developed
algorithms detect and resolve transaction conflicts.
© 2012 Alan Quayle Business and Service Development 129
© 2012 Alan Quayle Business and Service Development 130
Case Studies Understanding where big data is used in practice
© 2012 Alan Quayle Business and Service Development 131
Structure Part 4 of 5
• 15:00 Ecosystem, Taxonomies and
Suppliers: Understanding the many
suppliers, technology camps, and
approaches
• Taxonomy of Big Data Companies
• Big Data Landscape
• Cloudera
• Autonomy
• Vertica
• InfoChimps
• Guavas
• Matrix
© 2012 Alan Quayle Business and Service Development 132
• Case Studies
• Real Time Analytics for Big Data Lessons from
o Quick technology review
o Facebook Real-time Analytics System
o Goal
o Actual Analytics
o Solution
o Memory, Collocate, Economics
• Real Time Analytics for Big Data Lessons from
o Requirements
o Actual Analytics
o Challenges
o Performance
o One data any API
o Solution
o Memory, Collocate, Economics
• Other Case Studies
• Orbitz, Hertz, Yelp
© 2012 Alan Quayle Business and Service Development 133
© 2012 Alan Quayle Business and Service Development 134
© 2012 Alan Quayle Business and Service Development 135
© 2012 Alan Quayle Business and Service Development 136
© 2012 Alan Quayle Business and Service Development 137
Global Enterprise and Telecom Survey on Big Data and Real-Time Analytics
© 2012 Alan Quayle Business and Service Development 138
Structure
• Background
• The Questions
• The Importance of Analytics
• Impact of Big Data on Analytics
• Size of Data Sets, Number of Data Sources
• Update Frequency
• Integration of Data Sources
• Data Set Responsibility
• Types of Data, Types of Processing and Analytics
• Challenges
• Big Data Analytics Platforms
• Benefits and Plans
• Data Analytics Storage and IT Infrastructure Requirements
• Increasing Interest in Hadoop MapReduce Framework Technology
• Conclusions
© 2012 Alan Quayle Business and Service Development 139
Background
• Global Survey
• Across 200 business and IT executives, questioned in August and September
2012
• 105 enterprise (non Telco), 55 Telco – all large enterprises (no mid-market
analysis)
• Non-Telco included web service providers, financial services, healthcare,
manufacturing, retail, education, government, military, entertainment verticals
• Generally VP level with a few CxO level, all decision makers with budget
responsibilities
• Generally known to me, or through my contacts as I was trying to gather frank
reviews
• Surprisingly similar across Telco and non-Telco
© 2012 Alan Quayle Business and Service Development 140
© 2012 Alan Quayle Business and Service Development 141
31%
39%
20%
9%
1%
MostImportant
Top 5 Top 10 Top 20 NotImportant
Importance of Enhancing Data Processing and Analytics versus all
Business Priorities
Impact of Big Data on Analytics
• There is much market hype surrounding the term big data. When asked what the
term means to them, a majority of respondents indicated that it simply refers to very
large data sets, see next slide.
• The big data movement born from the Hadoop open source initiative has not reached
most IT departments or even analytics professionals, as evidenced by the fact that
only 11% of survey respondents associate Hadoop MapReduce with the concept of big
data.
• Most organizations’ analytics efforts to date have dealt with structured data, sourced
through relational databases and data warehouses, and for the vast majority of
analytical undertakings this makes sense.
• But even organizations that have not been captured by the Hadoop movement are
still increasingly under the gun to deal with larger data volumes, and the incursion of
unstructured data. This, plus the many public examples of big data that have caught
the imagination of business executives, have reinvigorated interest in data analytics.
© 2012 Alan Quayle Business and Service Development 142
© 2012 Alan Quayle Business and Service Development 143
0% 10% 20% 30% 40% 50% 60% 70% 80%
Very large data sets
Very large databases
Dat Warehouses
Data Analytics
Problems in storing / processing data
Web and search engine data
Hadoop / MapReduce
What does the term Big Data mean to you?
Size of Data Sets
• The majority (66%) of respondents revealed that the size of the
largest data set on which their organization conducts analytics is no
more than 5 terabytes (TB).
• Overall, the largest data analytics set is approximately 10 TB.
• While these numbers might not reflect the expectations that often
accompany the concept of big data, the reality is that processing
even gigabytes of data at a time during traditional analytics
exercises is significant.
© 2012 Alan Quayle Business and Service Development 144
© 2012 Alan Quayle Business and Service Development 145
5%
9%
20%
32%
19%
11%
3%
1%
<250GB <500GB <1TB <5TB <10TB <25TB <50TB >50TB
What is the Largest Data Set?
Number of Data Sources
• A significant part of data analytics exercises is the amalgamation of
data from multiple disparate sources.
• The next slide show 57% of these organizations are pulling from at
least three unique data sources, and one-quarter (25%) are
integrating data from five or more sources.
© 2012 Alan Quayle Business and Service Development 146
© 2012 Alan Quayle Business and Service Development 147
12%
21%
25%
17% 16%
9%
Single Source 2 3 4 5 >5
Number of Data Sources
Update Frequency
• Many organizations identified improving business intelligence and/or delivery of
real-time business information as a key business initiative that will have an
impact on IT spending decisions.
• Considering the volumes of data organizations intend to analyze in shorter
timeframes, organizations will need to evaluate whether their current
approaches are adaptable to these demanding and constantly changing
requirements. As part of the same spending survey, organizations also identified
major application deployments or upgrades as a top IT priority, which is
significant since every newly deployed or upgraded application will have a
corresponding impact on existing data integration processes.
• When asked about the rate with which their largest data set data is updated,
nearly two thirds (65%) of organizations revealed that the changes take place at
an either real-time or near real-time pace.
© 2012 Alan Quayle Business and Service Development 148
© 2012 Alan Quayle Business and Service Development 149
28%
37%
35%
Realtime (streams) Near realtime Batch
Frequency of Update
Integration of Data Sources
• When asked about the primary method to integrate data sources
comprising their organization’s largest data sets, nearly four fifths of
respondents identified purpose-built applications such as
Informatica, Oracle, and Teradata.
• An additional 30% use custom extract, transform, load (ETL) scripts
or custom extract, load, transform (ELT) scripts for data source
integration purposes.
© 2012 Alan Quayle Business and Service Development 150
© 2012 Alan Quayle Business and Service Development 151
39%
30%
12%
10% 9%
Purpose built Custom ETL EAI Open Source Other
Main Method of Integrating Data Sources
Data Set Responsibility
• In terms of the sources responsible for populating organizations’ largest data
sets, nearly half (51%) of respondents identified back office applications, such as
resource planning, human capital management, and accounting systems.
o For example, many years of order or payment information can yield useful insight into
customer patterns.
• Another common source involves the information gleaned from corporate data
centers and computer networks in the form of network traffic and system log
files. This information is important to not only those organizations looking to
maximize network and system performance and utilization metrics, but also to
those that rely on security analytics to help shape information privacy and
information protection strategies.
• Enterprise organizations were significantly more likely to identify internal back
and front office applications, internal data center or computer networks, e-
commerce applications (i.e., point-of-sale, supply chain, etc.), and scientific
research as data sources that comprise their largest data sets.
© 2012 Alan Quayle Business and Service Development 152
© 2012 Alan Quayle Business and Service Development 153
51%
45%
35%
34%
12%
10%
11%
10%
7%
Internal back-office
Internal data center
Front office
Web Applications
Social media
Telemetry
External public data
Third Party
Scientific research
Responsible for Populating Data Set
Types of Data
• What data types end up in organizations’ largest data sets from the
aforementioned sources? More than half (52%) of respondents indicated that
their largest data set is comprised of database data.
• Nearly half (48%) of organizations have some measure of transactional data—
such as point-of-sale (POS) or inventory—residing in their largest data set.
• What is interesting is the number of organizations that report that unstructured
data—especially machine-generated content such as log files and sensor data—
populates their largest data sets. These data types precipitated the concept of big
data and there are emerging signs that these will consume a vast amount of
bandwidth, compute, and storage resources. Probably the most significant
takeaway is that big data becomes really big when an organization starts to see
unstructured / machine-generated data grow to the size of—or even surpass—
relational information, which will serve to further exacerbate the integration
challenges mentioned above.
© 2012 Alan Quayle Business and Service Development 154
© 2012 Alan Quayle Business and Service Development 155
52%
48%
30%
22%
19%
18%
16%
11%
9%
Relational database
Transaction database
Office documents
Log files
Text / messages
Location data
Web log files
Audio / video
Sensor data
Source of Data
Challenges
• When asked to identify the data processing and/or analytics challenges
associated with their organization’s largest data set, nearly half cited security /
regulation / compliance.
• Personally identifiable information (PII) and other sensitive information is what
drives this.
• About one third of respondents identified data quality (35%) and data cleansing
tasks (33%) since data cleansing and preparation was categorized as the most
time-consuming data processing and analytics activity.
• While lack of skills is a middle of the pack challenge according to respondents.
• Clearly, responses involving process-related considerations (i.e., data security,
integration, cleansing, etc.) gravitated to the top of the challenges list
© 2012 Alan Quayle Business and Service Development 156
© 2012 Alan Quayle Business and Service Development 157
48%
35%
32%
29%
25%
19%
18%
17%
Security / Regulation / Compliance
Data quality
Cleansing
Data integration
Business expectations
Data Synchronization
Costs
Lack of Skills
Data Processing Challenges
Benefits
• Cost containment is still an important business initiative to many
organizations, especially when it comes to IT investments.
• More than half (55%) of respondents identified reduced costs as a
key benefit associated with their data analytics platform.
• Other top benefits centered on simplicity and efficiency, including
easier management and process improvements, as well as improved
business agility, which is particularly significant since business
requirements are constantly changing when it comes to data
analytics.
© 2012 Alan Quayle Business and Service Development 158
© 2012 Alan Quayle Business and Service Development 159
55%
37%
33%
32%
25%
21%
Cost reduction
Process improvement
Business agility
Better accuracy
Event monitoring
Fraud detection
Benefits from Data Analytics Platform
Conclusions and Recommendations
© 2012 Alan Quayle Business and Service Development 160
Recommendations to the Big Data Buyer
• Recognize the value of unified information access and analysis in supporting
fact-based decisions by individuals, groups, and systems.
• Recognize the shortcomings of operating without having the right information at
the right time. Use this awareness to help build the business case for addressing
those shortcomings – fine an anchor tenant for the project. NO ENTERPRISE
WIDE PLATFORM PROJECTS YET, LOOK TO THE CLOUD.
• Formulate a Big Data strategy that includes evaluation of decision makers‘
requirements, decision processes, existing and new technology, and availability
and quality of data. NOT TECHNOLOGY LED.
• The application of Big Data technology will fall into two primary categories:
o doing more efficiently (including at lower costs) tasks that have been done for years and doing completely new things
that were never before possible,
o Driving up long-term strategic organizational value.
o Identify opportunities to apply Big Data to both.
© 2012 Alan Quayle Business and Service Development 161
Recommendations to the Big Data Buyer
• Beware of the confusion and hyperbolic marketing in the Big Data
market today. WE ARE AT PEAK BS.
• IT organizations will need to consider a coordinated approach to
planning implementations - when more than one project exists.
• It is important to develop an IT infrastructure strategy that optimizes
the server, storage, and network resources. Well-developed plans for
networking support of Big Data projects should address optimizing the
network both within a Big Data domain and in the connection to
traditional enterprise infrastructure. LEGACY MATTERS.
• Consider the breadth of Big Data technologies and the functionality each
technology brings to the overall portfolio of tools for collecting,
accessing, analyzing, monitoring, and managing data. © 2012 Alan Quayle Business and Service Development 162
Recommendations to the Big Data Vendor • Revenue opportunities exist at all levels of the Big Data technology stack as well
as in services. Service is where the bulk of the growth exists.
• Articulate your value proposition by connecting technology capabilities to
business problems or opportunities. NOT TECHNOLOGY LED.
• Big Data technology is not an end in itself. NOT TECHNOLOGY LED.
• Recognize the value of Big Data to drive employee and customer decisions and
actions.
• Decide if you want to be a niche player or enter the mainstream.
o If the former, then build a network of consultants and partners to support your
technology.
o If the latter, then build a business case that assumes eventual acquisition.
• The growth in appliances, cloud, and outsourcing deals for Big Data technology
will likely mean that end users will choose new applications and services, based
less on the technology itself and more on the business value they deliver. © 2012 Alan Quayle Business and Service Development 163
Recommendations to the Big Data Vendor
• Whether the application is based on a database or is search based, and whether
the database is row based or column based, is in-memory or disk based, or uses
SQL or NoSQL technologies will become less relevant over time. Thus
technology will provide only a short-lived competitive advantage to any vendor.
• System performance, availability, security, and manageability will all matter
greatly. However, how they are achieved will be less of a point for
differentiation.
• HPC vendors have an edge in Big Data because leading-edge data-intensive
computing has been an integral part of HPC for decades.
• Most HPC Big Data work involves established methods of analyzing increasingly
large data volume related to numerical modeling and simulation.
© 2012 Alan Quayle Business and Service Development 164
Recommendations to the Big Data Vendor
• Vendors should tout, not hide, their HPC histories. A number of vendors
with HPC origins and strong HPC reputations have not capitalized on these
assets when attempting to address Big Data markets outside of HPC.
• It is better to position your high-end HPC experience as a strength for
meeting the presumably less-difficult, data-intensive challenges in the
mainstream market.
• Useful tools are largely lacking for very large data sets. Tools such as
Hadoop and MapReduce can effectively expedite searches through the large,
irregular data sets that characterize some of the newer Big Data problems.
• These tools can be great for retrieving and moving through complex data,
but they do not allow researchers to take the next step and pose intelligent
questions. In addition, the going gets tough when data sets cross the 100TB
threshold.
© 2012 Alan Quayle Business and Service Development 165
Recommendations to the Big Data Vendor
• Sophisticated tools for data integration and analysis on this scale are largely
lacking today. There are opportunities to create tools and applications for Big
Data. Vendors that create tools and applications for use at this scale can use
them as a lever to seize market leadership positions in the Big Data market.
• Not all Big Data use cases involve analytics. Analytics may be at the heart of
most Big Data opportunities in the enterprise market, but there are also
opportunities to support operational workloads and information access
applications.
• Some of the emerging technologies and the vendors behind them will likely end
up as components or features of broader information management, access, and
analysis platforms of larger vendors. Specialized application and service
providers with localized and industry expertise will be critical to expanding the
market.
© 2012 Alan Quayle Business and Service Development 166
© 2012 Alan Quayle Business and Service Development 167
Walk or Run to Big Data? It depends on your situation. For most telcos the move to Big Data will be incremental and complementary to
existing platforms and investments. Focus on the solution: the application of the Analytics to the Business – people and process not technology.