6
High-Velocity Data – The Data Fire Hose What is High-Velocity Data? Computer systems are creating ever more data at increasing speeds, and there are a growing number of consumers of that data—both operations and analytics. Hadoop-style batch processing has awakened engineers to the value of big data, but they increasing demand access to the data earlier. In essence people not only want all of the data, they want it as soon as possible; this is driving the trend toward high-velocity data. High-velocity—or fast data—can mean millions of rows of data per second, we are talking about massive volume. One of the use cases for high-velocity data is real-time analytics. What is Driving the Explosion in High-Velocity Data? Data generated by humans has been growing exponentially for quite some time, fueling the growth of companies like EMC and Netapp. In fact, 90% of the world’s data was created in the last 2 years [http://www.sciencedaily.com/releases/2013/05/130522085217.htm] . This really demonstrates how the world has embraced big data. However, the data generated by both devices and the actions of humans—such as log files, website click-stream data, and Twitter feeds—weren’t tracked or collected until recently, because the state of the art technology couldn’t handle that data velocity. Big Data, driven largely by Hadoop, provided a mechanism for running analytics across massive volumes of data using a batch process. This gave people a reason to store these huge amounts of data. As people began deriving value from big data, they started wanting more. They began to ask why they couldn’t process these large volumes of data in real-time. This extreme level of data velocity requires new high-velocity data technologies. High-Velocity Data You are here: Home / High-Velocity Data

High Velocity Data ScaleDB

Embed Size (px)

DESCRIPTION

High Velocity Data ScaleDB

Citation preview

Page 1: High Velocity Data   ScaleDB

High-Velocity Data – The Data Fire HoseWhat is High-Velocity Data?Computer systems are creating ever more data at increasing speeds, and there are a growing number of consumers of that data—both operations andanalytics. Hadoop-style batch processing has awakened engineers to the value of big data, but they increasing demand access to the data earlier. Inessence people not only want all of the data, they want it as soon as possible; this is driving the trend toward high-velocity data. High-velocity—or fastdata—can mean millions of rows of data per second, we are talking about massive volume. One of the use cases for high-velocity data is real-timeanalytics.

What is Driving the Explosion in High-Velocity Data?Data generated by humans has been growing exponentially for quite some time, fueling the growth of companies like EMC and Netapp. In fact, 90% of theworld’s data was created in the last 2 years [http://www.sciencedaily.com/releases/2013/05/130522085217.htm] . This really demonstrates how the worldhas embraced big data. However, the data generated by both devices and the actions of humans—such as log files, website click-stream data, and Twitterfeeds—weren’t tracked or collected until recently, because the state of the art technology couldn’t handle that data velocity.

Big Data, driven largely by Hadoop, provided a mechanism for running analytics across massive volumes of data using a batch process. This gave people areason to store these huge amounts of data. As people began deriving value from big data, they started wanting more. They began to ask why theycouldn’t process these large volumes of data in real-time. This extreme level of data velocity requires new high-velocity data technologies.

High-Velocity Data You are here: Home / High-Velocity Data

Page 2: High Velocity Data   ScaleDB

What are the Sources of High-Velocity Data?This is a list of some of the popular sources of high-velocity data today:

1. Log Files: Devices, websites, database, any number of technologies log events. Log mining applications like Splunk and Loggly opened people’s eyesto the value in these log files. This resulted in an increase in logging and the richness of data collected in these log files.

2. IT Devices: Networking devices (routers, switches, etc.) firewalls, printers, every device these days generates valuable data, a ssuming you can collectit and process it at scale.

Page 3: High Velocity Data   ScaleDB

3. User Devices: One of the largest sources of high-velocity data is the use of smartphones. Everything you do on your smartphone is logged, providingvaluable data.

4. Social Media: Whether it is the Twittertweets, Facebook posts, Foursquare check-ins or any number of other social data streams, these createmassive amounts of real-time data that degrades in value quickly.

5. Online Gaming: Another source of real-time data based on user interactions, not just with the game but also with other users. This group includesthe Massive Multiplayer Online Gaming (MMOG) like World of Warcraft as well as 1:1 games, many played on mobile phones, like Words with Friends.

6. SaaS Applications: SaaS applications typically start with a limited set of functionality. As they mature, the functionality grows and user relationshipsand interactions also grow, creating a massive flow of real-time data. Linkedin is perfect examples of this trend. This high-velocity stream of eventshas led Linkedin to create Kafka a Complex Event Processor (CEP) that handles the routing and delivery of high-velocity event data.

There are many more sources of high-velocity data, including vertical sources, like the flood of GIS data found in oil and gas companies. As technologiescome online to extract value from this high-velocity data, it is transforming many industries.

Managing the Flow of High-Velocity DataThe flood of high-velocity data can quickly overwhelm systems, especially during peak loads.Furthermore, most applications need certain quality guarantees (delivery guarantee, deliver only once, etc.). To coordinate the flow of high-velocity data,some companies use Complex Event Processing (CEP) solutions based ona publish and subscribe approach. Examples of these include Java MessagingService (JMS) and Apache Kafka, which came out of Linkedin. If you only need to manage the flow of data, CEP can help coordinate the flood of data.

Processing High-Velocity DataThe desire to extract real-time insight from high-velocity data led to the creation of Stream Processing Engines. These engines include Twitter’s Storm,Yahoo’s S4 and Linkedin’sSamza (built on top of the Kafka CEP above). These engines can route, transform and analyze a stream of data at high-velocity.However, they do not persist the data, instead they provide a brief sliding window on the data. For example, they might maintain a 2 minute or 10 minuteview of the data, but the amount, or time window, is limited by the velocity of the data and the size of their memory. These engines can persistthe data toa database, giving you a comprehensive view of the historical data. Thisassumesthat your chosen database can handle the data velocity.

Persisting High-Velocity Data…the Database

Page 4: High Velocity Data   ScaleDB

Traditional Database Management Systems (DBMS) simply cannot handle the high-velocity data coming from modern applications. This is a data ingestionproblem; think of a human sipping from a firehose and you’ll get the idea. Hadoopprovides batch processing of high-volume data, but when dealing withhigh-velocity data you need real-time processing. This has led to a few innovations.

Add a SQL Interface to HadoopThe demand for persisting and querying high-velocity data in real-time has led a number of companies to add limited SQL interfaces to Hadoop. Examplesof this approach include Apache Tez (Hortonworks), Impala (Cloudera), Hadapt (Hadapt) and Apache HBase. Hadoop and HDFS weren’t designed fordatabase requirements—in fact their storage is based on large files, not small blocks—butcorporate demand for a solution to the high-velocity dataingestion problem is certainly strong.Hadoop is really optimized for data volume, not data velocity.

NoSQLNoSQL is one solution to the high-velocity data ingest problem. The challenge NoSQL faces is the same challenge faced by Hadoop, namely thatcorporations have standardized upon and built expertise and tools around SQL, which doesn’t work for NoSQL databases.

In-Memory DBMSIn-memory databases eliminate the slowest piece of the traditional database—the disk—enabling databases to ingest data at a much higher rate thantraditional databases. The two big contenders in the in-memory database world are HANA (SAP) and TimesTen (Oracle). However, in-memory databasesare ill-suited to high-velocity data because their data size is limited to memory;they simply cannot handle the volume of data created by a high-velocitydata source.

Extending MySQL to Handle High-Velocity Data: ScaleDBTraditional databases, like MySQL, do not deliver sufficiently high data ingest rates to persist high-velocity data. ScaleDB changes all of that. ScaleDBextends MySQL without changing a single line of MySQL code, so the entire ecosystem (tools, applications, etc.) works with ScaleDB. ScaleDB’s newStreaming Table™ technology enables a small cluster of MySQL databases to ingest millions of rows of data per second. This data is then available forreal-time manipulation using the rich tools that are already part of the MySQL ecosystem, such as Tableau Software [http://www.tableausoftware.com/], QlikView [http://www.qlikview.com/] and LogiAnalytics [http://www.logianalytics.com/] .

In addition to running leading analytics tools, persisting the data in a database gives you the ability to query the data in an ad hoc fashion. If we use theexampleof a flow of colored balls, a stream processor can count green balls, or it can transform all data about red balls into orange balls. However, if youwant to ask questions of the data, across a time series, you need database functionality. For example, using a database you can ask how many red ballswere preceded by green balls, or how many orange balls we processed in the last hour, or any number of questions of any detail you need, all in aninteractive fashion.

Page 5: High Velocity Data   ScaleDB

Selecting the Right High-Velocity Data Tool for Your Needs*

ConclusionHigh-Velocity Data, over time, accumulates to create Big Data. Think of high-velocity data as the firehose, pumping out water that forms into a pond thatrepresents big data. Hadoop has gained popularity for providing batch-oriented processing of big data. But batch processing is deficient in that it doesnot provide real-time processing or ad hoc queries.

Several classes of applications are generating high-velocity data, where Hadoop-style batch processing is insufficient. For example, a Massive Multi-PlayerOnline Games (MMOG) might require a high-velocity data solution that serves multiple use cases, for example: (1) maintaining player state currently andin between session; (2) generating real-time analytics as a mechanism for modifying game play or informing operations; (3) supporting ad hoc queriesfrom customer support; (4) Providing real-time action-based billing, and more. In this case a brief moving window of time, as provided by streamprocessing engines is insufficient, it requires high-velocity streaming persistence with an ad hoc—ideally SQL-based—interface.

Hadoop opened up whole new possibilities for extracting value from big data, or high-volume data. This led more and more companies to start collectingmassive data, because they could extract value from it. The new wave of high-velocity data tools enable companies to extract real-time value from high-

Page 6: High Velocity Data   ScaleDB

velocity data, instead of waiting for it to pile up and then running a batch process on it. Look for more companies to recognize this opportunity to drinkupstream from their competition; using high-velocity data to make them more agile, responsive and ultimately more competitive.

© Copyright 2014 ScaleDB