Operational Analytics Using Spark And NoSQL White Paper V5 · rise, and when demand came to analyse new sources of high volume, complex multi-structured data, a new NoSQL platform

Delivering Operational Analytics Using Spark and NoSQL Data Stores

Prepared for:

By Mike Ferguson Intelligent Business Strategies

November 2015

WH

ITE

PA

PE

R INTELLIGENT

BUSINESS STRATEGIES


Copyright © Intelligent Business Strategies Limited, 2015, All Rights Reserved 2

Table of Contents The Changing Landscape Of Scalable Operational And Analytical Systems .................... 3

Highly Scalable Analytical Processing ................................................................ 3 Highly Scalable On-Line Operational Applications ............................................. 3

The Impact Of Web And Mobile Commerce ............................................ 3 The Need To Process And Analyse Transactional And Non-Transactional Data At Scale .................................................................... 4

The Emergence Of NoSQL Data Stores For Scalable Processing ..................... 5 Types of NoSQL Databases .................................................................... 5

Closed Loop Integration Is Still Needed ............................................................. 6 Big Data Analytics ............................................................................................................. 7

Hadoop And The Emergence Of Apache Spark ................................................. 7 Spark As An In-Memory Analytical Platform For Operational And Exploratory Analytics ............................................................................................................. 7

The Value of Operational Analytics ................................................................................... 9

Business Drivers For Operational Effectiveness ................................................ 9 Types of Operational Analysis ............................................................................ 9 Key Requirements For Operational Analytics ................................................... 10 The Importance Of In-Memory Processing ....................................................... 10

Operational Analytics using The Basho Data Platform and Apache Spark ..................... 11

Storage Instances ............................................................................................. 11 Core Services ................................................................................................... 12 Service Instances ............................................................................................. 12

Redis ...................................................................................................... 13 Apache Spark ........................................................................................ 13 Apache Solr ........................................................................................... 13

Operational Analytics Using Spark In The Basho Data Platform ...................... 13 Conclusions ..................................................................................................................... 15



THE CHANGING LANDSCAPE OF SCALABLE OPERATIONAL AND ANALYTICAL SYSTEMS

Up until perhaps 2005 the vast majority of both operational transaction processing applications and analytical systems, at least in medium and large enterprises were built on top of general purpose relational databases as these were seemingly able to cater for a wide spectrum of workloads. However two forces were at work that started to change this dynamic. These forces did not occur at the same time but nevertheless they both had a similar impact in that they both disrupted the norm of using a general purpose database for all workloads.

HIGHLY SCALABLE ANALYTICAL PROCESSING The first was in analytical systems. As data warehousing pushed forward into the twenty first century, analytical workloads started to move beyond basic query and reporting to become more complex, leveraging the power of new improved SQL constructs. What became clear was that general purpose row level relational databases were not particularly well suited to complex analytical workloads especially as historical transaction data volumes grew. Also analytical queries often required only a small number of columns (columnar) when most general purpose databases could only retrieve and store complete rows of data (i.e. all columns). In addition, dedicated hardware and massively parallel relational database software were starting to do a better job of analytical processing than the general purpose database offerings on the market. Eventually, workload-optimised systems emerged specifically aimed at analytical processing and massively parallel analytical databases became the accepted way forward. But we were not finished. Data volumes continued to rise, and when demand came to analyse new sources of high volume, complex multi-structured data, a new NoSQL platform emerged in the form of Hadoop that opened up the Big Data Analytics market. At first, we were limited to MapReduce batch analytical processing in this environment but when Hadoop 2.0 and Apache Spark appeared, the floodgates opened in the analytics market. Again, an example of a workload optimised system aimed at low cost extreme analytical processing. We will discuss Hadoop and Spark more later.

HIGHLY SCALABLE ON-LINE OPERATIONAL APPLICATIONS The second force at work was in operational transaction processing systems. Originally, bank tellers and similar staff in other industries, dictated the rate at which transactions were entered into on-line transaction processing (OLTP) systems. The emergence of the web changed all that. The web gave companies a way to cut front-office costs and reach the customer directly by ‘re-facing’ OLTP systems to make them directly accessible via a browser. In this way, customers could directly enter transactions themselves via a browser.

The Impact Of Web And Mobile Commerce The result was that customer facing e-commerce applications and the electronic shopping cart were born causing companies to have to re-architect operational applications to cater for a major leap in concurrent usage. Load

General purpose relational DBMSs have been widely used to support operational and analytical workloads

Analytical DBMSs emerged as the accepted way forward as analytical workloads became more complex and data volumes grew

Demand to analyse high volume, complex multi-structured data has resulted in Hadoop and Apache Spark being at the forefront of Big Data Analytics

Web-enablement of OLTP systems caused a major increase in concurrent usage as customers were able to directly transact business on-line for the first time



balancing across multiple applications server instances emerged to help scale out these applications but much like analytical systems, we were not done yet. In the 2007/2008 timeframe we saw the arrival of iOS and Android based smart phones and mobile applications (apps). Tablets followed and it wasn’t long before most organisations realised the implications on customer facing OLTP systems. They too, needed to become accessible from mobile devices to allow people to transact business on the move and not just from a desktop computer. The emergence of mobile commerce took transaction volumes up into levels not seen before. It also meant that most on-line operational systems were now expected to cater for masses of concurrent users – often well beyond any limits that these applications were either designed to cope with or that they had ever been tested against.

The Need To Process And Analyse Transactional And Non-Transactional Data At Scale Since the emergence of the web and then mobile devices, there has been explosive growth in the rates at which session data and transactions need to be captured. Also OLTP systems now store more than just transaction data. Today, customer facing applications provide a much richer user experience storing non-transaction data as well as transaction data. For example, session data, user profiles, shopping cart contents and product reviews. In an e-commerce application, every user’s shopping cart has to be stored. In a shopping visit people put products into their shopping cart, take them out again, put some in again etc., all before going to checkout. If every user does that, we need to scale reads and writes of shopping cart data for hundreds of thousands (or sometimes millions) of concurrent users on a 24 x 365 basis. Then of course, we want to capture and analyse on-line behaviour often while a user is still browsing to enable us to serve up personalised recommendations. This requires user clickstream data. On sites like Alibaba, eBay and Amazon, the rate at which clickstream data is being generated is huge. Nevertheless, it needs to be captured and analysed to understand on-line behaviour prior to checkout to make real-time personalised offers and to avoid abandoned shopping carts. In addition to web and mobile commerce, organisations also want to analyse other kinds of non-transactional data. This includes sensor data. Sensors are being deployed in utility grids, oil and gas pipelines, production lines and in distribution and supply chains to monitor and optimise processes. They are also being embedded in products e.g. smart phones (GPS sensor), industrial equipment, hospital equipment, cars, aircraft engines, fitness wristbands, watches, fridges etc. The Internet of Things (IoT) is upon us with billions of sensors already emitting data and billions more yet to come.

This is having a dramatic impact on the rates at which non-transactional big data is being generated. To give an example, 10,000 sensors emitting sensor readings at 10 times per second creates 100,000 pieces of data per second. That is 6,000,000 pieces of data a minute. Now think about all the smart phones in the world emitting data. It is a tsunami of data, some subset of which organisations want to capture for further analysis. Even if only a subset is

The emergence of mobile devices and mobile applications took the level of concurrent users of OLTP systems to a new high

Web and mobile access to OLTP systems has seen a huge growth in the need to capture both transaction and non-transaction data

Changes to shopping cart data need to be recorded during a session for every concurrent user

Non-transactional clickstream data is needed to track on-line behaviour, prevent abandoned shopping carts and to facilitate recommendations

Non-transactional data such as sensor data is also in demand to help optimise operations

Clickstream and sensor data is very high in velocity and volume



captured, the velocity of this type of data is staggering and the idea that general purpose databases can cater for this is just not feasible in most cases.

Taking all this into account, it is not surprising that, in a similar way to analytics, the need for workload optimised systems for highly scalable operational applications also emerged.

THE EMERGENCE OF NOSQL DATA STORES FOR SCALABLE PROCESSING With examples such as these, companies started to look around for alternative technologies to scale operational systems to capture, stream and store the required transactional and non-transactional data. These systems need to:

• Support peak transaction rates • Support peak capture of non-transactional data e.g. shopping cart data • Support peak data arrival and ingestion rates e.g. sensor data

It is this demand that gave rise to the emergence of NoSQL databases.

Types of NoSQL Databases Generally speaking there are several types of NoSQL databases

• Key Value data stores • Document databases • Column Family databases • Graph databases • Multi-modal databases

Key value stores are very useful when there is a requirement to accelerate a specific application to support high-speed read and write processing of non-transactional data. A key strength of key value stores is their simplicity. There is a key and a value. Access to the data is via the key and the application controls what’s stored in the value. It can be a single data field (e.g. a name), an image, a JSON document, etc. It is entirely up to the developer who is in complete control over navigating the data in the application. Data is partitioned and replicated across a cluster to get scalability and availability. For this reason, key value stores often do not support transactions1. They are however good at scaling applications dealing with high velocity non-transactional data especially if hash partitioning and in-memory processing is used. Popular use cases are simple session storage (e.g. shopping carts), ad serving, log data, sensor data, storing user accounts, settings and preferences, storing events and timelines and storing blog posts.

Document databases are typically associated with storing self-describing JSON, XML, BSON documents. They are similar to key value stores but there is a difference. Here there is a key and the value is a document. Typically the entire entity is in a document. Each document can have the same or a different structure and popular fields in the document can be indexed to speed retrieval

1 Transactions need to support Atomicity, Consistency, Isolation and Durability (ACID). However replication means consistency cannot be guaranteed. Therefore KV stores often support BASE (Basically Available, Soft state, Eventual consistency) instead. Note that some key value stores let you control the degree of replication, and the number of nodes responding to read and write requests. This means you can control the level of consistency, availability and partition tolerance (CAP).

Workload optimised systems are also needed for scalable high performance operational applications

The demand for alternative technologies to help scale operational systems has given rise to the emergence of NoSQL databases

There are several types of NoSQL databases

Key Value stores are used to accelerate specific applications requiring high speed read/write processing of non-transactional data

Good for session storage, ad serving, log data and sensor data

Document database store self-describing data such as JSON, XML and BSON In-place updates of documents are also supported



without knowing the key. Also many document DBMSs support in-place update with some also supporting transaction ACID1 properties.

Column Family databases group columns of related data together into families with multiple column families in one table. This is like heavily de-normalised data where a single row has multiple column families. Typically they group all columns in a column family together on disk so you can fetch multiple rows of one column family in a single read operation. Also in these systems, the number of columns in a family can be static or dynamic. The latter allows the number of columns in a column family to change dynamically at run time for every row in the database. A good example would be a database of Tweeters and their followers. The number of followers will vary for every tweeter and can constantly change. For some it could run into thousands of columns. However all can be retrieved in a single operation because only the column families associated with the columns required by a query are retrieved.

Graph databases are typically used for graph analysis, which is often used in fraud and in social network influencer analysis. They often store data in what is known as a ‘triple store’ where relationships between objects are represented.

Multi-modal databases provide some combination of the others and therefore can support a wider range of applications.

CLOSED LOOP INTEGRATION IS STILL NEEDED Despite the emergence of workload optimised data stores in both analytical and operational data processing, the requirement for most companies remains unchanged. ‘Closed loop’ integration between operational and analytical systems (shown in Figure 1) is still needed because businesses need to use analytics and insight in the context of every day operations to improve the effectiveness of decision making. The difference now is that the closed loop can involve both NoSQL and relational systems or even just NoSQL systems. In addition, with customer interactions now increasingly on-line, there is an increasing need to analyse the data on a continuous basis as soon as possible after that data is created i.e. either in true real-time or near real-time.

Figure 1

Analytical Systems

Closed+Loop+Integra3on+Between+Opera3onal+And+Analy3cal+Systems+Now+Includes+NoSQL+Technologies+

Operational applications

Scalable Analytical systems

data data

new+data+

new+insights+

Scalable Operational applications

Relational & NoSQL systems

Relational & NoSQL systems

Copyright © Intelligent Business Strategies 1992-2015!

Related columns can be grouped together in families

Column family databases can store very wide tables with millions of columns and billions of rows

The number of columns per family can vary for every row in a table

Graph databases are used for very specific analytical workloads

Closed loop processing now involves both relational and NoSQL systems

The requirement now is to implement closed loop processing at scale



BIG DATA ANALYTICS

HADOOP AND THE EMERGENCE OF APACHE SPARK In the world of big data analytics, Apache Hadoop has become centre stage as a low cost analytical platform for processing large volumes of structured and multi-structured data. A classic Hadoop system with its Hadoop Distributed File System (HDFS) is shown in Figure 2. Three popular use cases that have emerged for Hadoop are:

• As a data refinery to prepare and integrate data for analysis

• As an exploratory analytical platform

• As a data archive e.g. for archiving data warehouse data

Figure 2

Historically this was a platform for batch analysis on structured and multi-structured data as the only option was to write Java analytical applications to run in parallel on the MapReduce execution framework. Declarative scripting languages like Pig were also compiled down to execute on MapReduce. Hive provides a SQL interface for BI tools to access HDFS data but again it ran on MapReduce. However, Hadoop 2.0 with YARN2 in 2013, broke the dependency on MapReduce giving rise to new execution engines such as Tez. Most of these were aimed at specific analytical workloads e.g. Storm for streaming analytics.

SPARK AS AN IN-MEMORY ANALYTICAL PLATFORM FOR OPERATIONAL AND EXPLORATORY ANALYTICS

However Apache Spark is different. This is a new open source scalable cluster-computing framework for massively parallel, in-memory processing that

2 Yet Another Resource Negotiator: often referred to as Hadoop’s “operating system”

A"Hadoop"System"

Java"MapReduce"APIs"to"HDFS,"HBase,"Cascading"

file" file" file" file" file"

file" file" file" file" file"

file file"

file" file"

webHDFS"(An"HTTP"interface"to"

HDFS"has"REST"APIs)"

HDFS"

file"

file"

file

file"

YARN

PIG"laJn"scripts"

SQL"

Vendor SQL on Hadoop engine

MapReduce"ApplicaJon"

index"index"Index"parJJon"

SQL"

BI Tools"&"Applications

Storm

Analytic Application

YARN

MapReduce HBase Tez"


Apache Hadoop has become centre stage in big data analytics

Three popular use cases have emerged for Hadoop

The Hadoop Distributed File System partitions data across servers in a Hadoop cluster

Hadoop was originally used for massively parallel batch analysis on multi-structured data using MapReduce processing

Alternatives to MapReduce emerged in 2013 each aimed at specific analytical workloads



caters for multiple analytical workloads. Apache Spark consists of a number of components:

Component Description

Spark Core The foundation of Spark. It provides distributed task dispatching, scheduling, and basic I/O.

Spark Streaming For analysis of real-time streaming data

Spark Machine Learning Library (MLlib)

A library of pre-built analytic algorithms that can run in parallel across a Spark cluster

Spark SQL + DataFrames

Spark SQL lets you query structured data inside Spark programs using either SQL or a familiar DataFrames API. Usable in Jave, Python, Scala and R

GraphX A graph analysis engine running on Spark

SparkR For executing custom analytics written in R

These components are shown in Figure 3 below.

Figure 3

Data from a range of external sources like streaming data, Hadoop HDFS files, Amazon S3 and NoSQL data stores can be partitioned, distributed across multiple machines and held in-memory on each node in a Spark cluster. The distributed in-memory data is referred to as a Resilient Distributed Dataset (RDD). RDDs can also be created by applying coarse-grained transformations (e.g. map, filter, reduce, join) on existing RDDs. This is called closure. One of the key strengths of Spark is the ability to build in-memory analytic applications that combine different kinds of analytics to analyse data. For example you could read unstructured text data into memory, perform text analytics to extract entities from the text, apply a schema to the extracted data, access it using Spark SQL, analyse it with a machine learning algorithm write it back out to disk in a columnar file format for use by interactive query tools.

Spark&&Core&&&

Spark&&Streaming&

R&

Spark&SQL&+&DataFrames&

GraphX&(Graph&

Computa;on)&

MLlib&(Machine&Learning)&

Applica;ons&/&BI&Tools&

Apache&Spark&


SQL& Python& Scala& Java&

Apache Spark was the first in-memory execution engine to support multiple analytical workloads across a cluster

Apache Spark supports massively parallel, in-memory machine learning, graph analysis, interactive SQL query processing and streaming analytics

Scalable analytic applications can be built on Apache Spark that combine multiple types of analysis on big data in Hadoop HDFS and other NoSQL databases



THE VALUE OF OPERATIONAL ANALYTICS Now that we have a good understanding of why changes in the operational and analytical landscapes have emerged, this next section looks at why companies need to combine scalable operational and analytical systems for competitive advantage and how they can do that using operational analytics. It also looks at what requirements need to be met to make this possible.

BUSINESS DRIVERS FOR OPERATIONAL EFFECTIVENESS The need to scale to cater for huge numbers of concurrent users transacting business on-line has resulted in many companies now combining the use of relational DBMSs and workload optimised NoSQL data stores to scale to capture both transactional and non-transactional data respectively. In addition, companies are also using NoSQL data stores to capture high velocity non-transactional data such as clickstream and sensor data. The result is a much richer set of data now exists in operations and new insights can be produced by analysing it. If done in real-time or near real-time these insights can help increase revenue, reduce cost, and optimise operations.

As an example, in customer facing OLTP systems it should be possible to analyse customer on-line behaviour and customer social activity together with previous transaction activity to better engage with customers. Analytics could be applied to real-time and historical interaction data to determine customer specific behaviour patterns which can be used to help drive up customer value. Also by analysing what is put into and taken out of an on-line shopping cart in real-time, it is possible to do real-time basket analysis. This means more accurate personalised offers can be served up which may result in customers buying more products. It may also be possible to reduce the number of abandoned shopping carts in this way. Increasing the value of the basket and reducing abandonments both result in increased transaction value.

There is no doubt that operational analytics add value. The challenge, in this modern world, is to be able to do it at scale. In order to do that, we first of all need to define types of operational analytics and then define key requirements for using them in this new hybrid relational and NoSQL world.

TYPES OF OPERATIONAL ANALYSIS There are several types of operational analysis. The main ones are:

• Simple operational reporting of current position/state e.g. session state

• Situational awareness via visualisation of live operational data

• On-demand analytics of live operational and/or historical data

• On-demand recommendations

• Event stream processing to monitor, automatically analyse and act on events in real-time to prevent problems arising and to optimise business operations

Operational analytics allows organisations to integrate analytics into operational systems

Both analytical and operational systems can now scale to handle big data workloads

Companies want to combine operational and analytical processing at scale to improve customer engagement, reduce risk and optimise operational effectiveness

There are several types of operational analysis



KEY REQUIREMENTS FOR OPERATIONAL ANALYTICS With these in mind the following important requirements need to be met. It should be possible to:

• Integrate operational analytics into scalable operational applications

• Monitor operational non-transactional and transactional events as they happen via automatic analysis

• Support automatic analysis via on-demand or event-driven use of predictive and statistical models

• Automatically interpret the outcomes of predictive / statistical models and support rule-driven automatic actions to automate decision making

• Scale to support large numbers of concurrent users invoking analytical services on SQL and/or NoSQL data on-demand from operational applications, mobile devices and BI tools

• Run analytics on a 24x365 basis if operational applications requesting insights from those analytics are available around the clock

• Support human led analysis and operational reporting via BI tools

• Run predictive and statistical models as close to the data as possible for better performance

• Support event stream processing to monitor operational activity and automatically analyse events as soon as they occur

• Combine automated analysis with rule-driven automated actions to create prescriptive analytics that can be invoked on-demand, or on an event driven basis

• Automatically invoke alerting services (e.g., email, SMS), transaction services and/or whole business processes as part of an action

• Integrate real-time event data and alerts into operational applications and BI tools to provide early warning alerts etc.,

• Schedule automated analysis and action taking at user defined intervals to automatically identify opportunities on a timer-driven basis

• Exploit big data technologies for massively parallel analysis on large volumes of non-transactional data to produce new insights

• Store all relevant data together in NoSQL data stores to speed up execution of operational analytics

THE IMPORTANCE OF IN-MEMORY PROCESSING Given the nature of operational analytics, it goes without saying that performance is a critical success factor especially when dealing with large numbers of concurrent users and integrating on-demand analytics into high read/write applications. Similarly event driven operational analytics on high velocity data like clickstream and sensors is also something that needs high speed processing. One of the key mechanisms for helping make this possible is in-memory processing, especially if data can be processed in parallel. Memory is mission critical in stream processing significantly speeding up access to streaming operational data. Also it works well when operational data is stored in compressed columnar format.

Automated analysis is needed for on-demand recommendations and event processing

An operational analytics system must be capable of managing large numbers of concurrent users and offer high availablility

Predictive and statistical models need to run close to the data for better performance

Automated analysis and rule-driven automated decisions are both needed to create decision services

Keeping relevant related data together in NoSQL data stores speeds up analytical processing

In-memory processing is becoming mission critical for scalable operational systems and operational analytics



OPERATIONAL ANALYTICS USING THE BASHO DATA PLATFORM AND APACHE SPARK

Having understood the requirements, this section looks at how one vendor steps up to meet the requirements defined earlier in order to be able to analyse operational data. That vendor is Basho.

Basho Technologies is a distributed systems company. It was founded in 2008 and has many large enterprise customers. Basho is the creator of the Basho Data Platform shown in Figure 4.

Figure 4

The Basho Data Platform includes: • Basho Data Platform Core Services • Basho Data Platform Storage Instances • Basho Data Platform Service Instances

STORAGE INSTANCES There are multiple storage technologies offered as part of the Basho Data Platform. These include Riak KV, Riak TS and Riak S2.

Riak KV is Basho’s flagship distributed NoSQL Key Value data store. It is a configurable system that automatically combines hash partitioning and triple replication of data across a cluster to provide scalability, performance, and high availability fault tolerance. Multi-datacentre replication is also supported. A Riak KV cluster consists of physical nodes each with multiple virtual nodes (vnodes). Data is hashed and written to virtual nodes across the cluster, which is often known as a Riak Ring. Also, availability can be configured to favour specific application workloads in that it allows you to specify how many nodes you want to replicate an object, how many nodes to read from per request before returning to an application and how many nodes to write to per request.

Riak TS is Basho’s new distributed Key Value data store optimised for time series data. By co-locating time-series data, Riak TS provides faster reads and writes, making it easier to store, query and analyse time and location data. Like

SERVICE INSTANCES

STORAGE INSTANCES

Solr

Spark Redis (Caching) Solr Elastic

Search Web

Services

3rd Party Web

Services & Integrations

Riak!KV!!Key/Value

Riak S2 !Object

Storage

Riak TS !!Time!Series!

Document Store Columnar Graph

Replicate & Synchronize

Message Routing

Cluster Management & Monitoring

Logging & Analytics

Internal Data Store

CORE SERVICES

BASHO DEVELOPED BASHO INTEGRATED

THE!BASHO!DATA!PLATFORM!

Source:(Basho(

The Basho Data Platform uses in-memory processing for both operational and analytical processing

The Basho Data Platform can store data in its Riak KV key value data store or in Riak TS optimised for time series data

Data in Riak KV is partitioned and replicated for scalability and high availability



Riak KV, Riak TS is highly available and scalable.

Riak S2 is a multi-tenant, distributed large object storage for public, private or hybrid clouds, and is compatible with Amazon S3 and OpenStack Swift for easy integration into existing production workloads. In the future Riak S2 will also be integrated with the Basho Data Platform Service instances.

CORE SERVICES Basho Data Platform Core Services are used to deploy, manage and synchronise data in and between Storage Instances (Riak KV, Riak TS) and Service Instances (Apache Spark, Redis, Apache Solr). They include the following functionality:

• Data replication & synchronisation • Cluster management & monitoring • Internal data store • Message routing • Logging and analytics

Data replication and synchronisation replicates and synchronises data within and across Riak KV clusters, Redis and Spark Clusters. For Redis queries, when the data is not found in the Redis cluster it is read from Riak KV and synchronised across the client application query and Redis. Spark data is persisted in Riak KV so that Spark applications can now execute queries against imported data from Riak KV and existing Spark RDDs.

Cluster management and monitoring automates cluster management downloads, builds, and deploys clusters of Riak KV, Apache Spark, and Redis. It also automatically detects incidents with and across clusters and automatically restarts clusters. It also automatically scales clusters as data grows. For Spark, cluster management you can use Riak KV for leader election eliminating the need for Zookeeper.

Message routing is provided by a distributed message system to persist and route messages across platform clusters

Logging and analytics allows Basho generated event log data to be captured and analysed to help understand cluster data flows and to tune the cluster

SERVICE INSTANCES In addition to core services, the Basho Data Platform also includes add-on service instances. These have allowed Basho to extend their initial data store capabilities in Riak KV to support a broader set of capabilities that together have helped create the Basho Data Platform. Three add-on service instances currently exist at present with the potential to add more. They are:

Service Instance

Purpose

Redis Integrated in-memory caching for faster application performance Apache Spark Integrated in-memory analytics for Riak KV and Riak TS data Apache Solr Search based query processing on Riak data using Solr indexes

Core services are used to deploy, manage and synchronise data across the Basho Data Platform

Data can be replicated and synchronised across Riak KV, Redis and Spark clusters

Automated cluster management to simplify administration

Logging and analysis to help tune the cluster

The Basho Data platform includes three service add-ons

Both Redis and Apache Spark use in-memory processing



Redis Redis provides in-memory data caching to accelerate high performance read/write operational applications accessing data in Basho Data Platform Storage Instances. If data is not in the Redis cache, then applications get the data from Riak KV instead via transparent read through into Riak KV using a Basho Data Platform Cache Proxy (see Figure 4). There is no need to switch to a different API. This means that Redis developers building Redis applications don’t need to do anything to get at Riak KV data. In addition, the newest release of Basho Data Platform includes support for write-through cache. This takes consistency of data caching to a new level by automatically updating Riak KV upon writes to Redis memory.

Apache Spark In addition to the Redis in-memory caching, which is aimed at accelerating high performance read/write in operational applications, the Basho Data Platform also offers an in-memory analytics capability via its Apache Spark add-on service instance. Riak KV, as part of the Basho Data Platform, performs high availability management of a Spark cluster without the need for Zookeeper and an Apache Spark connector to Riak KV or Riak TS provides access to Riak data from Apache Spark applications.

Apache Solr In addition to Apache Spark add-on service instance, the Basho Data Platform also includes an Apache Solr service instance. This is a distributed search capability that lets you perform searches on data values in a Riak KV data store. It works by first indexing the data by value across a Riak KV cluster. Known operational queries can then be run by accessing Solr indexes rather than data in Riak KV itself. Indexes are automatically kept up-to-date in the background as changes occur to Riak data. Also given that Riak KV does not support joins, de-normalisation in the form of pre-joins, aggregations, and nested data is needed to help deal with queries that have these requirements.

OPERATIONAL ANALYTICS USING SPARK IN THE BASHO DATA PLATFORM Looking at the above add-on service instances, Apache Spark is the one that is relevant to operational analytics on the Basho Data Platform. The Apache Spark connector enables Spark based Java, Python or Scala in-memory operational analytical applications to be developed to analyse very low latency non-transactional data stored in Riak KV. Riak KV data can be retrieved using DataFrames and Spark Resilient Distributed Datasets (RDDs), loaded into memory as a Spark Resilient Distributed Dataset (RDD) and analysed in-parallel in a Spark cluster using Spark MLlib analytics and GraphX analytics. Also, as long as data in Riak KV value objects can be mapped into a structured schema in the Spark operational analytic application, SparkSQL can also be used to query and report on the data directly.

In addition, insights produced from analysing Riak KV data in Spark operational analytic applications can be written back into Riak KV for further processing either by Spark or other Big Data application components. With careful application and aggregated data model design, this ‘write back’ capability makes ‘closed loop’ analytics processing potentially possible. For example, a Spark application could analyse Riak KV operational data and write insights back into Riak KV from where e-commerce applications, could use the insights

Redis caches data in-memory to accelerate high performance operational applications

Apache Spark allows analytic applications to connect to and analyse data in Riak KV data stores

Apache Solr enables search query processing on all Riak KV data across the cluster

Apache Spark can be used to develop operational analytic applications on low latency data stored in Basho Riak KV

Insights produced from analysing Riak KV data in Spark can be written back to Riak KV for use by other applications

‘Write back’ is a form of closed loop processing



produced to provide personalised recommendations to website visitors. Alternatively Spark-based analytical web services could be invoked on-demand from a Riak KV application to analyse and score Riak KV data and pass back results of the analysis. The challenge in both cases is to get the data model right so that aggregated data is stored together. In addition, integration with Spark Streaming is in development, which would allow event driven Spark Streaming analytics applications to analyse data in real-time, filter out data of interest and store both the filtered data and specific scores directly in Riak KV. All three of these are shown in Figure 5.

Figure 5

Given these capabilities, the types of operational analytics (as discussed earlier), the Basho Data Platform can potentially support in a scalable operational NoSQL environment are as follows:

• Simple operational reporting of current position/state (e.g. session state) using application logic in a Riak KV application, Apache Solr query or potentially even Spark SQL (assuming structured data in Riak).

• Situational awareness via visualisation of live operational data using Spark analytic applications or BI tools and Spark SQL on Riak KV data

• On-demand analysis of live operational and/or historical data via invocation of a Spark analytic web service from a Riak or Redis application. In this case the Spark analytical service would either analyse Riak KV data passed to it in the web service request or access Riak KV data via a key passed to it in the web service request. The same approach could also be done for on-demand recommendations via invocation of a Spark decision web service.

• Analysis of IoT streaming sensor data using Spark Streaming to calculate rollups and detect abnormalities. and using on-demand Spark jobs for historical analysis and predictions, while keeping recent data in Redis for real-time dashboard visualization.

Opera&onal*analy&cs**

web*service*

Opera&onal*analy&c**

applica&on*BI*Tool*

data data data hash*par&&oned*data*

Scalable*opera&onal applica&on*

Spark**Core**

Spark*Stream<ing*

Spark*SQL*+*DataFrames*GraphX*MLlib*

write*back*

Opera&onal*Analy&cs*Using*The*Basho*Data*PlaIorm*


R* SQL** Python* Scala* Java*

**=*SQL*interface*to*Riak*s&ll*in*development*

Spark-based analytical web services could be invoked on-demand to analyse data in Riak KV or to analyse data passed to the service directly

Care is needed with application and data model design to get the most out of operational analytics

Further support for operational analytics is planned

The Basho Data Platform can support different types of operational analytics

Supporting different kinds of operational analytics at scale helps improve operational effectiveness



CONCLUSIONS There is no doubt that as demand grows to scale out operational applications, so too does the need to scale operational analytics to deal with increasing numbers of on-demand requests for analytics, recommendations, situational awareness and current position/state operational reporting. Operational analytics also needs to scale to monitor, automatically analyse and act on real-time streaming events data to prevent problems arising and to optimise business operations.

Architecturally speaking, Figure 4 and Figure 5 show that Basho is using in-memory processing with Riak KV and Riak TS storage instances in the Basho Data Platform to try to support both high performance, scalable operational applications and also enable operational analytics. The in-memory Redis service instance is aimed at accelerating high performance operational applications whereas the in-memory Apache Spark service instance is aimed at facilitating scalable operational analytics. In addition, separate in-memory processing add-ons keeps potentially conflicting workloads apart. It is clear that more is yet to come in the form of Redis asynchronous write through caching and also Spark Streaming support. Also, Apache Spark continues to be strengthened with better support for concurrent usage expected soon. Nevertheless, as long as organisations are careful with application and data model design, they can begin to introduce scalable operational analytics into high performance scalable NoSQL applications.

As operational application processing scales, so too does the need to scale operational analytics

Basho is using in-memory processing to accelerate operational applications and to introduce scalable operational analytics into these applications

New scalable ‘smart’ operational applications are therefore now becoming possible with careful design



About Intelligent Business Strategies Intelligent Business Strategies is a research and consulting company whose goal is to help companies understand and exploit new developments in business intelligence, analytical processing, data management and enterprise business integration. Together, these technologies help an organisation become an intelligent business.

Author Mike Ferguson is Managing Director of Intelligent Business Strategies Limited. As an independent IT industry analyst and consultant he specialises in Big Data, BI/Analytics, Data Management and enterprise business integration. With over 34 years of IT experience, Mike has consulted for dozens of companies on BI/Analytics strategy, big data, data governance, master data management and enterprise architecture. He has spoken at events all over the world and written numerous articles and blogs providing insights on the industry. Formerly he was a principal and co-founder of Codd and Date Europe Limited – the inventors of the Relational Model, a Chief Architect at Teradata on the Teradata DBMS and European Managing Director of Database Associates, an independent IT industry analyst organisation. He teaches popular master classes in Big Data Analytics, New Technologies for Business Intelligence and Data Warehousing, Data Virtualisation Enterprise Data Governance, Master Data Management, and Enterprise Business Integration.

INTELLIGENT BUSINESS STRATEGIES

Water Lane, Wilmslow Cheshire, SK9 5BG

England Telephone: (+44)1625 520700

Internet URL: www.intelligentbusiness.biz E-mail: [email protected]


Copyright © 2015 by Intelligent Business Strategies All rights reserved

Documents

Operational Analytics Using Spark And NoSQL White Paper V5 · rise, and when demand came to analyse new sources of high volume, complex multi-structured data, a new NoSQL platform