STREAMANALTIX 2.1.6 Processors · 2017. 2. 1. · flexibility to execute data pipelines using a stream processing engine of choice depending upon the application use-case, taking

StreamAnalytix 2.1.6 Processors

pg. 1

STREAMANALTIX 2.1.6

PROCESSORS

Learn about Processors


pg. 2

Introduction Welcome to StreamAnalytix! StreamAnalytix platform enables enterprises to analyze and respond to events in real-time at Big Data scale. With its unique multi-engine architecture, StreamAnalytix provides an abstraction that offers a flexibility to execute data pipelines using a stream processing engine of choice depending upon the application use-case, taking into account the advantages of Storm or Spark Streaming based upon processing methodology (CEP, ESP) and latency.

About This Guide This guide describes Processors and their configuration. More Information Please visit www.streamanalytix.com To give us your feedback on your experience with the application and report bugs or problems, mail us at [email protected] To receive updated documentation in the future please register yourself at www.streamanalytix.com We welcome your feedback.

http://www.streamanalytix.com/

mailto:[email protected]

http://www.streamanalytix.com/


pg. 3

Terms & Conditions This manual, the accompanying software and other documentation, is protected by U.S. and international copyright laws, and may be used only in accordance with the accompanying license agreement. Features of the software, and of other products and services of Impetus Technologies, may be covered by one or more patents. All rights reserved.

All other company, brand and product names are registered trademarks or trademarks of their respective holders. Impetus Technologies disclaims any responsibility for specifying which companies own which marks or which organizations.

USA Los Gatos Impetus Technologies, Inc. 720 University Avenue, Suite 130 Los Gatos, CA 95032, USA Ph.: 408.252.7111, 408.213.3310 Fax: 408.252.7114 © 2017 Impetus Technologies, Inc., All rights reserved.

If you have any comments or suggestions regarding this document, please send them via e-mail to [email protected]



pg. 4

Table of Contents Introduction ............................................................................................................................................... 2

About This Guide.................................................................................................................................... 2

Terms & Conditions ................................................................................................................................... 3

PROCESSORS .............................................................................................................................................. 5

Aggregation Function ............................................................................................................................. 6

Associative Aggregation ....................................................................................................................... 12

Cumulative Aggregation....................................................................................................................... 14

Enricher ................................................................................................................................................ 15

Window ................................................................................................................................................ 16

Distinct ................................................................................................................................................. 17

Filter… .................................................................................................................................................. 18

SQL…….. ................................................................................................................................................ 19

Scala Processor..................................................................................................................................... 21

Sort……. ................................................................................................................................................ 23

Join…….................................................................................................................................................. 24

Group…................................................................................................................................................. 25

Union…. ................................................................................................................................................ 26

Intersection .......................................................................................................................................... 27

Alert….. ................................................................................................................................................. 28

Persist……………………………………………………………………………………………………………………………………………29

FlatMap ................................................................................................................................................ 30

MapToPair ............................................................................................................................................ 31

TransformByKey ................................................................................................................................... 32

Repartition ........................................................................................................................................... 35

Take….. ................................................................................................................................................. 36

Timer…. ................................................................................................................................................ 37

Custom Processor ................................................................................................................................ 38

CEP ........................................................................................................................................................... 43

Custom CEP .......................................................................................................................................... 44

Dynamic CEP ........................................................................................................................................ 46


pg. 5

PROCESSORS Processors are the built-in operators for processing the streaming data by performing various transformations and analytical operations. For Spark pipelines, you can use the following processors:

Processor Description Window Collects input streams over a time or range. Sort Sorts input streams’ values in ascending or descending order. Alert Generates alerts based on specified rules. Enrich Lookups into internal and external stores for data enrichment. Join Joins two or more input streams. Distinct Removes duplicates from an input stream. Group Groups input streams by a key. Union Joins two or more input streams. Filter Filters input stream values based on specified rules. SQL Run SQL queries on streaming data. Intersection Detects common and unique values from two or more input streams. Aggregation Aggregates input stream values to perform min, max, count, sum, or

avg. Associative Aggregation

Performs aggregation functions over batch data or group by input values.

Cumulative Aggregation

Performs aggregation functions on the input streams cumulatively.

FlatMap Produces multiple outputs for one input data. MapToPair Returns Paired Dstream having dataset of (Key, Value) pairs. Scala Processor Implements your custom logic written in Scala in a pipeline. Persist Stores the Spark RDD data into memory. Repartition Reshuffles the data in the RDD to balance the data across partitions. TrasnformByKey Performs TrasnformByKey operation on the dataset. Take Performs take (n) operation on the dataset. Custom Implements your custom logic in a pipeline.

For Storm pipelines, you can use the following processors:

Processor Description Timer Collects input streams over a time or range. Alert Generates alerts based on specified rules. Enrich Lookups into internal and external stores for data enrichment. Aggregation Performs min, max, count, sum, or avg operations on incoming message

fields. Filter Filters input stream values based on specified rules.


pg. 6

Custom Implements your custom logic in a pipeline.

Aggregation Function Aggregation Function processor allows you to have aggregation operations on the fields of the incoming data.

The following table lists the functions supported by Aggregation processor.

Average Average function calculates average of all the defined fields. Sum Sum function totals all the defined field values. Count Count function counts the number of fields. Minimum Minimum function displays the minimum value of the defined fields. Maximum Maximum function displays the maximum value of the defined fields.

Configure Aggregation Processor for Spark To add an Aggregation Processor into your pipeline, drag the processor to the canvas and right click on it to configure.


pg. 7

Field Description

Message Select a message on which you want to perform aggregation operations.

Fields Select fields for which you want to perform aggregation operations.

Time Window Select a time window type for the aggregation operations.

Window Duration

It is the duration in milliseconds for which incoming data would be collected, and the aggregation operation will be applied on the collected data. If you give window duration of zero milliseconds, then no windowing will be performed and the operation will be performed on data that arrived within the batch duration of the application. The window duration should be a multiple of the batch duration.

Slide Duration

If you select sliding window as a Time Window, then specify the sliding duration (in milliseconds) along with the window duration, the sliding duration is like a triggering time for performing aggregation, at each slide interval, aggregation will be performed over the data arrived in the last X seconds, where X is the window duration. The Sliding duration should be a multiple of the batch duration.

Group By If selected yes then data will be grouped on the basis of grouping field before aggregation.

Grouping Field

Select the field based on which the data within the batch duration will be grouped for aggregation.

Action Select where the aggregation output should be sent. The action available will publish the output to Kafka.

Topic Name If publish to Kafka is selected as Action, then specify the Kafka topic name.

Configure Aggregation Function Processor for Storm To add an Aggregation Processor into your pipeline, drag the processor to the canvas and right click on it to configure.


pg. 8

Field Description Query Tab Attributes

Message Name Message on which Statistical CEP query has to be applied.

Group By Yes to group fields by selecting fields on the basis of which grouping will be done.

Enable Context If the context is enabled the execution of all subsequent queries will be bounded by the context. The system displays three input boxes Context Name, Start Name and End Time. The Context name is the identifier of the ESPER context. Specify the start and end time for executing the queries in standard cronjob format.

Fields Fields on which aggregation has to be applied.


pg. 9

Apply Filter Criteria Select the checkbox if you wish to apply filter criteria on the incoming messages.

Time Window Fixed Length: If option selected is Fixed Length, enter the Window duration in seconds. The configured aggregation functions get applied over the fixed window data Sliding Window: If option selected is Sliding window, enter the Window duration in seconds. The configured aggregation functions are applied over the moving window.

Window Duration Time window in seconds.

Apply Group By Select the checkbox if you wish to group the query result by message fields. In this particular query, if you wish to apply the Group BY condition specified on top, then select YES.

Output It is used to control or stabilize the rate at which events send output. The drop-down list has three options : All Records: All records will be displayed in the output. First Records: First records will be displayed in the output. Last Records: Last records will be displayed in the output. Choose the desired option from the drop-down list.

Flush result when window ends

Records will also be output when the window ends.

Add Action Click on the Add Action link. You will view five additional fields: Action, URL, Method, Header Params and Request Params.

Action Possible action to be performed on the result of query execution. Invoke WebService Call: Query output will be pushed to Webservice. Custom Delegate: Query output will be generated in Custom delegate code. Choose the desired action from the list.

URL End point where data needs to be pushed.

Method POST and PUT Choose the desired HTTP method

Header Params Add header parameters. Click on the + (plus) sign for adding the header parameters. Provide the below information for adding the header params. Name: Header parameter name. Value: Header parameter value.

Request Params: Add request parameters. Click on the + (plus) sign for adding the request parameters. Provide the below fields for adding the header params.


pg. 10

Name: Request parameter name. Value: Request parameter value.

Add Query Enables to add multiple queries using this link. Add Query link is enabled only if Enable Context checkbox is selected. If Enable Context checkbox is not selected, you are able to add single query.

Field Description Parallelism Number of executors (threads) of the processor. Task Count Number of instances of the processor. Emit CEP Query Output

If CEP query output is to be passed to the next processor in the pipeline, select True else select False.

Note: Both Parallelism and Task Count Fields will be in disabled mode if Group By option selected on Query tab is No. If the option selected is Yes, both the fields are in enabled mode.

If EmitCEPQueryOutput is set to True:

1. You are able to select only Custom delegate as Action. All other actions are disabled. 2. Processed data from this processor along with original input will be available on next

processor. 3. Add action link will be disabled and only one custom action is allowed.

If EmitCEPQueryOutput is set to False:

1. You are able to select all available delegate as Actions. 2. Processed data is only available to configured delegates. Only original input will be available

on next processor. 3. Enables to add multiple actions using Add Action link.


pg. 11

Note: If you apply any aggregation function on any particular field then the aggregation output of that function would get stored in the intermediate (output) field respectively as shown below. For example, if you are applying average aggregation on f2 the output will be available as avgF2.

• average of f2 as avgF2 • min of f2 as minF2 • max of f2 as maxF2 • sum of f2 as sumF2 • count of f2 as countF2


pg. 12

Associative Aggregation Associative Aggregation processor in StreamAnalytix enables you to perform aggregation functions over the real time streaming data. Using this processor, you can apply the min, max, age, sum and count aggregation function. The aggregation function can be performed over whole batch data or you can configure the group by field. Along with these configurations, you can also batch the data for aggregation functions by configuring the window operations on incoming stream. The output can be sent to any other processor or emitter. Configure Aggregation Processor To add an Associative Aggregation Processor into your pipeline, drag the processor on the canvas and right click on it to configure.

Field Description Message Name Message on which Aggregation query has to be applied.

Output Message Output message containing all the aggregation results. Field attributes:

Functions The processor supports five aggregation functions : Min, Max, sum, count, avg.

Input Fields Input field over which the selected aggregation function is applied.


pg. 13

Output Fields Output field where the aggregation result get stored in output message.

Time Window None: Does not apply the window operation on data. Fixed Window: In this time window, the window length and sliding interval remain same. The operations are performed after the provided window interval. Sliding Window: Enables to configure the window length and sliding interval.

Group By Enables to perform the aggregation functions over whole batch of data or the group of data. If option selected is yes, select the grouping field from the drop-down list for grouping the streaming data.


pg. 14

Cumulative Aggregation Cumulative Aggregation Processor is similar to Associative Aggregation with the only difference that aggregation functions will be performed cumulatively. Every next batch will use the old batch aggregation results to perform the current batch aggregations. To use the cumulative aggregation it is mandatory to configure the checkpoint in pipeline from pipeline definition. Note: Checkpoint directory must be specific for a data pipeline. Even if you switch between the deployment modes (i.e. Local/Client/Cluster), you need to change the checkpoint directory.


pg. 15

Enricher Enricher processor enables you to enrich or modify an incoming message with the data that is not provided by the external data sources on the fly. For example, if the cost price of a product is $150 and VAT = 2.5%. New price of product will be cost price +VAT. Enricher processor allows you to enrich field value of New Price on the fly so that you can enrich your message with this field. Enricher processor provides support for MVEL functions, out-of-the-box lookup functions for lookupCassandra, lookupHbase, lookupsES, and support for Date, String and user defined functions. Configure Enricher Processor To add an Enricher Processor into your pipeline, drag the Enricher Processor to the canvas and right click on it to configure as explained below.

Field Description Config Fields Config fields are used to create local variables. Output Fields Select a message and select the fields present in the message

schema. Message Name Select a message on which Enricher has to be applied.

Select Fields Select fields required for output data. Enter the value that needs to be enriched using an expression, function, variable or a constant value. To lookup for Enricher functions, put a dollar sign ($) in the input box.


pg. 16

Window Window processor allows you to collect the incoming data, for the specified window duration. This processor of Spark pipelines makes it very easy to compute streams for a window using the window function. Configure Window Processor To add a Window Processor into your pipeline, drag the processor to the canvas and right click on it to configure.

Field Description

Time Window Fixed window: In fixed window, you have to select a fixed window duration. For example, if the window duration is 60 seconds and batch duration is 10 seconds, data will be collected for 60 seconds. The next output is generated again after 60 seconds (window duration should be multiple of batch duration). Sliding window: In sliding window, you have to configure the window duration and sliding duration. For example if the window duration is 60 seconds and slide duration is 10 seconds then, in every 10 seconds, the data collected in the previous 60 seconds will be sent ahead. The window duration and slide duration should be multiple of the batch duration.

Window Duration

It is the duration in milliseconds for which incoming data would be collected, and the data arrived within that window duration would be passed further.


pg. 17

Distinct Distinct is a core operation of Apache Spark over streaming data. When distinct operation is performed on an input set of data then outcome of operation is the unique value of that particular input set as one output set. To add a Distinct Processor into your pipeline, drag the processor on to the canvas. Distinct Processor is not configurable which means you are not able to set the configuration properties.

Example to demonstrate how distinct works.

Input Set {name:xyz,rating:7} {name:abc,rating:9} {name:klm,rating:5} {name:klm,rating:6} {name:xyz,rating:7} {name:abc,rating:9}

Output Set {name:xyz,rating:7} {name:abc,rating:9} {name:klm,rating:5} {name:klm,rating:6}


pg. 18

Filter Filter processor filters the data based on defined condition(s). The filter condition can be updated dynamically i.e. one need not re-start the subsystem for updating defined filter condition; it can be done while the subsystem is in running condition. Configure Filter Processor To add a Filter Processor into your pipeline, drag the processor to the canvas and right click on it to configure.

Fields Description Config Fields (only for Spark)

Config Fields are used to create local variables or fields.

Message Name Select a message.

Filter Rule Applies filter on the data or fields based on the criteria provided.

Negate If Negate is true, the filter criteria is evaluated as NOT of the given criteria. If Negate is false, then the filter criteria is evaluated as it is.

Add New Criteria To add another criteria.


pg. 19

SQL SQL processor provides a facility to run queries over streaming data and registered tables. To register batch data as tables you can register data sources as tables under register table section. Then you can create queries on the registered tables and streaming message and store their result in temporary table in the SQL processor. Configure SQL Processor To add a SQL Processor into your pipeline, drag the processor to the canvas and right click on it to configure as explained below.

Field Description Table Names By default, message stream will be registered as a table with the

message name. All the registered tables will be displayed here from that you can select the table that you want to use in the pipeline.

Query section: Query section is used to add multiple queries; these are executed over different tables. Table Name Table name by which result of below defined query will be registered.

Query Query to execute on data. In this query, you can use any table selected in above given multi selection box named as Table Names or any table registered by queries defined above this query.

Add New Query Add a new query.

Go to the Emitters tab to define the sink where the query output will be stored, as explained below.


pg. 20

Field Description Query You can select multiple queries, to save the output.

Data Sink Select S3/HDFS as data source to save query output.

Connection Choose available connection for data source.

Data Format File format to save output.

Bucket Name Mention the Bucket name if data source is S3.

Path Enter the file path where data is stored.

You can also view schema of used registered table under Schema tab by selecting the table name.


pg. 21

Scala Processor Scala Processor allows you to write your own code in Scala for processing in the pipeline at runtime. Configure Scala Processor To add a Scala Processor into your pipeline, drag the processor on the canvas and right click on it configure.

Field Description Package Name Package for the Scala implementation class. Class Name Class name for the Scala implementation. Imports Import statements for the Scala implementation.

Input Source There are four types of Input Sources : JSON Object: Input source JSONObject is provided as JSON to scala code. You can perform operations over it. RDD [JSON Object]: Input source RDD [JSONObject] is provided as JSONRDD to scala code. You can use this input source to perform the transformation/action functions over RDD of JSONObject. JAVA Dstream [Json Object]: Input source JavaDStream [JSONObject] is provided as JSONStream to scala code. You can use this source to perform the transformation/action functions over JavaDStream of JSONObject.


pg. 22

JavaPairDStream [Object, JSONObject]: Input source JavaPairDStream [JSONObject] is provided as JSONStream to scala code. You can use this input source to perform the transformation/action functions over JavaPairDStream of JSONObject.

Scala Code Scala code implementation to perform the operation on JSON RDD.

Next, click on the Jar Upload tab. Jar file that is generated when you build scala code. You can upload the third party jars from here so that the API's can be utilized in scala code by adding the import statement.


pg. 23

Sort It allows you to sort data of a batch in increasing or decreasing order. Configure Sort Processor To add a Sort Processor into your pipeline, drag the processor to the canvas and right click on it to configure.

Field Description

Message Name Select a message on which sorting has to be applied.

Sort Key Select the field on which sorting is to be applied.

Order Select order of sort – ASCENDING or DESCENDING

Add Configuration To add additional custom properties.


pg. 24

Join Join Processor in StreamAnalytix can perform various types of join over input data i.e. Inner join, Left Outer join, Right Outer Join, Full Outer join and Cross Join (Cartesian product). Configure Join Processor To add a Join Processor into your pipeline, drag the processor to the canvas and right click on it to configure.

Field Description

Output Message The Output Message for join operation is selected on the join processor and it should contain all the fields that are present in the original input messages.

Join Type

Select which type of join you want to perform on the input sets, that includes Inner join, Left Outer join, Right Outer Join, Full Outer join and Cross Join (Cartesian product).

Join Field

In front of every input message, a drop-down list is shown on which message field is available for selection. For example, If two messages AlertMessage and OfficeMessage are having one common field customerName and you apply cross join such that, AlertMessage is left message and OfficeMessage is right message, then in the cross join output value of customerName for AlertMessage will be present in field named customerName in the output message, and value of customerName for OfficeMessage will be present in a field named OfficeMessage_customerName, so you should define these two fields in output message


pg. 25

Group Group processor groups the data by key. For example, there can be customer order records with customerId, orderId and orderDate. You can have multiple orders, therefore to list or group all the orders by customers, this group processor is useful where group by key is customerId and grouped fields can be orderId and orderDate. Configure Group Processor To add a Group Processor into your pipeline, drag the processor to the canvas and right click on it to configure.

Note: Kafka topic should be already existing which you are going to use in Group Processor.

Field Description Message Select a message.

Fields Message fields to be grouped. Grouping Field Select grouping field.

Action To publish grouped results to Kafka. Topic Name Name of the Kafka topic on which results are published.


pg. 26

Union When union operation is performed over two or more input sets, the outcome of operation is all of the values combined of those input sets as one output set. To add a Union Processor into your pipeline, drag the processor on to the canvas. Union Processor is not configurable which means you are not able to set the configuration properties.

Example to demonstrate how union works.

Input Set 1 {name:xyz,rating:7} {name:abc,rating:9} {name:klm,rating:5}

Input Set 2 {name:xyz,rating:7} {name:abc,rating:9} {name:abc,rating:9} {name:klm,rating:6}

Output Set after applying Union Operation {name:xyz,rating:7} {name:abc,rating:9} {name:klm,rating:5} {name:xyz,rating:7} {name:abc,rating:9} {name:abc,rating:9} {name:klm,rating:6}


pg. 27

Intersection Whenever two or more input streams are intersected then they generate a single output stream, which contains common and unique values of all the input streams. To add an Intersection Processor into your pipeline, drag the processor on to the canvas. Intersection Processor is not configurable which means you are not able to set the configuration properties. Example to demonstrate how Intersection works:

Input Set 1 {name:xyz,rating:7} {name:abc,rating:9} {name:klm,rating:5}

Input Set 2 {name:xyz,rating:7} {name:abc,rating:9} {name:abc,rating:9}

Output Set after applying Intersection Operation {name:xyz,rating:7} {name:abc,rating:9}


pg. 28

Alert Alerts are used to signal occurrence of an event during pipeline execution. To enable application alerts, add Alert Processor to your pipeline. Configure Alert Processor To add an Alert Processor into your pipeline, drag the processor to the canvas and right click on it to configure as explained below.

Field Description Parallelism (for Storm) Number of executors (threads) of the processor.

Task Count (for Storm) Number of instances of the processor.

Select Alerts All the configured alerts will be listed in this drop-down list. Choose the alert that you want to be notified for.

Enable Alerts Generation To generate Alerts in Storm, verify that Alert pipeline (Go to SuperUser UI > Datapipeline > AlertPipeline) is up and running. To generate Alerts in Spark, there is no need to start any superuserPipeline AlertPipeline. View Generated Alerts To view generated Alerts, go to your Workspace > Alerts > Information


pg. 29

Persist Persist processor can be used to store the Spark RDD data into memory. When you persist an RDD, each node stores a partition of it and computes in memory. Also reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x). Persist operator is a key tool for iterative algorithms and fast interactive use. Configure Persist Processor To add a Persist Processor into your pipeline, drag the processor to the canvas and right click on it to configure.

Storage levels are meant to provide different substitutions between memory usage and CPU efficiency.

• MEMORY_ONLY: Store RDD as reserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they are needed. This is the default level.

• MEMORY_AND_DISK: Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that do not fit on disk, and read them from there when they are needed.

• MEMORY_ONLY_SER: Store RDD as serialized Java objects (one-byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.

• MEMORY_AND_DISK_SER: Similar to MEMORY_ONLY_SER, but spill partitions that do not fit in memory to disk instead of precomputing them on the fly each time they are needed.

• MEMORY_ONLY_2: Same as the levels above, but replicate each partition on two cluster nodes.

• MEMORY_AND_DISK_2: Same as the levels above, but replicate each partition on two cluster nodes.

• DISK_ONLY: Store the RDD partitions only on disk. • MEMORY_ONLY__SER_2: Same as MEMORY_ONLY_2 but stores the RDD as serialized java

objects. • DISK_ONLY_2: Same as DISK_ONLY but RDD is replicated on two nodes. • MEMORY_AND_DISK_SER_2- Same as MEMORY_AND_DISK_2 but stores the RDD as serialized

java objects.


pg. 30

FlatMap FlatMap processor produces multiple outputs for one input data. Each input item can be mapped to one or more output items. FlatMap function would return a sequence rather than a single item. Configure FlatMap Processor To add a FlatMap Processor into your pipeline, drag the processor on the canvas and right click on it to configure.

Field Description Operator • FlatMap: Operates on single DStream data and returns a plain

Dstream output list. • FlatMapToPair: Operates on paired Dstream data and returns a

paired Dstream output list. Processing • Identity Mapper: Select the option as Identity Mapper, if one to

one output is to be produced. • Custom: Select the option as Custom, if Custom implementation is

to be provided. The system displays an additional field Executor Plugin when the option selected is Custom.

Executor Plugin

Class to which the control will be passed in order to process the incoming data.


pg. 31

MapToPair Each input DStream will return Paired Dstream having dataset of (Key, Value) pairs. DStream represents a continuous stream of data either the input data stream received from source or the processed data stream generated by transforming the input stream. Configure MapToPair Processor To add a MapToPair Processor into your pipeline, drag the processor on the canvas and right click on it to configure.

Field Description Key Type • Fields: Message fields can be selected. If multiple fields are

selected then their values (separated by #) can be used as key and trace message as value in output.

• Custom: Enables to provide a value for the key, value provided will be used as key and tracemessage as value in output. o If key Type selected is Fields, system displays Key Fields

inputbox. o If key Type selected is Custom, system displays an inputbox

named as Key. Key Fields/ Key • Key Fields: Select the message fields that are to be used as

key. • Key: User defined key.


pg. 32

TransformByKey TransformByKey processor allows you to perform operations on messages that are in the form of key value pair. For that, you need to initially process the stream through mapToPair processor to convert stream into a stream of key value pair. Configure TransformByKey Processor To add a TransformByKey Processor into your pipeline, drag the processor on the canvas and right click on it configure.

Field Description Operator The drop-down list contains following four types of operations:

• ReduceByKey: Aggregates the incoming paired stream for each batch using the custom logic written by you for data reduction.

• AggregateByKey: Aggregates the incoming paired stream for each batch using the custom logic written by you for data aggregation.

• SortByKey: Sorts the incoming paired stream for each batch. • UpdateStateByKey: Using UpdateStateByKey processor in

pipeline you can return a new state where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key. For example, if the function performed is SUM and the previous batch SUM was 300 and the current batch SUM is 700 then the current state of the key would be 300+700=1000.

Is Window It has two options: • True: Selecting the option as True enables time window

configuration. The system displays Time Window and Window


pg. 33

Duration fields. • False: Selecting the option as False disables time window

configuration. The system disables Time Window and Window Duration fields.

Time Window The drop-down list contains following two options: • Fixed Window: In this time, window, window length and sliding

interval remain same. The operations are performed after the provided window interval.

• Sliding Window: In this type of window, you can configure the window length and sliding interval.

Choose the desired Time Window from the list. Window Duration

It is the duration in milliseconds for which incoming data would be collected, and the data arrived within that window duration would be passed further. Window Duration should be in multiples of batch duration that is provided at the time of saving/updating the pipeline in Batch duration field.

Slide Duration In sliding window, you have to configure the window duration and sliding duration. For example if the window duration is 60 seconds and slide duration is 10 seconds then, in every 10 seconds, the data collected in the previous 60 seconds will be sent ahead. The window duration and slide duration should be multiple of the batch duration. The time duration in milliseconds, after which the reduction will be performed over the data that is collected in the previous window.

Reduce Logic Plugin

Class to which the control will be passed in order to process incoming data.

Scenario II -Operator selected as Aggregate By Key.


pg. 34

If you select the operator as Aggregate By Key, the system displays two additional fields

Combine Logic Plugin and Merge Logic Plugin. These are the classes to which the control will be passed in order to process the incoming data.

In Combine Logic Plugin, specify the qualified name of the custom combiner class in which you have written the code for combining data based on key.

In Merge logic Plugin, specify the qualified name of the custom merger class in which you have written the code for merging the outputs produced by combiners.

Scenario III- Operator selected as Sort By Key.

If you select the operator as Sort By Key, the system displays two fields Sort Key and Order. In the case of Sort Key, you can select the field, which acts as a key in the incoming paired stream. There are two other options available Custom and Combined. Select Custom if the key is Custom and Combined if the key is a combination of more than one field. Scenario IV –Operator Selected as Update State By Key.

Select the operator UpdateStateByKey from the drop-down list .Provide the value for the Executor Plugin field. This is the class to which the control will be passed in order to process the incoming data.


pg. 35

Repartition Repartition processor reshuffles the data in the RDD randomly to create either more or fewer partitions and balance the data across them. This always shuffles all data over the network. Configure Repartition Processor To add a Repartition Processor into your pipeline, drag the processor on the pipeline canvas and right click on it configure.

Enter the number of executors of a processor in the Parallelism field.


pg. 36

Take This processor performs take (n) operation on the dataset. It will return an array with the first n elements of the dataset. For example, if the number of elements entered are 10, the system will return first 10 elements of the dataset of each batch. Configure Take Processor To add a Take Processor into your pipeline, drag the processor on the pipeline canvas and right click on it configure.

Enter the elements in the No of Elements field.


pg. 37

Timer Timer Processor allows you to write a customized business logic that will be executed in a fixed interval. Filter can be applied before execution of customized business logic. Configure Timer Processor To add a Timer Processor into your pipeline, drag the processor on the pipeline canvas and right click on it to configure.

Field Description

Parallelism Number of executors (threads) of the processor.

Task Count Number of instances of the processor.

Timer Plugin

Class to which the control will be passed once results are accumulated for a specific time duration.

Tick Frequency in seconds

Time window for which data will be held in memory in order to

be operated as a group.

Add Configuration To add additional custom parameters.


pg. 38

Custom Processor Custom processor allows you to implement your custom logic in a pipeline. You can write your own custom code to ingest data from any data source and build it as a custom channel. You can use in your pipelines or even share it with other workspace users. It is a processor available for you to create a custom business logic. Implements the ‘com.streamanalytix.framework.api.processor.JSONProcessor’ interface and provides custom business logic in implemented methods. Pre-requisites – Create Custom Code Jar To use a custom processor, first you need to create a jar file containing your custom code and then upload the jar file in a pipeline or as a registered component. To write a custom logic for your custom channel, download the Sample Project.

Import the downloaded Sample project as a maven project in Eclipse. Ensure that Apache Maven is installed on your machine and that the PATH for the same is set. Implement your custom code. Build the project, to create a jar file containing your code, using the following command: mvn clean install –DskipTests.

If the maven build is successful, you need to upload the jar on the pipeline canvas as shown below.


pg. 39

Custom Processor for Spark Custom processor for Spark supports writing code in Java and Scala programming languages. It provides two types of operations: Custom and Inline as Scala. The Custom operation allows you to write the code in Java language whereas Inline as Scala operation allows you to write the code in Scala programming language. The Inline as Scala is an enhancement over the Custom operation as it enables you to write the entire scala code in a processor itself and if the build is successful, code can be used in the pipeline. Custom Code for your sample Project You have to implement ‘com.streamanalytix.framework.api.processor.JSONProcessor’ interface and provide the custom business logic in implemented methods while using a custom processor in pipeline. Shown below is a sample class structure:


pg. 40

There are three methods for you to implement.

1. Init: Allows you to enter any initialization calls. 2. Process: Contains actual business logic and this method is called for each tuple. 3. Cleanup: All resource cleanup occurs in this method.

Configure Custom Processor To add a Custom Processor into your Spark pipeline, drag the custom processor added in the pipeline or available as a registered component on the canvas, and right click on it to configure as explained below. The operation can be one of the two types: Custom or Inline as Scala.

• Custom: Select the option as Custom if you wish to write custom business logic. • Inline as Scala: Select the option as Inline as Scala if you wish to write the code in Scala

language. Option 1: selected Operation as Custom

Field Description

Interface The drop-down list has following four interfaces : • JSONProcessor: Interface that provides flexibility for implementing

the business logic over the streaming JSON object. • RDDProcessor: Interface that provides flexibility for implementing

the business logic over the JSON RDD's. You can write transformation and action functions over it.

• DStreamProcessor: The interface provides flexibility for implementing the business logic over the JSON DStream. You can write the transformation and action functions over it.

• PairDStreamProcessor: Interface that provides flexibility for implementing the business logic over the JSON PairDStream. You can write the transformation and action functions over it.

Choose the desired interface from the list.


pg. 41

Executor Plugin

Class to which the control will be passed in order to process incoming data.

Option 2: selected Operation as Inline as Scala

Field Description

Package Name

Package for the Scala implementation class.

Class Name

Class name for the Scala implementation.

Imports Import statements for the Scala implementation.

Input Source

Input source for the Scala code. It could be JSON Object, RDD JSON Object, JAVADStreamJSON Object or JAVAPairDStream[Object, JSONObject]

Scala Code

Code for performing operations over JSONRDD object.

Build Builds the scala code.

Configure Custom Processor for Storm You have to implement ‘com.streamanalytix.framework.api.processor.JSONProcessor’ interface and provide the custom business logic in implemented methods while using a custom processor in pipeline.


pg. 42

Shown below is a sample class structure:

There are three methods for you to implement.

1. Init: Allows you to enter any initialization calls. 2. Process: Contains actual business logic and this method is called for each tuple. 3. Cleanup: All resource cleanup occurs in this method.

To add a Custom Processor into your Strom pipeline, drag the custom processor added in the pipeline or available as a registered component on the canvas, and right click on it to configure as explained below.

Field Description Parallelism Number of executors (threads) of the processor.

Task Count Number of instances of the processor. Executor Plugin Class to which the control will be passed in order to process the

incoming data. Add Config Fields To add additional custom properties.


pg. 43

CEP Complex Event Processing aims at analyzing the data or events that flow between information systems to gain valuable information in real-time. CEP engine enables applications to store queries and run the data through. Response from the CEP engine occurs in real-time when conditions occur that match queries. The execution model is thus continuous rather than only when a query is submitted. CEP is used in Operational Intelligence (OI) solutions to provide insight into business operations by running query analysis against live feeds and event data. It helps organizations by analyzing patterns in real time and communicating better with other service departments. It delivers high speed processing of many events across all layers of organization classifying the most meaningful events, analyzing their impact and taking subsequent action in real time. CEP relies on a number of techniques, including:

• Event-pattern detection • Event abstraction • Event filtering • Event aggregation and transformation

CEP in StreamAnalytix is divided into three categories:

• Custom CEP • Dynamic CEP • Aggregation Function- Please refer to the Aggregation Function section for detailed

information. NOTE: CEP is available only for Storm pipelines.


pg. 44

Custom CEP Custom CEP allows registration of a user defined CEP Query. Only one query per custom CEP component is allowed. NOTE: Custom CEP is only available for Storm pipelines. Configure Custom CEP Processor To add a Custom CEP Processor into your pipeline, drag a Custom CEP Processor to the canvas and connect it to a Channel or Processor. Right click on the processor to configure it as explained below.



Message Name Name of the message on which query has to be fired.

Emit CEP Query Output

If EmitCEPQueryOutput is set to True, processed data from this processor along with original input will be available on the next processor. If EmitCEPQueryOutput is set to False, processed data is only available to configured custom delegate. Only original input will be available on the next processor.

Enable Context If the context is enabled, the execution of all subsequent queries will be bounded by the context.

Context Query Enter the context query to be fired in the input box.


pg. 45

CEP Query Enter the CEP query in the text box.

Custom Delegate attributes:

Class Name Class to which the control will be passed once the result of CEP query execution is fetched.

Init Parameters Initialization Parameter to be passed as a part of custom delegate.

Note: If you are trying to write a SampleCEPDelegate implementation, then you need to pass 'saxMessageType' in json containing the message name, so that the message emitted from CEP can be processed in next processor.

Example

The statement shown below will select all the events from the GenericEvent type.

select * from GenericEvent.win:time( 2 sec)

GenericEvent is the name of an event type. The win: time syntax in the above query declares a sliding time window for 2 seconds. The query will fetch all the records when the condition is met.

If you require a subset of the data in the window, you can specify one or more filter expressions as shown here:

select avg(price) as avgPrice from GenericEvent.win:time (40 sec) Where price is a field, which is configured in the message. The above statement thus outputs, continuously, the average price in the window that belong to a certain sector.


pg. 46

Dynamic CEP Dynamic CEP allows registration of CEP queries with pre-defined actions (INVOKE_WEBSERVICE_CALL and CUSTOM_ACTION) applied on the running Pipeline. It also enables dynamic configuration of queries and actions such as update and delete. Detailed information about the pre-defined actions is mentioned below:

• INVOKE_WEBSERVICE_CALL: This action invokes web service and publish data to the same defined URL. It supports two Methods: POST and PUT.

• CUSTOM_ACTION: Unlike other two pre-defined actions, if you want to edit the data or wants to publish it to a custom location, this pre-defined action should be applied.

The procedure to register a query involves working on Streamanalytix UI and REST UI simultaneously. Query registration is only possible through REST UI while Query updating and deletion is possible through StreamAnalytix UI as well. Configure Dynamic CEP Processor To add a Dynamic CEP Processor into your pipeline, drag the Dynamic CEP Processor to the canvas and connect it to a channel or processor. Right click on it to configure.

The Query config tab displays the query that is registered through Rest Client URL. Component Id displayed on top right side of the screen determines which query is used for a particular CEP, if more than one dynamic CEP is used in same pipeline. Configuration Settings of Dynamic CEP Processor are as follows:




pg. 47

Enable CEP Query Output

If EmitCEPQueryOutput is set to True : • Enables to select only Custom delegate as Action. All other

actions are disabled. • The processed data from this processor along with original

input will be available on next processor. • Add action functionality is also no longer available. Only one

custom action per query is allowed. If EmitCEPQueryOutput is set to False :

• Enables to select all available delegate as Actions. • The processed data is only available to configured delegates.

Only original input will be available on next processor. • Enables to add multiple actions per query using Add Action

link or direct from REST client. Add Configuration

To add additional custom parameters.

Configure Dynamic CEP Query

To configure the DynamicCEP query, provide the Component ID in the REST UI. To obtain the Component ID, edit the pipeline. Following section will let you specify the metadata of the data using REST INTERFACE that is used by the StreamAnalytix platform. For example, a sample JSON representing an employee data that can be defined as follows:

} StreamAnalytix requires configuration of an entity by defining the schema of the entity. Prerequisites For making any RESTClient call, Request Header must be set to TokenName: TokenValue otherwise, it gives an error of Unauthorized authentication token. To get the Token value, go to the Manage Users tab and edit a user.

{ name:employee, age:24, gender:Female, email:[email protected], address:A-12


pg. 48

Once the Token value is copied, open the RestClient UI, create a header with name token and paste the token value.

Next, go to Query Config tab of DynamicCEP and copy the Sample REST Client URL: http://<<IP:PORT>>/StreamAnalytix/datafabric/dynacep/query/register/<componentId> Now go to Rest Client UI, paste the URL and provide the IP and PORT of the machine where StreamAnalytix is deployed.

Register a CEPConfiguration


pg. 49

Method: Method section should have the value: POST URL: http://<<IP:PORT>>/StreamAnalytix/datafabric/dynacep/query/register/<componentId> BODY: It is sample entity schema JSON, make sure that the syntax remains the same and the keys, however with the requirements; you can change the values as shown in the highlighted section of Figure below Sample Entity Schema JSON Use the below keys for creating the entity schema JSON: {cepQuery:select a, b, max(a) as maxA, min(b) as minB, sum(a) as sumA, avg(a) as avgA from GenericEvent.win:time(10 sec) where a > 30, cepAction: [ { actionName: INVOKE_WEBSERVICE_CALL, params: { url: http://<HOST>:<PORT>/StreamAnalytix, method: POST, headerParams: {headerKey:headerValue}, requestParams: {reqParamKey:reqParamValue} } }, { actionName: CUSTOM_ACTION, params:{ className:com.streamanalytix.sample.delegate. SampleDynamicCEPDelegate, initParams: {paramKey: paramValue} } } ] }

As mentioned in the above figure, the key cepAction can have multiple actions, however one action is minimum. Also regarding headerParams, requestParams and initParams keys, you can have multiple values as well or none. Before clicking Send make sure that you are logged in the S-Ax platform, else an error message is shown i.e. context is not initialized.

Status SUCCESS notifies the successful completion of CEPConfiguration, keep the cepQueryId handy for further operations. This is how the CEPConfiguration registration is complete. In case, if Emit CEP Query output is set to true, above query will not be registered successfully as multiple actions are not allowed.

Update the CEPConfiguration

Method: Method section should have the value: POST


pg. 50

URL: http://<<IP:PORT>>/StreamAnalytix/datafabric/dynacep/query/update/<componentId>/<cepQueryId>

BODY: Use the keys below for updating the entity schema JSON

{ cepQuery: select a, max(a) as maxA, min(b) as minB, sum(a) as sumA, avg(a) as avgA from GenericEvent.win:time_batch(10 sec) where a > 30, cepAction: [ { actionName: CUSTOM_ACTION, params:{ className:com.streamanalytix.sample.delegate. SampleDynamicCEPDelegate, initParams: {paramKey2: paramValue2} } } ] }

Status SUCCESS notifies the successful update of query.

Delete CEP configuration based on cepQueryId

Method: POST

URL:

http://<<IP:PORT>>/StreamAnalytix/datafabric/dynacep/query/delete/<componentId>/<cepQueryId>

BODY: Leave this section blank.

Delete CEP configuration based on componentId

Method: POST

URL: http://<<IP:PORT>>/StreamAnalytix/datafabric/dynacep/query/delete/<componentId>


Get CEP configuration based on componentId and cepQueryId

Method: GET

URL: http://<<IP:PORT>>/StreamAnalytix/datafabric/dynacep/query/get/componentId/cepQueryId


pg. 51


Once you click SEND, it returns Response that has the entire CEP configuration for the given cepQueryId.

Get CEP configuration based on componentId

Method: GET

URL: http://<<IP:PORT>>/StreamAnalytix/datafabric/dynacep/query/get/componentId


Once you click SEND, it returns Response that has all the CEP configuration for the given componentId.

Get all the CEP configuration

Method: GET

URL: http://<<IP:PORT>>/StreamAnalytix/datafabric/dynacep/query/get


Once you click SEND, it returns Response that has the entire CEP configuration. Note: There can be multiple actions in a query, if you remove the last action; the query also is deleted. To remove action you can do so from the UI, as shown below: To give us your feedback on your experience with the application and report bugs or problems, mail us at [email protected]


Documents

STREAMANALTIX 2.1.6 Processors · 2017. 2. 1. · flexibility to execute data pipelines using a stream processing engine of choice depending upon the application use-case, taking