28
DataLines a framework for building steaming data applications Mike Haberman Senior Software/Network Engineer [email protected]

DataLines a framework for building steaming data applications

Embed Size (px)

DESCRIPTION

DataLines a framework for building steaming data applications. Mike Haberman Senior Software/Network Engineer [email protected]. The Problem. Data deluge: routers, switches, IDS, servers (web, mail, logs, etc), software (tcpdump, web100, SNMP, tarpit, etc), sensors, taps, … (help me). ?. ?. - PowerPoint PPT Presentation

Citation preview

Page 1: DataLines a framework for building steaming data applications

DataLinesa framework for building steaming data applications

Mike HabermanSenior Software/Network Engineer

[email protected]

Page 2: DataLines a framework for building steaming data applications

The Problem

• Data deluge: routers, switches, IDS, servers (web, mail, logs, etc), software (tcpdump, web100, SNMP, tarpit, etc), sensors, taps, … (help me)

??

?

Page 3: DataLines a framework for building steaming data applications

The problem (continues)

• Disparate data formats

• Software (sometimes) to manage each

• Tweaking to get what you want (custom software)

• Correlating data (more custom software)

Page 4: DataLines a framework for building steaming data applications

DataLines

• Can we build a framework that can remove all (most) of the tedium of working with all these disparate data formats?

Page 5: DataLines a framework for building steaming data applications

DataLines Framework

• designed to manage and build streaming data processing applications

Page 6: DataLines a framework for building steaming data applications

DataLines Framework

• designed to manage and build streaming data processing applications

Page 7: DataLines a framework for building steaming data applications

DataLines Framework

• Manage: would like one tool to handle all these different data sources.

designed to manage and build streaming data processing applications

Page 8: DataLines a framework for building steaming data applications

DataLines Framework

• Build: uniform way of creating a data processing application.

designed to manage and build streaming data processing applications

Page 9: DataLines a framework for building steaming data applications

DataLines Framework

• Streaming data:• Never ending stream of ‘manageable’ chunks of data• No random access, no blocking operators• One look, linear or sub-linear algorithms/data ops• Each data item (a tuple in DataLines) is an

independent entity• Many tools were not designed for streaming data

designed to manage and build streaming data processing applications

Page 10: DataLines a framework for building steaming data applications

DataLines Framework

• Processing:• Something you want to do to the data (e.g.

reading, writing, parsing, event generation, filtering, statistics, reports, data synopsis, …)

designed to manage and build streaming data processing applications

Page 11: DataLines a framework for building steaming data applications

DataLines

• Creating a DataLines application:

XML DataLines Application

“compile”

Page 12: DataLines a framework for building steaming data applications

DataLines

• XML file defines 3 major components:– Data Processors

• What one does with the data

– Processing Order• The order in which the processors will operate

on the data

– Event Management• What to do when a processor generates an

event

Page 13: DataLines a framework for building steaming data applications

DataLines Processors• Data Processors are the heart of D.L.

– I/O: socket, file– Filters: inline, dispatch– Collectors: binning, windowing (w/operators)– Gui: charts, picture taking– Converters: binary to tuple– Misc: printers, counters, iterators, timers,

data generators, gates, delays

• Processors can generate events• Processors can drop, mutate, mutilate the

tuple being processed, generate new tuples

Page 14: DataLines a framework for building steaming data applications

DataLines Pipelines

• Control tuple movement among processors

• Can connect either processors or other pipelines

• Two paths within a pipeline: binary and tuple

Page 15: DataLines a framework for building steaming data applications

Event Management

• Allow processors to signal an event– timers, open/close, client connects, etc

• Allow the user to tie in domain logic

• Allow the user to call a processor specific API

Page 16: DataLines a framework for building steaming data applications

DataLines Data

• The generalization of data is a DlTuple

• Tuple is just a set of values

• DlTuple is the interface processors use– String[] <-- getFieldNames()– DlValue <-- getValue(fieldname)

Page 17: DataLines a framework for building steaming data applications

DataLines Data

• Tuples can have virtual fields– calculated values, static values

• Tuples can have composite fields

• The creation of the tuple is left to the processor in charge of conversion

Page 18: DataLines a framework for building steaming data applications

XML Syntax … run away!<application><dataline name =“dl”>

<processor name=“reader” type=“FileReader”><configInfo></configInfo>

</processor>

<pipeline name =“p1”><pipe from = “reader” to = “parser” /><pipe from = “parser” to = “printer” />

</pipeline>

<eventManagement><event name=“start”>

<call method = “start” target = “reader”/></event><event name=“alert” from = “reader”>

<call method=“stop” target=“parser” /></event>

</eventManagement><dataline></application>

Page 19: DataLines a framework for building steaming data applications

Data Example<arg name = “tupleField”>

<map name = “name” value = “Src Ip”/><map name = “peer” value = “IpV4AddressPeer” /><map name = “length” value = “4” />

</arg>

Page 20: DataLines a framework for building steaming data applications

Data Example<arg name = “tupleField”>

<map name = “name” value = “A”/><map name = “peer” value = “IntegerPeer” /><map name = “length” value = “4” />

</arg><arg name = “tupleField”>

<map name = “name” value = “B”/><map name = “peer” value = “IntegerPeer” /><map name = “length” value = “4” />

</arg><arg name = “tupleField”>

<map name = “name” value = “C”/><map name = “peer” value = “JepPeer” /><data name = “expression”>

${A} + ${B}</data>

</arg>

Page 21: DataLines a framework for building steaming data applications

DataLines Tutorial

• Fast forward past a painful 3 hour tutorial covering each of those sections in detail (tuples, processors, pipelines, event management, configurations)

• You have seen all the XML though!

Page 22: DataLines a framework for building steaming data applications

DataLines Distilled

• A library of data processors that operate on “Tuples”

– one of the processors takes the raw data and creates the tuple

• An XML compiler that takes the xml file, the library, and creates an application

Page 23: DataLines a framework for building steaming data applications

DataLines Example

Page 24: DataLines a framework for building steaming data applications

DataLines in use

• DataLines does make it easier to hit the ground running. Much of the tedious work you need to do is taken care of

• For highly specific needs, you still need to write code. But that code then becomes part of the DataLines lib. That others can build on

Page 25: DataLines a framework for building steaming data applications

Balance Sheet• Positive

•Flexible (vendor neutral, data, debugging)•Reusable (pipelines, processors)•Fast development time•“easy” to change the client (cli, desktop, web page)

• Negative

•May need to write domain specific code •Learning curve -- processors config, data expectations, events

Page 26: DataLines a framework for building steaming data applications

DataLines in Action

• Network Engineering group– Monitor router, tar pit, IDS, packet

sampling, L2/L3 mappings• Security Group

– Network forensics

• Intergroup Wiring• Use DataLines to share data between groups/projects

Page 27: DataLines a framework for building steaming data applications

DataLines in Action

• Network Research group– Monitor cluster network activity from MPI

layer– Data Mining– Misc. NSF data oriented projects

Page 28: DataLines a framework for building steaming data applications

Future

• Open Source

• More Info: [email protected]

• http://datalines.ncsa.uiuc.edu (a work in progress)