Hadoop Summit EU 2014

DATA INTEGRATION AND SQL APPLICATION MIGRATION WITH CASCADING LINGUAL

Chris K Wensel | Hadoop Summit EU 2014

• Not a “data scientist”

• No idea what “big data” means

• Used MR in anger once, and did it wrong

• Author of Cascading

• Co-Author of Lingual (w/ Julian Hyde)

CHRIS K WENSEL

2

3

Why is Hadoop & “big data” a thing?

More is better

HADOOP & BIG DATA

4

More Data More Machines More Algorithms

More Tools

HADOOP & BIG DATA

5

Worse is better

HADOOP & BIG DATA

6

Less red tape More degrees of freedom

No upfront design

HADOOP & BIG DATA

7

8

Why Cascading?

Makes hard things possible.

CASCADING

9

While helping to retain Conceptual Integrity.

CASCADING

10

"the speed of innovation is proportional to the arrival rate of

answers to questions"

HADOOP & BIG DATA

11

True when you are questioning Data, Algorithms, and

Architecture

CASCADING

12

• Java API (alternative to Hadoop MapReduce)

• Separates business logic from integration

• Testable at every lifecycle stage

• Works with any JVM language

• Many integration adapters

CASCADING

13

Process Planner

Processing API Integration APIScheduler API

Scheduler

Compute

Cascading

Data Stores

ScriptingScala, Clojure, JRuby, Jython, Groovy

Enterprise Java

ECOSYSTEM

14

Lingual Pattern

Cascading

Hadoop MR

Scalding Cascalog

Hadoop Tez Whatever

• Started in 2007

• 2.0 released June 2012

• 2.5 stable out now

• 3.0 wip now available

• Tez support coming soon

• Apache Licensed Open-Source

• Supports all Hadoop 1 & 2 distros

CASCADING

15

ANSI SQL

on Cascading

on Whatever

LINGUAL

16

How’s this different than all the other “SQL for Hadoop” projects?

LINGUAL

17

Not intended as an ad-hoc query interface.

[Lingual is only as fast as Hadoop]

WHY LINGUAL?

18

Is intended to be as standards compliant as

possible.

WHY LINGUAL?

19

Migrate workloads from expensive systems to less expensive Hadoop

WHY LINGUAL?

20

Liberate the data trapped on Hadoop w/o involving an Engineer

WHY LINGUAL?

21

• ANSI Compatible SQL

• JDBC Driver

• Cascading Java API

• SQL Command Shell

• Catalog Manager Tool

• Data Provider API

LINGUAL

22

Query Planner

JDBC API Lingual APIProvider API

Cascading

Compute

Lingual

Data Stores

CLI / Shell Enterprise Java

Catalog

• SQL-92

• Character, Numeric, and Temporal types

• IN and CASE

• FROM sub-queries

• CAST and CONVERT

• CURRENT_*

ANSI SQL

23

http://docs.cascading.org/lingual/1.1/#sql-support

24

query: { select | query UNION [ ALL ] query | query EXCEPT query | query INTERSECT query } [ ORDER BY orderItem [, orderItem ]* ] [ LIMIT { count | ALL } ] [ OFFSET start { ROW | ROWS } ] [ FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } ] !orderItem: expression [ ASC | DESC ] [ NULLS FIRST | NULLS LAST ] !select: SELECT [ ALL | DISTINCT ] { * | projectItem [, projectItem ]* } FROM tableExpression [ WHERE booleanExpression ] [ GROUP BY { () | expression [, expression]* } ] [ HAVING booleanExpression ] [ WINDOW windowName AS windowSpec [, windowName AS windowSpec ]* ] !projectItem: expression [ [ AS ] columnAlias ] | tableAlias . *

tableExpression: tableReference [, tableReference ]* | tableExpression [ NATURAL ] [ LEFT | RIGHT | FULL ] JOIN tableExpression [ joinCondition ] !joinCondition: ON booleanExpression | USING ( column [, column ]* ) !tableReference: tablePrimary [ [ AS ] alias [ ( columnAlias [, columnAlias ]* ) ] ] !tablePrimary: [ TABLE ] [ [ catalogName . ] schemaName . ] tableName | ( query ) | VALUES expression [, expression ]* | ( TABLE expression ) !windowRef: windowName | windowSpec !windowSpec: [ windowName ] ( [ ORDER BY orderItem [, orderItem ]* ] [ PARTITION BY expression [, expression ]* ] { RANGE numericOrInterval { PRECEDING | FOLLOWING } | ROWS numeric { PRECEDING | FOLLOWING } } )

Lingual 1.1 -> Optiq 0.4.12.3https://github.com/julianhyde/optiq/blob/master/REFERENCE.md

Lingual provides two interfaces.

APIS

25

Allows SQL and non-SQL Flows to work together as a single application via

conceptually similar interfaces

CASCADING API

26

27

Cascading API !

FlowDef flowDef = FlowDef.flowDef()! .setName( "sqlflow" )! .addSource( "example.employee", emplTap )! .addSource( "example.sales", salesTap )! .addSink( "results", resultsTap );! !SQLPlanner sqlPlanner = new SQLPlanner()! .setSql( sqlStatement );! !flowDef.addAssemblyPlanner( sqlPlanner );!!Flow flow = new HadoopFlowConnector().connect( flowDef ); !

flow.complete();

So Systems and People can talk directly to Hadoop visible data

JDBC API

28

29

JDBC driver public void run() throws ClassNotFoundException, SQLException {! Class.forName( "cascading.lingual.jdbc.Driver" );! Connection connection =! DriverManager.getConnection(! "jdbc:lingual:local;schemas=src/main/resources/data/example" );! Statement statement = connection.createStatement();! ! ResultSet resultSet = statement.executeQuery(! "select *\n"! + "from \"EXAMPLE\".\"SALES_FACT_1997\" as s\n"! + "join \"EXAMPLE\".\"EMPLOYEE\" as e\n"! + "on e.\"EMPID\" = s.\"CUST_ID\"" );! ! // do something! ! resultSet.close();! statement.close();! connection.close();! }

JDBC

30

Server / Desktop

JDBCFlowAssembly

ClusterJobJobSQL

select * from employees ...

SQLselect * from employees ...

SQLselect * from employees ...

lingual-hadoop-1.1.0-jdbc.jar

meta-data catalog

DEFAULT SHELL

31

select dept_no, avg( max_salary ) from employees.dept_emp, ( select emp_no as sal_emp_no, max( salary ) as max_salary from employees.salaries group by emp_no ) where dept_emp.emp_no = sal_emp_no group by dept_no;

SUB-QUERY

32

ACCESS HADOOP FROM R

33

# load the JDBC package!library(RJDBC)! !# set up the driver!drv <- JDBC("cascading.lingual.jdbc.Driver", ! "~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-jdbc.jar")! !# set up a database connection to a local repository!connection <- dbConnect(drv, ! "jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/tables;schema=EMPLOYEES")! !# query the repository: in this case the MySQL sample database (CSV files)!df <- dbGetQuery(connection, ! "SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'")!head(df)! !# use R functions to summarize and visualize part of the data!df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25!summary(df$hire_age)!!library(ggplot2)!m <- ggplot(df, aes(x=hire_age))!m <- m + ggtitle("Age at hire, people named Gina")!m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()

RESULTS

34

> summary(df$hire_age) Min. 1st Qu. Median Mean 3rd Qu. Max. 20.86 27.89 31.70 31.61 35.01 43.92

INTEGRATION

35

But I use a custom data format!

• Any Cascading Tap and/or Scheme can be used from JDBC

• Use a “fat jar” on local disk or from a Maven repo

‣ cascading-jdbc:cascading-jdbc-oracle-provider:1.0

• The Jar is dynamically loaded into cluster, on the fly

DATA PROVIDER API

36

DATA PROVIDER

37

JDBC

Maven Repo

Assembly Flow

ClusterJobJob

lingual-hadoop-1.1.0-jdbc.jar

cascading-jdbc-oracle-provider.jar

your-avro-provider.jar

AMAZON EMR & REDSHIFT

38

Amazon Elastic MapReduceJob Job Job Job

SELECT ... FROM file1 JOIN file2 ON file1.id = file2.id ...

Amazon S3

Amazon RedShift

file1 file2

results

http://docs.cascading.org/tutorials/lingual-redshift/

All Cascading applications can be visualized and monitored …

MANAGED

39

• Understand how your application maps onto your cluster

• Identify bottlenecks (data, code, or the system)

• Jump to the line of code implicated on a failure

• Plugin available via Maven repo

• Beta UI hosted online

DRIVEN

40

http://cascading.io/driven/

MANAGED WITH DRIVEN

41

42

A BOOK!

43

Enterprise Data Workflows with Cascading

O’Reilly, 2013 amazon.com/dp/1449358721

http://amazon.com/dp/1449358721

[email protected] !

!

@cwensel

DONE

44

mailto:[email protected]

Documents

Hadoop Summit EU 2014