Upload
cwensel
View
421
Download
3
Tags:
Embed Size (px)
Citation preview
DATA INTEGRATION AND SQL APPLICATION MIGRATION WITH CASCADING LINGUAL
Chris K Wensel | Hadoop Summit EU 2014
• Not a “data scientist”
• No idea what “big data” means
• Used MR in anger once, and did it wrong
• Author of Cascading
• Co-Author of Lingual (w/ Julian Hyde)
CHRIS K WENSEL
2
3
Why is Hadoop & “big data” a thing?
More is better
HADOOP & BIG DATA
4
More Data More Machines More Algorithms
More Tools
HADOOP & BIG DATA
5
Worse is better
HADOOP & BIG DATA
6
Less red tape More degrees of freedom
No upfront design
HADOOP & BIG DATA
7
8
Why Cascading?
Makes hard things possible.
CASCADING
9
While helping to retain Conceptual Integrity.
CASCADING
10
"the speed of innovation is proportional to the arrival rate of
answers to questions"
HADOOP & BIG DATA
11
True when you are questioning Data, Algorithms, and
Architecture
CASCADING
12
• Java API (alternative to Hadoop MapReduce)
• Separates business logic from integration
• Testable at every lifecycle stage
• Works with any JVM language
• Many integration adapters
CASCADING
13
Process Planner
Processing API Integration APIScheduler API
Scheduler
Compute
Cascading
Data Stores
ScriptingScala, Clojure, JRuby, Jython, Groovy
Enterprise Java
ECOSYSTEM
14
Lingual Pattern
Cascading
Hadoop MR
Scalding Cascalog
Hadoop Tez Whatever
• Started in 2007
• 2.0 released June 2012
• 2.5 stable out now
• 3.0 wip now available
• Tez support coming soon
• Apache Licensed Open-Source
• Supports all Hadoop 1 & 2 distros
CASCADING
15
ANSI SQL
on Cascading
on Whatever
LINGUAL
16
How’s this different than all the other “SQL for Hadoop” projects?
LINGUAL
17
Not intended as an ad-hoc query interface.
[Lingual is only as fast as Hadoop]
WHY LINGUAL?
18
Is intended to be as standards compliant as
possible.
WHY LINGUAL?
19
Migrate workloads from expensive systems to less expensive Hadoop
WHY LINGUAL?
20
Liberate the data trapped on Hadoop w/o involving an Engineer
WHY LINGUAL?
21
• ANSI Compatible SQL
• JDBC Driver
• Cascading Java API
• SQL Command Shell
• Catalog Manager Tool
• Data Provider API
LINGUAL
22
Query Planner
JDBC API Lingual APIProvider API
Cascading
Compute
Lingual
Data Stores
CLI / Shell Enterprise Java
Catalog
• SQL-92
• Character, Numeric, and Temporal types
• IN and CASE
• FROM sub-queries
• CAST and CONVERT
• CURRENT_*
ANSI SQL
23
http://docs.cascading.org/lingual/1.1/#sql-support
24
query: { select | query UNION [ ALL ] query | query EXCEPT query | query INTERSECT query } [ ORDER BY orderItem [, orderItem ]* ] [ LIMIT { count | ALL } ] [ OFFSET start { ROW | ROWS } ] [ FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } ] !orderItem: expression [ ASC | DESC ] [ NULLS FIRST | NULLS LAST ] !select: SELECT [ ALL | DISTINCT ] { * | projectItem [, projectItem ]* } FROM tableExpression [ WHERE booleanExpression ] [ GROUP BY { () | expression [, expression]* } ] [ HAVING booleanExpression ] [ WINDOW windowName AS windowSpec [, windowName AS windowSpec ]* ] !projectItem: expression [ [ AS ] columnAlias ] | tableAlias . *
tableExpression: tableReference [, tableReference ]* | tableExpression [ NATURAL ] [ LEFT | RIGHT | FULL ] JOIN tableExpression [ joinCondition ] !joinCondition: ON booleanExpression | USING ( column [, column ]* ) !tableReference: tablePrimary [ [ AS ] alias [ ( columnAlias [, columnAlias ]* ) ] ] !tablePrimary: [ TABLE ] [ [ catalogName . ] schemaName . ] tableName | ( query ) | VALUES expression [, expression ]* | ( TABLE expression ) !windowRef: windowName | windowSpec !windowSpec: [ windowName ] ( [ ORDER BY orderItem [, orderItem ]* ] [ PARTITION BY expression [, expression ]* ] { RANGE numericOrInterval { PRECEDING | FOLLOWING } | ROWS numeric { PRECEDING | FOLLOWING } } )
Lingual 1.1 -> Optiq 0.4.12.3https://github.com/julianhyde/optiq/blob/master/REFERENCE.md
Lingual provides two interfaces.
APIS
25
Allows SQL and non-SQL Flows to work together as a single application via
conceptually similar interfaces
CASCADING API
26
27
Cascading API !
FlowDef flowDef = FlowDef.flowDef()! .setName( "sqlflow" )! .addSource( "example.employee", emplTap )! .addSource( "example.sales", salesTap )! .addSink( "results", resultsTap );! !SQLPlanner sqlPlanner = new SQLPlanner()! .setSql( sqlStatement );! !flowDef.addAssemblyPlanner( sqlPlanner );!!Flow flow = new HadoopFlowConnector().connect( flowDef ); !
flow.complete();
So Systems and People can talk directly to Hadoop visible data
JDBC API
28
29
JDBC driver public void run() throws ClassNotFoundException, SQLException {! Class.forName( "cascading.lingual.jdbc.Driver" );! Connection connection =! DriverManager.getConnection(! "jdbc:lingual:local;schemas=src/main/resources/data/example" );! Statement statement = connection.createStatement();! ! ResultSet resultSet = statement.executeQuery(! "select *\n"! + "from \"EXAMPLE\".\"SALES_FACT_1997\" as s\n"! + "join \"EXAMPLE\".\"EMPLOYEE\" as e\n"! + "on e.\"EMPID\" = s.\"CUST_ID\"" );! ! // do something! ! resultSet.close();! statement.close();! connection.close();! }
JDBC
30
Server / Desktop
JDBCFlowAssembly
ClusterJobJobSQL
select * from employees ...
SQLselect * from employees ...
SQLselect * from employees ...
lingual-hadoop-1.1.0-jdbc.jar
meta-data catalog
DEFAULT SHELL
31
select dept_no, avg( max_salary ) from employees.dept_emp, ( select emp_no as sal_emp_no, max( salary ) as max_salary from employees.salaries group by emp_no ) where dept_emp.emp_no = sal_emp_no group by dept_no;
SUB-QUERY
32
ACCESS HADOOP FROM R
33
# load the JDBC package!library(RJDBC)! !# set up the driver!drv <- JDBC("cascading.lingual.jdbc.Driver", ! "~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-jdbc.jar")! !# set up a database connection to a local repository!connection <- dbConnect(drv, ! "jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/tables;schema=EMPLOYEES")! !# query the repository: in this case the MySQL sample database (CSV files)!df <- dbGetQuery(connection, ! "SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'")!head(df)! !# use R functions to summarize and visualize part of the data!df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25!summary(df$hire_age)!!library(ggplot2)!m <- ggplot(df, aes(x=hire_age))!m <- m + ggtitle("Age at hire, people named Gina")!m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()
RESULTS
34
> summary(df$hire_age) Min. 1st Qu. Median Mean 3rd Qu. Max. 20.86 27.89 31.70 31.61 35.01 43.92
INTEGRATION
35
But I use a custom data format!
• Any Cascading Tap and/or Scheme can be used from JDBC
• Use a “fat jar” on local disk or from a Maven repo
‣ cascading-jdbc:cascading-jdbc-oracle-provider:1.0
• The Jar is dynamically loaded into cluster, on the fly
DATA PROVIDER API
36
DATA PROVIDER
37
JDBC
Maven Repo
Assembly Flow
ClusterJobJob
lingual-hadoop-1.1.0-jdbc.jar
cascading-jdbc-oracle-provider.jar
your-avro-provider.jar
AMAZON EMR & REDSHIFT
38
Amazon Elastic MapReduceJob Job Job Job
SELECT ... FROM file1 JOIN file2 ON file1.id = file2.id ...
Amazon S3
Amazon RedShift
file1 file2
results
http://docs.cascading.org/tutorials/lingual-redshift/
All Cascading applications can be visualized and monitored …
MANAGED
39
• Understand how your application maps onto your cluster
• Identify bottlenecks (data, code, or the system)
• Jump to the line of code implicated on a failure
• Plugin available via Maven repo
• Beta UI hosted online
DRIVEN
40
http://cascading.io/driven/
MANAGED WITH DRIVEN
41
42
A BOOK!
43
Enterprise Data Workflows with Cascading
O’Reilly, 2013 amazon.com/dp/1449358721