ApacheKylin2.0从传统OLAP向实时数据仓库演进
马洪宾 | [email protected]
PMC member of Apache KylinKyligence Inc.技术合伙人 &高级架构师
All rights reserved ©Kyligence Inc.http://kyligence.io
Apache Kylin历史
Sep 2013
项目开始
Oct 2014
加入Apache孵化器项目
Nov 2014
InfoWorld: Bossie Award
最佳开源大数据工具奖
毕业成为Apache顶级项目
Kyligence公司创建
Sep 2015 Nov 2015 Mar 2016
正式开源
InfoWorld: Bossie Award
最佳开源大数据工具奖
Sep 2016
商业版KAP发布
Aug 2016
All rights reserved ©Kyligence Inc.http://kyligence.io
关于Kyligence• Kyligence’svisionistounleashbigdataproductivityforeveryone's
analyticsneeds.
• ThecompanywasfoundedbytheteamwhocreatedApacheKylin™,atopopensourceOLAPenginebuiltforinteractiveanalyticsatpetabyte-scaledataonHadoop.KyligenceistheprimarycontributortotheopensourceKylinprojectglobally.
• Kyligenceprovidesaleadingintelligentdataplatformtosimplifybigdataanalyticsfromon-premisestocloud.
All rights reserved ©Kyligence Inc.http://kyligence.io
Apache Kylin全球案例
All rights reserved ©Kyligence Inc.http://kyligence.io
All rights reserved ©Kyligence Inc.http://kyligence.io
BI可视化
HDFS
ApacheKylin
Hive HBase
Interactive Reporting Dashboard
OLAP引擎
Hadoop
- 3万亿条数据,<1秒查询延迟@头条,国内第一新闻资讯app
- 60+维度的cube@太平洋保险,中国三大保险公司之一
- JDBC/ODBC/RestAPI
- BI集成
ApacheKylin是什么
Apache Kylin in the Zoo
All rights reserved ©Kyligence Inc.http://kyligence.io
Offline Cubing
Kylin
BITools,WebApp…
ANSISQL
Kylin为什么快
All rights reserved ©Kyligence Inc.http://kyligence.io
selectl_returnflag,o_orderstatus,sum(l_quantity)assum_qty,sum(l_extendedprice)assum_base_price…
fromv_lineiteminnerjoin v_orders onl_orderkey =o_orderkey
wherel_shipdate <='1998-09-16'
groupbyl_returnflag,o_orderstatus
orderbyl_returnflag,o_orderstatus;
Asamplequery:Reportrevenueby“returnflag”and“orderstatus”Sort
Aggr.
Filter
TablesO(N)
Join
Kylin为什么快
All rights reserved ©Kyligence Inc.http://kyligence.io
Sort
Cuboid
Filter
Sort
Aggr.
Filter
TablesO(N)
Join
O(flagxstatusxdays)=O(1)
Pre-calculatetheKylinCube
Kylin关键在于预计算
All rights reserved ©Kyligence Inc.http://kyligence.io
time, item
time, item, location
time, item, location, supplier
time item location supplier
time, location
Time, supplier
item, location
item, supplier
location, supplier
time, item, supplier
time, location, supplier
item, location, supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
- 基于cube理论- Model和 Cube定义了预计算的范围- BuildEngine执行构建任务- QueryEngine在预计算的结果之上完成查询
O(1)复杂度
All rights reserved ©Kyligence Inc.http://kyligence.io
OnlineCalculation
O(N)
O(1)ApacheKylin
DataSize
ResponseTime
All rights reserved ©Kyligence Inc.http://kyligence.io
近实时流数据处理
构建分钟级别延迟的cube
New in Kylin 1.6
All rights reserved ©Kyligence Inc.http://kyligence.io
Offline Cubing
Kylin
BITools,WebApp…
ANSISQL
New
ThanksSee you on our next meeting
DemoofTwitterAnalysis
All rights reserved ©Kyligence Inc.http://kyligence.io
http://hub.kyligence.io
Incrementalbuildtriggersevery2minutes,buildfinishesin3minutes.
- 8-nodeclusteronAWS,3Kafkabrokers
- Twittersamplefeed,10+Kmessagespersecond
- Cubehas9dimensionsand3measures
- 2jobsrunningatthesametime
All rights reserved ©Kyligence Inc.http://kyligence.io
SparkCubing减少一半的构建时间
Layered Cubing
All rights reserved ©Kyligence Inc.http://kyligence.io
标准的构建算法:Layered Cubing
- 启动多轮MR任务
- 将大型shuffle切分到多个stage
- 稳定,但是在构建时间上并不是最优的
In-memory Cubing
All rights reserved ©Kyligence Inc.http://kyligence.io
In-memory Cubing是对Layered Cubing的强力补充
- 在某些条件下触发
- 并不适用于所有场景
- 一旦被触发,往往拥有更好性能
CubingwithMR总结
All rights reserved ©Kyligence Inc.http://kyligence.io
比较稳定
Layered Cubing在某些场景下性能都有待提高
In-mem Cubing适用场景有限
社区迫不及待地想要尝试使用其他技术来加速cubing
ApacheSpark介绍
All rights reserved ©Kyligence Inc.http://kyligence.io
ApacheSparkisanopen-sourcecluster-computingframework,whichprovidesprogrammerswithanapplicationprogramminginterfacecenteredonadatastructurecalledRDD.
SparkwasdevelopedinresponsetolimitationsintheMapReduceclustercomputingparadigm.
SparkrunsonHadoop,Mesos,standalone,orinthecloud.ItcanaccessdiversedatasourcesincludingHDFS,Cassandra,HBase,andS3.
Spark Cubing:一次失败的尝试
All rights reserved ©Kyligence Inc.http://kyligence.io
Kylin1.5曾经尝试使用过Spark Cubing,但是从未正式发布
- 它只是简单地将In-memory Cubing移植到Spark上
- 使用一轮RDD转换计算整个cube
- 并未观察到明显改进
- Spark计算方式与MR并无明显区别
SparkCubingin2.0
All rights reserved ©Kyligence Inc.http://kyligence.io
RDD-1
RDD-2
RDD-3
RDD-4
RDD-5
Kylin2.0基于LayeredCubing算法重新打造了Spark Cubing
- 每一层的cuboid视作一个RDD
- 父亲RDD被尽可能cache到内存
- RDD被导出到sequencefile,
- 通过将 “map”替换为“flatMap”,以及把“reduce”替换为 “reduceByKey”,可以复用大部分代码
计算第三层cuboid的DAG
All rights reserved ©Kyligence Inc.http://kyligence.io
Performance Test
All rights reserved ©Kyligence Inc.http://kyligence.io
• Environment4nodesHadoopcluster;eachnodehas28GBRAMand12coresYRANhas48GBRAMand30coresintotalCDH5.8,ApacheKylin2.0beta
• SparkSpark1.6.3onYARN6executors,eachhas4cores,4GB+1GB(overhead)memory
• TestDataAirlinedata,total160millionrowsCube:10dimensions,5measures(SUM)
• TestScenariosBuildthecubeatdifferentsourcedatascale:3million,50millionand160millionsourcerowsComparethebuildtimewithMapReduce(both bylayer and in-mem)andSpark.NocompressionThetimeonlycoverthebuildingcubestep,notincludingpreparationsandsubsequentsteps
SparkCubingvs.MRLayeredCubing
All rights reserved ©Kyligence Inc.http://kyligence.io
Build Cube Time Comparison
Build
Tim
e (m
inut
e)
0
8
15
23
30
Source data rows
3-million 50-million 160-millionMapReduce Spark MapReduce Spark MapReduce Spark
17
8
3
29
19
6
70%to130%improvementonSpark
SparkCubingvs.MRIn-memCubing
All rights reserved ©Kyligence Inc.http://kyligence.io
结论
All rights reserved ©Kyligence Inc.http://kyligence.io
• Layered cubingalgorithmis stableonbothMRandSparkSpark
• ComparedwithMRlayered Cubing,70%to130%performanceimprovement is observed on
Spark Cubing
• Whensourcedatais sharded,Sparkstillkeeps closeperformancewithMRin-memcubing
• There’s still room for improvement
All rights reserved ©Kyligence Inc.http://kyligence.io
雪花模型的支持
运行TPC-Hbenchmark
Kylin1.0StarSchemaLimitation
All rights reserved ©Kyligence Inc.http://kyligence.io
Pre-calculatethejoinof1leveloflookups
- Supportstarschemaonly- Not allowsamenamecolumnsfromdifferenttables- Not allowtablejoiningitself- Difficulttosupportrealworldbusinesscases
LINEORDER
DATES
PART
CUSTOMER
SUPPLIER
Join
LINEORDER
DATES PART
CUSTOMERSUPPLIER
Kylin2.0SnowflakeSchema
All rights reserved ©Kyligence Inc.http://kyligence.io
Pre-calculateunlimitedlevelsoflookups
- Snowflakeschemasupport(KYLIN-1875)
- Allowtablebejoinedmultipletimes
- BigmetadatachangeatModellevel
- Manybugfixesregardingjoinsandsub-queries
- Supportcomplexmodelsofanykind,supportflexiblequeriesonthemodels
ORDERS
CUSTOMER
SUPPLIER
PART
LINEITEM
PARTSUPP
NATION
REGION
Join
Join
Join
Join
Join
TPC-HonKylin2.0
All rights reserved ©Kyligence Inc.http://kyligence.io
TPC-H isabenchmarkfordecisionsupportsystem.
- PopularamongcommercialRDBMS&DWsolutions- Queriesanddatahavebroadindustry-widerelevance- Examinelargevolumesofdata- Executequerieswithahighdegreeofcomplexity- Giveanswerstocriticalbusinessquestions
Kylin2.0runsallthe22TPC-Hqueries.(KYLIN-2467)
- Pre-calculationcananswerverycomplexqueries- Goalisfunctionalityatthisstage- Tryit:https://github.com/Kyligence/kylin-tpch
ORDERS
CUSTOMER SUPPLIER PART
LINEITEM
PARTSUPPNATION
REGION
ComplexQuery1
All rights reserved ©Kyligence Inc.http://kyligence.io
TPC-Hquery07- 0.17sec(Hive+Tez 35.23sec)- 2sub-queriesselectsupp_nation,cust_nation,l_year,sum(volume) asrevenue
from(selectn1.n_nameassupp_nation,n2.n_nameascust_nation,l_shipyear asl_year,l_saleprice asvolume
fromv_lineiteminnerjoinsupplier ons_suppkey =l_suppkeyinnerjoinv_orders onl_orderkey =o_orderkeyinnerjoincustomerono_custkey =c_custkeyinnerjoinnationn1ons_nationkey =n1.n_nationkeyinnerjoinnationn2onc_nationkey =n2.n_nationkey
where((n1.n_name='KENYA'andn2.n_name='PERU')or(n1.n_name='PERU'andn2.n_name='KENYA')
)andl_shipdate between'1995-01-01'and'1996-12-31'
)asshippinggroupbysupp_nation,cust_nation,l_year
orderbysupp_nation,cust_nation,l_year
Sort
Aggr.
Filter
LINEITEM
Join
Proj.
Join
Join
Join
Join SUPPLIER
ORDER
CUSTOMER
NATION
NATION
All rights reserved ©Kyligence Inc.http://kyligence.io
TPC-Hquery11- 3.42sec(Hive+Tez 15.87sec)- 4sub-queries,1onlinejoinwithq11_part_tmp_cachedas(selectps_partkey,sum(ps_partvalue) aspart_value
fromv_partsuppinnerjoin supplier onps_suppkey =s_suppkeyinnerjoin nationons_nationkey =n_nationkey
wheren_name ='GERMANY'
groupbyps_partkey),q11_sum_tmp_cached as(selectsum(part_value) astotal_value
fromq11_part_tmp_cached
)
selectps_partkey,part_value
from(selectps_partkey,part_value,total_value
fromq11_part_tmp_cached, q11_sum_tmp_cached
)awherepart_value >total_value *0.0001
orderbypart_value desc;
Sort
Filter
Join
Proj.
Aggr.
Filter
Join
Proj.
Join SUPPLIER
NATION
Aggr.
Filter
Join
Proj.
Join SUPPLIER
NATION
PARTSUPP
PARTSUPP
Proj.
Aggr.
ComplexQuery2
All rights reserved ©Kyligence Inc.http://kyligence.io
TPC-Hquery12- 7.66sec(Hive+Tez 12.64sec)- 5sub-queries,2onlinejoinswithin_scope_data as(selectl_shipmode,o_orderpriority
fromv_lineitem innerjoinv_orders on l_orderkey =o_orderkey
wherel_shipmode in('REGAIR','MAIL')andl_receiptdelayed =1andl_shipdelayed =0andl_receiptdate >='1995-01-01'andl_receiptdate <'1996-01-01'
),all_l_shipmode as(selectdistinctl_shipmode
fromin_scope_data
),high_line as(selectl_shipmode,count(*)ashigh_line_count
fromin_scope_data
whereo_orderpriority ='1-URGENT'oro_orderpriority ='2-HIGH'
groupby l_shipmode),low_line as(selectl_shipmode,count(*)aslow_line_count
fromin_scope_data
whereo_orderpriority <>'1-URGENT'ando_orderpriority <>'2-HIGH'
groupby l_shipmode)selectal.l_shipmode,hl.high_line_count, ll.low_line_count
fromall_l_shipmode alleftjoinhigh_line hlonal.l_shipmode =hl.l_shipmodeleftjoinlow_line ll onal.l_shipmode =ll.l_shipmode
orderbyal.l_shipmode
ComplexQuery3
Sort
Filter
Join
Join
Aggr.
Filter
Join
Proj.
ORDERS
LINEITEMAggr.
Filter
Join
Proj.
ORDERS
LINEITEM
Aggr.
Filter
Join
Proj.
ORDERS
LINEITEM
MorethanMOLAP
All rights reserved ©Kyligence Inc.http://kyligence.io
- Supportscomplexdatamodelsandsub-queries;RunsTPC-H- Percentile/Window/Timefunctions
SQLMaturity
Speed
Kylin2.0Kylin1.0
DWonHadoop
MOLAP AnalyticsDW
总结
All rights reserved ©Kyligence Inc.http://kyligence.io
ApacheKylin2.0
- Kylin2.0Beta可供下载.
- Sparkcubing
- 雪花模型的支持- 可尝试的TPC-Hbenchmark- 时间函数/窗口函数/百分比函数
- 近实时流式处理
Whatisnext
- Hadoop3.0支持(ErasureCoding)
- 完善Spark Cubing
- 连接更多数据源(JDBC,SparkSQL)
- 替换存储层(Kudu?)
- 支持真正实时 lambdaarchitecture
感谢聆听!
All rights reserved ©Kyligence Inc.http://kyligence.io
ApacheKylin
Twitter:@ApacheKylinhttp://kylin.apache.org
Kyligence Inc.
Twitter:@Kyligencehttp://kyligence.io