Upload
i-goo-lee
View
114
Download
0
Embed Size (px)
Citation preview
ALLPPT.com _ Free PowerPoint Templates, Diagrams and Charts
2017.04. 조창원
Introduce
Spark
Why Spark?
Spark Offers you
1) Lazy Computation Optimize the job before executing
2) In-Memroy data caching (not In-Memory DBMS) Scan HDD Only once, then scan your RAM
3) Efficient pipelining Avoid the data hitting the HDD by all means
The Spark Platform
RDD – Resilient Distributed DataSet
• In-Memory Processing computation(intermediate results in a distributed memory instead of Stable storage )
Two main abstractions of Spark
RDD – Resilient Distributed DataSet
• Transformation lazy operations on a RDD that create one or many new RDDs
(map, filter, reduceByKey, join, cogroup, randomSplit)
They do not change the input RDD – immutable
Narrow and Wide Transformation
• Action a RDD operation that returns a value of any type
an action evaluates the RDD lineage graph
Two main abstractions of Spark
RDD – Resilient Distributed DataSet
Two main abstractions of Spark
DAG – Direct Acyclic Graph
Two main abstractions of Spark
• Sequence of computations performed on data Node – RDD partition
Edge – transformation on top of data
Acyclic – gragh cannot return to the the older partition
Direct – transformation is an action that transitions data partition state
• DAGScheduler - Stage-oriented Scheduler transforms a logical execution plan to a physical execution plan
Two main abstractions of Spark
Spark Architecture
Spark Program Sampleimport org.apache.spark.sql.Row;import org.apache.spark.sql.types._;import java.util.Calendarimport java.text.SimpleDateFormat
object chaos_playinfo{def main(args:Array[String]){val schemafile = sc.textFile("s3n://xxxxxxxxxxxx.csv")schemafile.persist()def typefind(x:Any):DataType = x match {case "int" => IntegerTypecase "bigint" => LongTypecase "float" => DoubleTypecase _ => StringType}var schema = StructType(schemafile.map(_.split(":")).map(fieldName => StructField(fieldName(0), typefind(fieldName(1).split(",")(0)), true)).collect())var columnlength = schemafile.count()
- Schema Definition
GameCode:varchar,100
RegDate:datetime,0
code:varchar,20
idfa:varchar,255
os:varchar,10
country:varchar,3
category:nvarchar,100
name:nvarchar,100
p1:nvarchar,100
str1:nvarchar,100
str2:nvarchar,100
str3:nvarchar,100
Spark Program Sampleval format = new SimpleDateFormat("yyyyMMdd");val tddate = (format.format(cal.getTime()))val logs = spark.sparkContext.textFile("s3n://xxxxxxxxxxxx_data.csv")def castValue(value: String, toType: DataType) = toType match {case _: StringType => valuecase _: LongType => if(value=="") null else value.toLongcase _: DoubleType => if(value=="") null else value.toDoublecase _: IntegerType => if(value=="") null else value.toInt}val rowRDD = logs.distinct().map(_.split(",",-1)).filter(a=> a.size == columnlength).map(p => Row.fromSeq(p.zipWithIndex.map{case(value, index) => {castValue(value,schema(index).dataType)}}))val gameDataFrame = spark.createDataFrame(rowRDD, schema)gameDataFrame.printSchema()val gameDataFrame = spark.createDataFrame(rowRDD, schema)gameDataFrame.createOrReplaceTempView("mobile_chaos")
var strsql = " select cast(to_date(RegDate) as varchar) AS TDDate,name,str1, PlaceDiv,COUNT(*) as 춧,COUNT(distinct idfa) as ucnt "+" from (select RegDate,name,str1,IF(name == '던전' or name == '마도사의 탑' , SPLIT(p1,'-')[0] ,'' ) as PlaceDiv,idfa from mobile_chaos) t1 "+" WHERE name IN ('마도사의 탑','결투장','레이드','던전','영웅 훈련소') "+ " GROUP BY to_date(RegDate),name,str1,PlaceDiv "val results = spark.sql(strsql)results.write.format("csv").save("s3n://result_data/"+tddate)
Spark Program Sample
import java.sql._
import java.util.Properties
Class.forName("com.amazon.redshift.jdbc41.Driver");
val props = new Properties();
props.setProperty("user", "계정");
props.setProperty("password", "asdfsdfasdfasdfasdf");
val conn = DriverManager.getConnection("jdbc:redshift://dw_redshift", props);
val stmt = conn.createStatement();
val sql = "copy mobile_chaos_playinfo from 's3n://result_data/part' credentials 'aws_access_key_id=xxxxxxxxxxxxxx;aws_secret_access_key=xxxxxxxxxxxxxxxxxxxxxxxx' csv FILLRECORD;commit";
stmt.executeUpdate(sql);
Spark Monitoring
Spark Monitoring
Spark Monitoring
Zeppelin