Introduce spark (by 조창원)

ALLPPT.com _ Free PowerPoint Templates, Diagrams and Charts

2017.04. 조창원

Introduce

Spark

Why Spark?

Spark Offers you

1) Lazy Computation Optimize the job before executing

2) In-Memroy data caching (not In-Memory DBMS) Scan HDD Only once, then scan your RAM

3) Efficient pipelining Avoid the data hitting the HDD by all means

The Spark Platform

RDD – Resilient Distributed DataSet

• In-Memory Processing computation(intermediate results in a distributed memory instead of Stable storage )

Two main abstractions of Spark


• Transformation lazy operations on a RDD that create one or many new RDDs

(map, filter, reduceByKey, join, cogroup, randomSplit)

They do not change the input RDD – immutable

Narrow and Wide Transformation

• Action a RDD operation that returns a value of any type

an action evaluates the RDD lineage graph




DAG – Direct Acyclic Graph


• Sequence of computations performed on data Node – RDD partition

Edge – transformation on top of data

Acyclic – gragh cannot return to the the older partition

Direct – transformation is an action that transitions data partition state

• DAGScheduler - Stage-oriented Scheduler transforms a logical execution plan to a physical execution plan


Spark Architecture

Spark Program Sampleimport org.apache.spark.sql.Row;import org.apache.spark.sql.types._;import java.util.Calendarimport java.text.SimpleDateFormat

object chaos_playinfo{def main(args:Array[String]){val schemafile = sc.textFile("s3n://xxxxxxxxxxxx.csv")schemafile.persist()def typefind(x:Any):DataType = x match {case "int" => IntegerTypecase "bigint" => LongTypecase "float" => DoubleTypecase _ => StringType}var schema = StructType(schemafile.map(_.split(":")).map(fieldName => StructField(fieldName(0), typefind(fieldName(1).split(",")(0)), true)).collect())var columnlength = schemafile.count()

- Schema Definition

GameCode:varchar,100

RegDate:datetime,0

code:varchar,20

idfa:varchar,255

os:varchar,10

country:varchar,3

category:nvarchar,100

name:nvarchar,100

p1:nvarchar,100

str1:nvarchar,100

str2:nvarchar,100

str3:nvarchar,100

Spark Program Sampleval format = new SimpleDateFormat("yyyyMMdd");val tddate = (format.format(cal.getTime()))val logs = spark.sparkContext.textFile("s3n://xxxxxxxxxxxx_data.csv")def castValue(value: String, toType: DataType) = toType match {case _: StringType => valuecase _: LongType => if(value=="") null else value.toLongcase _: DoubleType => if(value=="") null else value.toDoublecase _: IntegerType => if(value=="") null else value.toInt}val rowRDD = logs.distinct().map(_.split(",",-1)).filter(a=> a.size == columnlength).map(p => Row.fromSeq(p.zipWithIndex.map{case(value, index) => {castValue(value,schema(index).dataType)}}))val gameDataFrame = spark.createDataFrame(rowRDD, schema)gameDataFrame.printSchema()val gameDataFrame = spark.createDataFrame(rowRDD, schema)gameDataFrame.createOrReplaceTempView("mobile_chaos")

var strsql = " select cast(to_date(RegDate) as varchar) AS TDDate,name,str1, PlaceDiv,COUNT(*) as 춧,COUNT(distinct idfa) as ucnt "+" from (select RegDate,name,str1,IF(name == '던전' or name == '마도사의 탑' , SPLIT(p1,'-')[0] ,'' ) as PlaceDiv,idfa from mobile_chaos) t1 "+" WHERE name IN ('마도사의 탑','결투장','레이드','던전','영웅 훈련소') "+ " GROUP BY to_date(RegDate),name,str1,PlaceDiv "val results = spark.sql(strsql)results.write.format("csv").save("s3n://result_data/"+tddate)

Spark Program Sample

import java.sql._

import java.util.Properties

Class.forName("com.amazon.redshift.jdbc41.Driver");

val props = new Properties();

props.setProperty("user", "계정");

props.setProperty("password", "asdfsdfasdfasdfasdf");

val conn = DriverManager.getConnection("jdbc:redshift://dw_redshift", props);

val stmt = conn.createStatement();

val sql = "copy mobile_chaos_playinfo from 's3n://result_data/part' credentials 'aws_access_key_id=xxxxxxxxxxxxxx;aws_secret_access_key=xxxxxxxxxxxxxxxxxxxxxxxx' csv FILLRECORD;commit";

stmt.executeUpdate(sql);

Spark Monitoring

Spark Monitoring

Spark Monitoring

Zeppelin

Technology

Introduce spark (by 조창원)