Spark 소개 1부

Lightning-fast cluster computing

Problem

이런데가 있어요 .

Problem

게다가 전 세계에 이런 게 퍼져있다고 합시다

● 왜 저런 환경을 만들었을까 ?

Problem

Problem

● 이런 서비스를 저런 환경에서 돌릴라면 어떻게 해야 해야 하지 ?

o SQL 도 돌려야 하고o 실시간으로 데이타 흐름도 파악해야 하고o 하다보면 그런저런 데이타 분석도 자동화 해야 하고

… ..

Problem

Solution

공밀레 공밀레 ?

Solution

MapReduce?● 참고자료

http://www.slideshare.net/brotherjinho/map-reduce-48412377

모든 일을 MapReduce 화 하라 !

근데 이런 SQL 을 어떻게 MapReduce

로 만들지 ?

SELECT LAT_N, CITY, TEMP_F

FROM STATS, STATION WHERE MONTH = 7 AND STATS.ID =

STATION.ID ORDER BY TEMP_F;

모든 일을 MapReduce 화 하라 !

이런 Machine learning/Data 분석업무는 ?

“ 지난 2007 년부터 매월 나오는 전국 부동산 실거래가 정보에서 영향을 미칠 수 있는 변수 140개중에 의미있는 변수 5개만 뽑아 .”“ 아 , 마감은 내일이다 .”

코드도 이정도면 뭐 ? ( …단순히 단어 세는 코드가 )

package org.myorg; import java.io.IOException;import java.util.*; import org.apache.hadoop.fs.Path;import org.apache.hadoop.conf.*;import org.apache.hadoop.io.*;import org.apache.hadoop.mapreduce.*;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }

근데 너 이 코드 저 서버실에 다 깔아야 한다 .

…신도 울고 나도 울었다 ..

원래 세월이 가면 연장은 좋아지는 법

Spark 등장

“ 그냥 있는 대로 일주세요 . 처리는 제가

알아서 할께요 .”

SQL 도



실시간 데이타 처리



Machine learning 도



Generality

High-level tool 들 아래에서 모든 일들을

있는 그대로 하게해줍니다 .

쓰기 쉽습니다 .

Java, Scala, Python 을 지원합니다 .

text_file = spark.textFile("hdfs://...") text_file.flatMap(lambda line: line.split()) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a+b)

Word count in Spark's Python API

온갖 분산처리 환경에서 다 돌아갑니다 .

● Hadoop, Mesos, 혼자서도 , Cloud

에서도 돌아요 . ● HDFS, Cassandra,

HBase, S3 등에서 데이타도 가져올 수

있어요 .

속도도 빠릅니다 .

Hadoop MapReduce 를 Memory 에서

올렸을 때보다 100배 , Disk 에서 돌렸을

때의 10 배 빠릅니다 .

Logistic regression in Hadoop and Spark

자체 Web UI …까지 있어요 .

Zeppelin

아예 Reporting tool … 까지[Video]

https://www.youtube.com/watch?v=_PQbVH_aO5E&feature=youtu.be

Spark 은 말이죠

● Tool 이에요 , Library 아닙니다 . o 이 Tool 위에 하고 싶은 일들을 정의하고o 실행시키는 겁니다 .

Spark 설치하기

● 일일이 말하기 귀찮아요 . 아래 보세요 . o 설치방법

http://bcho.tistory.com/1024o 아예 Vagrant 로 만들어 놨어요 . 빌드도

다 해요 . https://bitbucket.org/JinhoYoo_Nomad

Connection/spark_test_vm

http://bcho.tistory.com/1024

https://bitbucket.org/JinhoYoo_NomadConnection/spark_test_vm

https://bitbucket.org/JinhoYoo_NomadConnection/spark_test_vm

근데 이거 가지고 뭐할라고 ?

이런거 만들게요 .

http://engineering.vcnc.co.kr/2015/05/data-analysis-with-spark/




Line+ 게임보안 시스템

https://www.imaso.co.kr/news/article_view.php?article_idx=20150519094003





Small demo

오늘은 여기까지 다음번에 실체를 보여드릴께요 . 질문 !

Engineering

Spark 소개 1부