23
Data Analysis @ Daum Paul Kim ([email protected] ) DevOn 2012

Data Analysis @ Daum | Devon 2012

  • Upload
    daum-dna

  • View
    12.989

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Analysis @ Daum | Devon 2012

Data Analysis @ DaumPaul Kim ([email protected])DevOn 2012

Page 2: Data Analysis @ Daum | Devon 2012
Page 3: Data Analysis @ Daum | Devon 2012
Page 4: Data Analysis @ Daum | Devon 2012

Search Quality

Page 5: Data Analysis @ Daum | Devon 2012

Search Quality=

Satisfaction / Cost

Page 6: Data Analysis @ Daum | Devon 2012

How?

Page 8: Data Analysis @ Daum | Devon 2012

Understanding Userswith Logs

BIG DATA!

Page 9: Data Analysis @ Daum | Devon 2012

Data Analysis Processwith Hadoop

?HADOOP FEATURES TOOLS

!

SAS

RWEKA

ETC

2 QUAD-CORES8GB RAM4TB HDD

4 QUAD-CORES16GB RAM4TB HDD

X 60 NODES

X 30 NODES

Page 10: Data Analysis @ Daum | Devon 2012

For example,

Page 11: Data Analysis @ Daum | Devon 2012

라면 맛있게 끓이는 비법

Page 12: Data Analysis @ Daum | Devon 2012

많이 본 글

Mission만족스러운 검색 경험들을 랭킹에 반영

Target DataHalf Year Search Logs (about 40TB)

FeaturesQuery - Collection RelationshipQuery - Document - Session RelationshipSession - Query RelationshipSession - Document Relationship

GROUP-BY JOB

GROUP-BY JOB

GROUP-BY JOB

GROUP-BY JOB

Page 13: Data Analysis @ Daum | Devon 2012

많이 본 글

ModelingLinear Regression with SAS

Batch Process

HADOOP FEATURES MODEL ENGINE

LESS THAN 2 HOURS

Page 14: Data Analysis @ Daum | Devon 2012

바다 이야기

Page 15: Data Analysis @ Daum | Devon 2012

SEARCH SPAM INDEXMission

Spam이 검색 사용자에게 미치는 영향 파악Data

Search Log : Text with DelimiterPost Filtered Documents : Json FormatOperation Deleted Documents : Xml Format

TaskQuery - Session - Doc. 1 - Doc. 2 - Doc. 3 - Doc. 4

Click?Type? (Ham, Spam, OP Del.)

OUTER JOIN

Page 16: Data Analysis @ Daum | Devon 2012

SEARCH SPAM INDEXResult Sample

Page 17: Data Analysis @ Daum | Devon 2012

BLOG CLASSIFICATION

Page 18: Data Analysis @ Daum | Devon 2012

BLOG CLASSIFICATIONMission

Unsupervised Learning을 통한 나쁜 Blog ClusteringData

30 Days Blog DocumentsTask

Blog - Document’s Feature Analysis with Fixed Interval

Page 19: Data Analysis @ Daum | Devon 2012

BLOG CLASSIFICATIONModeling

Kohonen’s SOM(Self Organizing Map) with R

Page 20: Data Analysis @ Daum | Devon 2012

WHAT ELSE?Topic Analysis with PLSA

Query Chain Filtering

Reprocessing with Hadoop

Page 21: Data Analysis @ Daum | Devon 2012

In Conclusion,

Page 22: Data Analysis @ Daum | Devon 2012

ADVANTAGE OF HADOOP

ADVANTAGELow analyze cost!No more sampling!Low operation cost!Programming Language IndependentVarious support tools

DISADVANTAGEConceptual Change is Needed.Project under active development.Version upgrade is not supported.

Page 23: Data Analysis @ Daum | Devon 2012

THANK YOU!