Similarity Neighbors Clusterskseworkshop.kaist.ac.kr/2014/material/2014KSE-3.pdf · 2014. 2. 28. · 1. Similarity & Distance •유클리드 거리; 자카드 거리(집합에 대한

Similarity Neighbors Clusters

KAIST

지식서비스공학과

이의진

[email protected]

2014년2월27일

Different sorts of business tasks involve reasoning from similar examples:

• Retrieving similar things directly • IBM wants to find companies that are similar to their best business

customers, in order to have the sales staff look at them as prospects.

• Hewlett-Packard maintains many high performance servers for clients; this maintenance is aided by a tool that, given a server configuration, retrieves information on other similarly configured servers.

• Advertisers often want to serve online ads to consumers who are similar to their current good customers.

• Doing classification and regression

Different sorts of business tasks involve reasoning from similar examples:

• Grouping similar items together into clusters • Our customer base contains groups of similar customers and what these

groups have in common (unsupervised segmentation)

• Providing recommendations of similar products or from similar people • Whenever you see statements like “People who like X also like Y” or

“Customers with your browsing history have also looked at …” similarity is being applied.

• Other fields: medicine and law • A doctor may reason about a new difficult case by recalling a similar

case (either treated personally or documented in a journal) and its diagnosis.

• A lawyer often argues cases by citing legal precedents, which are similar historical cases whose dispositions were previously judged and entered into the legal casebook.

목차 (6장)

1. Similarity & distance

2. Nearest-neighbor reasoning (근접-이웃 추론)

3. Clustering (군집분석) • Hierarchical clustering (계층적 군집분석)

• K-means clustering (최적분리 군집분석)

• Density-based clustering (밀집도 기반 군집분석)

Similarity and distance (유사도 & 거리)

• 신용카드회사: 사용자 정보 • 나이, 거주연도, 거주지현황 등

• 두 사람의 유사도(similarity)를 어떻게 측정 할 것인가?

Euclidean (유클리드) 거리



• 두 사람: A, B • 애트리뷰트가 n개 일때의 유클리드 거리

목차 (6장)






Whiskey Analytics

Jackson’s five features

색깔

향기

보디

맛

여운

유사한 위스키 찾기?

Nearest neighbors for predictive modeling

Will David respond or not?


respond

yes

no

David


• Judging based on nearest neighbors…

다수결 => Yes or

P(Respond=Yes) = 2/3

respond

yes

no

Nearest neighbors: How many neighbors?

1개의 이웃만을 가지고 분류를 할 경우 분류영역 표시?

respond

yes

no


respond

yes

no


노이즈 같아 보임 => overfitting!

respond

yes

no


• k-NN 방법에서 좋은 k 찾기: Cross validation

• k를 1부터 적당한 수까지 늘여가면서 recall/precision/F-measure 등 값이 잘 나오는 k를 선택함

k-근접이웃(k-NN)

• 핵심아이디어: 과거의 사례들(cases)을 기반으로 새로운 결과를 예측

• k-NN: 과거 사례 중 k개의 유사 사례를 가지고 결과를 예측

• k의 선택? • 다수결 선택을 하는 경우 k를 홀수로 잡아 동점을 방지

• 너무 작은 k값은 over-fitting 문제가 발생(예: k=1)

• Cross-validation을 통해서 주어진 데이터로부터 적절한 k를 찾음

목차 (6장)






Whiskey Analytics Revisited

Evaluation of 109 Scotch Whiskies

Jackson’s five features

색깔

향기

보디

맛

여운

Similarity

Similarity – Jaccad’s distance

• X := Bunnahabhain’s Body = {firm, medium, light}

• Y := Jack Daniel’s Body = {firm, medium}

• X∩Y = {firm, medium} | X∩Y | = 2

• X∪Y = {firm, medium, light} | X∪Y | = 3

dJaccard(X, Y) = 1 – 2/3 = 1/3

Similarity – Jaccad’s distance

0

0 1/3

2/3

0

평균거리(Bunnahabhain, Jack Daniel) = (0+0+1/3+2/3+0)/5 = 1/5

(가정: 각 feature에 equal weight를 주었음)

Hierarchical Clustering

• 방법: • 시작: 각 object는 개개의 클러스터 임(atomic cluster)

• 반복: 거리가 가까운 클러스터 둘을 반복적으로 merge 시킴

• 끝: 1개의 단일 클러스터가 생성

덴드로그램(Dendrogram) 표시

2

3

4

5

6

Dendrogram tree representation

Object

Dis

tance

single link (min)

complete link (max)

average

Cluster Distance Measures

• Single link: smallest distance

between an element in one cluster

and an element in the other, i.e.,

d(Ci, Cj) = min{d(xip, xjq)}

• Complete link: largest distance

between an element in one cluster

and an element in the other, i.e.,

d(Ci, Cj) = max{d(xip, xjq)}

• Average: avg distance between

elements in one cluster and

elements in the other, i.e.,

d(Ci, Cj) = avg{d(xip, xjq)}

Dendrogram: hierarchical classification of single-malt Scotch whiskies

109 Scotch Whiskies

목차 (6장)






K-means clustering(최적분리 군집분석)

• Initial group centroids: • Place K points into the space represented by the objects that are being clustered.

1. Assign each object to the group that has the closest centroid.

2. When all objects have been assigned, recalculate the positions of the K centroids.

• Repeat Steps 1 and 2 until the centroids no longer move.

Here we have a dataset!

We randomly choose 2 group centroids!

We assign each point to the group that has the closest centroid.

We recalculate the positions of the centroids.

We assign each point to the group that has the closest centroid.

We recalculate the positions of the centroids.

No matter how many times the algorithm will be executed, from now on the centroids won’t move!!

So the clustering it’s over!

K-means clustering(최적분리 군집분석)

• Initial group centroids: • Place K points into the space represented by the objects that are being clustered.

1. Assign each object to the group that has the closest centroid.

2. When all objects have been assigned, recalculate the positions of the K centroids.

• Repeat Steps 1 and 2 until the centroids no longer move.

목차 (6장)






참고: Density-based clustering

Original Points

Point types: core, border and outliers

•밀집도가 높은 지역을 군집하는 방법: • 밀집도 확장 := 반경(eps)안에 최소 포인트 수(minPts) 만족

Clustered Points

Quiz

1. kNN방법은 k개의 유사 사례를 사용한다.

2. kNN방법은 k값은 아무 정수나 다 괜찮다.

3. kNN은 특정 값을 다수결로만 결정 가능하다.

4. kNN은 k값이 작으면 overfitting문제가 없다.

5. kNN에서 주어진 샘플을 갖고 좋은 k값을 찾을 수 있다.

6. Hierarchical clustering은 top-down 방식이다.

7. Hierarchical clustering은 클러스터 개수를 조절할 수 없다.

8. Hierarchical clustering에서 두 클러스터를 merge할 때 최소거리만을 사용 가능하다.

9. Hierarchical clustering 과정을 덴드로그램으로 표시 가능하다.

10. Hierarchical clustering에서 나온 덴드로그램의 y축은 클러스터 개수 이다.

Quiz

11. Hierarchical clustering과 k-means clustering은 unsupervised learning 방법이다.

12. Hierarchical clustering 방법은 centroid를 사용한다.

13. k-means clustering에서는 클러스터링 시작전에 k 값을 정해주지 않아도 자동으로 찾아준다.

14. k-means clustering에서 초기 centroid의 위치를 임으로 배정하면 안 된다.

15. k-means clustering에서 k개의 센트로이드 중 각 점을 가까운 센트로이드에 맵핑한다.

16. k-means clustering에서 k개의 센트로이드 중 각 점을 가까운 센트로이드에 맵핑하는데 거리가 같다면 아무점이나 할당한다.

17. k-means clustering에서 1) 각 점을 센트로이드에 할당하고 2) 센트로이드를 구하는 과정을 반복한다.

18. k-means clustering에서 이러한 반복이 끝나지 않고 센트로이드 위치를 못구하는 경우도 있다.

요약

1. Similarity & Distance • 유클리드 거리; 자카드 거리(집합에 대한 유사도 계산)

2. Nearest-Neighbor Reasoning (근접-이웃 추론) • k-NN: 과거 사례 중 k개의 유사 사례로 결과 예측(예: k개중 다수결)


• 시작: 모든 object가 클러스터

• 반복: 클러스터를 반복적으로 merge 시킴

• 끝: 하나의 클러스터 생성

• K-means clustering (최적분리 군집분석) • 시작: k개의 centroid => 각 object는 k개의 centroid중 가장 가까운 곳으로 분류

• 반복: 분류된 object를 바탕으로 k개의 centroid를 다시 계산 => 분류

• 끝: k개의 centroid가 더 이상 움직이지 않음

• Density-based clustering • 밀집도 확장(minPts/eps지역)을 통한 군집화 방법

Documents

Similarity Neighbors Clusterskseworkshop.kaist.ac.kr/2014/material/2014KSE-3.pdf · 2014. 2. 28. · 1. Similarity & Distance •유클리드 거리; 자카드 거리(집합에 대한