Web mining Tutorial: Entity search

Embed Size (px)

Citation preview

--

2010

http://www.ipr-ctr.t.u-tokyo.ac.jp/utsearch

GET

ForminputPOST

htmlcssXPath

WebLWP, Curl

IP

sleep

User agentMozilla/5.0 (Windows; U; Windows NT 5.1....

7,500

200,000

700

DB

key-valuememcached, cassandra, Tokyo cabinet

n-gram

1 4 5 9 123 1 4 3 VB code


e{w1, w2, ..., wn}qep(e|q) p(e|q) = p(q|e)p(e)/p(q)p(q|e)p(e)p(q|e)= p(w|e) (w q)* p(e)()dp(e|q) p(e|d)p(q|d)p(d)ed


p(w|e)p(w|e) = tf (w,e)/|e| idfwep(w|e) = tf(w,e)/|e| + (1-) tf(w,E)/|E|( = |e|/(|e|+))


Z={z1,z2,...,zt}wZ

PLSI (probabilistic Latent Semantic Indexing)p(w|e) = p(w|z)p(z|d)

LDA (Latent Dirichlet Allocation)p(w|e,,) = p(w|z,)p(z|e,)

LDA

7500200,00050

90.42

89.74

77.46

58.38

55.72

46.76

43.5

cDNA 42.48

- 41.54

115.2

98.14

65.8

57.9

50.38

48.44

41.84

40.44

39.38

38.64

69.24

65.92

59.66

42.32

34.26

32.7

32.6

31.96

29.86

29.12

69.02

68.68

66.54

62.8

50.94

50.78

43.86

39.42

38.36

34.8

50.82

48.1

47.34

43.56

38.98

STM 38.06

34.02

32.16

31.38

30.84

29.7

34.0

29.9

29.82

24.8

21.98

21.0

21.0

X 20.24

17.92

17.6

ES

62.26

60.08

53.86

ES 46.08

44.94

39.48

39.12

37.94

35.32

34.48

MEMS 28.08

25.56

25.44

23.8

CVD 22.86

17.16

16.78

15.2

-- 14.18

47.84

39.4

20.12

CAD 16.98

16.94

14.8

14.16

13.86

13.14

13.08

11.76

9.98

38.42

38.26

29.36

27.68

-- 26.06

23.78

21.96

21.38

20.98

19.92

39.88

37.2

31.76

31.02

30.6

27.24

26.88

26.86

25.94

23.5

36.96

30.5

26.58

25.48

20.0

19.92

19.7

19.36

17.48

16.86

36.74

QOL 19.68

11.52

9.44

8.98

8.5

8.0

7.98

7.12

6.64

6.28

17.62

16.82

16.7

15.22

14.48

14.16

13.86

13.84

13.72

13.72

12.96

12.94

11.96

11.44

Arnetminer

Academic Search

UMASS Rexa

(METI)(NEDO)


-

HITS

Pagerank

p(e) PR_t= A PR_t-1 + (1-)/|E|p(e|q) PR_t= A PR_t-1 + (1-) r(q)

Pagerank

KAKEN

tf/tfidf/,