36
by yi00da Something About Search

Something about search

  • Upload
    -

  • View
    74

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Something about search

by yi00da

Something AboutSearch

Page 2: Something about search

搜索探索性分析搜索评价指标BM25

Click Model

Learning To Rank

Search Analysis

Page 3: Something about search

过滤搜索关键字长度大于 40 的关键字

表 1-1

表 1-2

表 1-3

由表 1-1/1-2 可知,三搜索关键字组成, PC 段和移动端差不多

平台 搜索关键字平均长度android 7.44ios 7.62pc 8.47

平台每个 query 的不同 term数

一个 term 的占比

2 个 term 的占比

3 个 term 的占比

android 4.25 7% 13% 22%ios 4.07 8% 14% 23%pc 4.12 11% 12% 22%

平台 关键字搜索次数unique 关键词 占比

android 60925778 3612131 5.9%ios 19063979 1499656 7.9%pc 24689234 2356621 9.5%

Page 4: Something about search

表 2-1

表 2-2

由上表可知,一个用户的 session, pc 和 android 在 2 个小时左右 , ios 在一个小时左右;每个 session 含有大概 2-3 个 query

业界一般使用半个小时作为一次 search session, 一般用来做Query transformation/document ranking/user satisfaction prediction

平台session 平均时长 ( 分钟 )

android 108ios 63pc 129

平台 session 的 query 数mean

session 的 query 数 mean ,限制 query 数不超过 100

android 3.015 3.005ios 2.334 2.325pc 3.425 3.362

Page 5: Something about search

表 3-1

表 3-2

由表 3-1 可知,搜索结果的平均点击位置为 7 ,第一次点击的平均位置为 4 ,说明搜索的排序仍然有待提升 ( 点击位置越前越好 )

由表 3-2 可知,搜索后有播放的 query 仅仅占 55% 左右,说明搜索结果展示需要改进 ( 未有第三方竞品数据作对比 )

平台 点击的平均位置每个用户每个 sid 的不同搜索关键字首次点击平均位置

pc 7.08 4.08

平台搜索后有播放的 unique query数

总 unique query数 播放占比

有下载的 unique query数 下载占比

android 1969604 3612131 55% 727141 20%ios 756288 1499656 50%

pc 1280618 2356621 54% 31762513%( 不

准 )

Page 6: Something about search

中国新歌声

微微一笑很倾城TFB

OYS

逆流成河 张杰 鹿晗儿童歌曲

旋风少女 2tfb

oys 儿歌

蒙面唱将猜猜猜 郑源 冷漠 刘德华

大王叫我来巡山0

200000400000600000800000

10000001200000

android 搜索关键字 top30

薛之谦 周杰伦逆流成河 小幸运 歌在飞

微微一笑很倾城

没有你陪伴真的好孤单 陈奕迅 张学友 张杰 丑八怪tfb

oys 张信哲告白气球 汪峰

060000

120000180000

PC 搜索关键字 top30

由上图可以看出,移动端和 PC 端的用户搜索行为有区别

Page 7: Something about search

图 4-1 表 4-2

由图 4-1 可知,用户搜索主要集中在晚上 8-9 点由表 4-2 可知, top100000w 的关键字搜索量占到 85% ,长尾分布严重

平台 类型 占比pc top10 3.9%pc top100 14.0%pc top1000 36.5%pc top10000 66.2%pc top100000 84.2%android top10 6.1%android top100 17.7%android top1000 41.5%android top10000 70.4%android top100000 86.2%ios top10 6.5%ios top100 18.9%ios top1000 43.3%ios top10000 71.8%ios top100000 87.5%

0点 1点 2点 3点 4点 5点 6点 7点 8点 9点 10点 11点 12点 13点 14点 15点 16点 17点 18点 19点 20点 21点 22点 23点0.0%1.0%2.0%3.0%4.0%5.0%6.0%7.0%8.0%9.0%

20 点 ; 8.4%

不同时段的搜索次数占比

Page 8: Something about search

Do users scan document from top to bottom?

1. The click-through rate (CTR) of the first document is about 0.45 while the CTR of the tenth document is well below 0.052. The document below a click is viewed roughly 50% of the times

Page 9: Something about search

Search habit

Appendix

Page 10: Something about search

搜索探索性分析搜索评价指标BM25

Click Model

Learning To Rank

Search Analysis

Page 11: Something about search

MAPMean Average Precision

Example:假设有两个主题,主题 1 有 4 个相关网页,主题 2 有 5 个相关网页。某系统对于主题 1 检索出 4 个相关网页,其rank 分别为 1, 2, 4, 7 ;对于主题 2 检索出 3 个相关网页,其 rank 分别为 1,3,5 。对于主题 1 ,平均准确率为(1/1+2/2+3/4+4/7)/4=0.83 。对于主题 2 ,平均准确率为 (1/1+2/3+3/5+0+0)/5=0.45 。则 MAP= (0.83+0.45)/2=0.64

Page 12: Something about search

NDCGNormalize Discounted cumulative gain

Page 13: Something about search

搜索探索性分析搜索评价指标BM25

Click Model

Learning To Rank

Search Analysis

Page 14: Something about search

BM25BM25 算法,通常用来作搜索相关性平分。一句话概况其主要思想:对 Query 进行语素解析,生成语素qi ;然后,对于每个搜索结果 D ,计算每个语素 qi 与 D 的相关性得分,最后,将 qi 相对于 D 的相关性得分进行加权求和,从而得到 Query 与 D 的相关性得分

一般而言,没有相关信息,即 r 和 R 都是 0 ,而在 query 中,一般不会有某个 term 出现的次数大于 1 ,qfi=1,Score 的定义如下:

其中参数 b 的作用是调整文档长度对相关性影响的大小。 b 越大,文档长度的对相关性得分的影响越大,反之越小。

Page 15: Something about search

BM25 with title

可以看见, BM25 对歌曲的 Title效果不好

Appendix

Page 16: Something about search

搜索探索性分析搜索评价指标BM25

Click Model

Learning To Rank

Search Analysis

Page 17: Something about search

Random Click Model (RCM)Click-through Rate Models (CTR) Rank-based CTR Model (RCTR)

Document-based CTR Model (DCTR)

User Browsing Model (UBM) Position-based Model (PBM)

Dependent Click Model (DCM)Click Chain Model (CCM)Dynamic Bayesian Network Model (DBN) Simplified DBN Model (SDBN) Cascade Model (CM)

Click Model

Page 18: Something about search

Baseline model1.Random Click Model (RCM) Any document can be click with the same (fixed) probability

2. Click-Through Rate Models (RCTR)

the click probability depends on the rank of the document

3. Document-Based CTR Model (DCTR)

the click-through rates for each query-document pair.subject to overfitting for the reason that some documents and/or queries were not previouslyencountered in our click log

Page 19: Something about search

Position-Based Model

position-based model (PBD)

Means that a document is clicked when user Examine and attractive with it

Examination hypothesis. The probability of a user examining a document depends heav-ily on its rank or position. PBM introduces a set of examination parameters Y, one for each rank. PBM does not depend on the events at previous ranks.

Page 20: Something about search

Cascade ModelCascade model (CM)

Step:1.Start from the first document2.Examine documents one by one3.If click, then stop4.Otherwise, continue

Page 21: Something about search

Cascade model (CM)

In particular:1.CM does not allow sessions with more than one click2.CM can not explain non-linear examination patterns

Page 22: Something about search

So far,

1.CTR models + count clicks (simple and fast) - do not distinguish examination and attractiveness

2. Position-based model (PBM) User browsing model + examination and attractiveness - examination of a document at rank r does not depend on examinations and clicks above r

3. Cascade model (CM) Dynamic Bayesian network + cascade dependency of examination at r on examinations and clicks above r - only one click is allowed

Page 23: Something about search

User Browsing ModelUser Browsing model (UBM)

the examination probability depends notonly on the rank of a document r, but alsoon the rank of the previously clicked document r’

r’ is the rank of the previously clicked document or 0 if none of them was clickedwhere c0 is set to 1 for convenience

Page 24: Something about search

Dynamic Bayesian ModelDynamic Bayesian model (DBN)

Step:1.Start from the first document2.Examine documents one by one3.If click, read actual document and can be satisfied4.If satisfied, stop5.Otherwise,continue with fixed probability

Page 25: Something about search

Dynamic Bayesian model (DBN)

In particular:1.Gamma is the continuation probability for a user that either did not click on a document or clicked but was not satisfied by it2.DBN set gamma to 1,is Simplified DBN Model (SDBN) – MLE & good performance3.SDBN set to 1,then model become Cascade Model (CM)

Page 26: Something about search

Random Click Model (RCM) Click-through Rate Models (CTR) Rank-based CTR Model (RCTR)

Document-based CTR Model (DCTR)

User Browsing Model (UBM) Position-based Model (PBM)

Dependent Click Model (DCM) Click Chain Model (CCM) Dynamic Bayesian Network Model (DBN) Simplified DBN Model (SDBN) Cascade Model (CM)

1. Maximum likelihood estimation (RCM,RCTP,DCTP,DCM,SDBN,CM)2. Expectation maximization (UBM,PBM,CCM,DBN)

Parameter Estimation

Page 27: Something about search

Simplified DBN Model (SDBN) -- MLE

In particular:1. SDBN assumes that a user examines all documents until the last-clicked one and then aban-dons the search. In this case, both the attractiveness A and satisfaction S of SDBN are ob-served.2.吸引度 A 即是给定 query ,其 ducument 的点击次数和展示次数 ( 最后一个点击或之前 ) 之比3.满意度 S 即是给定 query ,在其 ducument 的点击集合中该 ducument 最后一次点击的占比

Page 28: Something about search

Simplified DBN Model (SDBN) -- MLE

Page 29: Something about search

Dynamic Bayesian model (DBN) -- EM

In particular:1. E-step. Given three parameters,compute the posterior probabilities A,E,S, This involves theforward-backward algorithm 2. M-step. Given the posterior probabilities, update three parameters

Page 30: Something about search

1.The DBN outperform others2. X-axis = 100 means those urls whose train set >= 100;more session means priors not as important. Cascade & DBN improve.3. Navigational queries have quality of context bias, and lots of sessions. Position models suffer

Result

Page 31: Something about search

Limit:1.Click model cannot model out of order clicks2. Completely blind to query reformulations3. Assumes homogeneous user population

Future research:1. Why not learning the structure of a click model from datainstead of defining it manually2. Interactions beyond clicks

Limitations and future research

Page 32: Something about search

[1] Anne Schuth, Floor Sietsma, Shimon Whiteson, and Maarten de Rijke. “Optimizing Base Rankers Using Clicks A Case Study using BM25”[2] Thorsten Joachims, Laura Granka Bing Pan, Helene Hembrooke,and Geri Gay.” Accuratelyinterpreting click-through data as implicit feedback”[3] Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. “An experimental comparison of click position-bias models”[4] Fan Guo, Chao Liu, Anitha Kannan, Tom Minka, Michael Taylor, Yi-Min Wang, and Christos Faloutsos. ”Click chain model in web search”[5] Olivier Chapelle and Ya Zhang.” A dynamic bayesian network click model for web search ranking”[6] Kevin Patrick Murphy. Machine Learning: “A Probabilistic Perspective.”[7] Suzan Verberne, Hans van Halteren, Daphne Theijssen,” Learning to Rank QA Data”[8] Thorsten Joachims,” Optimizing Search Engines using Clickthrough Data”[9] Daxin Jiang, Jian Pei, Hang Li,” Mining Search and Browse Logs for Web Search: A Survey”

Reference

Page 33: Something about search

SDBN compute

只取搜索结果 top60条记录

1. 定义曝光为最后一次点击之前的结果 2. 点击满意定义如备注所示1. 点击定义为播放、添加或下载 3. att_alpha=0.1,att_beta=250,sat_alpha=0.1,sat_beta=1002.丢弃 z 序列缺失超过 10% 的 session3. 过滤同一个 mid,sid 的记录数超过 10000 的 session4.只留下超过 10 个 session 的 query5. 用户播放顺序从上往下,抛弃乱序播放的 session

Appendix

pc 行为流水

爬取搜索接口数据 Click model 相关性 score

Page 34: Something about search

RCM VS RCTR

RCM RCTR全局热度和关键字下热度,都会出现 position bais,关键字热度要好一点,考虑到不同 query的影响

Appendix

Page 35: Something about search

SDBN VS CM

对某些关键字来看, SDBN效果要好一些。并未有人工编辑的标签,未做 NDCG

Appendix

Page 36: Something about search

NDCG compute

Appendix