Upload
shan17j
View
219
Download
0
Embed Size (px)
Citation preview
8/13/2019 kddcup9911
1/34
1
ACM KDD CupA Survey: 1997-2011
Qiang Yang (partly based on Xinyue Lius slides @SFU, and
Nathan Lius slides @hkust)
Hong Kong University of Science andTechnology
8/13/2019 kddcup9911
2/34
2
About KDD Cup (19972011)
Competitionis a strongmover for Science andEngineering:
ACM ProgrammingContest
World College levelProgramming skills
ROBOCUP World Robotics
Competition
8/13/2019 kddcup9911
3/34
3
About ACM KDDCUP
ACM KDD: Premiere Conference in knowledge discoveryand data mining
ACM KDDCUP:
Worldwide competition in conjunction with ACM KDDconferences.
It aims at:
showcase the best methods for discovering higher-levelknowledge from data.
Helping to close the gap between research and industry
Stimulating further KDD research and development
8/13/2019 kddcup9911
4/34
4
Statistics
Participation in KDD Cup grew steadily
Average person-hours per submission: 204
Max person-hours per submission: 910
Year 97 98 99 2000 2005 2011
Submissions 16 21 24 30 32 1000+
8/13/2019 kddcup9911
5/34
5
Algorithms (up to 2000)
8/13/2019 kddcup9911
6/34
6
KDD Cup 97
A classification tasktopredict financial servicesindustry (direct mailresponse)
Winners Charles Elkan, a Prof from
UC-San Diego with hisBoosted Naive Bayesian(BNB)
Silicon Graphics, Inc with
their software MineSet Urban Science Applications,
Inc. with their software gain,Direct Marketing SelectionSystem
http://www-cse.ucsd.edu/users/elkan/http://www.sgi.com/http://www.urbanscience.com/http://www.urbanscience.com/http://www.urbanscience.com/http://www.urbanscience.com/http://www.sgi.com/http://www-cse.ucsd.edu/users/elkan/8/13/2019 kddcup9911
7/347
MineSet (Silicon Graphics Inc.)
A KDD tool that combines data access, transformation,
classification, and visualization.
8/13/2019 kddcup9911
8/348
KDD Cup 98: CRM Benchmark
URL:www.kdnuggets.com/meetings/kdd98/kdd-cup-98.html
A classification taskto analyzefund raising mail responses to anon-profit organization
Winners Urban Science Applications,
Inc. with their softwareGainSmarts.
SAS Institute, Inc. with theirsoftware SAS Enterprise Miner
Quadstone Limited with theirsoftware Decisionhouse
http://www.kdnuggets.com/meetings/kdd98/kdd-cup-98.htmlhttp://www.kdnuggets.com/meetings/kdd98/kdd-cup-98.htmlhttp://www.urbanscience.com/http://www.urbanscience.com/http://www.sas.com/http://www.quadstone.com/http://www.quadstone.com/http://www.sas.com/http://www.urbanscience.com/http://www.urbanscience.com/http://www.kdnuggets.com/meetings/kdd98/kdd-cup-98.htmlhttp://www.kdnuggets.com/meetings/kdd98/kdd-cup-98.htmlhttp://www.kdnuggets.com/meetings/kdd98/kdd-cup-98.htmlhttp://www.kdnuggets.com/meetings/kdd98/kdd-cup-98.htmlhttp://www.kdnuggets.com/meetings/kdd98/kdd-cup-98.htmlhttp://www.kdnuggets.com/meetings/kdd98/kdd-cup-98.html8/13/2019 kddcup9911
9/34
KDDCUP 1998 Results
$-
$5,000
$10,000
$15,000$20,000
$25,000
$30,000
$35,000
$40,000
$45,000
$50,000
$55,000
$60,000
$65,000
$70,000
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%Maximum Possible ProfitLine
($72,776 in profits with 4,873 mailed)
GainSmarts
SAS/Enterprise Miner
Quadstone/Decisionhouse
Mail to EveryoneSolution
($10,560 in profits with 96,367 mailed)
8/13/2019 kddcup9911
10/3410
ACM KDD Cup 1999
URL:www.cse.ucsd.edu/users/elkan/kdresults.html
ProblemTo detect network intrusion
andprotect a computer networkfrom unauthorized users,including perhaps insiders
Data: from DoD Winners
SAS Institute Inc.with theirsoftware Enterprise Miner.
Amdocswith theirInformation AnalysisEnvironment
http://www.cse.ucsd.edu/users/elkan/kdresults.htmlhttp://www.cse.ucsd.edu/users/elkan/kdresults.htmlhttp://www.cse.ucsd.edu/users/elkan/kdresults.htmlhttp://www.sas.com/http://www.amdocs.com/main.asphttp://www.amdocs.com/main.asphttp://www.sas.com/http://www.cse.ucsd.edu/users/elkan/kdresults.htmlhttp://www.cse.ucsd.edu/users/elkan/kdresults.html8/13/2019 kddcup9911
11/3411
KDDCUP 2000: Data Set and Goal:
Data collected fromGazelle.com, a legwearand legcare Web retailer
Pre-processedTraining set: 2 monthsTest sets: one month
Data collected includes:Click streams Order information
The goalto designmodels to support web-site personalization andto improve the
profitability of the site byincreasing customerresponse.
Questions - Whengiven
a set of page views, characterize heavy
spenders characterize killer pages characterize which product
brand a visitor will view in
the remainder of thesession?
http://images.google.com/imgres?imgurl=http://www.puzzlesbyrussells.co.nz/images/russells%2520pics/spider.web.r&t.jpg&imgrefurl=http://www.puzzlesbyrussells.co.nz/raised&tactile.htm&h=631&w=472&sz=67&tbnid=cAKrbAng9SgJ:&tbnh=133&tbnw=100&start=12&prev=/images?q=The+Spider+Web&hl=en&lr=&ie=UTF-88/13/2019 kddcup9911
12/34
8/13/2019 kddcup9911
13/3413
KDD Cup 2001
3 Bioinformatics Tasks Dataset 1: Prediction of
Molecular Bioactivity forDrug Design
half a gigabyte whenuncompressed
Dataset 2: Prediction ofGene/Protein Function (task2) and Localization (task 3)
Dataset 2 is smaller andeasier to understand
7 megabytes uncompressed
A total of 136 groupsparticipated to producea total of 200
submitted predictionsover the 3 tasks: 114for Thrombin, 41 forFunction, and 45 forLocalization.
8/13/2019 kddcup9911
14/3414
2001 Winners
Task 1, Thrombin: Jie Cheng (Canadian Imperial
Bank of Commerce). Bayesian network learner and
classifier
Task 2, Function: Mark-A.Krogel (University ofMagdeburg).
Inductive Logic programming
Task 3, Localization: HisashiHayashi, Jun Sese, andShinichi Morishita (Universityof Tokyo).
K nearest neighbor
Task 2: the genes of one particular
type of organism A gene/protein can have
more than one function, butonly one localization.
8/13/2019 kddcup9911
15/3415
molecularbiology: Two tasks Task 1: Document
extraction frombiological articles
Task 2: Classification ofproteins based on gene
deletion experiments
Winners:
Task 1: ClearForestandCelera, USA
Yizhar Regev and MichalFinkelstein
Task 2: TelstraResearch Laboratories
, Australia Adam Kowalczyk and
Bhavani Raskutti
8/13/2019 kddcup9911
16/3416
2003 KDDCUP
InformationRetrieval/Citation Mining ofScientific research papers based on a very large
archive of research papers
First Task: predict how manycitations each paper will receiveduring the three monthsleading up to the KDD 2003conference
Second Task: a citation graphof a large subset of the archive
from only the LaTex sources Third Task: each paper's
popularity will be estimatedbased on partial download logs
Last Task: devise their ownquestions
8/13/2019 kddcup9911
17/3417
2003 KDDCUP: Results
Task 1: Claudia Perlich, Foster Provost,
Sofus Kacskassy New York University
Task 2: 1st place: David Vogel AI Insight Inc.
Task 3: Janez Brank and Jure Leskovec Jozef Stefan Institute, Slovenija
Task 4: Amy McGovern, Lisa Friedland,
Michael Hay, Brian Gallagher, Andrew Fast,
Jennifer Neville, and David Jensen University of Massachusetts Amherst, USA
http://www.matkurja.com/http://www.matkurja.com/8/13/2019 kddcup9911
18/3418
2004 Tasks and Results
Particle
physics; plus protein
homology prediction
David S. Vogel, Eric
Gottschalk, and
Morgan C. WangBernhard Pfahringer,
Yan Fu (),RuiXiang Sun, Qiang
Yang (), Simin He,
Chunli Wang, HaipengWang, Shiguang Shan,
Junfa Liu, Wen Gao.
8/13/2019 kddcup9911
19/34
Past KDDCUP Overview: 2005-2010Year Host Task Technique Winner
2005 Microsoft Web querycategorization
Feature Engineering,Ensemble
HKUST
2006 Siemens Pulmonary embolidetection
Multi-instance, Non-IIDsample, Cost sensitive,Class Imbalance, Noisydata
AT&T, BudapestUniversity ofTechnology &Economics
2007 Netflix Consumerrecommendation
Collaborative Filtering,Time series, Ensemble
IBM Research,HungarianAcademy ofSciences
2008 Siemens Breast cancerdetection from
medical images
Ensemble, Classimbalance, Score
calibration
IBM Research,National Taiwan
University2009 Orange Customer
relationshipprediction in telecom
Feature selection,Ensemble
IBM Research,University ofMelbourne
2010 PSLC DataShop
Student performanceprediction in E-
Learning
Feature engineering,Ensemble,
Collaborative filtering
National TaiwanUniversity CJ
Lin, S. Lin, etc.)
8/13/2019 kddcup9911
20/34
KDDCUP11 Dataset 11 years of data
Rated items are
Tracks
Albums
Artists Genres
Items arranges in a taxonomy
Two tasks
Track 1 Track 2
#ratings 263M 63M
#items 625K 296K
#users 1M 249K
8/13/2019 kddcup9911
21/34
Items in a Taxonomy
8/13/2019 kddcup9911
22/34
Track 1 Details
8/13/2019 kddcup9911
23/34
Track 1 Highlights
Largest publicly available dataset
Large number of items (50 times more thanNetflix)
Extreme rating sparsity (20 times moresparse than Netflix)
Taxonomy can help in combating sparsely
rated items. Fine time stamps with both date and time
allow sophisticated temporal modeling.
8/13/2019 kddcup9911
24/34
Track 2 Details
8/13/2019 kddcup9911
25/34
Track 2 Highlights
Performance metric focus on ranking/classification, which differs from traditional
collaborative filtering. No validation data provided, need to self-
construct binary labeled data from ratingdata.
Unlike track 1, track 2 removed time stampsto focus more than long term preferencerather than short term behaviors.
8/13/2019 kddcup9911
26/34
Submission Stats
8/13/2019 kddcup9911
27/34
WinnersTrack 1 Track 2
1stplace National Taiwan University National Taiwan University
2ndplace Commendo (Netflix PrizeWinnder)
Chinese Academy of Science,Hulu Labs
3rdplace Hong Kong University of
Science and Technology,Shanghai Jiaotong University
Commendo (Netflix Prize
Winnder)
8/13/2019 kddcup9911
28/34
Chinese Teams at KDDCUP (NTU,CAS, HKUST)
8/13/2019 kddcup9911
29/34
Key Techniques
Track 1: Blending of multiple techniques
Matrix factorization models
Nearest neighbor models Restricted Bolzmann machines
Temporal modelings
Track 2: Importance sampling of negative instances
Taxonomical modelings
Use of pairwise ranking objective functions
8/13/2019 kddcup9911
30/34
30
Summary
To place on top of KDDCUP requires
Team work
Expertise in domain knowledge as well as mathematical
tools Often done by world famous institutes and companies
Recent trends:
Dataset increasingly more realistic
Participants increasingly more professional
Tasks are increasingly more difficult
8/13/2019 kddcup9911
31/34
31
Summary
KDD Cup is an excellent source tolearn the state-of-art KDD techniques
KDDCUP dataset often becomes thestandard benchmarkfor futureresearch, development and teaching
Top winners are highly regarded andrespected
8/13/2019 kddcup9911
32/34
32
References
Elkan C. (1997). Boosting and Naive Bayesian Learning.
Technical Report No. CS97-557, September 1997, UCSD.Decisionhouse (1998). KDD Cup 98: Quadstone Take Bronze
Miner Award. Retrieved March 15, 2001 fromhttp://www.kdnuggets.com/meetings/kdd98/quadstone/index.html
Urbane Science (1998). Urbane Science wins the KDD-98 Cup.Retrieved March 15, 2001 fromhttp://www.kdnuggets.com/meetings/kdd98/gain-kddcup98-release.html
Georges, J. & Milley, A. (1999). KDD99 Competition:Knowledge Discovery Contest. Retrieved March 15, 2001
from http://www.cse.ucsd.edu/users/elkan/saskdd99.pdfRosset, S. & Inger A. (1999). KDD-Cup 99 : Knowledge
Discovery In a Charitable Organizations Donor Database.Retrieved March 15, 2001 fromhttp://www.cse.ucsd.edu/users/elkan/KDD2.doc
http://www.kdnuggets.com/meetings/kdd98/quadstone/index.htmlhttp://www.kdnuggets.com/meetings/kdd98/quadstone/index.htmlhttp://www.kdnuggets.com/meetings/kdd98/gain-kddcup98-release.htmlhttp://www.kdnuggets.com/meetings/kdd98/gain-kddcup98-release.htmlhttp://www.cse.ucsd.edu/users/elkan/saskdd99.pdfhttp://www.cse.ucsd.edu/users/elkan/KDD2.dochttp://www.cse.ucsd.edu/users/elkan/KDD2.dochttp://www.cse.ucsd.edu/users/elkan/saskdd99.pdfhttp://www.kdnuggets.com/meetings/kdd98/gain-kddcup98-release.htmlhttp://www.kdnuggets.com/meetings/kdd98/gain-kddcup98-release.htmlhttp://www.kdnuggets.com/meetings/kdd98/gain-kddcup98-release.htmlhttp://www.kdnuggets.com/meetings/kdd98/gain-kddcup98-release.htmlhttp://www.kdnuggets.com/meetings/kdd98/gain-kddcup98-release.htmlhttp://www.kdnuggets.com/meetings/kdd98/quadstone/index.htmlhttp://www.kdnuggets.com/meetings/kdd98/quadstone/index.html8/13/2019 kddcup9911
33/34
33
References (Cont.)
Sebastiani P., Ramoni M. & Crea A. (1999). Profiling your
Customers using Bayesian Networks. Retrieved March 15,2001 fromhttp://bayesware.com/resources/tutorials/kddcup99/kddcup99.pdf
Inger A., Vatnik N., Rosset S. & Neumann E. (2000). KDD-Cup2000: Question 1 Winners Report. Retrieved March 18, 2000from
http://www.ecn.purdue.edu/KDDCUP/amdocs-slides-1.ppt
Neumann E., Vatnik N., Rosset S., Duenias M., Sasson I. & IngerA. (2000). KDD-Cup 2000: Question 5 Winners Report.Retrieved March 18, 2000 from
http://www.ecn.purdue.edu/KDDCUP/amdocs-slides-5.pptSalford System white papers:
http://www.salford-systems.com/whitepaper.html
Summary talk presented at KDD (2000)
http://robotics.stanford.edu/~ronnyk/kddCupTalk.ppt
http://bayesware.com/resources/tutorials/kddcup99/kddcup99.pdfhttp://bayesware.com/resources/tutorials/kddcup99/kddcup99.pdfhttp://www.ecn.purdue.edu/KDDCUP/amdocs-slides-1.ppthttp://www.ecn.purdue.edu/KDDCUP/amdocs-slides-5.ppthttp://www.salford-systems.com/whitepaper.htmlhttp://robotics.stanford.edu/~ronnyk/kddCupTalk.ppthttp://robotics.stanford.edu/~ronnyk/kddCupTalk.ppthttp://www.salford-systems.com/whitepaper.htmlhttp://www.salford-systems.com/whitepaper.htmlhttp://www.salford-systems.com/whitepaper.htmlhttp://www.ecn.purdue.edu/KDDCUP/amdocs-slides-5.ppthttp://www.ecn.purdue.edu/KDDCUP/amdocs-slides-5.ppthttp://www.ecn.purdue.edu/KDDCUP/amdocs-slides-5.ppthttp://www.ecn.purdue.edu/KDDCUP/amdocs-slides-5.ppthttp://www.ecn.purdue.edu/KDDCUP/amdocs-slides-5.ppthttp://www.ecn.purdue.edu/KDDCUP/amdocs-slides-1.ppthttp://www.ecn.purdue.edu/KDDCUP/amdocs-slides-1.ppthttp://www.ecn.purdue.edu/KDDCUP/amdocs-slides-1.ppthttp://www.ecn.purdue.edu/KDDCUP/amdocs-slides-1.ppthttp://www.ecn.purdue.edu/KDDCUP/amdocs-slides-1.ppthttp://bayesware.com/resources/tutorials/kddcup99/kddcup99.pdfhttp://bayesware.com/resources/tutorials/kddcup99/kddcup99.pdf8/13/2019 kddcup9911
34/34
References (cont)
http://www.cs.wisc.edu/~dpage/kddcup2001/Cheng.pdf
http://www.cs.wisc.edu/~dpage/kddcup2001/Krogel.pdf
http://www.cs.wisc.edu/~dpage/kddcup2001/Hayashi.pdf
http://www.cs.wisc.edu/~dpage/kddcup2001/Cheng.pdfhttp://www.cs.wisc.edu/~dpage/kddcup2001/Krogel.pdfhttp://www.cs.wisc.edu/~dpage/kddcup2001/Hayashi.pdfhttp://www.cs.wisc.edu/~dpage/kddcup2001/Hayashi.pdfhttp://www.cs.wisc.edu/~dpage/kddcup2001/Krogel.pdfhttp://www.cs.wisc.edu/~dpage/kddcup2001/Cheng.pdf