kddcup9911

  • Upload
    shan17j

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

  • 8/13/2019 kddcup9911

    1/34

    1

    ACM KDD CupA Survey: 1997-2011

    Qiang Yang (partly based on Xinyue Lius slides @SFU, and

    Nathan Lius slides @hkust)

    Hong Kong University of Science andTechnology

  • 8/13/2019 kddcup9911

    2/34

    2

    About KDD Cup (19972011)

    Competitionis a strongmover for Science andEngineering:

    ACM ProgrammingContest

    World College levelProgramming skills

    ROBOCUP World Robotics

    Competition

  • 8/13/2019 kddcup9911

    3/34

    3

    About ACM KDDCUP

    ACM KDD: Premiere Conference in knowledge discoveryand data mining

    ACM KDDCUP:

    Worldwide competition in conjunction with ACM KDDconferences.

    It aims at:

    showcase the best methods for discovering higher-levelknowledge from data.

    Helping to close the gap between research and industry

    Stimulating further KDD research and development

  • 8/13/2019 kddcup9911

    4/34

    4

    Statistics

    Participation in KDD Cup grew steadily

    Average person-hours per submission: 204

    Max person-hours per submission: 910

    Year 97 98 99 2000 2005 2011

    Submissions 16 21 24 30 32 1000+

  • 8/13/2019 kddcup9911

    5/34

    5

    Algorithms (up to 2000)

  • 8/13/2019 kddcup9911

    6/34

    6

    KDD Cup 97

    A classification tasktopredict financial servicesindustry (direct mailresponse)

    Winners Charles Elkan, a Prof from

    UC-San Diego with hisBoosted Naive Bayesian(BNB)

    Silicon Graphics, Inc with

    their software MineSet Urban Science Applications,

    Inc. with their software gain,Direct Marketing SelectionSystem

    http://www-cse.ucsd.edu/users/elkan/http://www.sgi.com/http://www.urbanscience.com/http://www.urbanscience.com/http://www.urbanscience.com/http://www.urbanscience.com/http://www.sgi.com/http://www-cse.ucsd.edu/users/elkan/
  • 8/13/2019 kddcup9911

    7/347

    MineSet (Silicon Graphics Inc.)

    A KDD tool that combines data access, transformation,

    classification, and visualization.

  • 8/13/2019 kddcup9911

    8/348

    KDD Cup 98: CRM Benchmark

    URL:www.kdnuggets.com/meetings/kdd98/kdd-cup-98.html

    A classification taskto analyzefund raising mail responses to anon-profit organization

    Winners Urban Science Applications,

    Inc. with their softwareGainSmarts.

    SAS Institute, Inc. with theirsoftware SAS Enterprise Miner

    Quadstone Limited with theirsoftware Decisionhouse

    http://www.kdnuggets.com/meetings/kdd98/kdd-cup-98.htmlhttp://www.kdnuggets.com/meetings/kdd98/kdd-cup-98.htmlhttp://www.urbanscience.com/http://www.urbanscience.com/http://www.sas.com/http://www.quadstone.com/http://www.quadstone.com/http://www.sas.com/http://www.urbanscience.com/http://www.urbanscience.com/http://www.kdnuggets.com/meetings/kdd98/kdd-cup-98.htmlhttp://www.kdnuggets.com/meetings/kdd98/kdd-cup-98.htmlhttp://www.kdnuggets.com/meetings/kdd98/kdd-cup-98.htmlhttp://www.kdnuggets.com/meetings/kdd98/kdd-cup-98.htmlhttp://www.kdnuggets.com/meetings/kdd98/kdd-cup-98.htmlhttp://www.kdnuggets.com/meetings/kdd98/kdd-cup-98.html
  • 8/13/2019 kddcup9911

    9/34

    KDDCUP 1998 Results

    $-

    $5,000

    $10,000

    $15,000$20,000

    $25,000

    $30,000

    $35,000

    $40,000

    $45,000

    $50,000

    $55,000

    $60,000

    $65,000

    $70,000

    0%

    10%

    20%

    30%

    40%

    50%

    60%

    70%

    80%

    90%

    100%Maximum Possible ProfitLine

    ($72,776 in profits with 4,873 mailed)

    GainSmarts

    SAS/Enterprise Miner

    Quadstone/Decisionhouse

    Mail to EveryoneSolution

    ($10,560 in profits with 96,367 mailed)

  • 8/13/2019 kddcup9911

    10/3410

    ACM KDD Cup 1999

    URL:www.cse.ucsd.edu/users/elkan/kdresults.html

    ProblemTo detect network intrusion

    andprotect a computer networkfrom unauthorized users,including perhaps insiders

    Data: from DoD Winners

    SAS Institute Inc.with theirsoftware Enterprise Miner.

    Amdocswith theirInformation AnalysisEnvironment

    http://www.cse.ucsd.edu/users/elkan/kdresults.htmlhttp://www.cse.ucsd.edu/users/elkan/kdresults.htmlhttp://www.cse.ucsd.edu/users/elkan/kdresults.htmlhttp://www.sas.com/http://www.amdocs.com/main.asphttp://www.amdocs.com/main.asphttp://www.sas.com/http://www.cse.ucsd.edu/users/elkan/kdresults.htmlhttp://www.cse.ucsd.edu/users/elkan/kdresults.html
  • 8/13/2019 kddcup9911

    11/3411

    KDDCUP 2000: Data Set and Goal:

    Data collected fromGazelle.com, a legwearand legcare Web retailer

    Pre-processedTraining set: 2 monthsTest sets: one month

    Data collected includes:Click streams Order information

    The goalto designmodels to support web-site personalization andto improve the

    profitability of the site byincreasing customerresponse.

    Questions - Whengiven

    a set of page views, characterize heavy

    spenders characterize killer pages characterize which product

    brand a visitor will view in

    the remainder of thesession?

    http://images.google.com/imgres?imgurl=http://www.puzzlesbyrussells.co.nz/images/russells%2520pics/spider.web.r&t.jpg&imgrefurl=http://www.puzzlesbyrussells.co.nz/raised&tactile.htm&h=631&w=472&sz=67&tbnid=cAKrbAng9SgJ:&tbnh=133&tbnw=100&start=12&prev=/images?q=The+Spider+Web&hl=en&lr=&ie=UTF-8
  • 8/13/2019 kddcup9911

    12/34

  • 8/13/2019 kddcup9911

    13/3413

    KDD Cup 2001

    3 Bioinformatics Tasks Dataset 1: Prediction of

    Molecular Bioactivity forDrug Design

    half a gigabyte whenuncompressed

    Dataset 2: Prediction ofGene/Protein Function (task2) and Localization (task 3)

    Dataset 2 is smaller andeasier to understand

    7 megabytes uncompressed

    A total of 136 groupsparticipated to producea total of 200

    submitted predictionsover the 3 tasks: 114for Thrombin, 41 forFunction, and 45 forLocalization.

  • 8/13/2019 kddcup9911

    14/3414

    2001 Winners

    Task 1, Thrombin: Jie Cheng (Canadian Imperial

    Bank of Commerce). Bayesian network learner and

    classifier

    Task 2, Function: Mark-A.Krogel (University ofMagdeburg).

    Inductive Logic programming

    Task 3, Localization: HisashiHayashi, Jun Sese, andShinichi Morishita (Universityof Tokyo).

    K nearest neighbor

    Task 2: the genes of one particular

    type of organism A gene/protein can have

    more than one function, butonly one localization.

  • 8/13/2019 kddcup9911

    15/3415

    molecularbiology: Two tasks Task 1: Document

    extraction frombiological articles

    Task 2: Classification ofproteins based on gene

    deletion experiments

    Winners:

    Task 1: ClearForestandCelera, USA

    Yizhar Regev and MichalFinkelstein

    Task 2: TelstraResearch Laboratories

    , Australia Adam Kowalczyk and

    Bhavani Raskutti

  • 8/13/2019 kddcup9911

    16/3416

    2003 KDDCUP

    InformationRetrieval/Citation Mining ofScientific research papers based on a very large

    archive of research papers

    First Task: predict how manycitations each paper will receiveduring the three monthsleading up to the KDD 2003conference

    Second Task: a citation graphof a large subset of the archive

    from only the LaTex sources Third Task: each paper's

    popularity will be estimatedbased on partial download logs

    Last Task: devise their ownquestions

  • 8/13/2019 kddcup9911

    17/3417

    2003 KDDCUP: Results

    Task 1: Claudia Perlich, Foster Provost,

    Sofus Kacskassy New York University

    Task 2: 1st place: David Vogel AI Insight Inc.

    Task 3: Janez Brank and Jure Leskovec Jozef Stefan Institute, Slovenija

    Task 4: Amy McGovern, Lisa Friedland,

    Michael Hay, Brian Gallagher, Andrew Fast,

    Jennifer Neville, and David Jensen University of Massachusetts Amherst, USA

    http://www.matkurja.com/http://www.matkurja.com/
  • 8/13/2019 kddcup9911

    18/3418

    2004 Tasks and Results

    Particle

    physics; plus protein

    homology prediction

    David S. Vogel, Eric

    Gottschalk, and

    Morgan C. WangBernhard Pfahringer,

    Yan Fu (),RuiXiang Sun, Qiang

    Yang (), Simin He,

    Chunli Wang, HaipengWang, Shiguang Shan,

    Junfa Liu, Wen Gao.

  • 8/13/2019 kddcup9911

    19/34

    Past KDDCUP Overview: 2005-2010Year Host Task Technique Winner

    2005 Microsoft Web querycategorization

    Feature Engineering,Ensemble

    HKUST

    2006 Siemens Pulmonary embolidetection

    Multi-instance, Non-IIDsample, Cost sensitive,Class Imbalance, Noisydata

    AT&T, BudapestUniversity ofTechnology &Economics

    2007 Netflix Consumerrecommendation

    Collaborative Filtering,Time series, Ensemble

    IBM Research,HungarianAcademy ofSciences

    2008 Siemens Breast cancerdetection from

    medical images

    Ensemble, Classimbalance, Score

    calibration

    IBM Research,National Taiwan

    University2009 Orange Customer

    relationshipprediction in telecom

    Feature selection,Ensemble

    IBM Research,University ofMelbourne

    2010 PSLC DataShop

    Student performanceprediction in E-

    Learning

    Feature engineering,Ensemble,

    Collaborative filtering

    National TaiwanUniversity CJ

    Lin, S. Lin, etc.)

  • 8/13/2019 kddcup9911

    20/34

    KDDCUP11 Dataset 11 years of data

    Rated items are

    Tracks

    Albums

    Artists Genres

    Items arranges in a taxonomy

    Two tasks

    Track 1 Track 2

    #ratings 263M 63M

    #items 625K 296K

    #users 1M 249K

  • 8/13/2019 kddcup9911

    21/34

    Items in a Taxonomy

  • 8/13/2019 kddcup9911

    22/34

    Track 1 Details

  • 8/13/2019 kddcup9911

    23/34

    Track 1 Highlights

    Largest publicly available dataset

    Large number of items (50 times more thanNetflix)

    Extreme rating sparsity (20 times moresparse than Netflix)

    Taxonomy can help in combating sparsely

    rated items. Fine time stamps with both date and time

    allow sophisticated temporal modeling.

  • 8/13/2019 kddcup9911

    24/34

    Track 2 Details

  • 8/13/2019 kddcup9911

    25/34

    Track 2 Highlights

    Performance metric focus on ranking/classification, which differs from traditional

    collaborative filtering. No validation data provided, need to self-

    construct binary labeled data from ratingdata.

    Unlike track 1, track 2 removed time stampsto focus more than long term preferencerather than short term behaviors.

  • 8/13/2019 kddcup9911

    26/34

    Submission Stats

  • 8/13/2019 kddcup9911

    27/34

    WinnersTrack 1 Track 2

    1stplace National Taiwan University National Taiwan University

    2ndplace Commendo (Netflix PrizeWinnder)

    Chinese Academy of Science,Hulu Labs

    3rdplace Hong Kong University of

    Science and Technology,Shanghai Jiaotong University

    Commendo (Netflix Prize

    Winnder)

  • 8/13/2019 kddcup9911

    28/34

    Chinese Teams at KDDCUP (NTU,CAS, HKUST)

  • 8/13/2019 kddcup9911

    29/34

    Key Techniques

    Track 1: Blending of multiple techniques

    Matrix factorization models

    Nearest neighbor models Restricted Bolzmann machines

    Temporal modelings

    Track 2: Importance sampling of negative instances

    Taxonomical modelings

    Use of pairwise ranking objective functions

  • 8/13/2019 kddcup9911

    30/34

    30

    Summary

    To place on top of KDDCUP requires

    Team work

    Expertise in domain knowledge as well as mathematical

    tools Often done by world famous institutes and companies

    Recent trends:

    Dataset increasingly more realistic

    Participants increasingly more professional

    Tasks are increasingly more difficult

  • 8/13/2019 kddcup9911

    31/34

    31

    Summary

    KDD Cup is an excellent source tolearn the state-of-art KDD techniques

    KDDCUP dataset often becomes thestandard benchmarkfor futureresearch, development and teaching

    Top winners are highly regarded andrespected

  • 8/13/2019 kddcup9911

    32/34

    32

    References

    Elkan C. (1997). Boosting and Naive Bayesian Learning.

    Technical Report No. CS97-557, September 1997, UCSD.Decisionhouse (1998). KDD Cup 98: Quadstone Take Bronze

    Miner Award. Retrieved March 15, 2001 fromhttp://www.kdnuggets.com/meetings/kdd98/quadstone/index.html

    Urbane Science (1998). Urbane Science wins the KDD-98 Cup.Retrieved March 15, 2001 fromhttp://www.kdnuggets.com/meetings/kdd98/gain-kddcup98-release.html

    Georges, J. & Milley, A. (1999). KDD99 Competition:Knowledge Discovery Contest. Retrieved March 15, 2001

    from http://www.cse.ucsd.edu/users/elkan/saskdd99.pdfRosset, S. & Inger A. (1999). KDD-Cup 99 : Knowledge

    Discovery In a Charitable Organizations Donor Database.Retrieved March 15, 2001 fromhttp://www.cse.ucsd.edu/users/elkan/KDD2.doc

    http://www.kdnuggets.com/meetings/kdd98/quadstone/index.htmlhttp://www.kdnuggets.com/meetings/kdd98/quadstone/index.htmlhttp://www.kdnuggets.com/meetings/kdd98/gain-kddcup98-release.htmlhttp://www.kdnuggets.com/meetings/kdd98/gain-kddcup98-release.htmlhttp://www.cse.ucsd.edu/users/elkan/saskdd99.pdfhttp://www.cse.ucsd.edu/users/elkan/KDD2.dochttp://www.cse.ucsd.edu/users/elkan/KDD2.dochttp://www.cse.ucsd.edu/users/elkan/saskdd99.pdfhttp://www.kdnuggets.com/meetings/kdd98/gain-kddcup98-release.htmlhttp://www.kdnuggets.com/meetings/kdd98/gain-kddcup98-release.htmlhttp://www.kdnuggets.com/meetings/kdd98/gain-kddcup98-release.htmlhttp://www.kdnuggets.com/meetings/kdd98/gain-kddcup98-release.htmlhttp://www.kdnuggets.com/meetings/kdd98/gain-kddcup98-release.htmlhttp://www.kdnuggets.com/meetings/kdd98/quadstone/index.htmlhttp://www.kdnuggets.com/meetings/kdd98/quadstone/index.html
  • 8/13/2019 kddcup9911

    33/34

    33

    References (Cont.)

    Sebastiani P., Ramoni M. & Crea A. (1999). Profiling your

    Customers using Bayesian Networks. Retrieved March 15,2001 fromhttp://bayesware.com/resources/tutorials/kddcup99/kddcup99.pdf

    Inger A., Vatnik N., Rosset S. & Neumann E. (2000). KDD-Cup2000: Question 1 Winners Report. Retrieved March 18, 2000from

    http://www.ecn.purdue.edu/KDDCUP/amdocs-slides-1.ppt

    Neumann E., Vatnik N., Rosset S., Duenias M., Sasson I. & IngerA. (2000). KDD-Cup 2000: Question 5 Winners Report.Retrieved March 18, 2000 from

    http://www.ecn.purdue.edu/KDDCUP/amdocs-slides-5.pptSalford System white papers:

    http://www.salford-systems.com/whitepaper.html

    Summary talk presented at KDD (2000)

    http://robotics.stanford.edu/~ronnyk/kddCupTalk.ppt

    http://bayesware.com/resources/tutorials/kddcup99/kddcup99.pdfhttp://bayesware.com/resources/tutorials/kddcup99/kddcup99.pdfhttp://www.ecn.purdue.edu/KDDCUP/amdocs-slides-1.ppthttp://www.ecn.purdue.edu/KDDCUP/amdocs-slides-5.ppthttp://www.salford-systems.com/whitepaper.htmlhttp://robotics.stanford.edu/~ronnyk/kddCupTalk.ppthttp://robotics.stanford.edu/~ronnyk/kddCupTalk.ppthttp://www.salford-systems.com/whitepaper.htmlhttp://www.salford-systems.com/whitepaper.htmlhttp://www.salford-systems.com/whitepaper.htmlhttp://www.ecn.purdue.edu/KDDCUP/amdocs-slides-5.ppthttp://www.ecn.purdue.edu/KDDCUP/amdocs-slides-5.ppthttp://www.ecn.purdue.edu/KDDCUP/amdocs-slides-5.ppthttp://www.ecn.purdue.edu/KDDCUP/amdocs-slides-5.ppthttp://www.ecn.purdue.edu/KDDCUP/amdocs-slides-5.ppthttp://www.ecn.purdue.edu/KDDCUP/amdocs-slides-1.ppthttp://www.ecn.purdue.edu/KDDCUP/amdocs-slides-1.ppthttp://www.ecn.purdue.edu/KDDCUP/amdocs-slides-1.ppthttp://www.ecn.purdue.edu/KDDCUP/amdocs-slides-1.ppthttp://www.ecn.purdue.edu/KDDCUP/amdocs-slides-1.ppthttp://bayesware.com/resources/tutorials/kddcup99/kddcup99.pdfhttp://bayesware.com/resources/tutorials/kddcup99/kddcup99.pdf
  • 8/13/2019 kddcup9911

    34/34

    References (cont)

    http://www.cs.wisc.edu/~dpage/kddcup2001/Cheng.pdf

    http://www.cs.wisc.edu/~dpage/kddcup2001/Krogel.pdf

    http://www.cs.wisc.edu/~dpage/kddcup2001/Hayashi.pdf

    http://www.cs.wisc.edu/~dpage/kddcup2001/Cheng.pdfhttp://www.cs.wisc.edu/~dpage/kddcup2001/Krogel.pdfhttp://www.cs.wisc.edu/~dpage/kddcup2001/Hayashi.pdfhttp://www.cs.wisc.edu/~dpage/kddcup2001/Hayashi.pdfhttp://www.cs.wisc.edu/~dpage/kddcup2001/Krogel.pdfhttp://www.cs.wisc.edu/~dpage/kddcup2001/Cheng.pdf