Discriminative, Generative and Imitative jebara/papers/ , Generative and Imitative Learning by Tony Jebara Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning,

  • View
    213

  • Download
    1

Embed Size (px)

Transcript

  • Discriminative, Generative and ImitativeLearning

    byTony Jebara

    B.Eng., Electrical Engineering McGill University, 1996M.Sc., Media Arts and Sciences, MIT, 1998

    Submitted to the Program in Media Arts and Sciences,School of Architecture and Planning,

    in partial fulfillment of the requirements for the degree ofDOCTOR OF PHILOSOPHY IN MEDIA ARTS AND SCIENCES

    at theMassachusetts Institute of Technology

    February 2002

    cMassachusetts Institute of Technology, 2002All Rights Reserved

    Signature of Author

    Program in Media Arts and SciencesDecember 18, 2001

    Certified by

    Alex P. PentlandToshiba Professor of Media Arts and Sciences

    Program in Media Arts and SciencesThesis Supervisor

    Accepted by

    Andrew B. LippmanChair

    Departmental Committee on Graduate StudentsProgram in Media Arts and Sciences

  • Discriminative, Generative and Imitative Learning

    byTony Jebara

    Submitted to the Program in Media Arts and Sciences,School of Architecture and Planning,

    in partial fulfillment of the requirements for the degree of

    Doctor of Philosophy in Media Arts and Sciences

    Abstract

    I propose a common framework that combines three different paradigms in machine learning: gen-erative, discriminative and imitative learning. A generative probabilistic distribution is a principledway to model many machine learning and machine perception problems. Therein, one provides do-main specific knowledge in terms of structure and parameter priors over the joint space of variables.Bayesian networks and Bayesian statistics provide a rich and flexible language for specifying thisknowledge and subsequently refining it with data and observations. The final result is a distributionthat is a good generator of novel exemplars.

    Conversely, discriminative algorithms adjust a possibly non-distributional model to data optimizingfor a specific task, such as classification or prediction. This typically leads to superior performanceyet compromises the flexibility of generative modeling. I present Maximum Entropy Discrimination(MED) as a framework to combine both discriminative estimation and generative probability den-sities. Calculations involve distributions over parameters, margins, and priors and are provably anduniquely solvable for the exponential family. Extensions include regression, feature selection, andtransduction. SVMs are also naturally subsumed and can be augmented with, for example, featureselection, to obtain substantial improvements.

    To extend to mixtures of exponential families, I derive a discriminative variant of the Expectation-Maximization (EM) algorithm for latent discriminative learning (or latent MED). While EM andJensen lower bound log-likelihood, a dual upper bound is made possible via a novel reverse-Jenseninequality. The variational upper bound on latent log-likelihood has the same form as EM bounds,is computable efficiently and is globally guaranteed. It permits powerful discriminative learningwith the wide range of contemporary probabilistic mixture models (mixtures of Gaussians, mixturesof multinomials and hidden Markov models). We provide empirical results on standardized datasets that demonstrate the viability of the hybrid discriminative-generative approaches of MED andreverse-Jensen bounds over state of the art discriminative techniques or generative approaches.

    Subsequently, imitative learning is presented as another variation on generative modeling which alsolearns from exemplars from an observed data source. However, the distinction is that the generativemodel is an agent that is interacting in a much more complex surrounding external world. It is notefficient to model the aggregate space in a generative setting. I demonstrate that imitative learning(under appropriate conditions) can be adequately addressed as a discriminative prediction taskwhich outperforms the usual generative approach. This discriminative-imitative learning approachis applied with a generative perceptual system to synthesize a real-time agent that learns to engagein social interactive behavior.

    Thesis Supervisor: Alex PentlandTitle: Toshiba Professor of Media Arts and Sciences, MIT Media Lab

  • Discriminative, Generative and Imitative Learning

    byTony Jebara

    Thesis committee:

    Advisor:

    Alex P. PentlandToshiba Professor of Media Arts and Sciences

    MIT Media Laboratory

    Co-Advisor:

    Tommi S. JaakkolaAssistant Professor of Electrical Engineering and Computer Science

    MIT Artificial Intelligence Laboratory

    Reader:

    David C. HoggProfessor of Computing and Pro-Vice-ChancellorSchool of Computer Studies, University of Leeds

    Reader:

    Tomaso A. PoggioUncas and Helen Whitaker Professor

    MIT Brain Sciences Department and MIT Artificial Intelligence Lab

  • 4

    Acknowledgments

    I extend warm thanks to Alex Pentland for sharing with me his wealth of brilliant creative ideas, hisground-breaking visions for computer-human collaboration, and his ability to combine the strengthsof so many different fields and applications. I extend warm thanks to Tommi Jaakkola for shar-ing with me his masterful knowledge of machine learning, his pioneering ideas on discriminative-generative estimation and his excellent statistical and mathematical abilities. I extend warm thanksto David Hogg for sharing with me his will to tackle great challenging problems, his visionary ideason behavior learning, and his ability to span the panorama of perception, learning and behavior. Iextend warm thanks to Tomaso Poggio for sharing with me his extensive understanding of so manyaspects of intelligence: biological, psychological, statistical, mathematical and computational andhis enthusiasm towards science in general. As members of my committee, they have all profoundlyshaped the ideas in this thesis as well as helped me formalize them into this document.

    I would like to thank the Pentlandians group who has been a great team to work with: KarenNavarro, Elizabeth Farley, Tanzeem Choudhury, Brian Clarkson, Sumit Basu, Yuri Ivanov, NitinSawhney, Vikram Kumar, Ali Rahimi, Steve Schwartz, Rich DeVaul, Dave Berger, Josh Weaverand Nathan Eagle. I would also like to thank the TRG who have been a great source of readingsand brainstorming: Jayney Yu, Jason Rennie, Adrian Corduneanu, Neel Master, Martin Szummer,Romer Rosales, Chen-Hsiang, Nati Srebro, and so many others. My thanks to all the other MediaLab folks who are still around like Deb Roy, Joe Paradiso, Bruce Blumberg, Roz Picard, IrenePepperberg, Claudia Urrea, Yuan Qi, Raul Fernandez, Push Singh, Bill Butera, Mike Johnson andBill Tomlinson for sharing bold ideas and deep thoughts. Thanks also to great Media Lab friendswho have moved to other places but had a profound influence on me: Baback Moghaddam, BerntSchiele, Nuria Oliver, Ali Azarbayejani, Thad Starner, Kris Popat, Chris Wren, Jim Davis, TomMinka, Francois Berard, Andy Wilson, Nuno Vasconcelos, Janet Cahn, Lee Campbell, Marina Bers,and my UROPs Martin Wagner, Cyrus Eyster and Ken Russell. Thanks to so many folks outside thelab like Sayan Mukherjee, Marina Meila, Yann LeCun, Michael Jordan, Andrew McCallum, AndrewNg, Thomas Hoffman, John Weng, and many others for great conversations and valuable insight.

    And thanks to my family, my father, my mother and my sister, Carine. They made every possiblesacrifice and effort so that I could do this PhD and supported me cheerfully throughout the wholeendeavor.

  • Contents

    1 Introduction 14

    1.1 Learning and Generative Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    1.1.1 Learning and Generative Models in AI . . . . . . . . . . . . . . . . . . . . . . 16

    1.1.2 Learning and Generative Models in Perception . . . . . . . . . . . . . . . . . 16

    1.1.3 Learning and Generative Models in Temporal Behavior . . . . . . . . . . . . 17

    1.2 Why a Probability of Everything? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    1.3 Generative versus Discriminative Learning . . . . . . . . . . . . . . . . . . . . . . . . 18

    1.4 Imitative Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    1.5 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    1.6 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    1.7 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    2 Generative vs. Discriminative Learning 26

    2.1 Two Schools of Thought . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    2.1.1 Generative Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . . . 27

    2.1.2 Discriminative Classifiers and Regressors . . . . . . . . . . . . . . . . . . . . . 29

    2.2 Generative Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    2.2.1 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    2.2.2 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    2.3 Conditional Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    2.3.1 Conditional Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    2.3.2 Maximum Conditional Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 35

    2.3.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    2.4 Discriminative Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    2.4.1 Empirical Risk Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    2.4.2 Structural Risk Minimization and Large Marg

Recommended

View more >