18
C. Lee Giles David Reese Professor, College of Information Sciences and Technology Graduate Professor of Computer Science and Engineering Courtesy Professor of Supply Chain and Information Systems The Pennsylvania State University, University Park, PA, USA [email protected] http://clgiles.ist.psu.edu/IST441 Search Engines IST 441 Introduction to course and search engines

C. Lee Giles David Reese Professor, College of Information Sciences and Technology Graduate Professor of Computer Science and Engineering Courtesy Professor

Embed Size (px)

Citation preview

Page 1: C. Lee Giles David Reese Professor, College of Information Sciences and Technology Graduate Professor of Computer Science and Engineering Courtesy Professor

C. Lee GilesDavid Reese Professor, College of Information Sciences and Technology

Graduate Professor of Computer Science and Engineering

Courtesy Professor of Supply Chain and Information Systems

The Pennsylvania State University, University Park, PA, USA

[email protected]

http://clgiles.ist.psu.edu/IST441

Information Retrieval and Search EnginesIST 441

Introduction to course and search engines

Page 2: C. Lee Giles David Reese Professor, College of Information Sciences and Technology Graduate Professor of Computer Science and Engineering Courtesy Professor

Course homepage• Everything you need to know about the course

– http://clgiles.ist.psu.edu/IST441– Or put IST441 into Google

• Project• Exercises• Readings• Schedule• Participation• Exam

Page 3: C. Lee Giles David Reese Professor, College of Information Sciences and Technology Graduate Professor of Computer Science and Engineering Courtesy Professor

Professor C. Lee Giles• Intelligent and specialty search engines; cyberinfrastructure for science, academia and

government; big data

– Modular, scalable, robust, automatic science and technology focused cyberinfrastructure and search engine creation and maintenance

– Large heterogeneous data and information systems

– Specialty science and technology search engines for knowledge discovery & integration

• CiteSeerx (all scholarly documents – focus on computer science)

• ChemXSeer (e-chemistry portal)

• CollabSeer (collaboration search)

• CSSeer (expert finding)

• Scalable intelligent tools/agents/methods/algorithms

– Information, knowledge and data integration

– Information and metadata extraction; entity recognition

– Chemical formulae & names, tables, and figures

– Unique search, knowledge discovery, information integration, data mining algorithms

– Expert and collaboration recommendation

– Research evaluation

http://clgiles.ist.psu.edu

Page 4: C. Lee Giles David Reese Professor, College of Information Sciences and Technology Graduate Professor of Computer Science and Engineering Courtesy Professor

What will be covered• What is information

– How much is there?

• Properties of text– Documents models

• Information retrieval (IR) systems and methods– Query structures– Evaluation and Relevance– Role of the user– Vector models– Inverted index

Page 5: C. Lee Giles David Reese Professor, College of Information Sciences and Technology Graduate Professor of Computer Science and Engineering Courtesy Professor

What will be covered• Search engines as IR systems and how they work

– Indexers– Crawlers– Ranking– Evaluation– SEO

• Internet and Web– Web structure

• Semantic search• Google and link analysis• Social networks

Page 6: C. Lee Giles David Reese Professor, College of Information Sciences and Technology Graduate Professor of Computer Science and Engineering Courtesy Professor

Approach• Readings and Lectures

– Exercises– One exam– Participation

• Projects– Build 2 specialty search engines for a customer

• Customer defines the project– Built with Nutch, YouSeer, Lucid Works (based on Solr/Lucene)– Who uses Lucene– Build a Google Custom Search Engine

» Comparison of these two

– Customer receives (reviews) search engine at the end of the semester– Presentation on search engines built– Report on search engine due at end of semester

• Undergrads vs grads• Guest seminars

Page 8: C. Lee Giles David Reese Professor, College of Information Sciences and Technology Graduate Professor of Computer Science and Engineering Courtesy Professor
Page 9: C. Lee Giles David Reese Professor, College of Information Sciences and Technology Graduate Professor of Computer Science and Engineering Courtesy Professor

Web search engine use has new activities

Pew Internet & American Life Internet Project Survey: 2009

PewInternet

Page 10: C. Lee Giles David Reese Professor, College of Information Sciences and Technology Graduate Professor of Computer Science and Engineering Courtesy Professor

Search Engine Market Share

Page 11: C. Lee Giles David Reese Professor, College of Information Sciences and Technology Graduate Professor of Computer Science and Engineering Courtesy Professor
Page 12: C. Lee Giles David Reese Professor, College of Information Sciences and Technology Graduate Professor of Computer Science and Engineering Courtesy Professor

Dec 2012 2 billion internet users

Page 13: C. Lee Giles David Reese Professor, College of Information Sciences and Technology Graduate Professor of Computer Science and Engineering Courtesy Professor
Page 14: C. Lee Giles David Reese Professor, College of Information Sciences and Technology Graduate Professor of Computer Science and Engineering Courtesy Professor

2012

Page 15: C. Lee Giles David Reese Professor, College of Information Sciences and Technology Graduate Professor of Computer Science and Engineering Courtesy Professor

ComScore global share

Number of search engine queries - US

About billion per day

Page 16: C. Lee Giles David Reese Professor, College of Information Sciences and Technology Graduate Professor of Computer Science and Engineering Courtesy Professor

Students who took this course• Google• Yahoo• Microsoft• Facebook• RIT• IBM• Tencent• Klout• eBay• Raytheon• Lockheed Martin• …

Page 17: C. Lee Giles David Reese Professor, College of Information Sciences and Technology Graduate Professor of Computer Science and Engineering Courtesy Professor
Page 18: C. Lee Giles David Reese Professor, College of Information Sciences and Technology Graduate Professor of Computer Science and Engineering Courtesy Professor