35
6.006 Introduction to Algorithms Lecture 1: Document Distance Prof. Erik Demaine

Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

6.006IntroductiontoAlgorithms

Lecture1:DocumentDistanceProf.ErikDemaine

Page 2: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

YourProfessors

Prof.ErikDemaine Prof.Piotr Indyk Prof.Manolis Kellis

Page 3: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

YourTAs

KevinKelley JosephLaurendi Tianren Qi

NicholasZehenderDavidWen

Page 4: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

YourTextbook

Page 5: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

Administrivia• Handout: Course information• Webpage:http://courses.csail.mit.edu/6.006/spring11/• Signupforrecitationifyoudidn’tfilloutformalready• Sign up for problemsetserver: https://alg.csail.mit.edu/• SignupforPiazzza accounttoask/answerquestions:http://piazzza.com/

• Prereqs: 6.01(Python), 6.042(discretemath)• Grades: Problem sets (30%)

Quiz1 (20%;[email protected]–9.30pm)Quiz2 (20%;[email protected]–9.30pm)Final (30%)

• Lectures&Recitations;Homeworklabs;Quizreviews• Read collaboration policy!

Page 6: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

Today• Classoverview

– What’sa(good)algorithm?– Topics

• DocumentDistance– Vectorspacemodel– Algorithms– Pythonprofiling&gotchas

Page 7: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

What’sanAlgorithm?• Mathematicalabstractionofcomputerprogram

• Well‐specifiedmethodforsolvingacomputationalproblem– Typically,afinitesequenceofoperations

• DescriptionmightbestructuredEnglish,pseudocode,orrealcode

• Key: no ambiguityhttp://en.wikipedia.org/wiki/File:Euclid_flowchart_1.png

Page 8: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

al‐Khwārizmī(c.780–850)• “al‐kha‐raz‐mi”

http://en.wikipedia.org/wiki/File:Abu_Abdullah_Muhammad_bin_Musa_al‐Khwarizmi_edit.png

http://en.wikipedia.org/wiki/Al‐Khwarizmi

Page 9: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

al‐Khwārizmī(c.780–850)• “al‐kha‐raz‐mi”• Fatherofalgebra

– TheCompendiousBookonCalculationbyCompletionandBalancing(c.830)

– Linear&quadraticequations:someofthefirstalgorithms

http://en.wikipedia.org/wiki/File:Image‐Al‐Kit%C4%81b_al‐mu%E1%B8%ABta%E1%B9%A3ar_f%C4%AB_%E1%B8%A5is%C4%81b_al‐%C4%9Fabr_wa‐l‐muq%C4%81bala.jpg

http://en.wikipedia.org/wiki/Al‐Khwarizmi

Page 10: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

EfficientAlgorithms• Wantanalgorithmthat’s

– Correct– Fast– Smallspace– General– Simple– Clever

Page 11: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

EfficientAlgorithms• Mainlyinterestedinscalabilityasproblemsizegrows

Page 12: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

WhyEfficient Algorithms?• Savewaittime,storageneeds,energyconsumption/cost,…

• Scalability=win– Solvebiggerproblemsgivenfixedresources(CPU,memory,disk,etc.)

• Optimizetraveltime,scheduleconflicts,…

Page 13: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

HowtoDesignanEfficient Algorithm?

1. Definecomputational problem2. Abstract irrelevant detail3. Reducetoaproblemyoulearnhere

(or6.046oralgorithmicliterature)4. Elsedesignusing“algorithmictoolbox”5. Analyzealgorithm’sscalability6. Implement & evaluate performance7. Repeat(optimize,generalize)

Page 14: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

Modules&Applications1. Introduction Document similarity2. BinarySearchTrees Scheduling3. Hashing Filesynchronization4. Sorting Spreadsheets5. GraphSearch Rubik’s Cube6. Shortest Paths Google Maps7. Dynamic Programming Justifyingtext,packing,…8. NumbersPictures(NP) Computingπ,collision

detection,hardproblem9. Beyond Folding,streaming,bio

Page 15: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

DocumentDistance

• Giventwodocuments,howsimilararethey?

• Applications:– Findsimilardocuments– Detectplagiarism/duplicates

– Websearch(one“document”isquery)

http://en.wikipedia.org/wiki/Wikipedia:Mirrors_and_forks/

http://www.google.com/

Page 16: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

DocumentDistance

• Howtodefine“document”?

• Word =sequenceofalphanumericcharacters

• Document=sequenceofwords– Ignorepunctuation&formatting

Page 17: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

DocumentDistance

• Howtodefine“distance”?

• Idea: focusonsharedwords

• Wordfrequencies:– =#occurrencesofwordindocument

Page 18: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

VectorSpaceModel[Salton, Wong, Yang 1975]

• Treat each document as a vector of its words– Onecoordinate foreverypossibleword

• Example:– =“thecat”– =“thedog”

• Similaritybetweenvectors?– Dotproduct:

http://portal.acm.org/citation.cfm?id=361220

‘the’

‘cat’

‘dog’

11

1

Page 19: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

VectorSpaceModel[Salton, Wong, Yang 1975]

• Problem: Dotproductnotscaleinvariant• Example1:

– =“thecat”– =“thedog”–

• Example2:– =“thecatthecat”– =“thedogthedog”–

‘the’

‘cat’

‘dog’

2

2

2

1

1 10

http://portal.acm.org/citation.cfm?id=361220

Page 20: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

VectorSpaceModel[Salton, Wong, Yang 1975]

• Idea: Normalizeby#words:

• Geometricsolution:anglebetweenvectors

– 0=“identical”, ∘ =orthogonal(nosharedwords)

‘the’

‘cat’

‘dog’

11

1

http://portal.acm.org/citation.cfm?id=361220

Page 21: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

Algorithm1. Read documents2. Split eachdocument into words3. Count wordfrequencies(documentvectors)4. Compute dot product

Page 22: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

Algorithm1. Read documents2. Split eachdocument into words

– re.findall(‘\w+’, doc)

– Buthowdoesthisactuallywork?3. Count wordfrequencies(documentvectors)4. Compute dot product

Page 23: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

Algorithm1. Read documents2. Split eachdocument into words

– Foreachlineindocument:Foreachcharacterinline:

Ifnotalphanumeric:Addpreviousword

(ifany)tolistStartnewword

3. Count wordfrequencies(documentvectors)4. Compute dot product

Page 24: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

Algorithm1. Read documents2. Split eachdocument into words3. Count wordfrequencies(documentvectors)a. Sortthewordlistb. Foreachwordinwordlist:

– Ifsameaslastword:Incrementcounter

– Else:AddlastwordanditscountertolistResetcounterto0

4. Compute dot product

Page 25: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

Algorithm1. Read documents2. Split eachdocument into words3. Count wordfrequencies(documentvectors)4. Compute dot product:

Foreverypossibleword:LookupfrequencyineachdocumentMultiplyAddtototal

Page 26: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

Algorithm1. Read documents2. Split eachdocument into words3. Count wordfrequencies(documentvectors)4. Compute dot product:

Foreverywordinfirstdocument:Ifitappearsinseconddocument:MultiplywordfrequenciesAddtototal

Page 27: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

Algorithm1. Read documents2. Split eachdocument into words3. Count wordfrequencies(documentvectors)4. Compute dot product:a. Startatfirstwordofeachdocument(insortedorder)b. Ifwordsareequal:

MultiplywordfrequenciesAddtototal

c. Inwhicheverdocumenthaslexicallylesserword,advancetonextword

d. Repeatuntileitherdocumentoutofwords

Page 28: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

Algorithm1. Read documents2. Split eachdocument into words3. Count wordfrequencies(documentvectors)a. Initializeadictionarymappingwordstocountsb. Foreachwordinwordlist:

– Ifindictionary:Incrementcounter

– Else:Put0indictionary

4. Compute dot product

Page 29: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

Algorithm1. Read documents2. Split eachdocument into words3. Count wordfrequencies(documentvectors)4. Compute dot product:

Foreverywordinfirstdocument:Ifitappearsinseconddocument:MultiplywordfrequenciesAddtototal

Page 30: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

PythonImplementations

Page 31: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

PythonProfiling

Page 32: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

Culprit

Page 33: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

Fix

Page 34: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

PythonImplementationsdocdist1 initialversiondocdist2 addprofiling 192.5 secdocdist3 replace+ withextend 126.5secdocdist4 countfrequenciesusingdictionary 73.4 secdocdist5 splitwordswithstring.translate 18.1secdocdist6 changeinsertion sorttomergesort 11.5secdocdist7 nosorting, dotproductwithdictionary 1.8secdocdist8 split wordsonwholedocument,

notlinebyline0.2sec

ExperimentsonIntelPentium4,2.8GHz,Python2.6.2,Linux2.6.18.Document1(t2.bobsey.txt)has268,778lines,49,785words,3,354distincts.Document2(t3.lewis.txt)has1,031,470lines,182,355words,8,530distincts.

Page 35: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec

Don’tForget!• Webpage:http://courses.csail.mit.edu/6.006/spring11/

• Signupforrecitationifyoudidn’talreadyreceivearecitationassignmentfromus

• Sign up for problemsetserver:https://alg.csail.mit.edu/

• SignupforPiazzza accounttoask/answerquestions:http://piazzza.com/