72
Introduction to Computing Using Py Data Storage and Processing Databases and SQL Python Database Programming List comprehension and MapReduce Parallel Computing

Introduction to Computing Using Python Data Storage and Processing Databases and SQL Python Database Programming List comprehension and MapReduce

Embed Size (px)

Citation preview

Page 1: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

Data Storage and Processing

Databases and SQL Python Database Programming List comprehension and MapReduce Parallel Computing

Page 2: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

Data storage

Beijing × 3Paris × 5Chicago × 5

Chicago × 3Beijing × 6

Bogota × 3Beijing × 2Paris × 1

Chicago × 3Paris × 2Nairobi × 1

Nairobi × 7Bogota × 2

one.html four.html

two.html

three.html five.html

The data collected by a web crawler can be stored in a text file

Page 3: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

Data storage

URL word counthttp://reed.cs.depaul.edu/lperkovic/one.html Paris 5http://reed.cs.depaul.edu/lperkovic/one.html Beijing 3http://reed.cs.depaul.edu/lperkovic/one.html Chicago 5

URL link http://reed.cs.depaul.edu/lperkovic/one.html http://reed.cs.depaul.edu/lperkovic/two.htmlhttp://reed.cs.depaul.edu/lperkovic/one.html http://reed.cs.depaul.edu/lperkovic/three.html

URL word counthttp://reed.cs.depaul.edu/lperkovic/two.html Bogota 3http://reed.cs.depaul.edu/lperkovic/two.html Paris 1http://reed.cs.depaul.edu/lperkovic/two.html Beijing 2

URL link http://reed.cs.depaul.edu/lperkovic/two.html http://reed.cs.depaul.edu/lperkovic/four.html

URL word counthttp://reed.cs.depaul.edu/lperkovic/four.html Paris 2...

Page 4: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

Data storage

A search engine app may then need to access this file to make queries such as

1. In which web pages does word X appear in?2. What is the ranking of web pages containing word X, based on

the number of occurrences of word X in the page?3. How many pages contain word X?4. What pages have a hyperlink to page Y?5. What is the total number of occurrences of word ‘Paris’ across

all web pages?6. How many outgoing links does each visited page have?7. How many incoming links does each visited page have?8. What pages have a link to a page containing word X?9. What page containing word X has the most incoming links?

A text file is not ideal for this ...

Page 5: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

Data storage

Beijing × 3Paris × 5Chicago × 5

Chicago × 3Beijing × 6

Bogota × 3Beijing × 2Paris × 1

Chicago × 3Paris × 2Nairobi × 1

Nairobi × 7Bogota × 2

one.html four.html

two.html

three.html five.html

The data collected by a web crawler can be stored in a text file ...

Page 6: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

Database files

The data collected by a web crawler can be stored in a text file ...

... or in a database file

Hyperlinks

Keywords

Url Link

one.html two.html

one.html three.html

two.html four.html

three.html four.html

four.html five.html

five.html one.html

five.html two.html

five.html four.html

Url Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1

three.html Chicago 3

three.html Beijing 6

four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

Page 7: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

Database files

A database file consists of one or more tables

Each table has a name and consists of rows and columns Each column has a name and contains data of a specific type

Hyperlinks

Keywords

Url Link

one.html two.html

one.html three.html

two.html four.html

three.html four.html

four.html five.html

five.html one.html

five.html two.html

five.html four.html

Url Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1

three.html Chicago 3

three.html Beijing 6

four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

Each row is a database record

Page 8: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

Database files

Database files are not read from or written to directly

Instead, “read/write” commands are sent to a special type of server program called a database engine that manages the database

The database engine accesses the database file on the user’s behalf

The commands accepted by database engines are statements written in the Structured Query Language (SQL)

Page 9: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

SQL SELECT FROM statement

Link

two.html

three.html

four.html

four.html

five.html

one.html

two.html

four.html

Url Link

one.html two.html

one.html three.html

two.html four.html

three.html four.html

four.html five.html

five.html one.html

five.html two.html

five.html four.html

SELECT Link FROM Hyperlinks

HyperlinksSQL statement SELECT is used make queries into a database

result table

Page 10: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

SQL SELECT FROM statement

SQL statement SELECT is used make queries into a database.

SELECT Url, Word FROM Keywords

KeywordsUrl Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1

three.html Chicago 3

three.html Beijing 6

four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

Url Word

one.html Beijing

one.html Paris

one.html Chicago

two.html Bogota

two.html Beijing

two.html Paris

three.html Chicago

three.html Beijing

four.html Chicago

four.html Paris

four.html Nairobi

five.html Nairobi

five.html Bogota

Page 11: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

SQL SELECT FROM statement

Url Link

one.html two.html

one.html three.html

two.html four.html

three.html four.html

four.html five.html

five.html one.html

five.html two.html

five.html four.html

SELECT * FROM Hyperlinks

HyperlinksSQL statement SELECT is used make queries into a database

Url Link

one.html two.html

one.html three.html

two.html four.html

three.html four.html

four.html five.html

five.html one.html

five.html two.html

five.html four.html

Page 12: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

SQL DISTINCT keyword

Link

two.html

three.html

four.html

five.html

one.html

Url Link

one.html two.html

one.html three.html

two.html four.html

three.html four.html

four.html five.html

five.html one.html

five.html two.html

five.html four.html

SELECT DISTINCT Link FROM Hyperlinks

HyperlinksSQL keyword DISTINCT removes duplicate records in the result table

Page 13: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

SQL WHERE clause

SQL clause WHERE is used to select only those records that satisfy a condition

SELECT Url FROM KeywordsWHERE Word = 'Paris'

KeywordsUrl Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1

three.html Chicago 3

three.html Beijing 6

four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

Url

one.html

two.html

four.html

“In which pages does word X appear in?”

Page 14: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Operator Explanation= Equal<> Not equal> Greater than< Less than>= Greater than or equal<= Less than or equalBETWEEN Within an inclusive range

Introduction to Computing Using Python

SQL WHERE clause

SQL clause WHERE is used to select only those records that satisfy a condition

SELECT Column(s) FROM TableWHERE Column operator valueSELECT Column(s) FROM TableWHERE Column BETWEEN value1 AND value2

Page 15: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Url Freqone.html 5

two.html 2

four.html 1

Introduction to Computing Using Python

SQL keyword DESC

SQL keyword DESC is used to order the records in the result table in descending orderSELECT Url, Freq FROM KeywordsWHERE Word = 'Paris'ORDER by Freq DESC

KeywordsUrl Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1

three.html Chicago 3

three.html Beijing 6

four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

“What is the ranking of web pages containing word X, based on the number of occurrences of string X in the page?”

Page 16: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

Exercise

Hyperlinks

Keywords

Url Link

one.html two.html

one.html three.html

two.html four.html

three.html four.html

four.html five.html

five.html one.html

five.html two.html

five.html four.html

Url Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1

three.html Chicago 3

three.html Beijing 6

four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

Write an SQL query that returns:1. The URL of every page that has a link to web

page four.html

SELECT DISTINCT Url FROM HyperlinksWHERE Link = 'four.html'

Page 17: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

Exercise

Hyperlinks

Keywords

Url Link

one.html two.html

one.html three.html

two.html four.html

three.html four.html

four.html five.html

five.html one.html

five.html two.html

five.html four.html

Url Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1

three.html Chicago 3

three.html Beijing 6

four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

Write an SQL query that returns:2. The URL of every page that has an incoming link

from page four.html

SELECT DISTINCT Link FROM Hyperlinks WHERE Url = 'four.html'

Page 18: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

Exercise

Hyperlinks

Keywords

Url Link

one.html two.html

one.html three.html

two.html four.html

three.html four.html

four.html five.html

five.html one.html

five.html two.html

five.html four.html

Url Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1

three.html Chicago 3

three.html Beijing 6

four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

Write an SQL query that returns:3. The URL and word for every word that appears

exactly three times in the web page associated with the URL

SELECT Url, Word from KeywordsWHERE Freq = 3

Page 19: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

Exercise

Hyperlinks

Keywords

Url Link

one.html two.html

one.html three.html

two.html four.html

three.html four.html

four.html five.html

five.html one.html

five.html two.html

five.html four.html

Url Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1

three.html Chicago 3

three.html Beijing 6

four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

Write an SQL query that returns:4. The URL, word, and frequency for every word

that appears between 3 and 5 times, inclusive, in the web page associated with the URL

SELECT * from Keywords WHERE Freq BETWEEN 3 AND 5

Page 20: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

SQL built-in functions

SQL includes built-in math functions such as COUNT() and SUM()

SELECT COUNT(*) FROM Keywords WHERE Word = 'Paris'

KeywordsUrl Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1

three.html Chicago 3

three.html Beijing 6

four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

3

“How many pages contain the word Paris?”

Page 21: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

SQL built-in functions

SQL includes built-in math functions such as COUNT() and SUM()

SELECT SUM(Freq) FROM Keywords WHERE Word = 'Paris'

KeywordsUrl Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1

three.html Chicago 3

three.html Beijing 6

four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

8

Page 22: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Urlone.html 2

two.html 1

three.html 1

four.html 1

five.html 3

Introduction to Computing Using Python

SQL GROUP BY clause

SQL clause GROUP BY groups the records of a table that have the same value in a column

SELECT Url, COUNT(*) FROM HyperlinksGROUP BY Url

KeywordsUrl Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1

three.html Chicago 3

three.html Beijing 6

four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

“How many outgoing links does each web page have?”

Page 23: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

Exercise

Hyperlinks

Keywords

Url Link

one.html two.html

one.html three.html

two.html four.html

three.html four.html

four.html five.html

five.html one.html

five.html two.html

five.html four.html

Url Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1

three.html Chicago 3

three.html Beijing 6

four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

Write an SQL query that returns:1. The number of words, including duplicates, that

page two.html contains

SELECT SUM(Freq) From Keywords WHERE Url = 'two.html'

Page 24: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

Exercise

Hyperlinks

Keywords

Url Link

one.html two.html

one.html three.html

two.html four.html

three.html four.html

four.html five.html

five.html one.html

five.html two.html

five.html four.html

Url Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1

three.html Chicago 3

three.html Beijing 6

four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

Write an SQL query that returns:2. The number of distinct words page two.html

contains

SELECT Count(*) From KeywordsWHERE Url = 'two.html'

Page 25: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

Exercise

Hyperlinks

Keywords

Url Link

one.html two.html

one.html three.html

two.html four.html

three.html four.html

four.html five.html

five.html one.html

five.html two.html

five.html four.html

Url Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1

three.html Chicago 3

three.html Beijing 6

four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

Write an SQL query that returns:3. The number of words, including duplicates, that

each web page has

SELECT Url, SUM(Freq) FROM Keywords GROUP BY Url

Page 26: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

Exercise

Hyperlinks

Keywords

Url Link

one.html two.html

one.html three.html

two.html four.html

three.html four.html

four.html five.html

five.html one.html

five.html two.html

five.html four.html

Url Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1

three.html Chicago 3

three.html Beijing 6

four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

Write an SQL query that returns:4. The number of incoming links each web page

has

SELECT Link, COUNT(*) FROM Hyperlinks GROUP BY Link

Page 27: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

“What web pages have a link to a page containing word ‘Bogota’?”

Introduction to Computing Using Python

SQL queries involving multiple tables

Hyperlinks

Keywords

Url Link

one.html two.html

one.html three.html

two.html four.html

three.html four.html

four.html five.html

five.html one.html

five.html two.html

five.html four.html

Url Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1

three.html Chicago 3

three.html Beijing 6

four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

This question requires a lookup of both tables:• Look up Keywords to find the set S of URLs of

pages containing word ‘Bogota’• Then look up Keywords to find the URLs of

pages with links to pages in S

Page 28: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

SQL queries involving multiple tables

Hyperlinks

Keywords

Url Link

one.html two.html

one.html three.html

two.html four.html

three.html four.html

four.html five.html

five.html one.html

five.html two.html

five.html four.html

Url Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1

three.html Chicago 3

three.html Beijing 6

four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

The SELECT statement can be used on multiple tables.

SELECT * FROM Hyperlinks, Keywords

Page 29: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

SQL queries involving multiple tables

The SELECT statement can be used on multiple tables.

Url Link Url Word Freq

one.html two.html one.html Beijing 3

one.html two.html one.html Paris 5

one.html two.html one.html Chicago 5

one.html two.html two.html Bogota 3

... ... ... ... ...

five.html

four.html four.html Nairobi 5

five.html

four.html five.html Nairobi 7

five.html

four.html five.html Bogota 2

SELECT * FROM Hyperlinks, Keywords

104 records, each a combination of a record in Hyperlinks and a record in Keywords

The result table is the cross join of tables Hyperlink and Keywords

• It has five named columns corresponding to the two columns of table Hyperlinks and three columns of table Keywords.

(Hyperlink) (Keywords)

result table

Page 30: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

SQL queries involving multiple tables

Hyperlink

Keywords

Url Link

one.html two.html

one.html three.html

two.html four.html

three.html four.html

four.html five.html

five.html one.html

five.html two.html

five.html four.html

Url Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1

three.html Chicago 3

three.html Beijing 6

four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

The SELECT statement can be used on multiple tables.

SELECT * FROM Hyperlinks, Keywords WHERE Hyperlinks.Url = Keywords.Url

Page 31: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

SQL queries involving multiple tables

The SELECT statement can be used on multiple tables.

Url Link Url Word Freq

one.html two.html two.html Bogota 3

one.html two.html two.html Beijing 2

one.html two.html two.html Paris 1

one.html three.html three.html Chicago 3

... ... ... ... ...

five.html four.html four.html Paris 2

five.html four.html four.html Nairobi 5

SELECT * FROM Hyperlinks, Keywords WHERE Hyperlinks.Url = Keywords.Url

(Hyperlink) (Keywords)

Page 32: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

SQL queries involving multiple tables

Hyperlink

Keywords

Url Link

one.html two.html

one.html three.html

two.html four.html

three.html four.html

four.html five.html

five.html one.html

five.html two.html

five.html four.html

Url Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1

three.html Chicago 3

three.html Beijing 6

four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

SELECT * FROM Hyperlinks, Keywords WHERE Keywords.Word = 'Bogota' AND Hyperlinks.Link = Keywords.Url

“What web pages have a link to a page containing word ‘Bogota’?”

Page 33: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

SQL queries involving multiple tables

Url Link Url Word Freq

one.html two.html two.html Bogota 3

four.html five.html five.html Bogota 2

five.html two.html two.html Bogota 3

(Hyperlink) (Keywords)

SELECT * FROM Hyperlinks, Keywords WHERE Keywords.Word = 'Bogota' AND Hyperlinks.Link = Keywords.Url

“What web pages have a link to a page containing word ‘Bogota’?”

Page 34: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

SQL queries involving multiple tables

Url

one.html

four.html

five.html

SELECT Hyperlinks.Url FROM Hyperlinks, Keywords WHERE Keywords.Word = 'Bogota' AND Hyperlinks.Link = Keywords.Url

“What web pages have a link to a page containing word ‘Bogota’?”

Page 35: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

SQL CREATE TABLE statement

SQL statement CREATE TABLE is used to create a table in a database fileCREATE TABLE Keywords( Url text, Word text, Freq int)

KeywordsUrl Word Freq

Page 36: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

SQL CREATE TABLE statement

SQL statement CREATE TABLE is used to create a table in a database fileCREATE TABLE TableName( Column1 dataType1, Column2 dataType2, ...)

TableNameColumn1 Column2 ...

SQL Type Python Type Explanation

INTEGER int Holds integer values

REAL float Holds floating-point values

TEXT str Holds string values, delimited with quotes

BLOB bytes Holds sequence of bytes

Page 37: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

SQL INSERT statement

SQL statement INSERT is used to add a record to a table

INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3)

KeywordsUrl Word FreqUrl Word Freq

one.html Beijing 3

Page 38: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

SQL UPDATE statement

SQL statement UPDATE is used to modify a record in a table

UPDATE Keywords SET Freq = 4WHERE Url = 'two.html' AND Word = 'Bogota'

KeywordsUrl Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1

three.html Chicago 3

three.html Beijing 6

four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

Url Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 4

two.html Beijing 2

two.html Paris 1

three.html Chicago 3

three.html Beijing 6

four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

Page 39: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

Standard Library module sqlite3

The Python Standard Library includes module sqlite3 that provides an API for accessing database files

• It is an interface to a library of functions that accesses the database files directly

>>> import sqlite3>>> con = sqlite3.connect('web.db')

sqlite3 function connect() takes as input the name of a database and returns an object of type Connection, a type defined in module sqlite3

• The Connection object con is associated with database file web.db• If database file web.db does not exists in the current working directory,

a new database file web.db is created

Page 40: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

Standard Library module sqlite3

The Python Standard Library includes module sqlite3 that provides an API for accessing database files

• It is an interface to a library of functions that accesses the database files directly

>>> import sqlite3>>> con = sqlite3.connect('web.db')>>> cur = con.cursor()

Connection method cursor() returns an object of type Cursor, another type defined in the module sqlite3

• Cursor objects are responsible for executing SQL statements

Page 41: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

Standard Library module sqlite3

The Python Standard Library includes module sqlite3 provides an API for accessing database files

• It is an interface to a library of functions that accesses the database files directly

>>> import sqlite3>>> con = sqlite3.connect('web.db')>>> cur = con.cursor()>>> cur.execute("CREATE TABLE Keywords (Url text, Word text, Freq int)")<sqlite3.Cursor object at 0x100575730>

The Cursor class supports method execute() which takes an SQL statement as a string, and executes it

>>> import sqlite3>>> con = sqlite3.connect('web.db')>>> cur = con.cursor()>>> cur.execute("CREATE TABLE Keywords (Url text, Word text, Freq int)")<sqlite3.Cursor object at 0x100575730>>>> cur.execute("INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3)")<sqlite3.Cursor object at 0x100575730>

Hardcoded values

Page 42: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

Parameter substitution

In general, the values used in an SQL statement will not be hardcoded in the program but come from Python variables

>>> cur.execute("INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3)")<sqlite3.Cursor object at 0x100575730>>>> url, word, freq = 'one.html', 'Paris', 5>>>

Page 43: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

Parameter substitution

Parameter substitution is the technique used to construct SQL statements that make use of Python variable values

• similar to string formatting

>>> cur.execute("INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3)")<sqlite3.Cursor object at 0x100575730>>>> url, word, freq = 'one.html', 'Paris', 5>>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", (url, word, freq))<sqlite3.Cursor object at 0x100575730>

tuple

Page 44: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

Parameter substitution

>>> cur.execute("INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3)")<sqlite3.Cursor object at 0x100575730>>>> url, word, freq = 'one.html', 'Paris', 5>>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", (url, word, freq))<sqlite3.Cursor object at 0x100575730>>>> record = ('one.html','Chicago', 5)>>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", record)<sqlite3.Cursor object at 0x100575730>

Parameter substitution is the technique used to construct SQL statements that make use of Python variable values

• similar to string formatting

Page 45: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

Parameter substitution

Changes to a database file are not written to the database file immediately; they are only recorded temporarily, in memory

In order to ensure that the changes are written to the database file,the commit() method must be called on the Connection object

>>> cur.execute("INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3)")<sqlite3.Cursor object at 0x100575730>>>> url, word, freq = 'one.html', 'Paris', 5>>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", (url, word, freq))<sqlite3.Cursor object at 0x100575730>>>> record = ('one.html','Chicago', 5)>>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", record)<sqlite3.Cursor object at 0x100575730>>>> con.commit()>>>

A database file should be closed just like any other file

>>> cur.execute("INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3)")<sqlite3.Cursor object at 0x100575730>>>> url, word, freq = 'one.html', 'Paris', 5>>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", (url, word, freq))<sqlite3.Cursor object at 0x100575730>>>> record = ('one.html','Chicago', 5)>>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", record)<sqlite3.Cursor object at 0x100575730>>>> con.commit()>>> con.close()

Page 46: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

Querying a database

>>> import sqlite3>>> con = sqlite3.connect('links.db')>>> cur = con.cursor()>>> cur.execute('SELECT * FROM Keywords')<sqlite3.Cursor object at 0x102686960>>>> cur.fetchall()[('one.html', 'Beijing', 3), ('one.html', 'Paris', 5), ('one.html', 'Chicago', 5), ('two.html', 'Bogota', 5), ('two.html', 'Beijing', 2), ('two.html', 'Paris', 1), ('three.html', 'Chicago', 3), ('three.html', 'Beijing', 6), ('four.html', 'Chicago', 3), ('four.html', 'Paris', 2), ('four.html', 'Nairobi', 5), ('five.html', 'Nairobi', 7), ('five.html', 'Bogota', 2)]>>>

The result of a query is stored in the Cursor object

To obtain the result as a list of tuple objects, Cursor method fetchall() is used

Page 47: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

Querying a database

>>> cur.execute('SELECT * FROM Keywords')<sqlite3.Cursor object at 0x102686960>>>> for record in cur:

print(record)

('one.html', 'Beijing', 3)('one.html', 'Paris', 5)('one.html', 'Chicago', 5)('two.html', 'Bogota', 5)('two.html', 'Beijing', 2)('two.html', 'Paris', 1)('three.html', 'Chicago', 3)('three.html', 'Beijing', 6)('four.html', 'Chicago', 3)('four.html', 'Paris', 2)('four.html', 'Nairobi', 5)('five.html', 'Nairobi', 7)('five.html', 'Bogota', 2)>>>

An alternative is to iterate over the Cursor object

Page 48: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

Querying a database

>>> word = 'Paris'>>> cur.execute('SELECT Url FROM Keywords WHERE Word = ?', (word,))<sqlite3.Cursor object at 0x102686960>>>> cur.fetchall()[('one.html',), ('two.html',), ('four.html',)]>>> word, n = 'Beijing', 2>>> cur.execute("SELECT * FROM Keywords WHERE Word = ? AND Freq > ?", (word, n))<sqlite3.Cursor object at 0x102686960>>>> cur.fetchall()[('one.html', 'Beijing', 3), ('three.html', 'Beijing', 6)]>>>

Parameter substitution is again used whenever Python variable values are needed in the SQL statement

Page 49: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

List comprehension

>>> lines['First Line\n', 'Second\n', '\n', 'and Fourth.\n']>>>

Suppose we want to construct a list from an “old” list by modifying each “old” list item in the same way

['First Line\n', 'Second\n', '\n', 'and Fourth.\n']

['First Line', 'Second', '', 'and Fourth.']

>>> lines['First Line\n', 'Second\n', '\n', 'and Fourth.\n']>>> newlines = []>>> for i in range(len(lines)):

newlines.append(lines[i][:-1])

>>> newlines['First Line', 'Second', '', 'and Fourth.']>>>

>>> lines['First Line\n', 'Second\n', '\n', 'and Fourth.\n']>>> newlines = []>>> for i in range(len(lines)):

newlines.append(lines[i][:-1])

>>> newlines['First Line', 'Second', '', 'and Fourth.']>>> newlines = [line[:-1] for line in lines]>>> newlines['First Line', 'Second', '', 'and Fourth.']

Method 1: accumulator pattern

Method 2: list comprehension

lines

newlines

Page 50: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

List comprehension

>>> [line[:-1] for line in lines if line != '\n']['First Line', 'Second', 'and Fourth.']>>

The syntax of the list comprehension statement:

[<expression> for <item> in <sequence/iterator>]

[<expression> for <item> in <sequence/iterator> if <condition>]

More generally:

Examples:

>>> [line[:-1] for line in lines if line != '\n']['First Line', 'Second', 'and Fourth.']>>> [i for i in range(0, 20, 2)][0, 2, 4, 6, 8, 10, 12, 14, 16, 18]>>>

>>> [line[:-1] for line in lines if line != '\n']['First Line', 'Second', 'and Fourth.']>>> [i for i in range(0, 20, 2)][0, 2, 4, 6, 8, 10, 12, 14, 16, 18]>>> [len(word) for word in ['hawk', 'hen', 'hog', 'hyena']

Page 51: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

MapReduce

>>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five']

Suppose we would like to compute the frequency of every word in a list

the result would be[('one', 2), ('five', 2), ('two', 1), ('three', 3)]

So, for list

We have done this before using a dictionary and the accumulator loop pattern

We will now solve this problem using MapReduce

Page 52: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

MapReduce

'two'

'three'

'one'

'three'

'three'

'one'

'five'

'five'

input list

[('two', 1)]

[('three', 1)]

[('one', 1)]

[('three', 1)]

[('three', 1)]

[('one', 1)]

[('five', 1)]

[('five', 1)]

intermediate1

('two', [1])

('three', [1,1,1])

('one', [1,1])

('five', [1,1])

intermediate2

('two', 1)

('three', 3)

('one', 2)

('five', 2)

output list

Map step Partition step

Reduce step

Page 53: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

MapReduce

'two'

'three'

'one'

'three'

'three'

'one'

'five'

'five'

input list

[('two', 1)]

[('three', 1)]

[('one', 1)]

[('three', 1)]

[('three', 1)]

[('one', 1)]

[('five', 1)]

[('five', 1)]

intermediate1

('two', [1])

('three', [1,1,1])

('one', [1,1])

('five', [1,1])

intermediate2

('two', 1)

('three', 3)

('one', 2)

('five', 2)

output list

>>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five']>>>>>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five']>>> intermediate1 = [occurrence(word) for word in words]>>>

>>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five']>>> intermediate1 = [occurrence(word) for word in words]>>> intermediate2 = partition(intermediate1)>>>

>>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five']>>> intermediate1 = [occurrence(word) for word in words]>>> intermediate2 = partition(intermediate1)>>> [occurrenceCount(x) for x in intermediate2][('one', 2), ('five', 2), ('two', 1), ('three', 3)]

def occurrence(word): 'returns list containing tuple (word, 1)' return [(word, 1)]

ch11.py

def occurrenceCount(keyVal): '''takes tuple keyVal = (key, lst) as input and returns (key, sum(lst))''' return (keyVal[0], sum(keyVal[1]))

def partition(intermediate1):

# to do

Page 54: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

MapReduce

[('two', 1)]

[('three', 1)]

[('one', 1)]

[('three', 1)]

[('three', 1)]

[('one', 1)]

[('five', 1)]

[('five', 1)]

intermediate1

('two', [1])

('three', [1,1,1])

('one', [1,1])

('five', [1,1])

intermediate2

ch11.py

def partition(intermediate1): dct = {} # for every list lst of intermediate1 for lst in intermediate1: # for every (key, value) pair in list lst for key, value in lst: if key in dct: dct[key].append(value) else: dct[key] = [value] # return container of (key, values) tuples return dct.items() # return intermediate2

Page 55: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

MapReduce abstracted

ch11.py

def partition(intermediate1): # implementation here

class SeqMapReduce(object): 'a sequential MapReduce implementation'

def __init__(self, mapper, reducer): 'functions mapper and reducer are problem specific' self.mapper = mapper self.reducer = reducer

def process(self, data): 'runs MapReduce on data with mapper and reducer functions' intermediate1 = [self.mapper(x) for x in data] # Map intermediate2 = partition(intermediate1) return [self.reducer(x) for x in intermediate2] # Reduce

The MapReduce framework applies to a range of problems and therefore should be abstracted:

>>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five']>>> smr = SeqMapReduce(occurrence, occurrenceCount)>>> smr.process(words)[('one', 2), ('five', 2), ('two', 1), ('three', 3)]

>>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five']>>> smr = SeqMapReduce(occurrence, occurrenceCount)>>> smr.process(words)[('one', 2), ('five', 2), ('two', 1), ('three', 3)]>>> numbers = [2,3,4,3,2,3,5,4,3,5,1] >>> smr.process(numbers) [(1, 1), (2, 2), (3, 4), (4, 2), (5, 2)]

Page 56: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

A solution to the problem could be represented as a mapping that maps each word to the list of files containing it

This mapping is called an inverted index

Introduction to Computing Using Python

Inverted index problem

Given several text files, we want to know which words appear in which file.

[('Paris', ['a.txt', 'c.txt']),('Miami', ['a.txt']), ('Cairo', ['c.txt']), ('Quito', ['b.txt', 'c.txt']), ('Tokyo', ['a.txt', 'b.txt'])]

Paris: Miami, MiamiTokyo, Miami

a.txt

Tokyo Quito ... Tokyo.Quito

b.txt

Paris, Quito.

Cairo, Paris, Quito.

c.txt

To apply MapReduce, we need to define the mapper and reducer functions

Page 57: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

Inverted index problem

a.txt

b.txt

c.txt

input list

(Tokyo, [a.txt, b.txt])

(Paris, [a.txt, c.txt])

(Miami, [a.txt])

(Quito, [b.txt])

intermediate2

(Cairo, [c.txt])

(...)

(...)

(...)

(...)

output list

(...)

[(Tokyo, a.txt

(Paris, a.txt)

(Miami, a.txt)]

(Tokyo, b.txt)

(Quito, b.txt)

(Paris, c.txt)

(Cairo, c.txt)

intermediate1

Paris: Miami, MiamiTokyo, Miami

a.txt

Tokyo Quito ... Tokyo.Quito

b.txt

Paris, Quito.

Cairo, Paris, Quito.

c.txt

Page 58: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

MapReduce

a.txt

b.txt

c.txt

input list

(Tokyo, [a.txt, b.txt])

(Paris, [a.txt, c.txt])

(Miami, [a.txt])

(Quito, [b.txt])

intermediate2

(Cairo, [c.txt])

(...)

(...)

(...)

(...)

output list

(...)

[(Tokyo, a.txt

(Paris, a.txt)

(Miami, a.txt)]

(Tokyo, b.txt)

(Quito, b.txt)

(Paris, c.txt)

(Cairo, c.txt)

intermediate1

from string import punctuationdef getWordsFromFile(file): 'returns set of items (word, file) for every word in file' infile = open(file) content = infile.read() infile.close()

# remove punctuation transTable = str.maketrans(punctuation, ' '*len(punctuation)) content = content.translate(transTable)

# construct set of items (word, file) with no duplicates res = set() for word in content.split(): res.add((word, file)) return res # return intermediate1

def getWordIndex(keyVal): 'returns input value' return keyVal

MapperReducer

intermediate2 is actually the desired list sothe reducer just copies its items to the output list

Page 59: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

Module multiprocessing

Standard Library module multiprocessing includes tools that make it possible to execute Python programs in parallel on multi-core machines

>>> from multiprocessing import cpu_count >>> cpu_count()8

So 8 cores (your computer may have more or less)

Class Pool from module multiprocessing can be used to split a problem and execute its pieces in parallel (i.e. at the same time) on separate cores

A Pool object represents a pool of one or more processes, each of which is capable of executing code independently on a processor core

How many processor cores does a given computer have? Let’s check:

Note: process != core

Page 60: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

Class Pool in module multiprocessing

> python parallel.py[4, 3, 3, 5]

from multiprocessing import Pool

animals = ['hawk', 'hen', 'hog', 'hyena']

pool = Pool(2) # create pool of 2 processesres = pool.map(len, animals) # apply len() to every animals item

print(res) # print the list of string lengths

Class Pool from module multiprocessing can be used to split a problem and execute its pieces in parallel.

A Pool object represents a pool of one or more processes, each of which is capable of executing code independently on an available processor core

parallel.py

Execute this program from a OS shell (not the Python interpreter shell):

Page 61: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

Class Pool in module multiprocessing

> python parallel.py[4, 3, 3, 5]

from multiprocessing import Pool

animals = ['hawk', 'hen', 'hog', 'hyena']

pool = Pool(2) # create pool of 2 processesres = pool.map(len, animals) # apply len() to every animals item

print(res) # print the list of string lengths

parallel.py

Execute this program from a OS shell (not the Python interpreter shell):

The statement

and the statement

do the same thing (they construct a list by applying len() to every item of list animal)

pool.map(len, animals)

[len(x) for x in animals]

It is how they do it that is different:

executed by 2 processes

executed by 1 process

Page 62: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

Class Pool in module multiprocessing

from multiprocessing import Poolfrom os import getpid

def length(word): 'returns length of string word'

# print the id of the process executing the function print('Process {} handling {}'.format(getpid(), word)) return len(word)

# main programpool = Pool(2)res = pool.map(length, ['hawk', 'hen', 'hog', 'hyena'])print(res)

parallel2.py

Let’s verify that different processes are handling different list items

> python parallel2.pyProcess 5129 handling hawkProcess 5130 handling henProcess 5129 handling hogProcess 5130 handling hyena[4, 3, 3, 5]

every process has a unique id

Page 63: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

Parallel spedup

The benefit of using a pool of independent processes is they can be scheduled by the CPU scheduler to execute in parallel on separate cores

• This should result in faster program running time and parallel speedup

To showcase this, let’s consider a computationally intensive problem from number theory: compare the distribution of prime numbers in several ranges of integers

• Count the number of prime numbers in several equal-size ranges of 100,000 large integers

def countPrimes(start): 'returns the number of primes in range [start, start+rng)'

rng = 100000 formatStr = 'process {} processing range [{}, {})' print(formatStr.format(getpid(), start, start+rng))

# sum up numbers i in range [start, start_rng) that are prime return sum([1 for i in range(start,start+rng) if isprime(i)])

primeDensity.py

Page 64: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

Parallel spedup

def countPrimes(start): # not shown

if __name__ == '__main__': p = Pool(1) # starts is a list of left boundaries of integer ranges starts = [12345678, 23456789, 34567890, 45678901, 56789012, 67890123, 78901234, 89012345]

t1 = time() # start time print(p.map(countPrimes,starts)) t2 = time() # end time

p.close() print('Time taken: {} seconds.'.format(t2-t1))

primeDensity.py

If the Pool contains only 1 process

> python map.py process 4176 processing range [12345678, 12445678] process 4176 processing range [23456789, 23556789] process 4176 processing range [34567890, 34667890] process 4176 processing range [45678901, 45778901] process 4176 processing range [56789012, 56889012] process 4176 processing range [67890123, 67990123] process 4176 processing range [78901234, 79001234] process 4176 processing range [89012345, 89112345] [6185, 5900, 5700, 5697, 5551, 5572, 5462, 5469] Time taken: 47.84 seconds.

Page 65: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

def countPrimes(start): # not shown

if __name__ == '__main__': p = Pool(2) # starts in a list of left boundaries of integer ranges starts = [12345678, 23456789, 34567890, 45678901, 56789012, 67890123, 78901234, 89012345]

t1 = time() # start time print(p.map(countPrimes,starts)) t2 = time() # end time

p.close() print('Time taken: {} seconds.'.format(t2-t1))

Introduction to Computing Using Python

Parallel spedupprimeDensity.py

If the Pool contains 2 processes

Time taken: 24.60 seconds.

Speedup = parallel time/sequential time = 47.84/24.6 ≈1.94Using 2 processes on 2 cores instead of 1 process on 1 core descreased the running time from 47.84 to 24.6 seconds`

Page 66: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

def countPrimes(start): # not shown

if __name__ == '__main__': p = Pool(4) # starts is a list of left boundaries of integer ranges starts = [12345678, 23456789, 34567890, 45678901, 56789012, 67890123, 78901234, 89012345]

t1 = time() # start time print(p.map(countPrimes,starts)) t2 = time() # end time

p.close() print('Time taken: {} seconds.'.format(t2-t1))

Introduction to Computing Using Python

Parallel spedupprimeDensity.py

If the Pool contains 4 processes

Time taken: 16.78 seconds.

Speedup = 47.84/16.78 ≈2.85

Page 67: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

def countPrimes(start): # not shown

if __name__ == '__main__': p = Pool(8) # starts is a list of left boundaries of integer ranges starts = [12345678, 23456789, 34567890, 45678901, 56789012, 67890123, 78901234, 89012345]

t1 = time() # start time print(p.map(countPrimes,starts)) t2 = time() # end time

p.close() print('Time taken: {} seconds.'.format(t2-t1))

Introduction to Computing Using Python

Parallel speedupprimeDensity.py

If the Pool contains 8 processes

Time taken: 14.29 seconds.

Speedup = 47.84/14.29 ≈3.35

Page 68: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

from multiprocessing import Poolclass MapReduce(object): 'a parallel implementation of MapReduce'

def __init__(self, mapper, reducer, numProcs = None): 'initializes map and reduce functions and process pool'

self.mapper = mapper self.reducer = reducer self.pool = Pool(numProcs)

def process(self, data): 'runs MapReduce on sequence data'

intermediate1 = self.pool.map(self.mapper, data) # Map intermediate2 = partition(intermediate1) return self.pool.map(self.reducer, intermediate2) # Reduce

Introduction to Computing Using Python

ch12.py

MapReduce in parallel

MapReduce reimplemented using a pool of processes and method map()

Page 69: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

The name cross-checking problem

Tens of thousands of previously classified documents have just been posted on the web. You want to find out which documents mention a particular person, and you want to do that for every person named in one or more documents.

• Assume that people’s names are capitalized, which helps you narrow down the words that can be proper names.

The precise problem is then: given a list of URLs (of the documents), obtain a list of pairs (proper, urlList) in which proper is a capitalized word in any document and urlList is a list of URLs of documents containing proper

In order to use MapReduce, we need to define the map and reduce functions

Page 70: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

The name cross-checking problem

The map function takes a URL as input and returns a list of tuples (word, URL) for every word that is capitalized in the document identified by the URL

from urllib.request import urlopenfrom re import findall

def getProperFromURL(url): '''returns list of items (word, url) for every capitalized word in the document identified by url'''

content = urlopen(url).read().decode() pattern = '[A-Z][A-Za-z\'\-]*' # RE for capitalized words # collect al capitalized words and remove duplicates propers = set(findall(pattern, content))

res = [] for word in propers: # for every capitalized word # create pair (word, url) and append to res res.append((word, url)) return res

crosscheck.py

Page 71: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

The name cross-checking problem

The partition function will, for every capitalized word, collect all tuples (word, url) in every list in intermediate1 to construct list intermediate2 containing pairs (word, [url1, url2, ...])

def getWordIndex(keyVal): 'returns input value' return keyVal

Since intermediate2 contains the desired result (mapping of capitalized wordsto urls), the reducer function just returns its input

crosscheck.py

Page 72: Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce

Introduction to Computing Using Python

The name cross-checking problem

from time import timeif __name__ == '__main__':

urls = [ # URLS of eight Charles Dickens novels 'http://www.gutenberg.org/cache/epub/2701/pg2701.txt', 'http://www.gutenberg.org/cache/epub/1400/pg1400.txt', 'http://www.gutenberg.org/cache/epub/46/pg46.txt', 'http://www.gutenberg.org/cache/epub/730/pg730.txt', 'http://www.gutenberg.org/cache/epub/766/pg766.txt', 'http://www.gutenberg.org/cache/epub/1023/pg1023.txt', 'http://www.gutenberg.org/cache/epub/580/pg580.txt', 'http://www.gutenberg.org/cache/epub/786/pg786.txt']

t1 = time() # sequential start time SeqMapReduce(getProperFromURL, getWordIndex).process(urls) t2 = time() # sequential stop time, parallel start time MapReduce(getProperFromURL, getWordIndex, 4).process(urls) t3 = time() # parallel stop time

print('Sequential: {:5.2f} seconds.'.format(t2-t1)) print('Parallel: {:5.2f} seconds.'.format(t3-t2))

> python properNames.py Sequential: 19.89 seconds. Parallel: 14.81 seconds.

Let’s compare the sequential and parallel implementations of MapReduceby cross-checking the proper names in 8 Charles Dickens’ novels:

crosscheck.py