27
Python Programming in Context Chapter 7

Python Programming in Context Chapter 7. Objectives To use Python lists as a means of storing data To implement a nontrivial data mining application To

Embed Size (px)

Citation preview

Page 1: Python Programming in Context Chapter 7. Objectives To use Python lists as a means of storing data To implement a nontrivial data mining application To

Python Programming in Context

Chapter 7

Page 2: Python Programming in Context Chapter 7. Objectives To use Python lists as a means of storing data To implement a nontrivial data mining application To

Objectives

• To use Python lists as a means of storing data• To implement a nontrivial data mining

application• To understand and implement cluster analysis• To use visualization as a means of displaying

patterns

Page 3: Python Programming in Context Chapter 7. Objectives To use Python lists as a means of storing data To implement a nontrivial data mining application To

Cluster

• Data points that have something in common• Clusters are dissimilar to each other• Use simple Euclidean distance to measure

how close one point is to another• Centroid is a point that represents a cluster

(not necessarily a real data point)

Page 4: Python Programming in Context Chapter 7. Objectives To use Python lists as a means of storing data To implement a nontrivial data mining application To

Figure 7.1

Page 5: Python Programming in Context Chapter 7. Objectives To use Python lists as a means of storing data To implement a nontrivial data mining application To

Figure 7.2

Page 6: Python Programming in Context Chapter 7. Objectives To use Python lists as a means of storing data To implement a nontrivial data mining application To

Figure 7.3

Page 7: Python Programming in Context Chapter 7. Objectives To use Python lists as a means of storing data To implement a nontrivial data mining application To

Figure 7.4

Page 8: Python Programming in Context Chapter 7. Objectives To use Python lists as a means of storing data To implement a nontrivial data mining application To

Listing 7.1

def euclidD(point1, point2): sum = 0 for index in range(len(point1)): diff = (point1[index]-point2[index]) ** 2 sum = sum + diff euclidDistance = math.sqrt(sum) return euclidDistance

Page 9: Python Programming in Context Chapter 7. Objectives To use Python lists as a means of storing data To implement a nontrivial data mining application To

Figure 7.5

Page 10: Python Programming in Context Chapter 7. Objectives To use Python lists as a means of storing data To implement a nontrivial data mining application To

Listing 7.2def readFile(filename): datafile = open(filename, "r") datadict = {}

key = 0 for aline in datafile: key = key + 1 score = int(aline)

datadict[key] = [score] return datadict

Page 11: Python Programming in Context Chapter 7. Objectives To use Python lists as a means of storing data To implement a nontrivial data mining application To

Indefinite Iteration

• Repeating a process an unknown number of times

• Control is based on a boolean expression• Infinite loop is possible• Any for loop can be written as a while loop

Page 12: Python Programming in Context Chapter 7. Objectives To use Python lists as a means of storing data To implement a nontrivial data mining application To

Listing 7.3

while <condition>: statement1 statement2 ... statementn

Page 13: Python Programming in Context Chapter 7. Objectives To use Python lists as a means of storing data To implement a nontrivial data mining application To

Figure 7.6

Page 14: Python Programming in Context Chapter 7. Objectives To use Python lists as a means of storing data To implement a nontrivial data mining application To

Listing 7.4

sum = 0for anum in range(1,11): sum = sum + anumprint(sum)

Page 15: Python Programming in Context Chapter 7. Objectives To use Python lists as a means of storing data To implement a nontrivial data mining application To

Listing7.5

sum = 0anum = 1 #initializationwhile anum <= 10: #condition sum = sum + anum anum = anum + 1 #change of stateprint(sum)

Page 16: Python Programming in Context Chapter 7. Objectives To use Python lists as a means of storing data To implement a nontrivial data mining application To

Listing 7.6

sum = 0anum = 1while anum <= 10: sum = sum + anumprint(sum)

Page 17: Python Programming in Context Chapter 7. Objectives To use Python lists as a means of storing data To implement a nontrivial data mining application To

Listing 7.7def readFile(filename): datafile = open(filename, "r")

datadict = {}

key = 0 aline = datafile.readline() while aline != "": key = key + 1 score = int(aline) datadict[key] = [score]

aline = datafile.readline() return datadict

Page 18: Python Programming in Context Chapter 7. Objectives To use Python lists as a means of storing data To implement a nontrivial data mining application To

Creating Clusters

• Decide on number of clusters• Choose data points to be initial centroids• Assign data points to be members of a

centroid• Recompute centroids• Repeat

Page 19: Python Programming in Context Chapter 7. Objectives To use Python lists as a means of storing data To implement a nontrivial data mining application To

Listing 7.8def createCentroids(k, datadict): centroids=[] centroidCount = 0 centroidKeys = []

while centroidCount < k: rkey = random.randint(1,len(datadict)) if rkey not in centroidKeys: centroids.append(datadict[rkey]) centroidKeys.append(rkey) centroidCount = centroidCount + 1 return centroids

Page 20: Python Programming in Context Chapter 7. Objectives To use Python lists as a means of storing data To implement a nontrivial data mining application To

Listing 7.9def createClusters(k, centroids, datadict, repeats): for apass in range(repeats): print("****PASS",apass,"****") clusters = [] for i in range(k): clusters.append([])

for akey in datadict: distances = [] for clusterIndex in range(k): dist = euclidD(datadict[akey],centroids[clusterIndex]) distances.append(dist)

mindist = min(distances) index = distances.index(mindist)

clusters[index].append(akey)

dimensions = len(datadict[1])

Page 21: Python Programming in Context Chapter 7. Objectives To use Python lists as a means of storing data To implement a nontrivial data mining application To

Listing 7.9 continued for clusterIndex in range(k): sums = [0]*dimensions for akey in clusters[clusterIndex]: datapoints = datadict[akey] for ind in range(len(datapoints)): sums[ind] = sums[ind] + datapoints[ind] for ind in range(len(sums)): clusterLen = len(clusters[clusterIndex]) if clusterLen != 0: sums[ind] = sums[ind]/clusterLen centroids[clusterIndex] = sums for c in clusters: print ("CLUSTER") for key in c: print(datadict[key], end=" ") print() return clusters

Page 22: Python Programming in Context Chapter 7. Objectives To use Python lists as a means of storing data To implement a nontrivial data mining application To

Figure 7.7

Page 23: Python Programming in Context Chapter 7. Objectives To use Python lists as a means of storing data To implement a nontrivial data mining application To

Listing 7.10

def clusterAnalysis(dataFile): examDict = readFile(dataFile) examCentroids = createCentroids(5, examDict) examClusters = createClusters(5,

examCentroids, examDict, 3) clusterAnalysis("cs150exams.txt")

Page 24: Python Programming in Context Chapter 7. Objectives To use Python lists as a means of storing data To implement a nontrivial data mining application To

Visualizing Clusters

• Earthquake data• Show clusters on a map• Use turtle module to plot data

Page 25: Python Programming in Context Chapter 7. Objectives To use Python lists as a means of storing data To implement a nontrivial data mining application To

Figure 7.8

Page 26: Python Programming in Context Chapter 7. Objectives To use Python lists as a means of storing data To implement a nontrivial data mining application To

Listing 7.11def visualizeQuakes(dataFile): datadict = readFile(dataFile) quakeCentroids = createCentroids(6, datadict) clusters = createClusters(6, quakeCentroids, datadict, 7) quakeT = turtle.Turtle() quakeWin = turtle.Screen() quakeWin.bgpic("worldmap.gif") quakeWin.screensize(448,266) quakeWin.setup(width=500, height=300) wFactor = (quakeWin.screensize()[0]/2)/180 hFactor = (quakeWin.screensize()[1]/2)/90

quakeT.hideturtle() quakeT.up()

colorlist = ["red","green","blue","orange","cyan","yellow"]

for clusterIndex in range(6): quakeT.color(colorlist[clusterIndex]) for akey in clusters[clusterIndex]: lon = datadict[akey][0] lat = datadict[akey][1] quakeT.goto(lon*wFactor,lat*hFactor) quakeT.dot() quakeWin.exitonclick()

Page 27: Python Programming in Context Chapter 7. Objectives To use Python lists as a means of storing data To implement a nontrivial data mining application To

Figure 7.9