28
CEng 574 Statistical Data Analysis Volkan Atalay Fall 2014

CEng 574 Statistical Data Analysis€¦ · multivariate and exploratory data analysis. ... Mechanics of the course 2. Oct 2: Data, ... • Assignments and term paper and project

Embed Size (px)

Citation preview

Page 1: CEng 574 Statistical Data Analysis€¦ · multivariate and exploratory data analysis. ... Mechanics of the course 2. Oct 2: Data, ... • Assignments and term paper and project

CEng 574 Statistical Data Analysis

Volkan Atalay

Fall 2014

Page 2: CEng 574 Statistical Data Analysis€¦ · multivariate and exploratory data analysis. ... Mechanics of the course 2. Oct 2: Data, ... • Assignments and term paper and project

Wisconsin Breast Cancer Database Number of Instances: 699 # Attribute Domain -- ----------------------------------------- 1. Sample code number id number 2. Clump Thickness 1 - 10 3. Uniformity of Cell Size 1 - 10 4. Uniformity of Cell Shape 1 - 10 5. Marginal Adhesion 1 - 10 6. Single Epithelial Cell Size 1 - 10 7. Bare Nuclei 1 - 10 8. Bland Chromatin 1 - 10 9. Normal Nucleoli 1 - 10 10. Mitoses 1 - 10 11. Class: (benign, malignant)

Page 3: CEng 574 Statistical Data Analysis€¦ · multivariate and exploratory data analysis. ... Mechanics of the course 2. Oct 2: Data, ... • Assignments and term paper and project

5,1,1,1,2,1,3,1,1,2 5,4,4,5,7,10,3,2,1,2 3,1,1,1,2,2,3,1,1,2 6,8,8,1,3,4,3,7,1,2 4,1,1,3,2,1,3,1,1,2 8,10,10,8,7,10,9,7,1,4 1,1,1,1,2,10,3,1,1,2 2,1,2,1,2,1,3,1,1,2 2,1,1,1,2,1,1,1,5,2 4,2,1,1,2,1,2,1,1,2 1,1,1,1,1,1,3,1,1,2 2,1,1,1,2,1,2,1,1,2 5,3,3,3,2,3,4,4,1,4 1,1,1,1,2,3,3,1,1,2 8,7,5,10,7,9,5,5,4,4 7,4,6,4,6,1,4,3,1,4 4,1,1,1,2,1,2,1,1,2 4,1,1,1,2,1,3,1,1,2 10,7,7,6,4,10,4,1,2,4 6,1,1,1,2,1,3,1,1,2 7,3,2,10,5,10,5,4,4,4 10,5,5,3,6,7,7,10,1,4 3,1,1,1,2,1,2,1,1,2 1,1,1,1,2,1,3,1,1,2 5,2,3,4,2,7,3,6,1,4

Page 4: CEng 574 Statistical Data Analysis€¦ · multivariate and exploratory data analysis. ... Mechanics of the course 2. Oct 2: Data, ... • Assignments and term paper and project

Projection by PCA

Page 5: CEng 574 Statistical Data Analysis€¦ · multivariate and exploratory data analysis. ... Mechanics of the course 2. Oct 2: Data, ... • Assignments and term paper and project

Projection by PCA

Page 6: CEng 574 Statistical Data Analysis€¦ · multivariate and exploratory data analysis. ... Mechanics of the course 2. Oct 2: Data, ... • Assignments and term paper and project

‘Poverty map’ based on 39 indicators from World Bank statistics (1992)

Page 7: CEng 574 Statistical Data Analysis€¦ · multivariate and exploratory data analysis. ... Mechanics of the course 2. Oct 2: Data, ... • Assignments and term paper and project

Powerty Map

Page 8: CEng 574 Statistical Data Analysis€¦ · multivariate and exploratory data analysis. ... Mechanics of the course 2. Oct 2: Data, ... • Assignments and term paper and project

Notion of a Cluster can be Ambiguous

How many clusters?

Four Clusters Two Clusters

Six Clusters

Page 9: CEng 574 Statistical Data Analysis€¦ · multivariate and exploratory data analysis. ... Mechanics of the course 2. Oct 2: Data, ... • Assignments and term paper and project

Instructor Volkan Atalay phone 210 2108 [email protected]

Class Tuesday 9:40-12:30 (A-101)

Office Hour by appointment Course web page address http://www.ceng.metu.edu.tr/courses/ceng574/

Page 10: CEng 574 Statistical Data Analysis€¦ · multivariate and exploratory data analysis. ... Mechanics of the course 2. Oct 2: Data, ... • Assignments and term paper and project

Course Objectives •  The objective of this course is to introduce the

concepts and techniques of clustering and multivariate and exploratory data analysis.

•  This course also offers an opportunity to perform data analysis by using data visualization and projection.

•  In addition, it allows students to apply these techniques in a specific field, such as bioinformatics.

•  Prerequisites Knowledge of programming,

probability and linear algebra.

Page 11: CEng 574 Statistical Data Analysis€¦ · multivariate and exploratory data analysis. ... Mechanics of the course 2. Oct 2: Data, ... • Assignments and term paper and project

Main Reference Book

E. Alpaydın (2010) Introduction to Machine Learning. 2nd Edition, The MIT Press.

Yapay Öğrenme, Turkish language edition,

translated by the author, Boğaziçi Üniversitesi Yayınevi, April 2011

http://www.cmpe.boun.edu.tr/~ethem/i2ml2e/

Page 12: CEng 574 Statistical Data Analysis€¦ · multivariate and exploratory data analysis. ... Mechanics of the course 2. Oct 2: Data, ... • Assignments and term paper and project

Other Reference Books •  W. Härdle and L. Simar (2007) Applied Multivariate Statistical

Analysis. Springer. •  A. K. Jain and R. C. Dubes (1988) Algorithms for Clustering

Data. Prentice Hall. (freely available online) http://www.cse.msu.edu/%7Ejain/Clustering_Jain_Dubes.pdf

•  S. Theodoridis, K. Koutroumbas, (2003) Pattern recognition, 2nd Edition. Academic Press.

•  B. Everitt, S. Landau, and M. Leese (2001) Cluster analysis. 4th Edition. Edward Arnold Pubs. Ltd.

•  A. Webb (2002) Statistical Pattern Recognition. Wiley. New York.

•  R. O. Duda, P. E. Hart and D. G. Stork (2001) Pattern Classification (2nd ed.). John Wiley.

Page 13: CEng 574 Statistical Data Analysis€¦ · multivariate and exploratory data analysis. ... Mechanics of the course 2. Oct 2: Data, ... • Assignments and term paper and project

Grading

•  Assignments 40 •  Term Paper/Report 20 •  Presentations 30 •  Attendance and

class participation 10

Page 14: CEng 574 Statistical Data Analysis€¦ · multivariate and exploratory data analysis. ... Mechanics of the course 2. Oct 2: Data, ... • Assignments and term paper and project

Course Outline

1.  Sept 25: Syllabus distribution; Getting to know each other; Overview of ML and PR; Mechanics of the course

2.  Oct 2: Data, Measurements, Features, Similarities 3.  Oct 9:: Review of probability; R; Sample assignments; projects from previous years 4.  Oct 16:: Data set presentations by students 5.  Oct 23:: Linear projections and principal component analysis 6.  Oct 30:: Non-linear projections and multi-dimensional scaling 7.  Nov 6:: Clustering, and hierarchical clustering and k-means clustering and their

variations 8.  Nov 13::Clustering by mixture of Gaussians and EM algorithm, Evaluation and validity

of clusters 9.  Nov 20:: Choosing paper, information about advanced topics 10. Nov 27:: Bioinformatics research, discussion on advanced topic papers 11.  Dec 4:: Presentations by students 12. Dec 11:: Presentations by students 13. Dec 18:: Presentations by students 14. Dec 25:: Presentations by students

Page 15: CEng 574 Statistical Data Analysis€¦ · multivariate and exploratory data analysis. ... Mechanics of the course 2. Oct 2: Data, ... • Assignments and term paper and project

Assignments

0. Reading and Report-due Oct 2 1.  Data Set selection-due Oct 16 2.  Projections 1-due Oct 30 3.  Projections 2-due Nov 6 4.  Clustering-due Nov 13 5.  Validation-due Nov 20

Also:: Decide on advanced topic and paper by Nov 29

Page 16: CEng 574 Statistical Data Analysis€¦ · multivariate and exploratory data analysis. ... Mechanics of the course 2. Oct 2: Data, ... • Assignments and term paper and project

Presentations •  Data Set (All) •  Weka, Phyton (2 persons) •  Projections: PCA (1 person) •  Projections: MDS, GTM, LLE, Isomap (1 person) •  Clustering: k-means, hierarchical (1 person) •  Validation: (1 person) •  Advanced Topics (paper-all except the above 6 persons)

–  Semi-Supervised Clustering –  Kernel-based clustering –  Manifold Learning and Clustering –  Spectral Clustering

Page 17: CEng 574 Statistical Data Analysis€¦ · multivariate and exploratory data analysis. ... Mechanics of the course 2. Oct 2: Data, ... • Assignments and term paper and project

•  Assignments and term paper and project should be done on individual basis.

•  Remark that R seems to be the most convenient environment to perform computational operations during this course. http://www.r-project.org/

Page 18: CEng 574 Statistical Data Analysis€¦ · multivariate and exploratory data analysis. ... Mechanics of the course 2. Oct 2: Data, ... • Assignments and term paper and project

Learning Management System

METUCLASS-Moodle http://metuclass.metu.edu.tr

Page 19: CEng 574 Statistical Data Analysis€¦ · multivariate and exploratory data analysis. ... Mechanics of the course 2. Oct 2: Data, ... • Assignments and term paper and project

Resources: web pages and tutorials

•  Webpages •  http://www.sciencemag.org/site/feature/data/

compsci/machine_learning.xhtml •  http://dataclustering.cse.msu.edu/ •  http://en.wikipedia.org/wiki/Machine_learning

•  Tutorials •  http://homepages.inf.ed.ac.uk/rbf/IAPR/

researchers/MLPAGES/mltut.htm •  https://www.coursera.org/course/ml

Page 20: CEng 574 Statistical Data Analysis€¦ · multivariate and exploratory data analysis. ... Mechanics of the course 2. Oct 2: Data, ... • Assignments and term paper and project

Resources: Datasets

UCI Repository: http://www.ics.uci.edu/~mlearn/MLRepository.html

UCI KDD Archive: http://kdd.ics.uci.edu/summary.data.application.html

Statlib: http://lib.stat.cmu.edu/

Delve: http://www.cs.utoronto.ca/~delve/

20

Page 21: CEng 574 Statistical Data Analysis€¦ · multivariate and exploratory data analysis. ... Mechanics of the course 2. Oct 2: Data, ... • Assignments and term paper and project

Resources: Journals Journal of Machine Learning Research www.jmlr.org Machine Learning Neural Computation Neural Networks IEEE Transactions on Neural Networks IEEE Transactions on Pattern Analysis and Machine

Intelligence Annals of Statistics Journal of the American Statistical Association ...

21

Page 22: CEng 574 Statistical Data Analysis€¦ · multivariate and exploratory data analysis. ... Mechanics of the course 2. Oct 2: Data, ... • Assignments and term paper and project

Resources: Conferences International Conference on Machine Learning (ICML) European Conference on Machine Learning (ECML) Neural Information Processing Systems (NIPS) Uncertainty in Artificial Intelligence (UAI) Computational Learning Theory (COLT) International Conference on Artificial Neural Networks

(ICANN) International Conference on AI & Statistics (AISTATS) International Conference on Pattern Recognition (ICPR) ...

22

Page 23: CEng 574 Statistical Data Analysis€¦ · multivariate and exploratory data analysis. ... Mechanics of the course 2. Oct 2: Data, ... • Assignments and term paper and project

Resources: Computational •  MatLab, R, Weka, Phyton •  Machine Learning Open Source Software

•  http://jmlr.org/mloss/ and http://mloss.org/software/

•  http://www.dmoz.org/Computers/Artificial_Intelligence/Machine_Learning/Software/ and http://www.dmoz.org/Science/Math/Statistics/Software/

•  http://www.cs.ubc.ca/~murphyk/Teaching/CS540_Fall05/software.html

Page 24: CEng 574 Statistical Data Analysis€¦ · multivariate and exploratory data analysis. ... Mechanics of the course 2. Oct 2: Data, ... • Assignments and term paper and project

•  Sergey Brin's mother was diagnosed with Parkinson’s in 1999. •  In 2006, his wife-to-be, Anne Wojcicki, started the personal

genetics company 23andMe (Google is an investor). •  As an alpha tester, Brin had the chance to get an early look at

his genome. •  He looked up a spot known as G2019S—the notch on the

LRRK2 gene where an adenine nucleotide, the A in the ACTG code of DNA, sometimes substitutes for a guanine nucleotide, the G. And there it was: He had the mutation. His mother’s 23andMe readout showed that she had it, too.

Article http://www.wired.com/magazine/2010/06/ff_sergeys_search/all/1

Page 25: CEng 574 Statistical Data Analysis€¦ · multivariate and exploratory data analysis. ... Mechanics of the course 2. Oct 2: Data, ... • Assignments and term paper and project

Article http://www.wired.com/magazine/2010/06/ff_sergeys_search/all/1

Sergey Brin’s Search for a Parkinson’s Cure WIRED Magazine, July (August) 2010, p.124-133. Most Parkinson’s research, like much of medical research, relies

on the classic scientific method: hypothesis, analysis, peer review, publication.

Brin proposes a different approach, one driven by computational muscle and staggeringly large data sets. It’s a method that draws on his algorithmic sensibility—and Google’s storied faith in computing power—with the aim of accelerating the pace and increasing the potential of scientific research.

“Generally the pace of medical research is glacial compared to what I’m used to in the Internet,” Brin says. “We could be looking lots of places and collecting lots of information. And if we see a pattern, that could lead somewhere.”

Page 26: CEng 574 Statistical Data Analysis€¦ · multivariate and exploratory data analysis. ... Mechanics of the course 2. Oct 2: Data, ... • Assignments and term paper and project

Article http://www.wired.com/magazine/2010/06/ff_sergeys_search/all/1

Page 27: CEng 574 Statistical Data Analysis€¦ · multivariate and exploratory data analysis. ... Mechanics of the course 2. Oct 2: Data, ... • Assignments and term paper and project

Article http://www.wired.com/magazine/2010/06/ff_sergeys_search/all/1

Increasingly, though, scientists—especially those with a background in computing and information theory—are starting to wonder if that model could be inverted.

Why not start with tons of data, a deluge of information, and then wade in, searching for patterns and correlations?

Page 28: CEng 574 Statistical Data Analysis€¦ · multivariate and exploratory data analysis. ... Mechanics of the course 2. Oct 2: Data, ... • Assignments and term paper and project

Assignment #0: Alternatives

Read the article and the comments, and do one of the following in at most a page: •  Write your comments OR •  Write a letter to the author about the

article OR •  Write a review of the article as a referee. Due : October 2, 2014