Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Knowledge Discovery and Data Mining 1 (VO)(706.701)
Denis Helic
ISDS, TU Graz
March 2, 2020
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 1 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Lecturer
Name: Denis HelicOffice: ISDS, Petersgasse 116, Room 026
Office hours: Tuesday 12:00-13:00Phone: +43-316/873-30610email: [email protected]
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 2 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Lecturer
Name: Roman KernOffice: Know-Center, Inffeldgasse 13, 6th Floor, Room 072
Office hours: By appointmentPhone: +43-316/873-30860email: [email protected]
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 3 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Lecturer
Name: Tiago SantosOffice: ISDS, Inffeldgasse 16c, 1st Floor
Office hours: By appointmentPhone: +43-316/873- 5607email: [email protected]
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 4 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Language
Lectures in EnglishCommunication in German/EnglishExamination: German/English
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 5 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Outline
1 Welcome and Introduction
2 Course Organization
3 Motivation
4 Course Overview
5 Course Highlights
6 Practical Part: KDDM1 KU
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 6 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Welcome and Introduction
Teaching Data Science @ ISDS
Introduction to AI & Data ScienceComputational Methods for StatisticsData Analysis Courses:
Knowledge Discovery and Data Mining 1 (Basics and theory)Knowledge Discovery and Data Mining 2 (Applications)Visual Analytics
Analysis of Web Systems & Data:Computational Social Systems I (Basics)Computational Social Systems IINetwork Science (Theory and applications)
Infrastructure:Data ManagementArchitecture of Database SystemsArchitecture of Machine Learning Systems
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 7 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Welcome and Introduction
Course context
Knowledge Discovery and Data Mining 1 (VO) (706.701)Obligatory course Master Software Development and Business (1stSemester)Obligatory elective course in subject catalog “KnowledgeTechnologies” (Computer Science)New major/minor system: Obligatory for Data Science and IntelligentSystems
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 8 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Welcome and Introduction
Course context
Knowledge Discovery and Data Mining 1 (KU) (706.702)New major/minor system: Obligatory for Data Science and IntelligentSystemsAn add-on for the theoretical part
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 9 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Welcome and Introduction
Goals of the course
The overall goal of KDDM and related courses is to learn how todiscover patterns and models in data. We aim to discover patternsthat are:
i Valid: hold for new data with high probabilityii Useful: we can base further actions on themiii Unexpected: non-obviousiv Understandable: humans can interpret them
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 10 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Welcome and Introduction
Goals of the course: patterns example
1854 Broad Street cholera outbreakExtracting clusters of cholera outbreak in the city of London in 1854. Thecases clusterd around some intersections of roads in London. These hadcontaminated water wells.
http://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 11 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Welcome and Introduction
Goals of the course: patterns example
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 12 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Welcome and Introduction
Goals of the course
Specific goals of this course are to learn about two out of three basicelements of that discovery:
i Tools: advanced mathematical tools from probability theory,linear algebra, information theory, and statistical inference
ii Infrastructure: models of computation for large data (handled in othercourses)
iii Process: steps that are needed to discover patternsI assume here that you already know
i How to program and develop softwareii Mathematical basics from probability theory and linear algebra
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 13 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Welcome and Introduction
Goals of the course
Student goals: to pass the examinationBonus goal for all: to have fun!
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 14 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Organization
Course Calendar
02.03.2020: Course organization, Introduction and Motivation (Denis)09.03.2020: Statistical Data Science (Roman)16.03.2020: Feature Extraction (Roman)23.03.2020: Feature Engineering (Roman)
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 15 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Organization
Course Calendar
30.03.2020: Data Matrices (Denis)20.04.2020: Review of Linear Algebra (Denis) / Project presentations(KU)27.04.2020: Partial Exam 104.05.2020: Principal Component Analysis (Denis)11.05.2020: Singular Value Decomposition (Denis)
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 16 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Organization
Course Calendar
18.05.2020: Recommender Systems: Matrix Factorization (Denis)25.05.2020: Topic Modeling and Non-negative Matrix Factorization(Denis)08.06.2020: Clustering (Roman)15.06.2020: Classification (Denis)22.06.2020: Evaluation (Denis) / Project presentations (KU)29.06.2020: Partial Exam 2 / Final Examination
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 17 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Organization
Course Logistics
Course website:https://courses.isds.tugraz.at/dhelic/kddm1/index.htmlSlides are (will be) available on the course websiteAdditional readings, references, links, etc. also on the websiteWe expect that you have basic knowledge in probability theory andlinear algebraTo freshen the knowledge you should solve these problems!This problem is not graded!As a side note: we also expect that you know how to program(relevant for the practical part)
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 18 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Organization
Grading
Two partial examinationsFinal examination at end of JuneTwo additional examination dates in summer semesterThree examination dates in winter semesterExamination material: lectures/slides/further readingsIn class we will discuss sample examination questions
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 19 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Organization
Partial Examinations
2 written examinationsIn the beginning of a lecture: max 45 minutesEach partial examination 2 questionsDifficulty adjusted to solve both problems in approx. 30 minutesMax 20 points for each questionTotal points: 80
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 20 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Organization
Examination
Written examination 90 minutes4 questionsDifficulty adjusted to solve all four in approx. 60 minutesMax 20 points for each questionTotal points: 80
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 21 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Organization
Important!
If you take partial examinations it counts as one examination attemptIn other words: if you are negative at partial examinations you will getthe negative grade in the TUGOnline
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 22 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Organization
Grading
0-40 points: 541-50 points: 451-60 points: 361-70 points: 271-80 points: 1
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 23 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Organization
Important!
If you take partial examinations it counts as one examinationattemptIn other words: if you are negative at partial examinations you will getthe negative grade in the TUGOnlineWorst case scenario I: if you get 0 points at the first partialexamination you are negative!In that case you can take final examination in JuneAll together this will count as two attempts
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 24 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Organization
Important!
Worst case scenario II: if you are negative after the second partialexamination you need to take the exam in winter termAll together this will count as two attemptsTherefore: take partial examination only if you follow the lecture andlearn in parallelThat is also the advantage: you will learn as you go and noteverything at once
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 25 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Organization
KU Organization
KU organization today after this lectureTwo presentations
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 26 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Organization
Questions?
Raise them now (+1 +1)Ask after the lecture (+1)Visit me in the office hours (+1)Send me an e-mail (±1)As a side note: you should(!) interrupt me immediately (+1 +1 +1)and ask any question you might have during the lecture
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 27 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Motivation
How much information is being produced?
Figure: Source: https://www.domo.com/learn/data-never-sleeps-7Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 28 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Motivation
How much information is being produced?
We are producing more data than we are able to storeWe need to extract and describe useful dataUseful data ≪ all dataWe can store useful dataWe can also try to predict future data: store only prediction modelIt is a challenge but also an opportunityFor example, learn about human behavior, spread of diseases, politicalbehavior, etc.
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 29 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Examples
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 30 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Examples
Amazon product recommendationsi Valid: holds for new data with high probabilityii Useful: users can find and explore new productsiii Unexpected: non-obvious and non-trivialiv Understandable: related articles, etc.
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 31 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Examples
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 32 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Examples
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 33 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Examples
Twitter earthquake and typhoon predictioni Valid: holds for new data with high probabilityii Useful: can save livesiii Unexpected: non-obvious and non-trivialiv Understandable: trajectories of typhoons, positions of earthquakes
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 34 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Knowledge discovery vs. data mining
Knowledge discovery refers to the entire process, of which knowledgeis the end-productIt is iterative and interactiveData mining refers to a specific step in this processIt is the step consisting of applying data analysis and discoveryalgorithms that produce a particular enumeration of patterns overdataAdditional steps are necessary to ensure that the process producesuseful knowledge
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 35 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Steps in the knowledge discovery process
1 Developing an understanding of the application domain and therelevant prior knowledge and identifying the goal of the KDD processfrom the customers viewpoint
2 Creating a target data set: selecting a data set or focusing on asubset of variables or data samples on which discovery is to beperformed
3 Data cleaning and preprocessing: basic operations such as theremoval of noise. If appropriate collecting the necessary informationto model or account for noise, deciding on strategies for handlingmissing data fields, accounting for time sequence information andknown changes
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 36 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Steps in the knowledge discovery process4 Data reduction and projection: finding useful features to represent
the data depending on the goal of the task. Using dimensionalityreduction or transformation methods to reduce the effective numberof variables under consideration or to find invariant representations forthe data
5 Matching the goals of the KDD process step to a particular datamining method e.g. summarization, classification, regression,clustering, etc
6 Choosing the data mining algorithms: selecting methods to be usedfor searching for patterns in the data. This includes deciding whichmodels and parameters may be appropriate e.g. models forcategorical data are different than models on vectors over the reals.Matching a particular data mining method with the overall criteria ofthe KDD process e.g. the enduser may be more interested inunderstanding the model than its predictive capabilities
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 37 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Steps in the knowledge discovery process
7 Data mining searching for patterns of interest in a particularrepresentational form or a set of such representations, classificationrules or trees, regression, clustering and so forth. The user cansignificantly aid the data mining method by correctly performing thepreceding steps
8 Interpreting mined patterns: possibly return to any of the steps forfurther iteration. This step can also involve visualization of theextracted patterns, models or visualization of the data given theextracted models
9 Consolidating discovered knowledge: incorporating this knowledgeinto another system for further action or simply documenting it andreporting it to interested parties. This also includes checking for andresolving potential conflicts with previously believed or extractedknowledge
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 38 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Steps in the knowledge discovery process
Reading!Knowledge Discovery and Data Mining: Towards a Unifying Framework(1996) Usama Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 39 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Steps in the knowledge discovery process
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 40 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Quick example of what it all means
1 Collect training data (e.g. crawl, clean, preprocess)2 Represent examples (e.g. decide which features, how to weight them,
etc.)3 Distance measure (e.g. what is close vs. what is not close)4 Measure the goodness (e.g. objective function)5 Select an approach (e.g. optimization method)
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 41 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Quick example of what it all means
Day Sunny Normal Humidity Strong Wind Temp. (°C) Play TennisDay1 Yes No Yes 12 NoDay2 No No No 18 NoDay3 Yes Yes No 21 YesDay4 Yes Yes No 28 YesDay5 Yes Yes No 19 ?
Table: Should I play tennis on Day5?
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 42 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Quick example of what it all means
DefinitionMinkowski distance of order 𝑝 between two vectors 𝐱 and 𝐲 from ℝ𝑛 isgiven by:
𝑑𝑝(𝑥, 𝑦) = (𝑛
∑𝑖=1
|𝑥𝑖 − 𝑦𝑖|𝑝)1/𝑝
(1)
What is 𝑑2(𝑥, 𝑦)?
Euclidean distanceWhat is 𝑑1(𝑥, 𝑦)? Manhattan distance
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 43 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Quick example of what it all means
DefinitionMinkowski distance of order 𝑝 between two vectors 𝐱 and 𝐲 from ℝ𝑛 isgiven by:
𝑑𝑝(𝑥, 𝑦) = (𝑛
∑𝑖=1
|𝑥𝑖 − 𝑦𝑖|𝑝)1/𝑝
(1)
What is 𝑑2(𝑥, 𝑦)? Euclidean distanceWhat is 𝑑1(𝑥, 𝑦)?
Manhattan distance
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 43 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Quick example of what it all means
DefinitionMinkowski distance of order 𝑝 between two vectors 𝐱 and 𝐲 from ℝ𝑛 isgiven by:
𝑑𝑝(𝑥, 𝑦) = (𝑛
∑𝑖=1
|𝑥𝑖 − 𝑦𝑖|𝑝)1/𝑝
(1)
What is 𝑑2(𝑥, 𝑦)? Euclidean distanceWhat is 𝑑1(𝑥, 𝑦)? Manhattan distance
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 43 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Quick example of what it all means
Day1 Day2 Day3 Day4 Day5Day1 0. 6.164414 9.11043358 16.0623784 7.14142843Day2 6.164414 0. 3.31662479 10.09950494 1.73205081Day3 9.11043358 3.31662479 0. 7. 2.Day4 16.0623784 10.09950494 7. 0. 9.Day5 7.14142843 1.73205081 2. 9. 0.
Table: Euclidean distances between days
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 44 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Quick example of what it all means
Day1 Day2 Day3 Day4 Day5Day1 0. 8. 11. 18. 9.Day2 8. 0. 5. 12. 3.Day3 11. 5. 0. 7. 2.Day4 18. 12. 7. 0. 9.Day5 9. 3. 2. 9. 0.
Table: Manhattan distances between days
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 45 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Quick example of what it all means
Day1 Day2 Day3 Day4 Day5Day1 0. 3.72174037 3.66840663 4.47507277 3.5008579Day2 3.72174037 0. 3.27940612 3.76436189 3.23329618Day3 3.66840663 3.27940612 0. 1.35622245 0.38749213Day4 4.47507277 3.76436189 1.35622245 0. 1.74371458Day5 3.5008579 3.23329618 0.38749213 1.74371458 0.
Table: Euclidean distances between days (standardized features)
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 46 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Quick example of what it all means
Day Sunny Normal Humidity Strong Wind Temp. (°C) Temp (°F) Play TennisDay1 Yes No Yes 12 53.6 NoDay2 No No No 18 64.4 NoDay3 Yes Yes No 21 69.8 YesDay4 Yes Yes No 28 82.4 YesDay5 Yes Yes No 19 66.2 ?
Table: Should I play tennis on Day5?
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 47 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Quick example of what it all means
Day1 Day2 Day3 Day4 Day5Day1 0. 18.8 27.2 46.8 21.6Day2 18.8 0. 10.4 30. 4.8Day3 27.2 10.4 0. 19.6 5.6Day4 46.8 30. 19.6 0. 25.2Day5 21.6 4.8 5.6 25.2 0.
Table: Manhattan distances between days using Fahrenheit and Celsius
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 48 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Quick example of what it all means
Day Sunny Normal Humidity Strong Wind Temp. (°C) Temp (°F) Play TennisDay1 Yes No Yes 12 53.6 NoDay2 No No No 18 64.4 NoDay3 Yes Yes No 21 69.8 YesDay4 Yes Yes No 28 82.4 YesDay5 Yes Yes No 19 66.2 ?
Table: Should I play tennis on Day5?
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 49 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Big picture: KDDM
Probability Theory Linear Algebra Map-Reduce
Mathematical Tools Infrastructure
Knowledge Discovery Process
Information Theory Statistical Inference
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 50 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Teaching Data Science @ ISDS and KDDM1
KDDM1: Basics, theory and KDD process until step 8 (novisualization, no interpretation)KDDM2: Implementation and practice of the theory from KDDM1Visual Analytics: KDD process steps 8 and 9 (interpretation andvisualization)Network Science: graph and networks miningComputational Social Systems I & II: Web systems and socialcomputation
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 51 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Data mining and other fields
Data mining overlaps withi Databases: Large-scale data, simple queriesii Machine learning: Small data, complex models, model parametersiii Statistics: Theory, predictive models, no algorithms
Data mining: Algorithms, simple and predictive models, large-scaledata
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 52 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Data mining and other fields
Statistics Machine Learning
Databases
Data Mining
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 53 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Highlights
KDDM1: some thoughts
Big data, data science, …Many buzzwordsBut in the end:
i You need to know how to program and how to develop software systemsii You need to understand the math behind data analysis: linear algebra,
probability, statisticsIf you have solid knowledge in (i) and (ii) you are in the top 5% ofdevelopers in the field ;)For the years to come you will be earning a lot of money
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 54 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Highlights
KDDM1: some highlights
Recommender systems: decompose the matrix and find out what yourusers like :)PCA, SVD: reduce the dimensionality of the dataNMF: analyze relationships between set of documents with linearalgebraTopic models: analyze relationships between set of documents withprobability theoryBayesian inference
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 55 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Practical Part: KDDM1 KU
Lecturer
Name: Tiago SantosOffice: ISDS, Inffeldgasse 16c/I, Room ID01104
Office hours: Tuesday 15:00-16:00Phone: +43 316 873 5607Email: [email protected]
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 56 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Practical Part: KDDM1 KU
KDDM1 Course Practicals
Why should I be interested?To consolidate and reinforce your (theoretical) knowledge with practicalhands-on experienceHelps a lot with the partial and final examinationsGood preparation for KDDM2If interested: possibility to develop a topic for a project or MSc thesis
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 57 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Practical Part: KDDM1 KU
KDDM1 Course Practicals
Your task:Form small groups of 3 or 4 studentsDecide on an interesting practical or research questionDecide which data you need for that questionCrawl/download the data and work on your projectGive two presentations (in English) on the progress and your finalresultsEngage with the class and discuss the results of other groups
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 58 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Practical Part: KDDM1 KU
KDDM1 Course PracticalsTopic ideas:
Movie/Game/Music Recommender system construction and analysisSentiment analysis of user posts on social media (Twitter / Reddit /StackOverflow)Controversial topics in social media (statistical analysis, clustering,prediction, etc.)Hate speech topics in social media (statistical analysis, clustering,prediction, etc.)Your own idea! Discuss with Tiago
Very specific examples:Reproduce and extend “Quantifying the Advantage of LookingForward”1, an analysis which found positive correlation between GDPper country and quantity of searches for the future. Does this extendto any year and country? What about other search topics?Similar task but with “Parents mention sons more often than daughterson social media”2
1https://doi.org/10.1038/srep003502https://doi.org/10.1073/pnas.1804996116
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 59 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Practical Part: KDDM1 KU
KDDM1 Course Practicals
Organizational details:Presentations take place on 20.04.2020 and 22.06.2020, 14:00-16:00and in the room “HS i11”Presentations need to be sent to Tiago ([email protected]),in PDF format, until 23:59 of the day beforeName the files with the last and first names of the students in thegroup like this: santos_tiago_mustermann_max_wurst_hans.pdf
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 60 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Practical Part: KDDM1 KU
KDDM1 Course Practicals
For the first presentation prepare three slides (5 min strict):First slide: Research/Practical QuestionSecond slide: DatasetThird slide: Experimental Design
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 61 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Practical Part: KDDM1 KU
KDDM1 Course Practicals
For the second presentation prepare five slides (10 min strict):First slide: MotivationSecond slide: MethodologyThird slide: Experimental SetupFourth slide: ResultsFifth slide: Discussion
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 62 / 63
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Practical Part: KDDM1 KU
KDDM1 Course Practicals
Grading:Presentation (time restrictions will be taken into account)Results
Students which hand-in the first presentation will be graded
Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 63 / 63