View
9
Download
0
Category
Preview:
Citation preview
AI Tools Lab SS2020
Introduction
Todor Ivanov todor@dbis.cs.uni-frankfurt.de
http://www.bigdata.uni-frankfurt.de/
1
● Dr. Todor Ivanov – Senior Researcher– Big Data Benchmarking– Complex distributed software systems (Hadoop & Spark)– Storage and processing of data-intensive applications
2
About Me
AI Tools Lab SS2020
What about You? ☺
AI Tools Lab SS2020
3
Frankfurt Big Data Lab - http://www.bigdata.uni-frankfurt.de
Our lab is currently active in the following research areas:
• Big Data Management Technologies
• Data Analytics / Data Science
• Graph Databases / Linked Open Data (LOD)
4AI Tools Lab SS2020
This Hands-on Course will look at
● Python & Python Libraries
● Machine Learning & AI Tools:
● Google What-If Tool
● IBM AI Fairness 360
● IBM AI Explainability 360
● Microsoft InterpretML
AI Tools Lab SS2020 5
Course Organization (1)
● Course start/end: Tuesday, 28.04.2020 to Tuesday, 16.06.2020
● Time: Tuesday 10:00 – 12:00
● Location: remote via Zoom
● Languages: English and German
● Credit Points: Students can receive 8 CPs
● Important: Attendance is mandatory!
● Work in teams of 2 students! The final project will be graded (more info follows).
● Communication: aitoolslabss2020@gmail.com
6AI Tools Lab SS2020
Course Organization (2)
● Python ML courses● Project Topic
● Course page: http://www.bigdata.uni-frankfurt.de/big-data-technologies-ss-2020/
● Communication: aitoolslabss2020@gmail.com
AI Tools Lab SS2020 7
Date Topic
28.04.2020 Course Organization & Introduction→ Python for Data Science Course
05.05.2020 → Data Analysis with Python Course
12.05.2020 → Data Visualization with Python Course
19.05.2020 → Submit certificates and start Project
26.05.2020
02.06.2020
09.06.2020
16.06.2020 Project Submission
Python for Data Science (1)https://cognitiveclass.ai/courses/python-for-data-science
Module 1 - Python Basics ○ Your first program ○ Types ○ Expressions and Variables ○ String Operations
Module 2 - Python Data Structures ○ Lists and Tuples ○ Sets ○ Dictionaries
Module 3 - Python Programming Fundamentals ○ Conditions and Branching ○ Loops ○ Functions ○ Objects and Classes
Module 4 - Working with Data in Python ○ Reading files with open ○ Writing files with open ○ Loading data with Pandas ○ Working with and Saving data with
Pandas
COURSE SYLLABUS
8AI Tools Lab SS2020
Data Analysis with Python (2)https://cognitiveclass.ai/courses/data-analysis-python
Module 1 - Importing Datasets ○ Learning Objectives ○ Understanding the Domain ○ Understanding the Dataset ○ Python package for data science ○ Importing and Exporting Data in Python ○ Basic Insights from Datasets
Module 2 - Cleaning and Preparing the Data ○ Identify and Handle Missing Values ○ Data Formatting ○ Data Normalization Sets ○ Binning ○ Indicator variables
Module 3 - Summarizing the Data Frame ○ Descriptive Statistics ○ Basic of Grouping ○ ANOVA ○ Correlation ○ More on Correlation
Module 4 - Model Development ○ Simple and Multiple Linear Regression ○ Model Evaluation Using Visualization ○ Polynomial Regression and Pipelines ○ R-squared and MSE for In-Sample
Evaluation ○ Prediction and Decision Making
Module 5 - Model Evaluation ○ Model Evaluation ○ Over-fitting, Under-fitting and Model
Selection ○ Ridge Regression ○ Grid Search ○ Model Refinement
COURSE SYLLABUS
9AI Tools Lab SS2020
Data Visualization with Python (3)https://cognitiveclass.ai/courses/data-visualization-with-python
Module 1 - Introduction to Visualization Tools ○ Introduction to Data Visualization ○ Introduction to Matplotlib ○ Basic Plotting with Matplotlib ○ Dataset on Immigration to Canada ○ Line Plots
Module 2 - Basic Visualization Tools ○ Area Plots ○ Histograms ○ Bar Charts
Module 3 - Specialized Visualization Tools
○ Pie Charts ○ Box Plots ○ Scatter Plots ○ Bubble Plots
Module 4 - Advanced Visualization Tools ○ Waffle Charts ○ Word Clouds ○ Seaborn and Regression Plots
Module 5 - Creating Maps and Visualizing Geospatial Data
○ Introduction to Folium ○ Maps with Markers ○ Choropleth Maps
COURSE SYLLABUS
10AI Tools Lab SS2020
Grading Info ● (10%) - Regular participation in the virtual meetings
• every Tuesday 10:00-12:00
● (30%) - Complete successfully the 3 Python courses• Send me your certificates + link to badges until May 19th, 2020
● (60%) - Complete successfully the Project Topic• Implement the project task • Submit 15 slides presenting your project• Submit project solution as Jupyter Notebook in github until June 16th, 2020
11AI Tools Lab SS2020
What is Big Data?
AI Tools Lab SS202012
Big Data
Growing amount of data• petabytes/exabytes of user data (text, audio, video, images)
Variety of data sources:• Mobile devices• Social platforms• Sensors (RFID)• Web platforms
Processing speed• How fast is the result available ?
AI Tools Lab SS2020 13
Big Data Definition (sort of)
• Big Data refers to datasets and flows, large enough that has outpaced our capability to store, process, analyze, and understand.
• Big Data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze. (McKinsey Global Institute)
– This definition is Not defined in terms of data size (data sets will increase)– Vary by sectors (ranging from a few dozen terabytes to multiple petabytes)
AI Tools Lab SS2020 14
AI Tools Lab SS2020 15
Source:https://www.domo.com/learn/data-never-sleeps-6
Big Data Characteristics
AI Tools Lab SS2020 16
[1] D. Laney, “3D data management: Controlling data volume, velocity and variety,” Appl. Deliv. Strateg. File, vol. 949, 2001.
Big Data Characteristics
AI Tools Lab SS2020 17
● Volume represents the ever-growing amount of data in petabytes, exabytes, zettabytes and yottabytes, which is generated by applications like Facebook, Twitter, IoT, etc. and challenges the current stage of storage systems.
● Velocity describes how quickly the data is retrieved, stored and processed.
● Variety describes the multitude of data sources like sensors, smart devices and social media, producing data in data formats. That is structured, semi-structured or unstructured data, with unstructured data as most common in Big Data use cases.
● Value defines the business value derived from the extracted data insights. Varies depending on the application domain.
● Veracity defines the data accuracy or how truthful it is. If the data is corrupted, imprecise or uncertain, this has direct impact on the quality of the final results.
● Variability defines the different interpretations that a certain data can have when put in different contexts. It focuses on the meaning of the data instead of its variety in terms of structure or representation.
Structured Data
Employee
EmpNo Ename DeptNo DeptName100 Bob 10 Marketing200 Bob 20 Purchasing150 Peter 10 Marketing170 Doug 20 Purchasing105 John 10 Marketing
18AI Tools Lab SS2020
Clickstream Data - is an information trail a user leaves behind while visiting a website. It is typically captured in semi-structured website log (source http://www.jafsoft.com/searchengines/log_sample.html) and( http://hortonworks.com)
fcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400] "GET /contacts.html HTTP/1.0" 200 4595 "-" "FAST-WebCrawler/2.1-pre2 (ashen@looksmart.net)"
fcrawler.looksmart.com - - [26/Apr/2000:00:17:19 -0400] "GET /news/news.html HTTP/1.0" 200 16716 "-" "FAST-WebCrawler/2.1-pre2 (ashen@looksmart.net)"
ppp931.on.bellglobal.com - - [26/Apr/2000:00:16:12 -0400] "GET /download/windows/asctab31.zip HTTP/1.0" 200 1540096 "http://www.htmlgoodies.com/downloads/freeware/webdevelopment/15.html" "Mozilla/4.7 [en]C-SYMPA (Win95; U)"
123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/wpaper.gif HTTP/1.0" 200 6248 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"
123.123.123.123 - - [26/Apr/2000:00:23:47 -0400] "GET /asctortf/ HTTP/1.0" 200 8130 "http://search.netscape.com/Computers/Data_Formats/Document/Text/RTF" "Mozilla/4.05 (Macintosh; I; PPC)"
123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/5star2000.gif HTTP/1.0" 200 4005 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"
123.123.123.123 - - [26/Apr/2000:00:23:50 -0400] "GET /pics/5star.gif HTTP/1.0" 200 1031 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"
123.123.123.123 - - [26/Apr/2000:00:23:51 -0400] "GET /pics/a2hlogo.jpg HTTP/1.0" 200 4282 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"
123.123.123.123 - - [26/Apr/2000:00:23:51 -0400] "GET /cgi-bin/newcount?jafsof3&width=4&font=digital&noshow HTTP/1.0" 200 36 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"
19AI Tools Lab SS2020
Unstructured DataSensor data logged to a text file. Imported data into Excel (source Memos From the Cube)
20AI Tools Lab SS2020
Big Data: Challenges
1. Data (Volume, Variety, Velocity … ) 2. Processing (Batch, Near-real time, Real-time …)
3. Management (Meta-data(Schema), Security … )
4. Data Science (Machine Learning, Deep Learning, AI …)
AI Tools Lab SS2020 21
Big Data & Hadoop Ecosystem
AI Tools Lab SS2020 22
Big Data Ecosystem
AI Tools Lab SS2020 23https://mattturck.com/data2019/
What is Data Science/Data Scientist?
AI Tools Lab SS2020
24
Big Data vs. Data Science
Big Data = Big Data Systems + Data Science
25AI Tools Lab SS2020
Image source: http://www.kevinschmidt.biz/2015/03/22/data-engineer-vs-data-scientist-vs-business-analyst/
Data Science - Definitions
Data Science aims to derive knowledge from data, efficiently and intelligently. Data Science encompasses the set of activities, tools, and methods that enable data-driven activities in science, business, medicine, and government.
AI Tools Lab SS2020 26
What is Artificial Intelligence?
AI Tools Lab SS202027
What is Artificial Intelligence?
28
Arthur Samuel described it as: "the field of study that gives computers the ability to learn without being explicitly programmed.“
Tom Mitchell provides a more modern definition: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."
Regression
Classification
Clustering
Decision Trees
Image Processing
Speech Processing
Natural Language Processing
Recommender Systems
Adversarial Networks
Reinforcement Learning
Source: https://blog.quantinsti.com/machine-learning-basics/
AI Tools Lab SS2020
Unsupervised Learning
29
Unsupervised learning allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don't necessarily know the effect of the variables.
Regression
Classification
Clustering
Decision Trees
Image Processing
Speech Processing
Natural Language Processing
Recommender Systems
Adversarial Networks
Reinforcement Learning
K-Means
Source: https://en.wikipedia.org/wiki/Cluster_analysis
AI Tools Lab SS2020
Supervised Learning
30
In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output. (We have data which includes the correct answer)
Regression
Classification
Clustering
Decision Trees
Image Processing
Speech Processing
Natural Language Processing
Recommender Systems
Adversarial Networks
Reinforcement Learning
Given data about the size of houses on the real estate market, try to predict their price. Price as a function of size is a continuous output, so this is a regression problem.
Turn this example into a classification problem by instead making our output about whether the house "sells for more or less than the asking price." Here we are classifying the houses based on price into two discrete categories.
AI Tools Lab SS2020
Supervised Learning
31
In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output. (We have data which includes the correct answer)
Regression
Classification
Clustering
Decision Trees
Image Processing
Speech Processing
Natural Language Processing
Recommender Systems
Adversarial Networks
Reinforcement Learning
Source: Introduction to artificial intelligence using Intel® hardware platform
AI Tools Lab SS2020
Problems?
32
PredPol is an algorithm designed to predict when and where crimes will take place, with the aim of helping to reduce human bias in policing. But in 2016, the Human Rights Data Analysis Group found that the software could lead police to unfairly target certain neighbourhoods. When researchers applied a simulation of PredPol’s algorithm to drug offences in Oakland, California, it repeatedly sent officers to neighbourhoods with a high proportion of people from racial minorities, regardless of the true crime rate in those areas.Source: Lum, Kristian, and William Isaac. "To predict and serve?." Significance 13.5 (2016): 14-19.
“Volvo admits its self-driving cars are confused by kangaroos[…]The company’s “Large Animal Detection system” can identify and avoid deer, elk and caribou, but early testing in Australia shows it cannot adjust to the kangaroo’s unique method of movement.” Source: https://www.theguardian.com/technology/2017/jul/01/volvo-admits-its-self-driving-cars-are-confused-by-kangaroos
AI Tools Lab SS2020
Bias / Fairness“Algorithmic bias describes systematic and repeatable errors in a computer system that create unfair outcomes, such as privileging one arbitrary group of users over others. Bias can emerge due to many factors, including but not limited to the design of the algorithm or the unintended or unanticipated use or decisions relating to the way data is coded, collected, selected or used to train the algorithm.”Source: https://en.wikipedia.org/wiki/Algorithmic_bias
Finding suitable definitions of fairness in an algorithmic context is a subject of much debate:
„Equality of opportunity defines an important welfare criterion in political philosophy and policy analysis. Philosophers define equality of opportunity as the requirement that an individual’s well being be independent of his or her irrelevant characteristics. The difference among philosophers is mainly about which characteristics should be considered irrelevant. Policymakers, however, are often called upon to address more specific questions: How should admissions policies be designed so as to provide equal opportunities for college? Or how should tax schemes be designed so as to equalize opportunities for income? These are called local distributive justice problems, because each policymaker is in charge of achieving equality of opportunity to a specific issue.”Source: Catarina Calsamiglia. Decentralizing equality of opportunity and issues concerning theequality of educational opportunity, 2005. Doctoral Dissertation, Yale University.
33AI Tools Lab SS2020
Course Ethical Implications of AI
Topics:● The Ethics of Artificial Intelligence (AI) ● Ethics, Moral Values, Humankind, Technology, AI Examples● On the ethics of algorithmic decision-making in healthcare● Fairness, Bias and Discrimination in AI ● AI and Trust: Explainability, Transparency● AI Privacy, Responsibility, Accountability, Safety, Human-in the loop● AI and Trust● Introduction to Z-inspection. A framework to assess Ethical AI● Legal relevance of AI Ethics● AI Fairness and AI Explainability software tools● Design of Ethics Tools for AI Developers● Assessing AI use cases. Ethical tensions, Trade offs
→ http://www.bigdata.uni-frankfurt.de/data-challenge-ss-2020/
→ All course slides and videos available on the website!
34
AI/Machine Learning Glossary
35
Library Framework Topology
Hardware-optimized mathematical and other
primitive functions that are commonly used in machine & deep learning algorithms,
topologies & frameworks
Open-source software environments that facilitate
deep learning model development & deployment through built-in components and the ability to customize
code
Wide variety of algorithms modeled loosely after the
human brain that use neural networks to recognize complex
patterns in data that are otherwise difficult to
reverse engineer
Yolo, Inception-ResNetV2,
SSD-MobileNet, Resnet-50,
Faster-RCNN, Wavenet, …
AI Tools Lab SS2020
Popular Python Libraries
● Computing:
• Pandas (data structures & tools) - https://pandas.pydata.org/
• Numpy (arrays & matrices) - https://numpy.org/
• SciPy (integrals &optimizations) - https://www.scipy.org/
● Algortihmic libs:
• scikit-learn (ML, regressions, ...) - https://scikit-learn.org/stable/
• statsmodels (statistics, estimations, etc..) - https://www.statsmodels.org/stable/index.html
● Data Visualization:
• matplotlib (plots & graphs) - https://matplotlib.org/
• seaborn (heat maps, time series, ..) - https://seaborn.pydata.org/
36AI Tools Lab SS2020
ML & AI Tools in this Lab
● Google What-If● IBM AI Fairness 360● IBM AI Explainability 360● MS InterpretML
Python Environments (Cloud-based)● Google Colaboratory Tools - https://colab.research.google.com/notebooks/intro.ipynb
● IBM Notebooks - https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/notebooks-parent.html
37AI Tools Lab SS2020
What-If Tool
What-if Tool is an interactive visual interface designed to probe your models better.https://pair-code.github.io/what-if-tool/
● Compare multiple models within the same workflow
● Visualize inference results
● Visualize feature attributions
● Arrange datapoints by similarity
● Edit a datapoint and see how your model performs
● Compare counterfactuals to datapoints
● Use feature values as lenses into model performance
● Experiment using confusion matrices and ROC curves
● Test algorithmic fairness constraints
38
AI Fairness & Explainability 360
AIF 360 is an open source toolkit that can help you examine, report, and mitigate discrimination and bias in machine learning models throughout the AI application lifecycle. Containing over 70 fairness metrics and 10 state-of-the-art bias mitigation algorithms developed by the research community, it is designed to translate algorithmic research from the lab into the actual practice of domains as wide-ranging as finance, human capital management, healthcare, and education. https://aif360.mybluemix.net/
AIX 360 is an open source toolkit that can help you comprehend how machine learning models predict labels by various means throughout the AI application lifecycle. Containing eight state-of-the-art algorithms for interpretable machine learning as well as metrics for explainability, it is designed to translate algorithmic research from the lab into the actual practice of domains as wide-ranging as finance, human capital management, healthcare, and education.http://aix360.mybluemix.net/
39AI Tools Lab SS2020
InterpretML - https://github.com/interpretml/interpret
● InterpretML is an open-source python package for training interpretable machine learning models and explaining black-box systems.
● Interpretability is essential for:• Model debugging - Why did my model make this mistake?• Detecting bias - Does my model discriminate?• Human-AI cooperation - How can I understand and trust the model's
decisions?• Regulatory compliance - Does my model satisfy legal requirements?• High-risk applications - Healthcare, finance, judicial, ...
40AI Tools Lab SS2020
Next Meeting
● Project Topics and working teams (2 students)
● Zoom invitation via Email (on Monday)
41AI Tools Lab SS2020
External Resources
● Multiple Learning Paths in Big Data and Data Science: https://cognitiveclass.ai/learn/all/ ● Free courses ● Obtain a certificate (badge) after completing each course
● Learning Paths:•Applied Data Science with Python•Big Data Analytics•Big Data Fundamentals•Data Science for Business•Data Science Fundamentals•Deep Learning•Many Hadoop and Spark courses: https://cognitiveclass.ai/learn/all/page/2/
AI Tools Lab SS2020 42
Deep Learning & AI courses
● Deep Learning in Coursera
● Introduction to TensorFlow for Artificial Intelligence, Machine Learning, and Deep Learning
● TensorFlow Specialization
● AI For Everyone in Coursera
43AI Tools Lab SS2020
Recommended