Introduction Todor Ivanov todor@dbis.cs.uni-frankfurt · 2020. 4. 28. · Module 1 - Importing...

Preview:

Citation preview

AI Tools Lab SS2020

Introduction

Todor Ivanov todor@dbis.cs.uni-frankfurt.de

http://www.bigdata.uni-frankfurt.de/

1

● Dr. Todor Ivanov – Senior Researcher– Big Data Benchmarking– Complex distributed software systems (Hadoop & Spark)– Storage and processing of data-intensive applications

2

About Me

AI Tools Lab SS2020

What about You? ☺

AI Tools Lab SS2020

3

Frankfurt Big Data Lab - http://www.bigdata.uni-frankfurt.de

Our lab is currently active in the following research areas:

• Big Data Management Technologies

• Data Analytics / Data Science

• Graph Databases / Linked Open Data (LOD)

4AI Tools Lab SS2020

This Hands-on Course will look at

● Python & Python Libraries

● Machine Learning & AI Tools:

● Google What-If Tool

● IBM AI Fairness 360

● IBM AI Explainability 360

● Microsoft InterpretML

AI Tools Lab SS2020 5

Course Organization (1)

● Course start/end: Tuesday, 28.04.2020 to Tuesday, 16.06.2020

● Time: Tuesday 10:00 – 12:00

● Location: remote via Zoom

● Languages: English and German

● Credit Points: Students can receive 8 CPs

● Important: Attendance is mandatory!

● Work in teams of 2 students! The final project will be graded (more info follows).

● Communication: aitoolslabss2020@gmail.com

6AI Tools Lab SS2020

Course Organization (2)

● Python ML courses● Project Topic

● Course page: http://www.bigdata.uni-frankfurt.de/big-data-technologies-ss-2020/

● Communication: aitoolslabss2020@gmail.com

AI Tools Lab SS2020 7

Date Topic

28.04.2020 Course Organization & Introduction→ Python for Data Science Course

05.05.2020 → Data Analysis with Python Course

12.05.2020 → Data Visualization with Python Course

19.05.2020 → Submit certificates and start Project

26.05.2020

02.06.2020

09.06.2020

16.06.2020 Project Submission

Python for Data Science (1)https://cognitiveclass.ai/courses/python-for-data-science

Module 1 - Python Basics ○ Your first program ○ Types ○ Expressions and Variables ○ String Operations

Module 2 - Python Data Structures ○ Lists and Tuples ○ Sets ○ Dictionaries

Module 3 - Python Programming Fundamentals ○ Conditions and Branching ○ Loops ○ Functions ○ Objects and Classes

Module 4 - Working with Data in Python ○ Reading files with open ○ Writing files with open ○ Loading data with Pandas ○ Working with and Saving data with

Pandas

COURSE SYLLABUS

8AI Tools Lab SS2020

Data Analysis with Python (2)https://cognitiveclass.ai/courses/data-analysis-python

Module 1 - Importing Datasets ○ Learning Objectives ○ Understanding the Domain ○ Understanding the Dataset ○ Python package for data science ○ Importing and Exporting Data in Python ○ Basic Insights from Datasets

Module 2 - Cleaning and Preparing the Data ○ Identify and Handle Missing Values ○ Data Formatting ○ Data Normalization Sets ○ Binning ○ Indicator variables

Module 3 - Summarizing the Data Frame ○ Descriptive Statistics ○ Basic of Grouping ○ ANOVA ○ Correlation ○ More on Correlation

Module 4 - Model Development ○ Simple and Multiple Linear Regression ○ Model Evaluation Using Visualization ○ Polynomial Regression and Pipelines ○ R-squared and MSE for In-Sample

Evaluation ○ Prediction and Decision Making

Module 5 - Model Evaluation ○ Model Evaluation ○ Over-fitting, Under-fitting and Model

Selection ○ Ridge Regression ○ Grid Search ○ Model Refinement

COURSE SYLLABUS

9AI Tools Lab SS2020

Data Visualization with Python (3)https://cognitiveclass.ai/courses/data-visualization-with-python

Module 1 - Introduction to Visualization Tools ○ Introduction to Data Visualization ○ Introduction to Matplotlib ○ Basic Plotting with Matplotlib ○ Dataset on Immigration to Canada ○ Line Plots

Module 2 - Basic Visualization Tools ○ Area Plots ○ Histograms ○ Bar Charts

Module 3 - Specialized Visualization Tools

○ Pie Charts ○ Box Plots ○ Scatter Plots ○ Bubble Plots

Module 4 - Advanced Visualization Tools ○ Waffle Charts ○ Word Clouds ○ Seaborn and Regression Plots

Module 5 - Creating Maps and Visualizing Geospatial Data

○ Introduction to Folium ○ Maps with Markers ○ Choropleth Maps

COURSE SYLLABUS

10AI Tools Lab SS2020

Grading Info ● (10%) - Regular participation in the virtual meetings

• every Tuesday 10:00-12:00

● (30%) - Complete successfully the 3 Python courses• Send me your certificates + link to badges until May 19th, 2020

● (60%) - Complete successfully the Project Topic• Implement the project task • Submit 15 slides presenting your project• Submit project solution as Jupyter Notebook in github until June 16th, 2020

11AI Tools Lab SS2020

What is Big Data?

AI Tools Lab SS202012

Big Data

Growing amount of data• petabytes/exabytes of user data (text, audio, video, images)

Variety of data sources:• Mobile devices• Social platforms• Sensors (RFID)• Web platforms

Processing speed• How fast is the result available ?

AI Tools Lab SS2020 13

Big Data Definition (sort of)

• Big Data refers to datasets and flows, large enough that has outpaced our capability to store, process, analyze, and understand.

• Big Data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze. (McKinsey Global Institute)

– This definition is Not defined in terms of data size (data sets will increase)– Vary by sectors (ranging from a few dozen terabytes to multiple petabytes)

AI Tools Lab SS2020 14

AI Tools Lab SS2020 15

Source:https://www.domo.com/learn/data-never-sleeps-6

Big Data Characteristics

AI Tools Lab SS2020 16

[1] D. Laney, “3D data management: Controlling data volume, velocity and variety,” Appl. Deliv. Strateg. File, vol. 949, 2001.

Big Data Characteristics

AI Tools Lab SS2020 17

● Volume represents the ever-growing amount of data in petabytes, exabytes, zettabytes and yottabytes, which is generated by applications like Facebook, Twitter, IoT, etc. and challenges the current stage of storage systems.

● Velocity describes how quickly the data is retrieved, stored and processed.

● Variety describes the multitude of data sources like sensors, smart devices and social media, producing data in data formats. That is structured, semi-structured or unstructured data, with unstructured data as most common in Big Data use cases.

● Value defines the business value derived from the extracted data insights. Varies depending on the application domain.

● Veracity defines the data accuracy or how truthful it is. If the data is corrupted, imprecise or uncertain, this has direct impact on the quality of the final results.

● Variability defines the different interpretations that a certain data can have when put in different contexts. It focuses on the meaning of the data instead of its variety in terms of structure or representation.

Structured Data

Employee

EmpNo Ename DeptNo DeptName100 Bob 10 Marketing200 Bob 20 Purchasing150 Peter 10 Marketing170 Doug 20 Purchasing105 John 10 Marketing

18AI Tools Lab SS2020

Clickstream Data - is an information trail a user leaves behind while visiting a website. It is typically captured in semi-structured website log (source http://www.jafsoft.com/searchengines/log_sample.html) and( http://hortonworks.com)

fcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400] "GET /contacts.html HTTP/1.0" 200 4595 "-" "FAST-WebCrawler/2.1-pre2 (ashen@looksmart.net)"

fcrawler.looksmart.com - - [26/Apr/2000:00:17:19 -0400] "GET /news/news.html HTTP/1.0" 200 16716 "-" "FAST-WebCrawler/2.1-pre2 (ashen@looksmart.net)"

ppp931.on.bellglobal.com - - [26/Apr/2000:00:16:12 -0400] "GET /download/windows/asctab31.zip HTTP/1.0" 200 1540096 "http://www.htmlgoodies.com/downloads/freeware/webdevelopment/15.html" "Mozilla/4.7 [en]C-SYMPA (Win95; U)"

123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/wpaper.gif HTTP/1.0" 200 6248 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"

123.123.123.123 - - [26/Apr/2000:00:23:47 -0400] "GET /asctortf/ HTTP/1.0" 200 8130 "http://search.netscape.com/Computers/Data_Formats/Document/Text/RTF" "Mozilla/4.05 (Macintosh; I; PPC)"

123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/5star2000.gif HTTP/1.0" 200 4005 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"

123.123.123.123 - - [26/Apr/2000:00:23:50 -0400] "GET /pics/5star.gif HTTP/1.0" 200 1031 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"

123.123.123.123 - - [26/Apr/2000:00:23:51 -0400] "GET /pics/a2hlogo.jpg HTTP/1.0" 200 4282 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"

123.123.123.123 - - [26/Apr/2000:00:23:51 -0400] "GET /cgi-bin/newcount?jafsof3&width=4&font=digital&noshow HTTP/1.0" 200 36 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"

19AI Tools Lab SS2020

Unstructured DataSensor data logged to a text file. Imported data into Excel (source Memos From the Cube)

20AI Tools Lab SS2020

Big Data: Challenges

1. Data (Volume, Variety, Velocity … ) 2. Processing (Batch, Near-real time, Real-time …)

3. Management (Meta-data(Schema), Security … )

4. Data Science (Machine Learning, Deep Learning, AI …)

AI Tools Lab SS2020 21

Big Data & Hadoop Ecosystem

AI Tools Lab SS2020 22

Big Data Ecosystem

AI Tools Lab SS2020 23https://mattturck.com/data2019/

What is Data Science/Data Scientist?

AI Tools Lab SS2020

24

Big Data vs. Data Science

Big Data = Big Data Systems + Data Science

25AI Tools Lab SS2020

Image source: http://www.kevinschmidt.biz/2015/03/22/data-engineer-vs-data-scientist-vs-business-analyst/

Data Science - Definitions

Data Science aims to derive knowledge from data, efficiently and intelligently. Data Science encompasses the set of activities, tools, and methods that enable data-driven activities in science, business, medicine, and government.

AI Tools Lab SS2020 26

What is Artificial Intelligence?

AI Tools Lab SS202027

What is Artificial Intelligence?

28

Arthur Samuel described it as: "the field of study that gives computers the ability to learn without being explicitly programmed.“

Tom Mitchell provides a more modern definition: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."

Regression

Classification

Clustering

Decision Trees

Image Processing

Speech Processing

Natural Language Processing

Recommender Systems

Adversarial Networks

Reinforcement Learning

Source: https://blog.quantinsti.com/machine-learning-basics/

AI Tools Lab SS2020

Unsupervised Learning

29

Unsupervised learning allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don't necessarily know the effect of the variables.

Regression

Classification

Clustering

Decision Trees

Image Processing

Speech Processing

Natural Language Processing

Recommender Systems

Adversarial Networks

Reinforcement Learning

K-Means

Source: https://en.wikipedia.org/wiki/Cluster_analysis

AI Tools Lab SS2020

Supervised Learning

30

In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output. (We have data which includes the correct answer)

Regression

Classification

Clustering

Decision Trees

Image Processing

Speech Processing

Natural Language Processing

Recommender Systems

Adversarial Networks

Reinforcement Learning

Given data about the size of houses on the real estate market, try to predict their price. Price as a function of size is a continuous output, so this is a regression problem.

Turn this example into a classification problem by instead making our output about whether the house "sells for more or less than the asking price." Here we are classifying the houses based on price into two discrete categories.

AI Tools Lab SS2020

Supervised Learning

31

In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output. (We have data which includes the correct answer)

Regression

Classification

Clustering

Decision Trees

Image Processing

Speech Processing

Natural Language Processing

Recommender Systems

Adversarial Networks

Reinforcement Learning

Source: Introduction to artificial intelligence using Intel® hardware platform

AI Tools Lab SS2020

Problems?

32

PredPol is an algorithm designed to predict when and where crimes will take place, with the aim of helping to reduce human bias in policing. But in 2016, the Human Rights Data Analysis Group found that the software could lead police to unfairly target certain neighbourhoods. When researchers applied a simulation of PredPol’s algorithm to drug offences in Oakland, California, it repeatedly sent officers to neighbourhoods with a high proportion of people from racial minorities, regardless of the true crime rate in those areas.Source: Lum, Kristian, and William Isaac. "To predict and serve?." Significance 13.5 (2016): 14-19.

“Volvo admits its self-driving cars are confused by kangaroos[…]The company’s “Large Animal Detection system” can identify and avoid deer, elk and caribou, but early testing in Australia shows it cannot adjust to the kangaroo’s unique method of movement.” Source: https://www.theguardian.com/technology/2017/jul/01/volvo-admits-its-self-driving-cars-are-confused-by-kangaroos

AI Tools Lab SS2020

Bias / Fairness“Algorithmic bias describes systematic and repeatable errors in a computer system that create unfair outcomes, such as privileging one arbitrary group of users over others. Bias can emerge due to many factors, including but not limited to the design of the algorithm or the unintended or unanticipated use or decisions relating to the way data is coded, collected, selected or used to train the algorithm.”Source: https://en.wikipedia.org/wiki/Algorithmic_bias

Finding suitable definitions of fairness in an algorithmic context is a subject of much debate:

„Equality of opportunity defines an important welfare criterion in political philosophy and policy analysis. Philosophers define equality of opportunity as the requirement that an individual’s well being be independent of his or her irrelevant characteristics. The difference among philosophers is mainly about which characteristics should be considered irrelevant. Policymakers, however, are often called upon to address more specific questions: How should admissions policies be designed so as to provide equal opportunities for college? Or how should tax schemes be designed so as to equalize opportunities for income? These are called local distributive justice problems, because each policymaker is in charge of achieving equality of opportunity to a specific issue.”Source: Catarina Calsamiglia. Decentralizing equality of opportunity and issues concerning theequality of educational opportunity, 2005. Doctoral Dissertation, Yale University.

33AI Tools Lab SS2020

Course Ethical Implications of AI

Topics:● The Ethics of Artificial Intelligence (AI) ● Ethics, Moral Values, Humankind, Technology, AI Examples● On the ethics of algorithmic decision-making in healthcare● Fairness, Bias and Discrimination in AI ● AI and Trust: Explainability, Transparency● AI Privacy, Responsibility, Accountability, Safety, Human-in the loop● AI and Trust● Introduction to Z-inspection. A framework to assess Ethical AI● Legal relevance of AI Ethics● AI Fairness and AI Explainability software tools● Design of Ethics Tools for AI Developers● Assessing AI use cases. Ethical tensions, Trade offs

→ http://www.bigdata.uni-frankfurt.de/data-challenge-ss-2020/

→ All course slides and videos available on the website!

34

AI/Machine Learning Glossary

35

Library Framework Topology

Hardware-optimized mathematical and other

primitive functions that are commonly used in machine & deep learning algorithms,

topologies & frameworks

Open-source software environments that facilitate

deep learning model development & deployment through built-in components and the ability to customize

code

Wide variety of algorithms modeled loosely after the

human brain that use neural networks to recognize complex

patterns in data that are otherwise difficult to

reverse engineer

Yolo, Inception-ResNetV2,

SSD-MobileNet, Resnet-50,

Faster-RCNN, Wavenet, …

AI Tools Lab SS2020

Popular Python Libraries

● Computing:

• Pandas (data structures & tools) - https://pandas.pydata.org/

• Numpy (arrays & matrices) - https://numpy.org/

• SciPy (integrals &optimizations) - https://www.scipy.org/

● Algortihmic libs:

• scikit-learn (ML, regressions, ...) - https://scikit-learn.org/stable/

• statsmodels (statistics, estimations, etc..) - https://www.statsmodels.org/stable/index.html

● Data Visualization:

• matplotlib (plots & graphs) - https://matplotlib.org/

• seaborn (heat maps, time series, ..) - https://seaborn.pydata.org/

36AI Tools Lab SS2020

ML & AI Tools in this Lab

● Google What-If● IBM AI Fairness 360● IBM AI Explainability 360● MS InterpretML

Python Environments (Cloud-based)● Google Colaboratory Tools - https://colab.research.google.com/notebooks/intro.ipynb

● IBM Notebooks - https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/notebooks-parent.html

37AI Tools Lab SS2020

What-If Tool

What-if Tool is an interactive visual interface designed to probe your models better.https://pair-code.github.io/what-if-tool/

● Compare multiple models within the same workflow

● Visualize inference results

● Visualize feature attributions

● Arrange datapoints by similarity

● Edit a datapoint and see how your model performs

● Compare counterfactuals to datapoints

● Use feature values as lenses into model performance

● Experiment using confusion matrices and ROC curves

● Test algorithmic fairness constraints

38

AI Fairness & Explainability 360

AIF 360 is an open source toolkit that can help you examine, report, and mitigate discrimination and bias in machine learning models throughout the AI application lifecycle. Containing over 70 fairness metrics and 10 state-of-the-art bias mitigation algorithms developed by the research community, it is designed to translate algorithmic research from the lab into the actual practice of domains as wide-ranging as finance, human capital management, healthcare, and education. https://aif360.mybluemix.net/

AIX 360 is an open source toolkit that can help you comprehend how machine learning models predict labels by various means throughout the AI application lifecycle. Containing eight state-of-the-art algorithms for interpretable machine learning as well as metrics for explainability, it is designed to translate algorithmic research from the lab into the actual practice of domains as wide-ranging as finance, human capital management, healthcare, and education.http://aix360.mybluemix.net/

39AI Tools Lab SS2020

InterpretML - https://github.com/interpretml/interpret

● InterpretML is an open-source python package for training interpretable machine learning models and explaining black-box systems.

● Interpretability is essential for:• Model debugging - Why did my model make this mistake?• Detecting bias - Does my model discriminate?• Human-AI cooperation - How can I understand and trust the model's

decisions?• Regulatory compliance - Does my model satisfy legal requirements?• High-risk applications - Healthcare, finance, judicial, ...

40AI Tools Lab SS2020

Next Meeting

● Project Topics and working teams (2 students)

● Zoom invitation via Email (on Monday)

41AI Tools Lab SS2020

External Resources

● Multiple Learning Paths in Big Data and Data Science: https://cognitiveclass.ai/learn/all/ ● Free courses ● Obtain a certificate (badge) after completing each course

● Learning Paths:•Applied Data Science with Python•Big Data Analytics•Big Data Fundamentals•Data Science for Business•Data Science Fundamentals•Deep Learning•Many Hadoop and Spark courses: https://cognitiveclass.ai/learn/all/page/2/

AI Tools Lab SS2020 42

Deep Learning & AI courses

● Deep Learning in Coursera

● Introduction to TensorFlow for Artificial Intelligence, Machine Learning, and Deep Learning

● TensorFlow Specialization

● AI For Everyone in Coursera

43AI Tools Lab SS2020

Recommended