96

Data Science with PostgreSQLdatascientist.at/wp-content/uploads/2015/01/Bárány-Data-Science-with... · Data Science with PostgreSQL Contents Introduction What is Data Science? Process

  • Upload
    others

  • View
    41

  • Download
    0

Embed Size (px)

Citation preview

Data Science with PostgreSQL

Data Science with PostgreSQL

Balázs Bárány

Data Scientist

pgconf.eu 2015

Data Science with PostgreSQL

Contents

Introduction � What is Data Science?Process model

Tools and methods of Data Scientists

Data Science with PostgreSQLBusiness & data understandingPreprocessingModeling

Evaluation

Deployment

Summary

Data Science with PostgreSQL

Introduction � What is Data Science?

Sexiest job of the 21st century

I According to Google, LinkedIn, ...

I Who is a Data Scientist?

Data Science with PostgreSQL

Introduction � What is Data Science?

Sexiest job of the 21st century

I According to Google, LinkedIn, ...

I Who is a Data Scientist?

Data Science with PostgreSQL

Introduction � What is Data Science?

Data Science Venn Diagram

(c) Drew Conway, 2010. CC-BY-NC

Data Science with PostgreSQL

Introduction � What is Data Science?

Tasks of data scientists

I Get data from various sources

I Big data?

I Mash up & format for analysis

I Analyze & visualize

I Predict & prescribe

I Operationalize

Data Science with PostgreSQL

Introduction � What is Data Science?

Tasks of data scientists

I Get data from various sources

I Big data?

I Mash up & format for analysis

I Analyze & visualize

I Predict & prescribe

I Operationalize

Data Science with PostgreSQL

Introduction � What is Data Science?

Tasks of data scientists

I Get data from various sources

I Big data?

I Mash up & format for analysis

I Analyze & visualize

I Predict & prescribe

I Operationalize

Data Science with PostgreSQL

Introduction � What is Data Science?

Tasks of data scientists

I Get data from various sources

I Big data?

I Mash up & format for analysis

I Analyze & visualize

I Predict & prescribe

I Operationalize

Data Science with PostgreSQL

Introduction � What is Data Science?

Tasks of data scientists

I Get data from various sources

I Big data?

I Mash up & format for analysis

I Analyze & visualize

I Predict & prescribe

I Operationalize

Data Science with PostgreSQL

Introduction � What is Data Science?

Process model

The Data Mining process

Cross Industry Standard Process for Data Mining (Kenneth Jensen/Wikimedia Commons)

Data Science with PostgreSQL

Tools and methods of Data Scientists

Tools and methods

Tools and methods

Data Science with PostgreSQL

Tools and methods of Data Scientists

Scripting and programming

I R

I Python with extensions

I Octave/Matlab, other mathematic languages

I Hadoop and Big Data programming libraries (mostly Java)

I Cloud services

Data Science with PostgreSQL

Tools and methods of Data Scientists

Integrated GUI tools

I (partly) Open Source: RapidMiner, KNIME, Orange

I Data Warehouse tools extended for analytics: Pentaho, Talend

I Many commercial tools, e. g. SAS, IBM SPSS

I Hadoop-related newcomers: e. g. Datameer

Data Science with PostgreSQL

Tools and methods of Data Scientists

Data Infrastructure

I Databases and data stores

I Relational, NoSQLI Hadoop clustersI In-memory

I Data streams

I Free-form data: text, images, video, audio, ...

I Web APIs

I Open Data

Data Science with PostgreSQL

Tools and methods of Data Scientists

Data acquisition and preprocessing

I Data ingestion in raw format

I Joining, aggregating, �ltering, calculating, ...

I Data cleansing

I Missing valuesI Abnormal values

I Result: data set suitable for analytics

Data Science with PostgreSQL

Tools and methods of Data Scientists

Data acquisition and preprocessing

I Data ingestion in raw format

I Joining, aggregating, �ltering, calculating, ...

I Data cleansing

I Missing valuesI Abnormal values

I Result: data set suitable for analytics

Data Science with PostgreSQL

Tools and methods of Data Scientists

Data acquisition and preprocessing

I Data ingestion in raw format

I Joining, aggregating, �ltering, calculating, ...

I Data cleansing

I Missing valuesI Abnormal values

I Result: data set suitable for analytics

Data Science with PostgreSQL

Tools and methods of Data Scientists

Data acquisition and preprocessing

I Data ingestion in raw format

I Joining, aggregating, �ltering, calculating, ...

I Data cleansing

I Missing valuesI Abnormal values

I Result: data set suitable for analytics

Data Science with PostgreSQL

Tools and methods of Data Scientists

Predictive Modeling

I Supervised and unsupervised methods

I Target variable known or not

I Classi�cation (supervised): Prediction of a class or category

I Regression (supervised): Prediction of numeric value

I Clustering (unsupervised): Automatic �grouping� of data

I Association analysis, outlier detection, time series prediction,...

Data Science with PostgreSQL

Tools and methods of Data Scientists

Predictive Modeling

I Supervised and unsupervised methods

I Target variable known or not

I Classi�cation (supervised): Prediction of a class or category

I Regression (supervised): Prediction of numeric value

I Clustering (unsupervised): Automatic �grouping� of data

I Association analysis, outlier detection, time series prediction,...

Data Science with PostgreSQL

Tools and methods of Data Scientists

Predictive Modeling

I Supervised and unsupervised methods

I Target variable known or not

I Classi�cation (supervised): Prediction of a class or category

I Regression (supervised): Prediction of numeric value

I Clustering (unsupervised): Automatic �grouping� of data

I Association analysis, outlier detection, time series prediction,...

Data Science with PostgreSQL

Tools and methods of Data Scientists

Deployment and operationalization

I Model application to new data => prediction + con�dence

I What to do with predictions?

I Store in ERP or CRM

I Tell someone (email, popup)

I Add a label (e. g. mark email as spam)

I Interrupt �nancial transaction => prescription

I Order supplies => prescription

I ...

Data Science with PostgreSQL

Tools and methods of Data Scientists

Deployment and operationalization

I Model application to new data => prediction + con�dence

I What to do with predictions?

I Store in ERP or CRM

I Tell someone (email, popup)

I Add a label (e. g. mark email as spam)

I Interrupt �nancial transaction => prescription

I Order supplies => prescription

I ...

Data Science with PostgreSQL

Tools and methods of Data Scientists

Deployment and operationalization

I Model application to new data => prediction + con�dence

I What to do with predictions?

I Store in ERP or CRM

I Tell someone (email, popup)

I Add a label (e. g. mark email as spam)

I Interrupt �nancial transaction => prescription

I Order supplies => prescription

I ...

Data Science with PostgreSQL

Data Science with PostgreSQL

Data Science with PostgreSQL

Doing Data Science with PostgreSQL

Data Science with PostgreSQL

Data Science with PostgreSQL

Caveats

I This stu� is not easy

I Must be root and postgres

I Maintain your PostgreSQL yourselfI Able to compile stu�

I You should ask ;-)

I your bossI co-workersI customer

Data Science with PostgreSQL

Data Science with PostgreSQL

Caveats

I This stu� is not easy

I Must be root and postgres

I Maintain your PostgreSQL yourselfI Able to compile stu�

I You should ask ;-)

I your bossI co-workersI customer

Data Science with PostgreSQL

Data Science with PostgreSQL

Caveats

I This stu� is not easy

I Must be root and postgres

I Maintain your PostgreSQL yourselfI Able to compile stu�

I You should ask ;-)

I your bossI co-workersI customer

Data Science with PostgreSQL

Data Science with PostgreSQL

Business & data understanding

Business understanding

I What is the purpose of the business?

I What are existing processes?

I Drivers of business success

I Project goals and challenges

I Availability of data and resources

I Success criteria

I Not a technical activity, PostgreSQL can't help much

Data Science with PostgreSQL

Data Science with PostgreSQL

Business & data understanding

Business understanding

I What is the purpose of the business?

I What are existing processes?

I Drivers of business success

I Project goals and challenges

I Availability of data and resources

I Success criteria

I Not a technical activity, PostgreSQL can't help much

Data Science with PostgreSQL

Data Science with PostgreSQL

Business & data understanding

Business understanding

I What is the purpose of the business?

I What are existing processes?

I Drivers of business success

I Project goals and challenges

I Availability of data and resources

I Success criteria

I Not a technical activity, PostgreSQL can't help much

Data Science with PostgreSQL

Data Science with PostgreSQL

Business & data understanding

Data understanding

I Existing data

I Entities and covered conceptsI Complete? Correct? In suitable form?I Usable? (regulations, access constraints, ...)

I Connecting separate data sources

I Simple or complex IDs

I Data size

I Too smallI Too big

I Suitability for predictive modeling

I Target variable?I Attribute types

Data Science with PostgreSQL

Data Science with PostgreSQL

Business & data understanding

Data understanding

I Existing data

I Entities and covered conceptsI Complete? Correct? In suitable form?I Usable? (regulations, access constraints, ...)

I Connecting separate data sources

I Simple or complex IDs

I Data size

I Too smallI Too big

I Suitability for predictive modeling

I Target variable?I Attribute types

Data Science with PostgreSQL

Data Science with PostgreSQL

Business & data understanding

Data understanding

I Existing data

I Entities and covered conceptsI Complete? Correct? In suitable form?I Usable? (regulations, access constraints, ...)

I Connecting separate data sources

I Simple or complex IDs

I Data size

I Too smallI Too big

I Suitability for predictive modeling

I Target variable?I Attribute types

Data Science with PostgreSQL

Data Science with PostgreSQL

Business & data understanding

Data understanding

I Existing data

I Entities and covered conceptsI Complete? Correct? In suitable form?I Usable? (regulations, access constraints, ...)

I Connecting separate data sources

I Simple or complex IDs

I Data size

I Too smallI Too big

I Suitability for predictive modeling

I Target variable?I Attribute types

Data Science with PostgreSQL

Data Science with PostgreSQL

Business & data understanding

Data understanding with PostgreSQL

I Get data into PostgreSQL

I Classical import processI Foreign Data Wrappers

I Analyze data distribution

I Group by and aggregate

I Count, Count Distinct, Min, Max

I Count NULLsI Search for missing links (incomplete foreign keys)

I Analyze �surprizes�

I Impossible valuesI Missing values in �required� �elds

Data Science with PostgreSQL

Data Science with PostgreSQL

Business & data understanding

Data understanding with PostgreSQL

I Get data into PostgreSQL

I Classical import processI Foreign Data Wrappers

I Analyze data distribution

I Group by and aggregate

I Count, Count Distinct, Min, Max

I Count NULLsI Search for missing links (incomplete foreign keys)

I Analyze �surprizes�

I Impossible valuesI Missing values in �required� �elds

Data Science with PostgreSQL

Data Science with PostgreSQL

Business & data understanding

Data understanding with PostgreSQL

I Get data into PostgreSQL

I Classical import processI Foreign Data Wrappers

I Analyze data distribution

I Group by and aggregate

I Count, Count Distinct, Min, Max

I Count NULLsI Search for missing links (incomplete foreign keys)

I Analyze �surprizes�

I Impossible valuesI Missing values in �required� �elds

Data Science with PostgreSQL

Data Science with PostgreSQL

Business & data understanding

Data understanding with PostgreSQL � summary

I Good SQL knowledge required

I Tedious manual process

I repetitiveI not suitable for large number of attributes

I No built-in visualization

I Or maybe...

Data Science with PostgreSQL

Data Science with PostgreSQL

Business & data understanding

Data understanding with PostgreSQL � summary

I Good SQL knowledge required

I Tedious manual process

I repetitiveI not suitable for large number of attributes

I No built-in visualization

I Or maybe...

Data Science with PostgreSQL

Data Science with PostgreSQL

Business & data understanding

SQL barchart output

Data Science with PostgreSQL

Data Science with PostgreSQL

Business & data understanding

Bar chart from GUI tool

Data Science with PostgreSQL

Data Science with PostgreSQL

Business & data understanding

Boxplot output

Data Science with PostgreSQL

Data Science with PostgreSQL

Business & data understanding

Data understanding wrap up

I DBMS not built for this

I It can support more specialized tools

I Introduction: R

I �A free software environment for statistical computing andgraphics�

I Available in PostgreSQL

Data Science with PostgreSQL

Data Science with PostgreSQL

Business & data understanding

Data understanding wrap up

I DBMS not built for this

I It can support more specialized tools

I Introduction: R

I �A free software environment for statistical computing andgraphics�

I Available in PostgreSQL

Data Science with PostgreSQL

Data Science with PostgreSQL

Business & data understanding

PL/R: A statistical language for PostgreSQL

I R as a standalone language

I Mathematical and statistical methodsI Powerful visualization functionsI Classical, modern and bleeding edge modelingI Arrays and data frames are central data typesI Operates only in memory

I PL/R: R as a loadable procedural language for PostgreSQL

I First released in 2003 by Joe ConwayI License: GPL

Data Science with PostgreSQL

Data Science with PostgreSQL

Business & data understanding

PL/R: A statistical language for PostgreSQL

I R as a standalone language

I Mathematical and statistical methodsI Powerful visualization functionsI Classical, modern and bleeding edge modelingI Arrays and data frames are central data typesI Operates only in memory

I PL/R: R as a loadable procedural language for PostgreSQL

I First released in 2003 by Joe ConwayI License: GPL

Data Science with PostgreSQL

Data Science with PostgreSQL

Business & data understanding

R usage outside of PostgreSQL

I Development environments

I RStudio (AGPL or commercial, local & web)I RKWard, Cantor (KDE projects)I StatET (Eclipse)

I Frontends

I R CommanderI DeducerI Rattle

I Web framework: Shiny (AGPL or commercial)

Data Science with PostgreSQL

Data Science with PostgreSQL

Business & data understanding

Working with R in PostgreSQL

I Install functions in the database

Example

select install_rcmd('

myfunction <-function(x)

{print(x)}

');

I Install without function body

Example

CREATE FUNCTION rnorm

(n integer, mean double precision, sd double precision)

RETURNS double precision[]

AS �

LANGUAGE 'plr';

Data Science with PostgreSQL

Data Science with PostgreSQL

Business & data understanding

Using R in PostgreSQL for data understanding

I Advanced visualization

I Data distributions

I Advanced statistics

I Execution in the database

I Clumsy, but direct data access

I Execution outside

I Simple and interactive, but data transfer

Data Science with PostgreSQL

Data Science with PostgreSQL

Business & data understanding

Using R in PostgreSQL for data understanding

I Advanced visualization

I Data distributions

I Advanced statistics

I Execution in the database

I Clumsy, but direct data access

I Execution outside

I Simple and interactive, but data transfer

Data Science with PostgreSQL

Data Science with PostgreSQL

Business & data understanding

Using R in PostgreSQL for data understanding

I Advanced visualization

I Data distributions

I Advanced statistics

I Execution in the database

I Clumsy, but direct data access

I Execution outside

I Simple and interactive, but data transfer

Data Science with PostgreSQL

Data Science with PostgreSQL

Preprocessing

Preprocessing

I What databases are built for

I Rows: very dynamic

I Easy to create new rows by joiningI Easy to �lter

I Columns: not so much

I Easy to create new columnsI Only explicit access

I Wider interpretation of preprocessing

I Enrichment with external dataI New attributes from existing onesI Recoding, recalculationI Missing value handling

Data Science with PostgreSQL

Data Science with PostgreSQL

Preprocessing

Preprocessing

I What databases are built for

I Rows: very dynamic

I Easy to create new rows by joiningI Easy to �lter

I Columns: not so much

I Easy to create new columnsI Only explicit access

I Wider interpretation of preprocessing

I Enrichment with external dataI New attributes from existing onesI Recoding, recalculationI Missing value handling

Data Science with PostgreSQL

Data Science with PostgreSQL

Preprocessing

Preprocessing

I What databases are built for

I Rows: very dynamic

I Easy to create new rows by joiningI Easy to �lter

I Columns: not so much

I Easy to create new columnsI Only explicit access

I Wider interpretation of preprocessing

I Enrichment with external dataI New attributes from existing onesI Recoding, recalculationI Missing value handling

Data Science with PostgreSQL

Data Science with PostgreSQL

Preprocessing

Preprocessing: organizing work�ow

I Common Table Expressions

I organize processing stepsI partial and intermediate results

Example

WITH source AS (

SELECT *, ROW_NUMBER() OVER () AS rownum

FROM source_table

),

no_missings AS (

SELECT *

FROM source

WHERE field1 IS NOT NULL

AND field2 IS NOT NULL

)

etc.

Data Science with PostgreSQL

Data Science with PostgreSQL

Preprocessing

Preprocessing: attribute creation

I Aggregation

I Partial aggregation by window functions

I In-group measures, e. g. ratioI att / SUM(att) OVER (PARTITION BY ...)

I In-group numberingI ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)

I Comparing to previous/next value

I att - LAG(att, 1) OVER (ORDER BY ...)

I Much easier in SQL than programming languages and datamining tools

Data Science with PostgreSQL

Data Science with PostgreSQL

Preprocessing

Preprocessing: attribute creation

I Aggregation

I Partial aggregation by window functions

I In-group measures, e. g. ratioI att / SUM(att) OVER (PARTITION BY ...)

I In-group numberingI ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)

I Comparing to previous/next value

I att - LAG(att, 1) OVER (ORDER BY ...)

I Much easier in SQL than programming languages and datamining tools

Data Science with PostgreSQL

Data Science with PostgreSQL

Preprocessing

Preprocessing: attribute creation

I Aggregation

I Partial aggregation by window functions

I In-group measures, e. g. ratioI att / SUM(att) OVER (PARTITION BY ...)

I In-group numberingI ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)

I Comparing to previous/next value

I att - LAG(att, 1) OVER (ORDER BY ...)

I Much easier in SQL than programming languages and datamining tools

Data Science with PostgreSQL

Data Science with PostgreSQL

Preprocessing

Preprocessing: attribute creation

I Aggregation

I Partial aggregation by window functions

I In-group measures, e. g. ratioI att / SUM(att) OVER (PARTITION BY ...)

I In-group numberingI ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)

I Comparing to previous/next value

I att - LAG(att, 1) OVER (ORDER BY ...)

I Much easier in SQL than programming languages and datamining tools

Data Science with PostgreSQL

Data Science with PostgreSQL

Preprocessing

Preprocessing: attribute creation

I Aggregation

I Partial aggregation by window functions

I In-group measures, e. g. ratioI att / SUM(att) OVER (PARTITION BY ...)

I In-group numberingI ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)

I Comparing to previous/next value

I att - LAG(att, 1) OVER (ORDER BY ...)

I Much easier in SQL than programming languages and datamining tools

Data Science with PostgreSQL

Data Science with PostgreSQL

Preprocessing

Preprocessing: enrichment

I PostGIS for geodata

I Foreign data wrappers (see PostgreSQL Wiki)

I Other databases (other PostgreSQL server, MySQL, Oracle,MSSQL, JDBC, SQL Alchemy ...)

I NoSQL databases (MongoDB, Cassandra, CouchDB, Redis, ...)

I Big Data (Hadoop Hive, Impala)

I Network sources - Multicorn (RSS, IMAP, Twitter, S3, ...)I Files (CSV, ZIP, JSON, ...)

I Write your own in C or Python or Ruby

Data Science with PostgreSQL

Data Science with PostgreSQL

Preprocessing

Preprocessing: enrichment

I PostGIS for geodata

I Foreign data wrappers (see PostgreSQL Wiki)

I Other databases (other PostgreSQL server, MySQL, Oracle,MSSQL, JDBC, SQL Alchemy ...)

I NoSQL databases (MongoDB, Cassandra, CouchDB, Redis, ...)

I Big Data (Hadoop Hive, Impala)

I Network sources - Multicorn (RSS, IMAP, Twitter, S3, ...)I Files (CSV, ZIP, JSON, ...)

I Write your own in C or Python or Ruby

Data Science with PostgreSQL

Data Science with PostgreSQL

Preprocessing

Preprocessing: enrichment

I PostGIS for geodata

I Foreign data wrappers (see PostgreSQL Wiki)

I Other databases (other PostgreSQL server, MySQL, Oracle,MSSQL, JDBC, SQL Alchemy ...)

I NoSQL databases (MongoDB, Cassandra, CouchDB, Redis, ...)

I Big Data (Hadoop Hive, Impala)

I Network sources - Multicorn (RSS, IMAP, Twitter, S3, ...)I Files (CSV, ZIP, JSON, ...)

I Write your own in C or Python or Ruby

Data Science with PostgreSQL

Data Science with PostgreSQL

Preprocessing

Preprocessing: enrichment

I PostGIS for geodata

I Foreign data wrappers (see PostgreSQL Wiki)

I Other databases (other PostgreSQL server, MySQL, Oracle,MSSQL, JDBC, SQL Alchemy ...)

I NoSQL databases (MongoDB, Cassandra, CouchDB, Redis, ...)

I Big Data (Hadoop Hive, Impala)

I Network sources - Multicorn (RSS, IMAP, Twitter, S3, ...)I Files (CSV, ZIP, JSON, ...)

I Write your own in C or Python or Ruby

Data Science with PostgreSQL

Data Science with PostgreSQL

Preprocessing

Preprocessing: enrichment

I PostGIS for geodata

I Foreign data wrappers (see PostgreSQL Wiki)

I Other databases (other PostgreSQL server, MySQL, Oracle,MSSQL, JDBC, SQL Alchemy ...)

I NoSQL databases (MongoDB, Cassandra, CouchDB, Redis, ...)

I Big Data (Hadoop Hive, Impala)

I Network sources - Multicorn (RSS, IMAP, Twitter, S3, ...)I Files (CSV, ZIP, JSON, ...)

I Write your own in C or Python or Ruby

Data Science with PostgreSQL

Data Science with PostgreSQL

Preprocessing

Preprocessing: enrichment

I PostGIS for geodata

I Foreign data wrappers (see PostgreSQL Wiki)

I Other databases (other PostgreSQL server, MySQL, Oracle,MSSQL, JDBC, SQL Alchemy ...)

I NoSQL databases (MongoDB, Cassandra, CouchDB, Redis, ...)

I Big Data (Hadoop Hive, Impala)

I Network sources - Multicorn (RSS, IMAP, Twitter, S3, ...)I Files (CSV, ZIP, JSON, ...)

I Write your own in C or Python or Ruby

Data Science with PostgreSQL

Data Science with PostgreSQL

Modeling

Model development

I Machine learning algorithms not well suited for SQL

I Some attempts to build them

I Naive Bayes, Linear RegressionI Di�cult for more advanced algorithms

I Better done in specialized language or tool

I PL/RI PL/Python

Data Science with PostgreSQL

Data Science with PostgreSQL

Modeling

Model development

I Machine learning algorithms not well suited for SQL

I Some attempts to build them

I Naive Bayes, Linear RegressionI Di�cult for more advanced algorithms

I Better done in specialized language or tool

I PL/RI PL/Python

Data Science with PostgreSQL

Data Science with PostgreSQL

Modeling

Model development

I Machine learning algorithms not well suited for SQL

I Some attempts to build them

I Naive Bayes, Linear RegressionI Di�cult for more advanced algorithms

I Better done in specialized language or tool

I PL/RI PL/Python

Data Science with PostgreSQL

Data Science with PostgreSQL

Modeling

PL/Python

I Python procedural language available in PostgreSQL

I scikit-learn: Machine learning toolbox for Python

I Classi�cation, regression, clusteringI Model selection, validationI Preprocessing

I matplotlib: Generic and statistical plotting library

I PL/Python is an alternative to PL/R

Data Science with PostgreSQL

Data Science with PostgreSQL

Modeling

PL/Python

I Python procedural language available in PostgreSQL

I scikit-learn: Machine learning toolbox for Python

I Classi�cation, regression, clusteringI Model selection, validationI Preprocessing

I matplotlib: Generic and statistical plotting library

I PL/Python is an alternative to PL/R

Data Science with PostgreSQL

Data Science with PostgreSQL

Modeling

Evaluation of modeling results

I Models return predictions

I Prediction can be compared to known result (target variable)

I Measures of model performance: Accuracy, precision, recall, ...

I Results on the training set meaningless

I Split validation

I Cross validation

Data Science with PostgreSQL

Data Science with PostgreSQL

Modeling

Evaluation of modeling results

I Models return predictions

I Prediction can be compared to known result (target variable)

I Measures of model performance: Accuracy, precision, recall, ...

I Results on the training set meaningless

I Split validation

I Cross validation

Data Science with PostgreSQL

Data Science with PostgreSQL

Modeling

Evaluation of modeling results

I Models return predictions

I Prediction can be compared to known result (target variable)

I Measures of model performance: Accuracy, precision, recall, ...

I Results on the training set meaningless

I Split validation

I Cross validation

Data Science with PostgreSQL

Data Science with PostgreSQL

Modeling

Evaluation results

I �Good result� depends on the application

I If not good enough,

I get more data

I do more preprocessing

I select better classi�er

I optimize classi�er parameters

I Cycle: preprocessing - modeling - evaluation

I Better done in data mining environment

Data Science with PostgreSQL

Data Science with PostgreSQL

Modeling

Evaluation results

I �Good result� depends on the application

I If not good enough,

I get more data

I do more preprocessing

I select better classi�er

I optimize classi�er parameters

I Cycle: preprocessing - modeling - evaluation

I Better done in data mining environment

Data Science with PostgreSQL

Data Science with PostgreSQL

Modeling

Evaluation results

I �Good result� depends on the application

I If not good enough,

I get more data

I do more preprocessing

I select better classi�er

I optimize classi�er parameters

I Cycle: preprocessing - modeling - evaluation

I Better done in data mining environment

Data Science with PostgreSQL

Data Science with PostgreSQL

Modeling

Evaluation results

I �Good result� depends on the application

I If not good enough,

I get more data

I do more preprocessing

I select better classi�er

I optimize classi�er parameters

I Cycle: preprocessing - modeling - evaluation

I Better done in data mining environment

Data Science with PostgreSQL

Data Science with PostgreSQL

Modeling

Evaluation results

I �Good result� depends on the application

I If not good enough,

I get more data

I do more preprocessing

I select better classi�er

I optimize classi�er parameters

I Cycle: preprocessing - modeling - evaluation

I Better done in data mining environment

Data Science with PostgreSQL

Data Science with PostgreSQL

Modeling

Evaluation results

I �Good result� depends on the application

I If not good enough,

I get more data

I do more preprocessing

I select better classi�er

I optimize classi�er parameters

I Cycle: preprocessing - modeling - evaluation

I Better done in data mining environment

Data Science with PostgreSQL

Data Science with PostgreSQL

Modeling

Evaluation results

I �Good result� depends on the application

I If not good enough,

I get more data

I do more preprocessing

I select better classi�er

I optimize classi�er parameters

I Cycle: preprocessing - modeling - evaluation

I Better done in data mining environment

Data Science with PostgreSQL

Data Science with PostgreSQL

Deployment

Deployment

I Advantages of deployment in the database:

I Less overhead

I Instant application using triggers

I Well-known execution environment

I Functionality available over standard interface

I Some models easily expressed in SQL

Data Science with PostgreSQL

Data Science with PostgreSQL

Deployment

Deployment

I Advantages of deployment in the database:

I Less overhead

I Instant application using triggers

I Well-known execution environment

I Functionality available over standard interface

I Some models easily expressed in SQL

Data Science with PostgreSQL

Data Science with PostgreSQL

Deployment

Deployment

I Advantages of deployment in the database:

I Less overhead

I Instant application using triggers

I Well-known execution environment

I Functionality available over standard interface

I Some models easily expressed in SQL

Data Science with PostgreSQL

Data Science with PostgreSQL

Deployment

Deployment

I Advantages of deployment in the database:

I Less overhead

I Instant application using triggers

I Well-known execution environment

I Functionality available over standard interface

I Some models easily expressed in SQL

Data Science with PostgreSQL

Data Science with PostgreSQL

Deployment

Deployment

I Advantages of deployment in the database:

I Less overhead

I Instant application using triggers

I Well-known execution environment

I Functionality available over standard interface

I Some models easily expressed in SQL

Data Science with PostgreSQL

Data Science with PostgreSQL

Deployment

Deployment of PL/R or PL/Python models

I Model developed in database or outside

I Put into global context

I PL/R: load(�saved object�, .GlobalEnv)I PL/Python: Global dictionary �GD�

I Application function in matching language

I Uses existing modelI Returns target variable

I Trigger func or UPDATE uses application function

Data Science with PostgreSQL

Data Science with PostgreSQL

Deployment

Deployment of PL/R or PL/Python models

I Model developed in database or outside

I Put into global context

I PL/R: load(�saved object�, .GlobalEnv)I PL/Python: Global dictionary �GD�

I Application function in matching language

I Uses existing modelI Returns target variable

I Trigger func or UPDATE uses application function

Data Science with PostgreSQL

Data Science with PostgreSQL

Deployment

Deployment of PL/R or PL/Python models

I Model developed in database or outside

I Put into global context

I PL/R: load(�saved object�, .GlobalEnv)I PL/Python: Global dictionary �GD�

I Application function in matching language

I Uses existing modelI Returns target variable

I Trigger func or UPDATE uses application function

Data Science with PostgreSQL

Summary

Summary

I PostgreSQL's support for data science tasks

I Best: preprocessing, deployment

I Modern SQL for preprocessing

I Foreign Data Wrappers for data integration

I Procedural languages for data mining

Data Science with PostgreSQL

Summary

Summary

I PostgreSQL's support for data science tasks

I Best: preprocessing, deployment

I Modern SQL for preprocessing

I Foreign Data Wrappers for data integration

I Procedural languages for data mining

Data Science with PostgreSQL

Summary

Summary

I PostgreSQL's support for data science tasks

I Best: preprocessing, deployment

I Modern SQL for preprocessing

I Foreign Data Wrappers for data integration

I Procedural languages for data mining

Data Science with PostgreSQL

Summary

Summary

I PostgreSQL's support for data science tasks

I Best: preprocessing, deployment

I Modern SQL for preprocessing

I Foreign Data Wrappers for data integration

I Procedural languages for data mining

Data Science with PostgreSQL

Summary

Questions?

I Balázs Bárány, <[email protected]>

I https://datascientist.at/