Upload
lythuy
View
221
Download
0
Embed Size (px)
Citation preview
K236: Basis of Data ScienceLecture 2. Data and Databases
Lecturer: Tu Bao Ho and Hieu Chi DamTA: Moharasan Gandhimathi
and Nuttapong Sanglerdsinlapachai
2
Schedule of K236
1. Introduction to data science 6K3�kYµ 6/9
2. Introduction to data science 6K3�kYµ 6/13
3. Data and databases 6K3�6K3@K/ 6/16
4. Review of univariate statistics bh´�ª 6/20
5. Review of linear algebra �tQ{ 6/23
6. Data mining software 6K3A$9J)2>7%&# 6/27
7. Data preprocessing 6K3a]� 6/30
8. Classification and prediction (1) ^¸�P� (1) 7/4
9. Knowledge evaluation �¯«V 7/7
10. Classification and prediction (2) ^¸�P� (2) 7/11
11. Classification and prediction (3) ^¸�P� (3) 7/14
12. Mining association rules (1) �·HKH�¨� 7/18
13. Mining association rules (2) �·HKH�¨� 7/21
14. Cluster analysis (F/3K¨� 7/25
15. Review and Examination I=DK�¬¹ (the data is not fixed) 7/27
Outline
1. Much(more(data(around(us(than(before
2.Data management3.Data quality problems
This lecture aims to provide you the idea of how data are collected, represented and organized.
3
Data collection, representation, organization and inference
4
Low(levelof abstraction
High(level(((((((of abstraction
Generalization(inductive0learning)
! How(data(is(collected,(represented,(and(organized?
" Collection:(sample(or(all(available(data" Representation:(vectors,(sequences,(lists,(graphs,(etc." Organization:(databases,(warehouses,(etc.
! Inference" Induction:(!"#$%&! '( , !#&*%+!* '(vs.(Deduction:("#$%&!* ' !,&-!'(, !-%-./%!*('())
Data Knowledge
5
Astronomical0data0� �.=RjwhAstronomy is facing a major data avalanche: � �NSjwh��R�$Q'HMAY
Multi-terabyte sky surveys and archives (soon: multi-petabyte), billions of detected sources, hundreds of measured attributes per source … �iqm]kUR�16%jwh{���UR6%&{6%&FOQ�-U@Y�
6
Earthquake0data �?jwh
1932-199604/25/92 Cape
Mendocino, CA
Japanese)earthquakes)))!"R�?196121994
7
Explosion of biological data������k6K3
10,267,507,282 bases in 9,092,760 records.
25,000%Genes
2,000,000%Proteins
3000%metabolitesMetabolomics
Proteomics
Genomics8
A portion of the DNA sequence, consisting of 1.6 million characters, is given as follows (about 350 characters, 4570 times smaller): 1600� �CWPY���<R�; y4570R�z
How biological data look like?��k6K3�ts��
…TACATTAGTTATTACATTGAGAAACTTTATAATTAAAAAAGATTCATGTAAATTTCTTATTTGTTTATTTAGAGGTTTTAAATTTAATTTCTAAGGGTTTGCTGGTTTCATTGTTAGAATATTTAACTTAATCAAATTATTTGAATTTTTGAAAATTAGGATTAATTAGGTAAGTAAATAAAATTTCTCTAACAAATAAGTTAAATTTTTAAATTTAAGGAGATAAAAATACTACTCTGTTTTATTATGGAAAGAAAGATTTAAATACTAAAGGGTTTATATATATGAAGTAGTTACCCTTAGAAAAATATGGTATAGAAAGCTTAAATATTAAGAGTGATGAAGTATATTATGT…
Many other kinds of biological data
9
! Approximately 80% of the world’s data is held in unstructured formats(source: Oracle Corporation) �,�RjwhR2~}xD#:[UJPAjwhyaqcsQVYz
! Example: MEDLINE is a source of life sciences and biomedical information, with nearly eleven million records ��+�0�v+*��R��&N@YMEDLINEQS21100��R7 ��D@Y
" About 60,000 abstracts on hepatitis (BK4(QLAMS6��)
Text:0huge0sources0of0knowledgeibfk�/8R�EP&
36003: Biomed Pharmacother. 1999 Jun;53(5-6):255-63. Pathogenesis of autoimmune hepatitis.Institute of Liver Studies, King's College Hospital, London, United Kingdom.
Autoimmune hepatitis (AIH) is an idiopathic disorder affecting the hepatic parenchyma. There are no morphological features that are pathognomonic of the condition but the characteristic histological picture is that of an interface hepatitis without other changes that are more typical of other liver diseases. It is associated with hypergammaglobulinaemia, high titres of a wide range of circulating auto-antibodies, often a family history of other disorders that are thought to have an autoimmune basis, and a striking response to immunosuppressive therapy. The pathogenetic mechanisms are not yet fully understood but there is now considerable circumstantial evidence suggesting that: (a) there is an underlying genetic predisposition to the disease; (b) this may relate to several defects in immunological control of autoreactivity, with consequent loss of self-tolerance to liver auto-antigens; (c) it is likely that an initiating factor, such as a hepatotropic viral infection or an idiosyncratic reaction to a drug or other hepatotoxin, is required to induce the disease in susceptible individuals, …
10
Web0link0data0000^_nRrucjwh
Outline
1. Much(more(data(around(us(than(before
2.Data management" Data models" Data types" Structures of data" Various kinds of databases
3.Data quality problems
11
Data models
• Model: Simplified description or abstraction of a reality.
• Data model: Data description by a set of concepts of " The structure of a database, typically include
! elements (e.g., data types), ! groups of elements (e.g., entity, record, table), and ! relationships among such groups.
" The operations for manipulating these structures, specifying database retrievals and updates! basic model operations (e.g., insert, delete operations)! user-defined operations (e.g., compute_student_avarage_score)
" Certain constraints (restrictions on valid data) that the database should obey.
12
Approaches to data models• External model (Views): Describes how
users see the data for a particular purpose" Course_info(cid: string, enrollment: integer)
• Conceptual model: Defines logical structure*" Students(sid: string, name: string, login:
string, age: integer, gpa: real)" Courses(cid: string, cname: string, credits:
integer) " Enrolled(sid: string, cid: string, grade: string)
• Internal (physical) model: Describes how data is stored in computer" Relations stored as unordered files. " Index on first column of students.
13
View(1 View(2
Conceptual(model
Physical(model
External(Level
Conceptual(Level
Physical(Level
*(A(conceptual(model(is(an(underlying(model(that(is(capable(of(supporting(any(valid((and(perhaps(changing)(external(view(that(falls(within(its(scope.(https://en.wikipedia.org/wiki/Data_model#cite_noteUMW99U3
Types of data models• Flat model: a single, two-dimensional array of data
elements.
• Hierarchical model: data is organized into a tree-like
structure, implying a single upward link in each record to describe the nesting.
• Network model: two constructs: records contain fields, and sets define one-to-many relationships between records.
• Relational model: a database as a collection of predicates
over a finite set of predicate variables, describing constraints on the possible values and combinations of values.
• Object-relational model: a relational database model, but objects, classes and inheritance are directly supported in database schemas and in the query language.
• Star scheme: The simplest style of data warehouse
14
Data types! SYMBOLIC
" Indexing: E.g., names, tags, case numbers, or serial numbers that identify a respondent or group of respondents.
" Binary: Two values, e.g., YES or NO, SUCCESS or FAILURE, MALE or FEMALE, WHITE or NON-WHITE, FOR or AGAINST, and so on.
" Boolean: Two values TRUE or FALSE, and may have the value UNKNOWN.
" Nominal: Character-string values (green, blue, red, …)
" Ordinal: Values for this character-string data type are linearly ordered (Small, Middle, Large,…)
! NUMERIC" Integer: Values are just integer numbers" Continuous: real numbers.
15
Symbols(or(Numbers
16
Combinatorial search in hypothesis spaces (machine learning)R®�¶�����d! x�
Often matrix-based computation (multivariate data analysis)±r�¤`@K/�ª�»ih´6K3¨�¼
Why caring about data types?
Attribute Numerical Symbolic
No structure
!= Places,Color
Ordinal structure
!"= Ring
structure
Rank,Resemblance
Integer: Age,Temperature
Continuous: Income,Length
Nominal orcategorical(Binary, Boolean)
Ordinal
Measurable
!+"#=
Posible analysis
operations (thus
methods, algorithms) depend on data types
Advances: Data Transformation
Structures of data
• Structured data" Can be stored in database SQL
in table with rows and columns.
" Only about 5-10% of all available data.
• Semi-structured data" Doesn’t reside in a relational
database but that does have some organizational properties that make it easier to analyze.
" XML documents and NoSQL databases documents are semi structured
17
Articls2in2a2Latex2database
Structures of data
• Unstructured data" Unstructured data represent around 80% of data. It often include text
and multimedia content. Example: e-mail messages, word documents, videos, photos, audio files, webpages and many other kinds of business documents.
" A key issue in data science is representing unstructured dataExample: The DNA sequence“…TACATTAGTTATTACATTGAGAAACTTTATAATTAAAAAAGATTC…”can be represented by different ways for computation such as sliding windows, motifs, kernel function, etc., or the web link representation
18
Databases
• The most popular format for organizing data in a database is in the form of rectangular tables (also called data arrays or data matrices)data array�data matrices��f� ��t5K?H�6K3@K/"����}�~e�|���
" Each row represents the values of all variables on a single multivariate observation, c¤�bL�ih´§��Z��h{�¦�"¥
" Each column represents the values of a single variable for each observation. c`�c§�����bL�h{�X"¥����
• A typical database table having n multivariate observations taken on r variables will be represented by an (r × n)-matrix \g��6K3@K/�Á�ih´§��Â�h{�(r x n) - A7G4(/�¥ �
19
Elements of database systems
! A database management system (DBMS) is a software system that manages data and provides controlled access to the database. 6K3@K/A:.CJ7-/5B»DBMS�¼6K3"��½6K3@K/��#(1/"yU��2>7
! Database system (consisting of databases, DBMS, and application programs) is typically used for managing large quantities of data, regarded as two entities: ! a server (or backend), which holds the DBMS, and
! a set of clients (or frontend), each consists of a hardware and a software component, including application programs
6K3@K/�+K<K�(F$#J7����j´�6K3"����-/5B
20
Structured
Commercial
Open2source
Unstructured
(RDBMS)(NoSQL DB)
Source:(Cisco
Big data landscape Structured query language (SQL)
! Users communicate with a DBMS through a declarative query language typically SQL (Structured Query Language).EK,K�±rSQL�f� �o©g('G©"±��RDBMS�±W"¤�
! SQL has two main sublanguages: SQL�O�¿���©���" a data definition language (DDL), used by database admin to define data
structures by creating a database object, altering or destroying a database object.6K3m�©»DDL¼½�� �T�6K3�²"m���©
" a data manipulation language (DML) is an interactive system that allows users to retrieve, delete, and update existing data from and add new data to the database.6K3zS©»DML¼½EK,K�6K3@K/M�6K3"zS�����©
! Examples" create!table!<table!name>!(<table!elements>);!" select!<columns>!from!<table!name>!where!<condition>;!" select!max(<column>)!as!max,!min(<column>)!as!min!from!<table!name>! where!
<condition>;!22
Flat0model:0labeled0data
23
H1
C3
H3 H4
H2
C2C1
C4
ID color))))))))#nuclei)))))#tails))) status
H1)))))))light 1 1)))))))))healthyH2)))))))dark 1 1)))))))))healthyH3)))))))light 1 2)))))))))healthyH4)))))))light 2 1)))))))))healthyC1)))))))dark 1 2))))))))cancerousC2)))))))dark 2 1))))))))cancerousC3)))))))light 2 2))))))))cancerousC4)))))))dark 2))))))) 2))))))))cancerous)
��LEjwhSupervised data (labeled)
Descriptive0attributes00000000000000000000000000000000000000000Color:({dark,(light},(#nuclei:({1,(2},(#tails:({1,(2}(
Class0attributeStatus({cancerous,(healthy}
Flat0model:0unlabeled0data
24
H1
C3
H3 H4
H2
C2C1
C4
ID color))))))))#nuclei)))))#tails))) status
H1)))))))light 1 1)))))))))healthyH2)))))))dark 1 1)))))))))healthyH3)))))))light 1 2)))))))))healthyH4)))))))light 2 1)))))))))healthyC1)))))))dark 1 2))))))))cancerousC2)))))))dark 2 1))))))))cancerousC3)))))))light 2 2))))))))cancerousC4)))))))dark 2))))))) 2))))))))cancerous)
��)HjwhUnsupervised data (unlabeled)
Descriptive0attributes00000000000000000000000000000000000000000Color:({dark,(light},(#nuclei:({1,(2},(#tails:({1,(2}(
25
Relational0databases
A relational database is a collection of tables, each of which is assigned a unique name, and consists of a set of attributes and a set of tuples.
Cust2ID))))))))))))))name)))))))))))))))))))address))))))))))))))))))))))))))))))))))))age)))))))))))))))))income)))))))))))))credit2info)))))))))))).C1))))))))))))Smith,)Sandy)))))))))5463)E)Hasting,)Burnaby)))))))))))))))21)))))))))))))))))))$27000)))))))))))))))))1))))) …
BC)V5A)459,)Canada))))))))… … … … … … …
Item2ID))))))))))name)))))))))))))brand))))))))))))))category)))))))))))))))type))))))))))))))price)))))))))))))))place2made)))) supplier)))))))))))))))cost))))I3)))))))))))high2res2TV))))))Toshiba))))))))high)resolution)))))))))))TV)))))))))))))$988.00))))))))))))))))Japan)))))))))) NIkoX)))))))))))))$600.00I8)))))))))))multidisc2 Sanyo)))))))))))))multidisc))))))))))))))CD)player)))))$369.00))))))))))))))))Japan)))))))) MusicFont))))))))$120.00
… CDplayer))))))))))))… … … … … … …
customer
item
Emp2ID)))))))))))name))))))))))))))))))category))))))))))))))))))))))))group))))))))))))))))))))salary)))))))))))))))))))))commisionE35))))))))))))Jones,)Jane)))))))home)entertainmentl))))))))))manager)))))))))))))))$18,000))))))))))))))))))))))))))2%… … … … … …
employee
Branch2ID))))))))))name)))))))))))))))))))))))))))))))))))))))))))))))))))addressB1)))))))))))))))City)square))))))))369)Cambie)St.,)Vancouver,)BC)V5L)3A2,)Canada… … …
branch
Trans2ID))))))cust2ID)))))))empl2ID)))))))data)))))))))time))))))))method2paid)))))))amountT100))))))))))C1))))))))))))B55)))))))))))01/21/98))))15:45))))))Visa))))))))))))))))))$1357.00… .))… … … … … …
purchases
Trnas2ID))item2ID)))sty
T100))))))))))I3)))))))))1T100))))))))))I8)))))))))2… … …
Empl2ID))branch2ID
E55))))))))))B1… …
Item2sold444444444444444444works2at
26
A data warehouse is a repository of information collected from multiple resources, stored under a unified schema, and which is usually resides at a single site. jwh^`\l^fS5�RrgwfCW3�GZJfbwpQ�>GZJ��RtoekrNI|9� �Rd]kQ@XTI
Data)sourcein)Chicago
Data)sourcein)New)York
Data)sourcein)Vancouver
Data)sourcein)Toronto
CleanTransformIntegrateLoad
Data)warehouse
Query)andanalysis)tool
client
client
Data0warehouses
27
Transactional databases
! A transactional database consists of a file where each record represents a transaction.
! A transaction typically includes a unique transaction identity number (trans_ID), and list of the items making up the transaction.
Trans_ID)))))list)of)item_ID
T100))))))))))))beer,)cake,)onigiriT200))))))))))))beer,)cakeT300))))))))))))beer,)onigiri))))))T400 beer,)onigiriT500))))))))))))cake
28
! Object-Oriented Databases
! Object-Relational Databases
! Spatial Databases
! Temporal Databases and Time-Series Databases
! Text Databases and Multimedia Databases
! Heterogeneous Databases and Legacy Databases
! The World Wide Web
Advanced0database0systems
29
! Spatial databases contain spatial-related information: geographic databases, VLSI chip design databases, medical and satellite image databases etc.
! Data mining may uncover patterns describing the characteristics of houses located near a specified kind of location, the climate of mountainous areas located at various altitudes, etc.
Spatial databases
Japanese)earthquakes)))196121994
30
! They store time-related data. A temporal database stores relational data that include time-related attributes (timestamps with different semantics). A time-series database stores sequences of values that change with time (stock exchange)
! Data analytics finds the characteristics of object evolution, trend of change for objects: e.g., stock exchange data can be mined to uncover trends in investment strategies
Temporal and time-series databases
31
! Text databases contain documents, usually highly unstructured or semi-structured. To uncover general descriptions of object classes, keywords, content associations, clustering behavior of text objects, etc.
! Multimedia databases store image, audio, and video data: picture content-based retrieval, voice-email systems, video-on-demand-systems, speech-based user interface, etc.
Text and multimedia databases
32
The Web provides an enormous source of explicit and implicit knowledge that people can navigate and search for what they need.
Example: When examining the data collected from Internet Mart, heavily trodden paths gave BT hints to regions of the site which were of key interest to its visitors.
The world wide web
Outline
1. Much more data around us than before2. Data management3. Data quality problems
33
Noisy, inconsistencies, outliersCommon properties of large real-world databases: �nM�qj�6K3@K/�[±���v
! Incomplete: lacking attribute values or certain of interest NlZÀ6K3��£¾pw�³_����
��
! Noisy: containing errors or outliers ;$0À'FK��rX
! Inconsistent: containing discrepancies in codes or names ��À*K8�ea�NL¡
No quality data, no quality data mining results!
°�¢���6K3���VX�����
�u� ��º
34
KDD nuggets
www.kdnuggets.com is website of the data mining community
35
Homework for K236-L2
! Carefully study the slides. You(can(consult(the(book(chapter(“Data and Databases” provided in the website. Raise your questions on what you have yet clearly seen.
! Choose 4 datasets from www.statsci.org/datasets.html and summarize each of them (about the area where the data are collected, data type, number of features and objects, etc.). It is required that the datasets you select relating to different kinds of data (categorical, ordinal, integer, real number, etc.) and different data representations (vector, sequence, lists, graph, etc.).
! Report of this homework will be submitted at the latest one week after the class (June 23).
36