Upload
luis-goldster
View
122
Download
0
Embed Size (px)
Citation preview
1
From Data to Wisdomi Data
4 The raw material of information
i Information4 Data organized and
presented by someonei Knowledge
4 Information read, heard or seen and understood and integrated
i Wisdom4 Distilled knowledge and
understanding which can lead to decisions
Wisdom
Knowledge
Information
Data
The Information Hierarchy
Why Data Mining? i The Explosive Growth of Data: from terabytes to
petabytes4 Data collection and data availability
h Automated data collection tools, database systems, Web, computerized society
4 Major sources of abundant datah Business: Web, e-commerce, transactions, stocks, … h Science: Remote sensing, bioinformatics, scientific simulation, … h Society and everyone: news, images, video, documentsh Internet …
2
3Source: Intel
How much data?i Google: ~20-30 PB a dayi Wayback Machine has ~4 PB + 100-200 TB/monthi Facebook: ~3 PB of user data + 25 TB/dayi eBay: ~7 PB of user data + 50 TB/dayi CERN’s Large Hydron Collider generates 15 PB a yeari In 2010, enterprises stored 7 Exabytes = 7,000,000,000 GB
640K ought to be enough for anybody.
Big Data Growing
5
The Untapped Data Gap:Most of the useful data will not be tagged or analyzed – partly due to skill shortage
IDC predicts: From 2005 to 2020, the digital universe will double every 2 years and grow from 130 exabytes to 40,000 exabytesor 5,200 GB / person in 2020.
What Is Data Mining? i We are drowning in data, but starving for knowledge! i “Necessity is the mother of invention”—Data mining—
Automated analysis of massive data sets
6
The non-trivial extraction of implicit, previously unknown and potentially useful knowledge from data in large data repositories
i Data Mining: A Definition
4 Non-trivial: obvious knowledge is not useful4 implicit: hidden difficult to observe knowledge4 previously unknown4 potentially useful: actionable; easy to understand
7
Data Mining: Confluence of Multiple Disciplines
Data Mining
MachineLearning Statistics
Applications
Algorithm
PatternRecognition
High-PerformanceComputing
Visualization
Database Technology
8
Data Mining’s Virtuous Cycle
1. Identifying the problem
2. Mining data to transform it into actionable information
3. Acting on the information
4. Measuring the results
9
The Knowledge Discovery Processi Data Mining v. Knowledge Discovery in Databases (KDD)
4 DM and KDD are often used interchangeably4 actually, DM is only part of the KDD process
- The KDD Process
10
Types of Knowledge Discoveryi Two kinds of knowledge discovery: directed and undirected
i Directed Knowledge Discovery4 Purpose: Explain value of some field in terms of all the others (goal-oriented)4 Method: select the target field based on some hypothesis about the data; ask the
algorithm to tell us how to predict or classify new instances4 Examples:
h what products show increased sale when cream cheese is discountedh which banner ad to use on a web page for a given user coming to the site
i Undirected Knowledge Discovery4 Purpose: Find patterns in the data that may be interesting (no target field)4 Method: clustering, affinity grouping4 Examples:
h which products in the catalog often sell togetherh market segmentation (find groups of customers/users with similar
characteristics or behavioral patterns)
12
Data Mining: On What Kinds of Data?
i Database-oriented data sets and applications
4 Relational database, data warehouse, transactional database
4 Object-relational databases, Heterogeneous databases and legacy databases
i Advanced data sets and advanced applications
4 Data streams and sensor data
4 Time-series data, temporal data, sequence data (incl. bio-sequences)
4 Structure data, graphs, social networks and information networks
4 Spatial data and spatiotemporal data
4 Multimedia database
4 Text databases
4 The World-Wide Web
13
Data Mining: What Kind of Data?i Structured Databases
4 relational, object-relational, etc.4 can use SQL to perform parts of the processe.g., SELECT count(*) FROM Items WHERE type=video GROUP BY category
14
Data Mining: What Kind of Data?i Flat Files
4 most common data source4 can be text (or HTML) or binary4 may contain transactions, statistical data, measurements, etc.
i Transactional databases4 set of records each with a transaction id, time stamp, and a set of items4 may have an associated “description” file for the items4 typical source of data used in market basket analysis
15
Data Mining: What Kind of Data?i Other Types of Databases
4 legacy databases4 multimedia databases (usually very high-dimensional)4 spatial databases (containing geographical information, such as maps, or
satellite imaging data, etc.)4 Time Series Temporal Data (time dependent information such as stock market
data; usually very dynamic)i World Wide Web
4 basically a large, heterogeneous, distributed database4 need for new or additional tools and techniques
h information retrieval, filtering and extractionh agents to assist in browsing and filteringh Web content, usage, and structure (linkage) mining tools
4 The “social Web”h User generated meta-data, social networks, shared resources, etc.
16
What Can Data Mining Doi Many Data Mining Tasks
4 often inter-related4 often need to try different techniques/algorithms for each task4 each tasks may require different types of knowledge discovery
i What are some of data mining tasks4 Classification4 Prediction4 Clustering4 Affinity Grouping / Association discovery4 Sequence Analysis4 Characterization4 Discrimination
17
Some Applications of Data miningi Business data analysis and decision support
4 Marketing focalizationh Recognizing specific market segments that respond to particular
characteristicsh Return on mailing campaign (target marketing)
4 Customer Profilingh Segmentation of customer for marketing strategies and/or product
offeringsh Customer behavior understandingh Customer retention and loyaltyh Mass customization / personalization
18
Some Applications of Data miningi Business data analysis and decision support (cont.)
4 Market analysis and managementh Provide summary information for decision-makingh Market basket analysis, cross selling, market segmentation.h Resource planning
4 Risk analysis and managementh "What if" analysish Forecastingh Pricing analysis, competitive analysish Time-series analysis (Ex. stock market)
19
Some Applications of Data miningi Fraud detection
4 Detecting telephone fraud:h Telephone call model: destination of the call, duration, time of day or weekh Analyze patterns that deviate from an expected normh British Telecom identified discrete groups of callers with frequent intra-group calls,
especially mobile phones, and broke a multimillion dollar fraud scheme
4 Detection of credit-card fraud4 Detecting suspicious money transactions (money laundering)
i Text mining:4 Message filtering (e-mail, newsgroups, etc.)4 Newspaper articles analysis4 Text and document categorization
i Web Mining4 Mining patterns from the content, usage, and structure of Web resources
Types of Web Mining
Web ContentMining
Web StructureMining
Web UsageMining
Web Mining
21
Applications:• document clustering or
categorization• topic identification / tracking• concept discovery• focused crawling• content-based
personalization• intelligent search tools
Types of Web Mining
Web ContentMining
Web StructureMining
Web UsageMining
Web Mining
Applications:• user and customer behavior
modeling• Web site optimization• e-customer relationship
management• Web marketing• targeted advertising• recommender systems
22
Types of Web Mining
Web ContentMining
Web StructureMining
Web UsageMining
Web Mining
Applications:• document retrieval and
ranking (e.g., Google)• discovery of “hubs” and
“authorities”• discovery of Web
communities• social network analysis
23