chap2 تنقيب.ppt

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Data Mining: Introduction

Lecture 2

Introduction to Data Miningby

Tan, Steinbach, Kumar


Data Mining Tasks... تنقيب مهامالبيانات

Classification [Predictive] تصنيف Clustering [Descriptive] تجميع أو عنقدة Association Rule Discovery [Descriptive]ربط Sequential Pattern Discovery [Descriptive] األنماط اكتشاف

التسلسلية Regression [Predictive]االنحدار Deviation Detection [Predictive] االنحراف كشف


Classification: Definition Given a collection of records (training set ) سجالت مجموعة معطى

للتدريب– Each record contains a set of attributes, one of the attributes is the

class. الصنف هو الخصائص هذه أحد الخصائص مجموعة على يحتوي سجل كل Find a model for class attribute as a function of the values of

other attributes األخرى الخصائص قيم على تحتوي كدالة للصنف نموذج . نعمل Goal: previously unseen records should be assigned a class as

accurately as possible. :تكن لم التي السجالت كافة على تطبق دالة عمل الهدفقبل من : موجودة

– A test set is used to determine the accuracy of the model مجموعةالنموذج دقة لتحديد تستخدم االختبار Usually, the given data set . سجالت

is divided into training and test sets, with training set used to build the model and test set used to validate it. لبناء التدريب مجموعة تستخدم

صحته من للتحقق االختبار مجموعة تستخدم بينما النموذجClassification consists of predicting a certain outcome based on a given input. In order to predict the outcome, the algorithm processes a training set containing a set of attributes and the respective outcome, usually called goal or prediction attribute. The algorithm tries to discover relationships between the attributes that would make it possible to predict the outcome. Next the algorithm is given a data set not seen before, called prediction set, which contains the same set of attributes, except for the prediction attribute – not yet known. The algorithm analyses the input and produces a prediction. The prediction accuracy defines how “good” the algorithm is


Classification Example

TestSet

Training Set

Model

Learn Classifier

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent31…40 high yes fair>40 medium no excellent

Features Classes


Classification: Application 1 Zالتطبيقالتصنيف على األول

Direct Marketing المباشر التسويق– Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a

new cell-phone product– البريدية: تكلفة تخفيض منالهدف الهاتف جديد منتج لشراء عرضة المستهلكين من مجموعة استهداف خالل

..الخلوي– Approach:نهج

Use the data for a similar product introduced before. قبل من قدم مماثل لمنتج البيانات .استخدام We know which customers decided to buy and which decided otherwise. This

{buy, don’t buy} decision forms the class attribute. البيانات تصنيف يتم التي الخاصيةيشتري ال أو يشتري هي أساسها على

Collect various demographic, lifestyle, and company-interaction related information about all such customers العمالء عن المختلفة المعلومات .نجمع

– Type of business, where they stay, how much they earn, etc. Use this information as input attributes to learn a classifier model.

From [Berry & Linoff] Data Mining Techniques, 1997


Classification: Application 2

Fraud Detection واالحتيال الغش كشف– Goal: Predict fraudulent cases in credit card transactions التنبؤ

االئتمان بطاقات معامالت في االحتيال ..حاالت– Approach:

Use credit card transactions and the information on its account-holder as attributes كصفات حاملها حساب عن والمعلومات االئتمان بطاقات معامالت ..استخدام

– When does a customer buy, what does he buy, how often he pays on time, etc Label past transactions as fraud or fair transactions. This forms the

class attribute المعامالت أو االحتيال ك السابقة المعامالت تسميةالتصنيف. يتم أساسها على التي الصفة يشكل وهذا ..العادلة

Learn a model for the class of the transactions لتصنيف نموزج.المعامالت

Use this model to detect fraud by observing credit card transactions on an account.


Classification: Application 3

Customer Attrition/Churn: العميل استنزاف– Goal: To predict whether a customer is likely to be lost

to a competitor.– منافس: أمام سنخسره العميل كان إذا ما التنبؤ .الهدف– Approach:

Use detailed record of transactions with each of the past and present customers, to find attributes.

الماضي في العمالء من كل مع المعامالت من مفصل سجل استخدامالصفات إليجاد .والحاضر،

– How often the customer calls, where he calls, what time-of-the day he calls most, his financial status, marital status االجتماعية .etc ,الحالة

Label the customers as loyal or disloyal – ( دائم زبون ك العمالء تسمية.Find a model for loyalty ال( للوالء نموزج (نعمل

From [Berry & Linoff] Data Mining Techniques, 1997


Clustering Definition العنقدة تعريفالتجميع) (

Given a set of data points Yالبيانات Yنقاط من مجموعة each , معطىhaving a set of attributes YفاتYص مجموعة على تحتوي Yالبيانات من نقطة Yكل , and a similarity measure among them بينها فيما التشابه مقدار , نقيسfind clusters مثل المجموعات نجد such that ثم– Data points in one cluster are more similar to one

another Yالبعض لبعضها مماثلة أكثر هي واحدة مجموعة في Yالبيانات Yنقاط.– Data points in separate clusters are less similar to

one another Yالبعض لبعضها مماثلة Yأقل هي منفصلة مجموعات في Yالبيانات Yنقاط. Similarity Measures التشابه :مقياس

– Euclidean Distance if attributes are continuous مستمرة الصفات كانت إذا االقليدية . المسافات

– Other Problem-specific Measures محددة مشكلة لكل أخرى .تدابير


Euclidian Distance ( بين البعد االقليدية بعدنقطتين (

X axis

Y axis7

4(7,4)

X axis

Y axis7

4(7,4)

2

2

(2,2)

Distance (a, b)= √((X2-X1)2+(Y2-Y1)2)Distance (a, b)= √ ((2-7)2+(2-4)2)Distance (a, b)= √ ((-5)2+(-2)2)Distance (a, b)= √ (25+4) = 5.38

a

b


Illustrating Clustering التجمع توضيح

Euclidean Distance Based Clustering in 3-D space األبعاد ثالثي حالة في . البعYد

Intracluster distancesare minimized

الكتلة داخل المسافاتاألدنى للحYد تصغيرها تم

Intercluster distancesare maximized

المشتركة المجYموعة مسافاتاألقصى للحد تكبيرها تم


Clustering: Application 2

Document Clustering: الوثائق تجميع– Goal: To find groups of documents that are similar to each other

based on the important terms appearing in them. :إليجاد الهدفأو الشروط أساس على البعض بعضها تشبه التي الوثائق من مجموعة

فيها تظهر التي الهامة .المصطلحات– Approach: To identify frequently occurring terms in each

document. Form a similarity measure based on the frequencies of different terms. Use it to cluster.

– . تشكيل: يجب وثيقة كل في كثيرا تتكرر التي المصطلحات لتحديد النهج . للتجميع ونستخدمه المختلفة المصطلحات تكرار على بناء التشابه .مقياس

– Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.

– المجموعات: تستخدم أن يمكن استرجاعها تم التي المعلومات النتائجالبحث مصطلح أوربط جديد مستند لربط المجموعات هذه من لالستفادة

المجمعة الوثائق .إلى


Illustrating Document Clustering الوثائق تجميع شرح

Clustering Points: 3204 Articles of Los Angeles Times. Similarity Measure التشابه How many words are common in these :مقياس

documents (after some word filtering). المستندات هذه في المعروفة الكلمات عدد كم

Category TotalArticles

CorrectlyPlaced

Financial 555 364Foreign 341 260National 273 36Metro 943 746Sports 738 573

Entertainment 354 278


Association Rule Discovery: Definitionالربط

Given a set of records each of which contain some number of items from a given collection; من مجموعة معطاه

العناص من مجموعة على يحتوي سجل وكل رالسجالت – Produce dependency rules which will predict

occurrence of an item based on occurrences of other items. عنصر حدوث تتوقع والتي االعتمادية قواعد نستخرج

أخرى عناصر حدوث على بناء ماTID Items

1 Bread, Coke, Milk

2 Tea, Bread

3 Tea, Coke, Tissues, Milk

4 Tea, Bread, Tissues, Milk

5 Coke, Tissues, Milk

Rules Discovered: {Milk} --> {Coke} {Tissues, Milk} --> {Tea}

الكوك يحدد الحليبالحليب على يعتمد الكوك أو

يحددوا والحليب المناديل الشاي

الحليب على يعتمد الشاي أووالمناديل


Association Rule Discovery: Application 1

Marketing and Sales Promotion: المبيعات وترويج :التسويق– Let the rule discovered be {Potato Chips} <-- { … ,Bagels}الخبز – Potato Chips as consequent => Can be used to determine what

should be done to boost its sales.– البطاطس الناتجة) ( رقائق أو ينبغي =< التالية ما لتحديد تستخدم أن يمكن

مبيعاتها لتعزيز به .القيام– Bagels in the antecedent => Can be used to see which products

would be affected if the store discontinues selling bagels.– السابق ) ( كان =< الخبز إذا ستتأثر التي المنتجات لمعرفة الخبز استخدام يمكن

الخبز بيع أوقف .مخزن– Bagels in antecedent and Potato chips in consequent => Can be

used to see what products should be sold with Bagels to promote sale of Potato chips! أن يمكن الخبز بين االعتتمادية عالقة

البطاطس بيع لتعزيز الخبز مع بيعها ينبغي التي المنتجات هي ما تحدد


Association Rule Discovery: Application 2

Supermarket shelf management. ماركت السوبر في الرف إدارة– Goal: To identify items that are bought together by sufficiently

many customers.– العYمالء: من العديد قبل من معا شراؤها يتم التي العYناصر تحديد .الهدف– Approach: Process the point-of-sale data collected with barcode

scanners to find dependencies among items.– الباركود: ماسحات باستخدام جمعها تم التي البيع منفذ بيانات نعالج النهج

العناصر بين االعتمادية .لمعرفة– A classic rule – الكالسيكية القاعدة

If a customer buys diaper and milk, then he is very likely to buy beer. So, don’t be surprised if you find six-packs stacked next to diapers!

أن جدا المرجح من فإنه وحليب، حفاضات يشتري العYميل كان إذا. البيرة يشتري

حفاضات بجانب مكدسة حزم ست وجدت إذا تتفاجأ ال !لذلك،


Sequential Pattern Discovery: Definition

التسلسل نمط معرفة Given is a set of objects, with each object associated with its own timeline of events, find rules that predict strong sequential dependencies among different

events. , , المختلفة األحداث بين التسلسلسية االعتمادية تتوقع التي القاعدة أوجد له الزمني الخط على تمت التي باألحداث مرتبط كائن كل الكائنات من مجموعة معطى

Rules are formed by first disovering patterns. Event occurrences in the patterns are governed by timing constraints. , النمط في الحدث إجراء النمط نكتشف أوالزمنية وقيود شروط طريق عن به التحكم يتم

(A B) (C) (D E)

<= ms

<= xg >ng <= ws

(A B) (C) (D E)


Sequential Pattern Discovery: Examples

التسلسلي النمط اكتشاف على أمثلة Applications of sequential pattern mining

– Customer shopping sequences العمالء تسوق : تسلسلFirst buy computer, then CD-ROM, and then digital camera, within 3 months.Athletic Apparel Store: الرياضية المالبس : متجر

(Shoes) (Racket, Racketball) --> (Sports_Jacket)) ( >- ) ( ) رياضي) جاكيت الراكيت مضرب، أحذية

– Medical treatments, natural disasters (e.g., earthquakes), science & eng. processes, stocks and markets, etc.

– Telephone calling patterns, Weblog click streams– DNA sequences and gene structures

–. ) الهندسية ) والعمليات والعلوم ، والزالزل المثال، سبيل على الطبيعية والكوارث الطبية، العالجاتالخ واألسواق، واألسهم

المدونات , – التليفوني االتصال أنماط الجينات وهياكل النووي الحمض تسلسل


Regression التراجع أو االنحدار Predict a value of a given continuous valued variable based on the values of

other variables, assuming a linear or nonlinear model of dependency. على أخرى، متغيرات قيم على بناء معطاه مستمرة قيم مجموعة متغيرمن قيمة توقع

س قيمة على مرة كل تعتمد ص قيمة مثل خطية غير أو خطية اعتمادية عالقة أساس. المدخلة

Greatly studied in statistics االحصاء في بكثرة في neural network fields ,تستخدمالعصبية الشبكات .مجال

Examples:– Predicting sales amounts of new product based on advertising

expenditure. الدعاية مصروفات على بناء YيدYجد منتج مبيعات كمية توقع– Predicting wind velocities as a function of temperature, humidity, air pressure,

etc الهواء , , ضYغط الرطوبة الحرارYة درجة مع العالقة أساس على الرياح سرعة .توقع– Time series prediction of stock market indices خالل األسهم سوق مؤشرات توقع

زمنية .فترة


Deviation/Anomaly Detection الشذوذ / كشف االنحراف Detect significant deviations from normal behavior إكتشاف

الطبيعي السلوك عن معين انحراف Applications:تطبيقات

– Credit Card Fraud Detectionاالئتمان – بطاقات في االحتيال كشف

– Network Intrusion Detection الشبكة اختراق كشف

Typical network traffic at University level may reach over 100 million connections per day

Documents

chap2 تنقيب.ppt