21
JAYARAJ ANNAPACKIAM CSI COLLEGE OF ENGINEERING Department of Information Technology IT6702 DATA WAREHOUSING AND DATA MINING Anna University 2 & 16 Mark Questions & Answers Year / Semester: IV IT A & B / VII Regulation: 2013 Academic year: 2017 - 2018 JACSICE

Year / Semester: IV IT A & B / VII Academic year: JACSICEnew.jacsice.in/pdf/it/studymaterials/DWDM.pdf · Clustering: Outliers may be detected by clustering, where similar values

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

JAYARAJ ANNAPACKIAM CSICOLLEGE OF ENGINEERING

Department of Information Technology

IT6702 DATA WAREHOUSING AND DATA MINING

Anna University 2 & 16 Mark Questions & Answers

Year / Semester: IV IT A & B / VII

Regulation: 2013

Academic year: 2017 - 2018

JACSICE

UNIT-1IT6702 DATA WAREHOUSING AND DATA MINING

TWOMARKS WITHANSWER

DATAWAREHOUSING

1. What are the usesofmultifeature cubes? (Nov/Dec2007)Multifeaturecubes,whichcomputecomplexqueriesinvolvingmultipledependentaggregatesat

multiplegranularity.Thesecubesareveryusefulinpractice.Manycomplexdataminingqueriescanbeansweredbymultifeaturecubeswithoutanysignificantincreaseincomputationalcost,incomparisontocubecomputation forsimple queries with standarddatacubes.

2. CompareOLTPandOLAPSystems.(Apr/May 2008), (May/June2010)If an on-line operational database systems is used for efficient retrieval, efficient storage and

managementoflargeamountsofdata,thenthesystemissaidtobeon-linetransactionprocessing.Datawarehousesystemsservesusers(or)knowledgeworkersintheroleofdataanalysisanddecision-making.Such systemscan organize and present data in various formats. These systemsare known as on-lineanalyticalprocessingsystems.

3. Whatis datawarehousemetadata? (Apr/May 2008)Metadataaredataaboutdata.Whenusedinadatawarehouse,metadataarethedatathatdefinewarehou

se objects. Metadata are created for the data names and definitions of the given warehouse.Additional metadataarecreatedand capturedfortime stampinganyextracted data,thesource oftheextracted data, and missingfieldsthathave been addedbydata cleaningorintegration processes.

4. Explain the differences between starand snowflakeschema. (Nov/Dec2008)Thedimension table ofthesnowflake schema modelmaybe kept in normalized formto

reduceredundancies. Sucha tableiseasyto maintain andsavesstorage space.

5. In thecontextofdatawarehousingwhatis data transformation?(May/June2009)`Indatatransformation,thedataaretransformedorconsolidatedintoformsappropriateformining.

Datatransformationcaninvolve the following:Smoothing, Aggregation, Generalization, Normalization, Attributeconstruction

6. DefineSliceandDice operation. (May/June 2009)The sliceoperationperforms aselection on one dimensionofthecuberesultingina sub cube.The diceoperationdefinesasubcubeby performinga selection ontwo (or)moredimensions.

7. Listthe characteristicsofa dataware house. (Nov/Dec2009)Therearefourkeycharacteristicswhich separate the datawarehousefromothermajoroperationalsystems:1. SubjectOrientation:Data organized bysubject2. Integration:Consistencyofdefiningparameters3. Non-volatility:Stable datastorage medium4. Time-variance:Timeliness ofdata and access terms

8. What are the varioussourcesfor datawarehouse?(Nov/Dec 2009)Handlingofrelationalandcomplextypesofdata:Becauserelationaldatabases and datawarehousesarewidelyused, thedevelopmentofefficientand effective datamining systems forsuch dataisimportant.Mining informationfromheterogeneous databases and globalinformation systems:

JACSICE

Local-and wide-areacomputernetworks (such as the Internet)connectmanysources ofdata,forminghuge, distributed, and heterogeneousdatabases.

9. Whatis bitmap indexing?(Nov/Dec 2009)ThebitmapindexingmethodispopularinOLAPproductsbecauseitallowsquicksearchingi

ndatacubes. Thebitmap indexisan alternative representationoftherecordID(RID)list.

10.Whatis datawarehouse?(May/June2010)Adatawarehouseisarepositoryofmultipleheterogeneousdatasourcesorganizedunderaunifi

edschemaatasinglesitetofacilitatemanagementdecisionmaking.(or)Adatawarehouseisasubject- oriented, time-variant and nonvolatile collection of data in support ofmanagement’s decision-making process.

11.Differentiatefact table and dimension table.(May/June2010)

Facttable contains the name offacts(or)measures aswellaskeys to each of the relateddimensionaltables.

Adimensiontable is used fordescribingthe dimension. (e.g.)Adimension tableforitemmaycontain the attributes item_ name, brand and type.

12.Briefly discussthe schemasfor multidimensionaldatabases.(May/June2010)Starsschema:Themostcommonmodelingparadigmisthestarschema,inwhichthedatawa

rehousecontains(1)a large centraltable (fact table)containingthe bulkof the data, with noredundancy, and (2)a set ofsmallerattendanttables (dimensiontables), one foreachdimension.

Snowflakes schema:The snowflake schema isavariantofthestarschema model, wheresomedimensiontablesarenormalized,therebyfurthersplittingthedataintoadditionaltables.Theresulting schema graph forms ashape similarto a snowflake.

FactConstellations:Sophisticated applications mayrequiremultiplefacttables tosharedimensiontables.Thiskindofschemacanbeviewedasacollectionofstars,andhenceiscalledagalaxy schema orafactconstellation.

13.Howisadatawarehousedifferentfromadatabase?Howaretheysimilar?(Nov/Dec2007,Nov/Dec2010)

Data warehouseis arepositoryofmultiple heterogeneous data sources,organizedundera unified schema atasinglesiteinordertofacilitatemanagementdecision-making.Arelationaldatabasesisacollectionoftables,eachofwhichisassignedauniquename.Eachtableconsistsofasetofattributes(columnsorfields)andusuallystoresalargesetoftuples(recordsorrows).Eachtupleinarelationaltablerepresentsanobjectidentifiedbyauniquekeyanddescribedbyasetofattributevalues.Bothareusedtostoreandmanipulate the data.

14.Whatis descriptiveand predictive datamining?(Nov/Dec2010)Descriptivedata mining,whichdescribesdata

inaconciseandsummarativemannerandpresents interestinggeneralproperties ofthe data.

Predictivedata mining,whichanalyzesdata inordertoconstructoneorasetofmodelsandattemptsto predict the behaviorofnewdatasets.Predictivedata mining,suchasclassification,regressionanalysis,and trendanalysis.

15.Listout thefunctions ofOLAP serversinthe datawarehousearchitecture.(Nov/Dec2010)

TheOLAPserverperformsmultidimensionalqueriesofdataandstores theresultsinitsmultidimensionalstorage.Itspeeds theanalysisof facttablesintocubes,storesthecubesuntilneeded, and then quicklyreturns the data to clients.

JACSICE

16.Differentiate dataminingand datawarehousing.(Nov/Dec2011)

Dataminingreferstoextractingor“mining”knowledgefromlargeamountsofdata.Thetermisactuallyamisnomer.Rememberthattheminingofgoldfromrocksorsandisreferredtoasgoldminingratherthanrockorsandmining.Thus,dataminingshouldhavebeenmoreappropriatelynamed“knowledgemining fromdata,”

Adata warehouseisusually modeledbyamultidimensionaldatabasestructure,whereeachdimensioncorrespondstoanattributeorasetofattributesintheschema,andeachcellstoresthevalueofsomeaggregate measure, such as countorsales amount.

17.What do you understandaboutknowledge discovery? (Nov/Dec2011)Peopletreatdataminingasasynonymforanotherpopularlyusedterm,KnowledgeDiscovery

from Data, or KDD. Alternatively, others view data miningassimplyan essentialstep in theprocessofknowledge discovery.Knowledge discoveryasa process andan iterative sequenceofthe followingsteps:

1. Datacleaning(toremovenoiseandinconsistentdata)2. Dataintegration(wheremultipledata sourcesmay becombined)3. Dataselection (where data relevant to the analysistaskareretrievedfromthe database)4.Data transformation (wheredataaretransformedorconsolidated into forms appropriateformining by performingsummaryoraggregation operations,forinstance)5. Data mining(an essentialprocess where intelligentmethods areapplied in order toextractdatapatterns)

6. Pattern evaluation (toidentifythetrulyinterestingpatternsrepresentingknowledge based on some interestingnessmeasures)

7. Knowledge presentation(where visualizationand knowledgerepresentationtechniques areusedto present the mined knowledge to theuser)

PART B

1.Writeindetailaboutthearchitectureandimplementationofthedatawarehouse.(Nov/Dec‘07)(OR)Diagrammaticallyillustrateanddiscussthethreetierdatawarehousingarchitecture.(May/June2009).(OR)Writeadetaileddiagramdescribethegeneralarchitectureofdatawarehouse.(Nov/Dec2010).(

OR) Describe the datawarehouse architecturewith aneatdiagram. (May/June 2010)

2. Listanddiscussthemajorfeatures ofa datawarehouse. (May/June2009)(OR)

3.Discussthevarioustypesofwarehouseschemawithsuitableexample.(Nov/Dec’09)(OR)Whatdoyouunderstand aboutdatabaseschemas? Explain.(Nov/Dec2011)

4. DescribeOLAP operations inmultidimensionaldata model.(Nov/Dec 2011)

5. Explain thetypes of OLAPserver indetail. (Nov/Dec2009)

6. Enumeratethebuildingblocksof adatawarehouse. Explaintheimportance ofmetadatainadatawarehouseenvironment.What are the challenges in metadatamanagement?(Nov/Dec ‘08).

7.CompareandcontrastthedatawarehouseandoperationalDBwithvariousfeatures.(Nov/Dec2011).

Explain in detailabout the different kinds ofdata onwhich data mining canbe applied.(Nov/Dec‘07).

JACSICE

UNIT-IIBUSINESS ANALYSIS

1. What is the need for preprocessing the data? (Nov/Dec2007)

Incomplete, noisy, and inconsistent data are commonplaceproperties of large real world databasesand data warehouses. Incomplete data can occur for a number of reasons. Attributesof interest may not always be available, such as customer information for salestransaction data. Other data may not be included simply because it was not consideredimportant at the time of entry. Relevant data may not be recorded due to amisunderstanding, or because of equipment malfunctions. Data that wereinconsistent with other recorded data may have been deleted. Furthermore, therecording of the history or modifications to the data may have been overlooked.Missing data, particularly for tuples with missing values for some attributes, may needto be inferred.

2. What is parallel mining of concept description? (Nov/Dec 2007) (OR) What isconcept description? (Apr/May 2008)Data can be associated with classes or concepts. It can beuseful to describe individual classes and concepts insummarized, concise, and yet precise terms. Suchdescriptions of a class or a concept are called class/conceptdescriptions. These descriptions can be derived via(1) data characterization, by summarizing the data of theclass under study (often called the target class) in generalterms, or(2) data discrimination, by comparison of the target classwith one or a set of comparative classes (often called thecontrasting classes),or(3) both data characterization and discrimination.

3. What is dimensionality reduction? (Apr/May 2008)In dimensionality reduction, data encoding ortransformations are applied so as to obtain a reduced or“compressed” representation of the original data. If theoriginal data can be reconstructed from the compresseddata without any loss of information, the data reduction iscalled lossless.

4. Mention the various tasks to be accomplished as part ofdata pre-processing. (Nov/ Dec 2008)

1. Data cleaning2. Data Integration3. Data Transformation4. Data reduction

5. What is data cleaning? (May/June 2009)Data cleaning means removing the inconsistent data or noiseand collecting necessary information of a collection of

JACSICE

interrelated data.6. Define Data mining. (Nov/Dec 2008)

Data mining refers to extracting or “mining” knowledge from large amounts ofdata. The term is actually a misnomer. Remember that the mining of gold from rocksor sand is referred to as gold mining rather than rock or sand mining. Thus, datamining should have been more appropriately named “knowledge mining from data,”

7. What are the types of concept hierarchies? (Nov/Dec2009)

A concept hierarchy defines a sequence of mappings from aset of low-level concepts to higher-level, more generalconcepts. Concept hierarchies allow specialization, or drillingdown ,where by concept values are replaced by lower-levelconcepts.

8. List the three important issues that have to be addressedduring data integration.

(May/June 2009) (OR) List the issues to be considered during dataintegration. (May/June 2010)

There are a number of issues to consider during dataintegration. Schema integration and object matching canbe tricky. How can equivalent real-world entities frommultiple data sources be matched up? This is referred to asthe entity identification problem.

Redundancy is another important issue. An attribute (such as annual revenue,for instance) may be redundant if it can be “derived” from another attribute or setof attributes. Inconsistencies in attribute or dimension naming can also causeredundancies in the resulting data set.

A third important issue in data integration is the detection and resolution of datavalue conflicts. For example, for the same real-world entity, attribute values fromdifferent sources may differ. This may be due to differences in representation, scaling,or encoding. For instance, a weight attribute may be stored in metric units in one systemand British imperial units in another.

9. Write the strategies for data reduction. (May/June 2010)1. Data cube aggregation2. Attribute subset selection3. Dimensionality reduction4. Numerosity reduction5. Discretization and concept hierarchy generation.

10. Why is it important to have data mining query language?(May/June 2010)

The design of an effective data mining query languagerequires a deep understanding of the power, limitation, andunderlying mechanisms of the various kinds of data miningtasks.

JACSICE

A data mining query language can be used to specify data mining tasks. Inparticular, we examine how to define data warehouses and data marts in our SQL-baseddata mining query language, DMQL.

11. List the five primitives for specifying a data mining task.(Nov/Dec 2010)

The set of task-relevant data to be minedThe kind of knowledge to be mined:The background knowledge to be used in the discovery process

The interestingness measures and thresholds for pattern evaluationThe expected representation for visualizing the discovered pattern

12. What is data generalization? (Nov/Dec 2010)It is process that abstracts a large set of task-relevant data in a database from a relatively lowconceptual levels to higher conceptual levels 2 approaches for Generalization.1) Data cube approach 2) Attribute-oriented induction approach

13. How concept hierarchies are useful in data mining?(Nov/Dec 2010)

A concept hierarchy for a given numerical attribute defines a discretization of theattribute. Concept hierarchies can be used to reduce the data by collecting and replacing low-level concepts (such as numerical values for the attribute age) with higher-level concepts(such as youth, middle-aged, or senior). Although detail is lost by such datageneralization, the generalized data may be more meaningful and easier to interpret.

14. How do you clean the data? (Nov/Dec 2011)Data cleaning (or data cleansing) routines attempt to fill in missing values, smooth

out noise while identifying outliers, and correct inconsistencies in the data.For Missing Values1. Ignore the tuple2. Fill in the missing value manually3. Use a global constant to fill in the missing value4. Use the attribute mean to fill in the missing value:5. Use the attribute mean for all samples belonging to the same class as the given tuple6. Use the most probable value to fill in the missing valueFor Noisy Data1. Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,”that is, thevalues around it.2. Regression: Data can be smoothed by fitting the data to a function, such as with Regression3. Clustering: Outliers may be detected by clustering, where similar values are organizedinto groups, or “clusters.

15. What is need of GUI? (Nov/Dec 2011)Commercial tools can assist in the data transformation step. Data migration tools

allow simple transformations to be specified, such as to replace the string“gender” by “sex”. ETL (extraction/transformation/loading) tools allow users tospecify transforms through a graphical user interface (GUI). These tools typically supportonly a restricted set of transforms so that, often, we may also choose to write custom scripts

JACSICE

for this step of the data cleaning process.PART B

1. Describe the various Reporting and Query Tools and Application.

2. What are differences between three main types of data usage:

information processing, analytical processing and data mining? Discuss

the motivation behind OLAP mining. (Apr/May ‘08)

3. Explain the different types of OLAP tools (May/ June 14)

4. Write the difference between multi – dimensional OLAP and multi relational OLAP

(May/June14)

5. Explain Cognos Impromptu in detail (Apr / May 15)

JACSICE

JACSICE

JACSICE

JACSICE

JACSICE

JACSICE

JACSICE

JACSICE

JACSICE

JACSICE

JACSICE

JACSICE

JACSICE

JACSICE