40
~ 1 ~

Main Part PDF Con

Embed Size (px)

DESCRIPTION

pdf of

Citation preview

Page 1: Main Part PDF Con

~ 1 ~

Page 2: Main Part PDF Con

~ 2 ~

CHAPTER-1

INTRODUCTION

1.1 Database

A database is an organized collection of data. The data is typically

organized to model relevant aspects of reality (for example, the availability of

rooms in hotels), in a way that supports processes requiring this information

(for example, finding a hotel with vacancies). Traditional databases are

organized by fields, records, and files. A field is a single piece of information;

a record is one complete set of fields; and a file is a collection of records. For

example, a telephone book is analogous to a file. It contains a list of records,

each of which consists of three fields: name, address, and telephone number.

1.2 Database Management System

To access information from a database, we need a database management

system (DBMS). This is a collection of programs that enable us to enter,

organize, and select data in a database.

1.3 Data Warehouse

A data warehouse is a relational database that is designed for query and

analysis rather than for transaction processing. It usually contains historical

data derived from transaction data, but it can include data from other sources. It

separates analysis workload from transaction workload and enables an

organization to consolidate data from several sources. It is a database of unique

data structure that allows relatively quick and easy performance of complex

query over large amount of data.

Page 3: Main Part PDF Con

~ 3 ~

1.4 Data Mining

Generally, data mining (sometimes called data or knowledge discovery)

is the process of analyzing data from different perspectives and summarizing it

into useful information - information that can be used to increase speed , cuts

costs. Data mining software is one of a number of analytical tools for analyzing

data. It allows users to analyze data from many different dimensions or angles,

categorize it, and summarize the relationships identified. Technically, data

mining is the process of finding correlations or patterns among dozens of fields

in large relational databases.

Fig - 1.1 Data mining of finger print converted into digital data.

Page 4: Main Part PDF Con

Fig -1.2

Example :

For example, one Midwest grocery chain used the data mining capacity

of Oracle software to analyze local buying patterns. They discovered that when

men bought diapers on Thursdays and Saturdays, they also tended to buy beer.

Further analysis showed that these shoppers typically did their weekly grocery

shopping on Saturdays. O

The retailer concluded that they purchased the beer to have it available for the

upcoming weekend. The grocery chain could use this newly discovered

information in various ways to increase revenue. For ex

the beer display closer to the diaper display. And, they could make sure beer

and diapers were sold at full price on Thursdays.

~ 4 ~

1.2 Data mining from data warehouse.

For example, one Midwest grocery chain used the data mining capacity

Oracle software to analyze local buying patterns. They discovered that when

men bought diapers on Thursdays and Saturdays, they also tended to buy beer.

Further analysis showed that these shoppers typically did their weekly grocery

shopping on Saturdays. On Thursdays, however, they only bought a few items.

The retailer concluded that they purchased the beer to have it available for the

upcoming weekend. The grocery chain could use this newly discovered

information in various ways to increase revenue. For example, they could move

the beer display closer to the diaper display. And, they could make sure beer

and diapers were sold at full price on Thursdays.

For example, one Midwest grocery chain used the data mining capacity

Oracle software to analyze local buying patterns. They discovered that when

men bought diapers on Thursdays and Saturdays, they also tended to buy beer.

Further analysis showed that these shoppers typically did their weekly grocery

n Thursdays, however, they only bought a few items.

The retailer concluded that they purchased the beer to have it available for the

upcoming weekend. The grocery chain could use this newly discovered

ample, they could move

the beer display closer to the diaper display. And, they could make sure beer

Page 5: Main Part PDF Con

~ 5 ~

1.5 Dataset

A dataset (or data set) is a collection of data, usually presented in tabular

form. Each column represents a particular variable. Each row corresponds to a

given member of the dataset in question. It lists values for each of the variables,

such as height and weight of an object. Each value is known as a datum. The

dataset may comprise data for one or more members, corresponding to the

number of rows.

1.6 Structure Query Language(SQL)

SQL, which is an abbreviation for Structured Query Language, is a

language to request data from a database, to add, update, or remove data within

a database, or to manipulate the metadata of the database.

SQL is a declarative language in which the expected result or operation

is given without the specific details about how to accomplish the task. The

steps required to execute SQL statements are handled transparently by the SQL

database. Sometimes SQL is characterized as non-procedural because

procedural languages generally require the details of the operations to be

specified, such as opening and closing tables, loading and searching indexes, or

flushing buffers and writing data to files systems. Therefore, SQL is considered

to be designed at a higher conceptual level of operation than procedural

languages because the lower level logical and physical operations aren't

specified and are determined by the SQL engine or server process that executes

it.

1.7 Vertical Aggregation

The essential idea is to allow relevant sites to be overlaid on top of each

other by the end user to create a complete view of the information they are

looking for. It arrange dataset from database in vertically as respect with

necessary query (such as group by clause in SQL) .Generally in relational

database system the aggregation are arranged by vertical aggregation.

Page 6: Main Part PDF Con

~ 6 ~

1.8 Horizontal Aggregation

Here introduce a new class of aggregations that have similar behavior to

SQL standard aggregations, but which produce tables with a horizontal layout.

In contrast, we call standard SQL aggregations vertical aggregations since they

produce tables with a vertical layout. Horizontal aggregations just require a

small syntax extension to aggregate functions called in a SELECT statement.

Alternatively, horizontal aggregations can be used to generate SQL code from a

data mining tool to build data sets for data mining analysis. We start by

explaining how to automatically generate SQL code.

Page 7: Main Part PDF Con

~ 7 ~

Page 8: Main Part PDF Con

~ 8 ~

CHAPTER-2

Review Literature

2.1 Analysis the literature

Preparing a data set for analysis is generally the most time consuming

task in a data mining project, requiring many complex SQL queries, joining

tables, and aggregating columns[1]. Existing SQL aggregations have

limitations to prepare data sets because they return one column per aggregated

group. In general, a significant manual effort is required to build data sets,

where a horizontal layout is required. A simple, yet powerful, methods to

generate SQL code to return aggregated columns in a horizontal tabular layout,

returning a set of numbers instead of one number per row. This new class of

functions is called horizontal aggregations[2]. Horizontal aggregations build

data sets with a horizontal de-normalized layout (e.g., point-dimension,

observation variable, instance-feature), which is the standard layout required by

most data mining algorithms. Here three fundamental methods to evaluate

horizontal aggregations: CASE: Exploiting the programming CASE construct;

SPJ: Based on standard relational algebra operators (SPJ queries); PIVOT:

Using the PIVOT operator, which is offered by some DBMSs. Experiments

with large tables compare the proposed query evaluation methods. CASE

method has similar speed to the PIVOT operator and it is much faster than the

SPJ method. In general, the CASE and PIVOT methods exhibit linear

scalability, whereas the SPJ method does not.

Page 9: Main Part PDF Con

~ 9 ~

2.2 Explanation of F, FV , and FH Table

2.2.1 F(Original Table) :

This table contains data that can be aggregate first vertical then

horizontal. It can be contain null but must not contain blob(data type)

data.

K D1 D2 A

1 3 X 9

2 2 Y 6

3 1 Y 10

4 1 Y 0

5 2 X 1

6 1 X null

7 3 X 8

8 2 X 7

Table 2.1 Original Data Table

2.2.2 FV (Vertical Aggregated Table) :

The essential idea is to allow relevant sites to be overlaid on top

of each other by the end user to create a complete view of the

information they are looking for. It arrange dataset from database in

vertically as respect with necessary query (such as group by clause in

SQL) .Generally in relational database system the aggregation are

arranged by vertical aggregation.

D1 D2 A

1 X null

1 Y 10

2 X 8

2 Y 6

3 X 17

Table 2.2 Vertical Table

Page 10: Main Part PDF Con

2.2.3 FH (Horizontal

Here introduce a new class of aggregations that have similar

behavior to SQL standard aggregations, but which produce tables with a

horizontal layout. In contrast, we call standard SQL aggregations

vertical aggregations since they produce tables with a vertica

Horizontal aggregations just require a small syntax extension to

aggregate functions

horizontal aggregations can be used to generate SQL code from a data

mining tool to build data sets for data mining

explaining how to automatically generate SQL code.

Fig- 2.1 Main steps of methods based on F (un

SPJ

d left joins

~ 10 ~

Horizontal Aggregated Table) :

introduce a new class of aggregations that have similar

behavior to SQL standard aggregations, but which produce tables with a

horizontal layout. In contrast, we call standard SQL aggregations

vertical aggregations since they produce tables with a vertica

Horizontal aggregations just require a small syntax extension to

aggregate functions called in a SELECT statement. Alternatively,

horizontal aggregations can be used to generate SQL code from a data

mining tool to build data sets for data mining analysis. We start by

explaining how to automatically generate SQL code.

D1 D2X D2Y

1 null 10

2 8 6

3 17 null

Table 2.3 Horizontal Table

Main steps of methods based on F (un-optimized).

Select Distinct

R1.....Rk

d pivoting Value

CASE

d sum(case) terms

Compute

Fh

introduce a new class of aggregations that have similar

behavior to SQL standard aggregations, but which produce tables with a

horizontal layout. In contrast, we call standard SQL aggregations

vertical aggregations since they produce tables with a vertical layout.

Horizontal aggregations just require a small syntax extension to

alled in a SELECT statement. Alternatively,

horizontal aggregations can be used to generate SQL code from a data

analysis. We start by

optimized).

PIVOT

d pivoting Value

Page 11: Main Part PDF Con

Fig- 2.2 Main

2.3 SPJ method

The SPJ method is interesting from a theoretical point of view because it

is based on relational operators only. The basic idea is to create one table with a

vertical aggregation for each result

produce FH. We aggregate from F into d projected tables with d Select

Join-Aggregation queries (selection,

FI one subgrouping combin

aggregation on A as the only nonkey column. It is necessary to introduce an

additional table F, that will be outer joined with projected tables to get a

complete result set. We propose two basic substrategies to compute F . The

first one directly aggregates from F. The

vertical aggregation in a temporary table F

Then horizontal aggregations can be instead computed from F

a compressed version of F,

SPJ

d left joins

~ 11 ~

2.2 Main steps of methods based on FV (optimized).

The SPJ method is interesting from a theoretical point of view because it

is based on relational operators only. The basic idea is to create one table with a

vertical aggregation for each result column, and then join all those tables to

. We aggregate from F into d projected tables with d Select

Aggregation queries (selection, projection, join, aggregation). Each table

one subgrouping combination and has {L1; ...;Lj} primary key and an

aggregation on A as the only nonkey column. It is necessary to introduce an

additional table F, that will be outer joined with projected tables to get a

complete result set. We propose two basic substrategies to compute F . The

e directly aggregates from F. The second one computes the equivalent

a temporary table FV grouping by {L1; ...;Lj}.

Then horizontal aggregations can be instead computed from F

a compressed version of F, since standard aggregations are distributive [9].We

Select Distinct

R1.....Rk

d pivoting Value

CASE

d sum(case) terms

Compute

Fh

Compute

Fv

(optimized).

The SPJ method is interesting from a theoretical point of view because it

is based on relational operators only. The basic idea is to create one table with a

column, and then join all those tables to

. We aggregate from F into d projected tables with d Select-Project-

join, aggregation). Each table

primary key and an

aggregation on A as the only nonkey column. It is necessary to introduce an

additional table F, that will be outer joined with projected tables to get a

complete result set. We propose two basic substrategies to compute F . The

second one computes the equivalent

}.

Then horizontal aggregations can be instead computed from FV, which is

gregations are distributive [9].We

PIVOT

d pivoting Value

Page 12: Main Part PDF Con

~ 12 ~

now introduce the indirect aggregation based on the intermediate table F , that

will be used for both the SPJ and the CASE method. Let FV be a table

containing the vertical aggregation, based on {L1……Lj} and {R1…..Rj}. Let

V() represent the corresponding vertical aggregation for H(). The statement to

compute F gets a cube:

INSERT INTO

SELECT L1 ………Lj, R1…..RK,V(A)

FROM F

GROUP BY L1 ………Lj, R1…..RK;

Then each table F aggregates only those rows that correspond to the Ith

unique combination of R1……….Rk, given by the WHERE clause. A possible

optimization is synchronizing table scans to compute the d tables in one pass.

Finally, to get FH we need d left outer joins with the d + 1 tables so that all

individual aggregations are properly assembled as a set of d dimensions for

each group. Outer joins set result columns to null for missing combinations for

the given group. In general, nulls should be the default value for groups with

missing combinations. We believe it would be incorrect to set the result to zero

or some other number by default if there are no qualifying rows. Such approach

should be considered on a per-case basis.

INSERT INTO FH

SELECT

F0.L1, F0.L2,…………,F0.Lj,

F1.A, F2.A,…………, Fd.A,

FROM F0

LEFT OUTER JOIN F1

ON F0.L1=F1.L1 and ……and F0.Lj = F1.Lj

LEFT OUTER JOIN F2

ON F0.L1=F2.L1 and ……and F0.Lj = F2.Lj

…..

LEFT OUTER JOIN Fd

ON F0.L1=Fd.L1 and ……and F0.Lj=Fd.Lj;

Then each table FI aggregates only those rows that correspond to the Ith

unique combination of R1, . . .,Rk, given by the WHERE clause. A possible

optimization is synchronizing table scans to compute the d tables in one pass.

Finally, to get FH we need d left outer joins with the d + 1 tables so that all

Page 13: Main Part PDF Con

~ 13 ~

individual aggregations are properly assembled as a set of d dimensions for

each group. Outer joins set result columns to null for missing combinations for

the given group. In general, nulls should be the default value for groups with

missing combinations. We believe it would be incorrect to set the result to zero

or some other number by default if there are no qualifying rows. Such approach

should be considered on a per-case basis.

INSERT INTO FH

SELECT

F0.L1, F0.L2, . . . ,F0.Lj,

F1.A, F2.A, . . . , Fd.A

FROM F0

LEFT OUTER JOIN F1

ON F0.L1 = F1.L1 and . . . and F0.Lj = F1.Lj

LEFT OUTER JOIN F2

ON F0.L1 = F2.L1 and . . . and F0:Lj = F2.Lj

. . .

LEFT OUTER JOIN Fd

ON F0.L1 = Fd.L1 and . . . and F0.Lj = Fd.Lj;

This statement may look complex, but it is easy to see that each left

outer join is based on the same columns L1, . . . , Lj. To avoid ambiguity in

column references, L1, . . . , Lj are qualified with F0. Result column I is

qualified with table FI . Since F0 has n rows each left outer join produces a

partial table with n rows and one additional column. Then at the end, FH will

have n rows and d aggregation columns. The statement above is equivalent to

an update-based strategy. Table FH can be initialized inserting n rows with key

L1, . . . , Lj and nulls on the d dimension aggregation columns. Then FH is

iteratively updated from FI joining on L1, . . . ,Lj. This strategy basically incurs

twice I/O doing updates instead of insertion. Reordering the d projected tables

to join cannot accelerate processing because each partial table has n rows.

Another claim is that it is not possible to correctly compute horizontal

aggregations without using outer joins. In other words, natural joins would

produce an incomplete result set.

Page 14: Main Part PDF Con

~ 14 ~

2.4 Case Method

For this method, the “case” programming construct available in SQL.

The case statement returns a value selected from a set of values based on

boolean expressions. From a relational database theory point of view this is

equivalent to doing a simple projection/aggregation query where each nonkey

value is given by a function that returns a number based on some conjunction

of conditions. Proposed two basic substrategies to compute F. In a similar

manner to SPJ, the first one directly aggregates from F andthe second one

computes the vertical aggregation in a temporary table FV and then horizontal

aggregations are indirectly computed from FV.

Now present the direct aggregation method. Horizontal aggregation

queries can be evaluated by directly aggregating from F and transposing rows

at the same time to produce FH. First, we need to get the unique combinations

of R. R1,……..,Rk. that define the matching Boolean expression for result

columns. The SQL code to compute horizontal aggregations directly from F is

as follows: observe V () is a standard (vertical) SQL aggregation that has a

“case” statement as argument. Horizontal aggregations need to set the result to

null when there are no qualifying rows for the specific horizontal group to be

consistent with the SPJ method and also with the extended relational model [4].

SELECT DISTINCT

FROM F;

INSERT INTO FH

SELECT L1,…………,Lj

,V(CASE WHEN R1=V11 and…….and RK=VK1

THEN A ELSE NULL END)

..

,V(CASE WHEN R1=V11 and…….and RK=VKd

THEN A ELSE null END)

FROM F

GROUP BY L1, L2,…….., Lj;

This statement computes aggregations in only one scan on F. The main

difficulty is that there must be a feedback process to produce the “case”

boolean expressions. We now consider an optimized version using FV . Based

Page 15: Main Part PDF Con

~ 15 ~

on FV , we need to transpose rows to get groups based on L1, . . . , Lj. Query

evaluation needs to combine the desired aggregation with “CASE” statements

for each distinct combination of values of R1, . . .,Rk. As explained above,

horizontal aggregations must set the result to null when there are no qualifying

rows for the specific horizontal group. The boolean expression for each case

statement has a conjunction of k equality comparisons. The following

statements compute FH:

SELECT DISTINCT R1,. . .,Rk

FROM FV ;

INSERT INTO FH

SELECT L1,..,Lj

,sum(CASE WHEN R1 = v11 and .. and Rk = vk1

THEN A ELSE null END)

......

,sum(CASE WHEN R1 = v1d and .. and Rk = vkd

THEN A ELSE null END)

FROM FV

GROUP BY L1, L2, . . . , Lj;

As can be seen, the code is similar to the code presented before, the main

difference being that we have a call to sum() in each term, which preserves

whatever values were previously computed by the vertical aggregation. It has

the disadvantage of using two tables instead of one as required by the direct

computation from F. For very large tables F computing FV first, may be more

efficient than computing directly from F.

Page 16: Main Part PDF Con

~ 16 ~

2.5 PIVOT Method

Here use the PIVOT operator which is a built-in operator in a

commercial DBMS. Since this operator can perform transposition it can help

evaluating horizontal aggregations. The PIVOT method internally needs to

determine how many columns are needed to store the transposed table and it

can be combined with the GROUP BY clause. The basic syntax to exploit the

PIVOT operator to compute a horizontal aggregation assuming one BY column

for the right key columns (i.e., k = 1) is as follows:

SELECT DISTINCT R1

FROM F;

SELECT L1, L2,……., Lj;

,v1,v2,………vd

INTO Ft

FROM F

PIVOT(

V(A) FOR R1 in (v1,v2……..vd)

)AS P;

SELECT L1, L2………….,Lj

,V(v1), V(v2)………. V(vd)

INTO FH

FROM Ft

GROUP BY L1, L2………….,Lj;

This set of queries may be inefficient because Ft can be a large intermediate

table. We introduce the following optimized set of queries which reduces of the

intermediate table:

SELECT DISTINCT R1

FROM F; /* produces v1, . . . , vd */

SELECT

L1, L2, . . . ,Lj

,v1, v2, . . . , vd

INTO FH

FROM (

SELECT L1, L2, . . . ,Lj, R1, A

FROM F) Ft

Page 17: Main Part PDF Con

~ 17 ~

PIVOT(

V (A) FOR R1 in (v1, v2, . . . , vd)

) AS P;

Notice that in the optimized query the nested query trims F from

columns that are not later needed. That is, the nested query projects only

those columns that will participate in FH. Also, the first and second

queries can be computed from FV .

Page 18: Main Part PDF Con

~ 18 ~

Page 19: Main Part PDF Con

~ 19 ~

CHAPTER-3

Problem Structure Analysis

3.1 Problem of literature

3.1.1 Problem 1 :

Number of column may be exceed than the allowed number of column

of DBMS[1]. That means reaching the maximum number of columns in one

table and reaching the maximum column name length when columns are

automatically named.

To elaborate on this, a horizontal aggregation can return a table that

goes beyond the maximum number of columns in the DBMS when the set of

columns {R1,. . .,Rk} has a large number of distinct combinations of values, or

when there are multiple horizontal aggregations in the same query.

3.1.2 Problem 2 :

It is impossible to aggregate when data field’s are image or file(such as

blob data). Suppose when an image data converted to a column or attribute

name then it exceed the defined DBMS column name length.

This issue is automatically generating unique column names. If there are

many sub grouping columns {R1, . . .,Rk} or columns are of string data types,

this may lead to generate very long column names, which may exceed DBMS

limits. However, these are not important limitations because if there are many

dimensions that is likely to correspond to a sparse matrix (having many zeroes

or nulls) on which it will be difficult or impossible to compute a data mining

model. On the other hand, the large column name length can be solved as

explained below.

Page 20: Main Part PDF Con

~ 20 ~

The problem of d going beyond the maximum number of columns can

be solved by vertically partitioning FH so that each partition table does not

exceed the maximum number of columns allowed by the DBMS. Evidently,

each partition table must have {L1,. . . , Lj } as its primary key. Alternatively,

the column name length issue can be solved by generating column identifiers

with integers and creating a “dimension” description table that maps identifiers

to full descriptions, but the meaning of each dimension is lost. An alternative is

the use of abbreviations, which may require manual input.

Page 21: Main Part PDF Con

~ 21 ~

3.2 Introduce with Split-SPJ

When number of column exceed than the allowed number of column in

DBMS, then it limit SPJ method, But the Split-SPJ method create another table

when the DBMS column limit exceed. Without exceeding column number all

properties of SPJ are contains Split-SPJ.

Column limit of different Database System :

Database Maximum Permitted Column

Microsoft Access 255

Microsoft SQL Server 1024

MySql 4096

Oracle Default 1000 but it can be increase by

command.

Table 3.1 Different database permitted column

If we see the table, the lowest allowed column is 255 (Microsoft

Access). So we decide the splitting point is 255 sequentially.

Example :

If vertical attributes of a table is :

ID, VA1, VA2, VA3, VA4, VA5, VA6, VA7, . . . . . . . . . . . . . . . . . . . . . . . . .

,VA255, VA256, VA257, . . . . . . . . .. . . . . . . . . . . . . ,VA270, VA271, VA272, VA273

(It is impossible to aggregate in SPJ method)

The output of Split-SPJ method :

Table-1

ID, VA1, VA2, VA3, VA4, VA5, VA6, VA7, . . . . . . . . . . . . . . . . . . . . . . . .,VA255

Table-2

ID, VA256, VA257, . . . . . . . . .. . . . . . . . . . . . . ,VA270, VA271, VA272, VA273

Page 22: Main Part PDF Con

~ 22 ~

Page 23: Main Part PDF Con

~ 23 ~

CHAPTER-4

Experimental Description

4.1 Experimental data of this system

We use a system for the simulation that is now days popular social

networking system. There are a lot of picture is handled in facebook within a

few second. We consider here four users whose are named by user1, user2,

user3, user4 and 25 pictures are named by pic1 to pic25. Here any user can

comment any picture randomly by using any character length. We use

horizontal aggregation concept to find out the total character number of a

picture comment by each user. If any user does not comment any picture than

the field is defined by NULL.

By following the process of previous literature each user is aggregate

with each picture. Our proposed system is simulating that we assume the

column number of database is 20. So the total number column will be break at

20 and next 5 column will create a new table. This was unable at previous

thesis.

The time complexity of the proposed system is same to previous SPJ

method but able to show the full horizontal aggregation. If we indexed the

picture number than character length that commented by all users from pic1 to

pic2 are shown in table number one and last five number of picture are shown

in next table.

Page 24: Main Part PDF Con

~ 24 ~

Fig- 4.1 Experimental data(Original data table)

Page 25: Main Part PDF Con

~ 25 ~

Fig- 4.2 Experimental data(Vertical table)

Page 26: Main Part PDF Con

~ 26 ~

Fig- 4.3 Experimental data(Horizontal table)

Page 27: Main Part PDF Con

~ 27 ~

4.2 Figure of Split-SPJ horizontal aggregation

Fig- 4.4 Split-SPJ horizontal aggregation

Page 28: Main Part PDF Con

~ 28 ~

4.3 Comparison of SPJ with Split-SPJ

When aggregated column < 255

5

4

3

2

1

SPJ

0 10 20 30 40 50 60 70 80 90 100

Tim

e (

ms)

Fig 4.3.2 : Split-SPJ curve when number of column is 100.

5

4

3

2

1

SPJ

0 10 20 30 40 50 60 70 80 90 100

Tim

e (

ms)

Fig 4.3.1 : SPJ curve when number of column is 100.

Page 29: Main Part PDF Con

~ 29 ~

When aggregated column > 255

5

4

3

2

1

0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360

Tim

e (

ms)

SPJ

255 No. of Column

Fig 4.3.3 : SPJ curve when number of column is 360.

Tim

e (

ms)

5

4

3

2

1

0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360

No. of Column 255

SPJ

SPJ

2.4

Fig 4.3.4 : Split-SPJ curve when number of column is 360.

Page 30: Main Part PDF Con

~ 30 ~

4.4 Code for the different methods

4.4.1 Code for vertical aggregation :

using System;

using System.Windows.Forms;

using HorizontalAggregation.App_Code;

namespace HorizontalAggregation.UI

{

public partial class VerticalAggregationUI : Form

{

public VerticalAggregationUI()

{

InitializeComponent();

}

private DataManager dataManager = null;

private void VerticalAggregationUI_Load(object sender, EventArgs e)

{

dataManager = new DataManager();

dgvVerticalAggregation.DataSource = dataManager.GetVerticalTable();

}

}

}

public DataTable GetVerticalTable()

{

dataExecuteClass = new DataExecuteClass();

dataSet = new DataSet();

DataTable dataTable = null;

string queryString = string.Format("SELECT facebook_id, image_name, sum(comments_char) as

[SUM] from stdinfo group by facebook_id,image_name order by facebook_id,image_name;");

try

{

dataSet = dataExecuteClass.getDataSet(queryString);

dataTable = dataSet.Tables[0];

return dataTable;

}

catch (Exception ex)

{

throw ex;

}

}

Page 31: Main Part PDF Con

~ 31 ~

4.4.2 Code for horizontal aggregation :

using System;

using System.Windows.Forms;

using HorizontalAggregation.App_Code;

namespace HorizontalAggregation.UI

{

public partial class HorizontalAggregationUI : Form

{

public HorizontalAggregationUI()

{

InitializeComponent();

}

private DataManager dataManager = null;

private DataExecuteClass DataExecuteClass = null;

private void HorizontalAggregationUI_Load(object sender, EventArgs e)

{

dataManager = new DataManager();

DataExecuteClass = new DataExecuteClass();

dataManager = new DataManager();

dgvHA.DataSource = dataManager.GetHorizontalTable();

}

}

}

public DataTable GetHorizontalTable()

{

dataExecuteClass = new DataExecuteClass();

dataSet = new DataSet();

DataTable dataTable = null;

string queryString = string.Format("SELECT * from horizontal order by facebook_id;");

try

{

dataSet = dataExecuteClass.getDataSet(queryString);

dataTable = dataSet.Tables[0];

return dataTable;

}

catch (Exception ex)

{

throw ex;

}

}

Page 32: Main Part PDF Con

4.4.3 Main steps of Split

Fig- 4.5 Main steps of Split

From the experimental

table and then horizontal aggregated table

blob(Such as image, file etc).

~ 32 ~

Split-SPJ method based on FV :

Main steps of Split-SPJ method based on FV.

experimental data table first produced vertical aggregated

and then horizontal aggregated table. Data can be null but not

blob(Such as image, file etc).

Select Distinct

R1.....Rk

Split-SPJ

d left joins

Compute

Fh

Compute

Fv

produced vertical aggregated

Data can be null but not

Page 33: Main Part PDF Con

~ 33 ~

4.4.4 The Split-SPJ Algorithm (Proposed Algorithm):

Algorithm 4.1 : Split-SPJ (D, DV, DH, TRV, TCH, TEMP)

Let experimental data table D, it produced vertical aggregated table DV

and then horizontal aggregated table DH. Data can be null but not

blob(Such as image, file etc). The variable TRV, TCH and TEMP denote

respectively total rows of DV, Total columns of DH.

1. [Create vertical aggregated table from experimental table.]

TEMP =: SELECT(D).

2. [Assigning vertical data.]

DV =: TEMP.

3. [Create horizontal aggregated table from vertical aggregated table.]

TEMP =: SELECT(DV).

4. [Assigning horizontal data.]

DH =: TEMP.

5. [Count column of horizontal data table.]

COUNTER =: COUNT(DH).

6. [Check condition.]

If COUNTER > 255 then :

Create table using 255 column.

COUNTER =: COUNTER – 255.

GoTo step 6.

Else :

Create table using total column.

End If

7. Exit.

Page 34: Main Part PDF Con

~ 34 ~

4.4.5 Code for Split-SPJ horizontal aggregation :

For oracle :

SELECT

(SELECT column_name FROM user_tab_columns WHERE

table_name like ‘table_name’ and rownum = 255)

FROM (FROM F0

LEFT OUTER JOIN F1

ON F0.L1 = F1.L1 and . . . and F0.Lj = F1.Lj

LEFT OUTER JOIN F2

ON F0.L1 = F2.L1 and . . . and F0.Lj = F2.Lj

. . . . . . . . .

LEFT OUTER JOIN Fd

ON F0.L1 = Fd.L1 and . . . and F0.Lj = Fd.Lj)

using System;

using System.Collections.Generic;

using System.ComponentModel;

using System.Data;

using System.Data.OleDb;

using System.Drawing;

using System.Linq;

using System.Text;

using System.Windows.Forms;

using HorizontalAggregation.App_Code;

namespace HorizontalAggregation.UI

{

public partial class ProposedHorizontalAggregationUI : Form

{

public ProposedHorizontalAggregationUI()

{

InitializeComponent();

}

private DataManager dataManager = null;

private DataExecuteClass dataExecuteClass = null;

private DataGridView dataGridView = null;

private string[] attributeName = (new DataManager()).GetAllAttributeOfAtable("stdinfo");

private string[] col = new string[20];

private void ProposedHorizontalAggregationUI_Load(object sender, EventArgs e)

{

dataManager=new DataManager();

int maxColLength = int.Parse(dataManager.GetMaxColumnLength());

if (maxColLength==0)

{

dataGridView =new DataGridView();

dataGridView.Dock=DockStyle.Top;

dataGridView.DataSource = CrieateHorizantalAgreateTable();

this.Controls.Add(dataGridView);

}

else

Page 35: Main Part PDF Con

~ 35 ~

{

int totalColumnLength = attributeName.Count()-1;

int fstSkipPoint = 0, lstSkipPoint = 0;

int numOfDGV = (int)Math.Ceiling((float)totalColumnLength/maxColLength);

for (int j = 0; j < numOfDGV; j++)

{

fstSkipPoint = j*maxColLength+1;

lstSkipPoint = fstSkipPoint+maxColLength-1;

DataTable dataTable = CrieateHorizantalAgreateTable();

for (int i = 1; i <= totalColumnLength; i++)

{

if((i>=fstSkipPoint && i<=lstSkipPoint) || i==1)

{

continue;

}

else

{

string column = col[i-1];

dataTable.Columns.Remove(column);

}

}

dataGridView = new DataGridView();

dataGridView.DataSource = dataTable;

dataGridView.Dock = DockStyle.Top;

this.Controls.Add(dataGridView);

}

}

}

private DataTable CrieateHorizantalAgreateTable()

{

dataManager = new DataManager();

dataExecuteClass = new DataExecuteClass();

int i = 0;

DataRow dr;

string[] horizontalColumn = dataManager.SelectDistinctRowInaColumn("D2", "stdinfo");

DataTable horizontalAggrigationTable = new DataTable();

//Column of horizontal table

string col1 = attributeName[1];

col[0] = col1;

string col2 = attributeName[2] + horizontalColumn[0];

col[1] = col2;

string col3 = attributeName[2] + horizontalColumn[1];

col[2] = col3;

horizontalAggrigationTable.Columns.Add(col1);

horizontalAggrigationTable.Columns.Add(col2);

horizontalAggrigationTable.Columns.Add(col3);

//Create Rows of horizontal table

string[] data1 = dataManager.SelectDistinctRowInaColumn("D1", "stdinfo");//Prepare 1st Column

string[] data2 = new string[data1.Count()];

string[] data3 = new string[data1.Count()];

//Prepare 2nd Column

string query = "SELECT SUM FROM (SELECT D1, D2, sum(A) as [SUM] from stdinfo group by

D1,D2 order by D1,D2) WHERE D2='x'";

OleDbDataReader reader = dataExecuteClass.ExecuteReader(query);

while (reader.Read())

{

data2[i] = reader["SUM"].ToString();

Page 36: Main Part PDF Con

~ 36 ~

i++;

}

//Prepare 3rd Column

query = "SELECT SUM FROM (SELECT D1, D2, sum(A) as [SUM] from stdinfo group by D1,D2

order by D1,D2) WHERE D2='y'";

reader = null; i = 0;

reader = dataExecuteClass.ExecuteReader(query);

while (reader.Read())

{

data3[i] = reader["SUM"].ToString();

i++;

}

for (i = 0; i < data1.Count(); i++)

{

dr = horizontalAggrigationTable.NewRow();

dr[col1] = data1[i];

dr[col2] = data2[i];

dr[col3] = data3[i];

horizontalAggrigationTable.Rows.Add(dr);

}

return horizontalAggrigationTable;

}

}

}

Page 37: Main Part PDF Con

~ 37 ~

Page 38: Main Part PDF Con

~ 38 ~

CHAPTER-5

Conclusion and Future Research

5.1 Conclusion

We introduced a new method to extend aggregate functions, called Split

SPJ horizontal aggregations which help preparing data sets for data mining .

Specifically, the method is useful to create data sets with a horizontal layout, as

commonly required by data mining algorithms. Basically, a horizontal

aggregation returns a set of numbers instead of a single number for each group,

resembling a multidimensional vector. We proposed an abstract, but minimal,

extension to SQL standard aggregate functions to compute horizontal

aggregations which just Split the data set at the final limit of column of related

database. From a query optimization perspective, we used query evaluation

methods.

5.2 Future Research Work

We need to understand if Split-SPJ method of horizontal aggregations

can be applied to holistic functions (e.g., rank()). Optimizing a workload of

horizontal aggregation queries is another challenging problem.

If the length of aggregate object is exceed column length of related

database than there occur an error which may be overcome by using alias

method. That means it is very complex to aggregate when data field’s are

contain image or file (such as blob data).

Page 39: Main Part PDF Con

~ 39 ~

REFERENCE

1. Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining

Analysis. [IEEE TRANSACTIONS ON KNOWLEDGE AND DATA

ENGINEERING, VOL. 24, NO. 4, APRIL 2012]

2. Vertical and Horizontal Percentage Aggregations. [Proc. ACM

SIGMOD Int’l Conf. Management of Data (SIGMOD ’04), pp. 866-871,

2004.]

3. Data Set Preprocessing and Transformation in a Database System.

[Intelligent Data Analysis, vol. 15, no. 4, pp. 613-631, 2011.]

4. Integrating K-Means Clustering with a Relational DBMS Using SQL.

[IEEE Trans. Knowledge and Data Eng., vol. 18, no. 2, pp. 188-201,

Feb. 2006.]

5. Data Cube A Relational Aggregation Operator [Proc. Int’l Conf. Data

Eng., pp. 152-159, 1996.]

6. Mining Low-Support Discriminative Patterns [IEEE TRANSACTIONS

ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 2,

FEBRUARY 2012]

7. Data Mining Techniques for Software Effort [IEEE TRANSACTIONS

ON SOFTWARE ENGINEERING, VOL. 38, NO. X, XXXXXXX

2012]

8. C. Galindo-Legaria and A. Rosenthal, “Outer Join Simplification and

Reordering for Query Optimization,” ACM Trans. Database Systems,

vol. 22, no. 1, pp. 43-73, 1997.

Page 40: Main Part PDF Con

~ 40 ~

9. C. Ordonez, “Horizontal Aggregations for Building Tabular Data Sets,”

Proc. Ninth ACM SIGMOD Workshop Data Mining and Knowledge

Discovery (DMKD ’04), pp. 35-42, 2004.

10. H. Wang, C. Zaniolo, and C.R. Luo, “ATLAS: A Small But Complete

SQL Extension for Data Mining and Data Streams,” Proc. 29th Int’l

Conf. Very Large Data Bases (VLDB ’03), pp. 1113-1116, 2003.