CHAPTER

2

Literature

Survey

2.1 Association Rules

This part introduces

the basic concepts off

pattern mining for discovery of attractive associations

and correlations between item sets in transactional

and relational database.

Association rulemining can

bedefined formallyas follows:

Let X, Y be a set of

items an association rule has the form

X->Y

Where

X ?Y = Ø.

X is called the

antecedent and Y is called the consequent of the rule where, X, Y is a set of

items called as an itemset or a pattern.

Let frequency (X) be

the number of rows (transactions) containing X itemset in the given database.

The support of an itemset X is defined as the fraction of all rows containing

the itemset, i.e. frequency (X)/D.

The support of an

association rule is the support of union of X and Y, 8 i.e.

Support (X->Y) = (X U Y)/ D

The confidence of an

association rule is defined as the percentage of rows in D containing itemset X

that also contain itemset Y, i.e.

Confidence (X->Y) = P(X/Y) = Support(X U Y)/ support (X)

The overall performance

of mining association

rules is determined

primarily by

the first step. The second step is easy. After the large

item sets are identified, the corresponding association

rules can be derivative in straight forward manner.

Our main consideration of the thesis is First step

i.e. to find the extraction of

frequent itemsets

9.

Fig 2.1 Generating

Association rules

2.1.1

Association rule description

Mining association rule is one of main

content of data mining research at present, and emphasizes particularly on

finding the relation of differ items in the database

Table

2.1 Basic Concepts of AssociationRules

Name

Explain

Formula

Confidence

Probability of set Y appear

Only if X appear

P(Y|X)

Support

Probability of set X and Y

Appear simultaneity

P(Y?X)

Expected

Confidence

Probability of set Y appear

P(Y)

Lift

Ratio of confidence to

Expected confidence

P(Y|X)/P(Y)

2.1.2 Classification of Association Rule

There

are many kinds of association rules. Association rules can be classified in

various ways, based on the following criteria:

·

Based on the types of values handled in the rule,

the rules can be divided into Boolean association rule and quantitative

association rule.

·

Based on the levels of abstractions involved in the

rule set, the rules could be divided into the single-level association rule and

the multi-level association rule.

·

Based on the dimensions of data involved in the

rule set, the rules could be divided into the single-dimensional association

rule and the multi-dimensional association rule.

2.2 Apriori Algorithm

2.2.1

Introduction of Apriori algorithm

Apriori

is a classic algorithm for learning association rules in data mining. Apriori is an influential algorithm for

mining frequent itemsets for Boolean association rules 11. The Apriori

algorithm is a classical data mining method for association rule discovery

Apriori

uses an iterative search method layer by layer, where k-dimensional itemsets

are used to explore (k+1)-dimensional itemsets.

First,

the set of frequent 1- dimensional itemsets is found and denoted L1,

Next, L1 is used to find L2, the set of L2

frequent 2-itemsets, which is used to find L3,and so on until no

more frequent k-dimensional itemsets can be found13 .Finally, getting the

rules from large set of data items.

How

Li-1 is used to find Li is consisting of two step

process, join and prune actions as followed 14:

·

The

join step:Join

Lk-1 with itself, than combine the same extension item appeared to

generate a possible candidate k-dimensional itemsets, this set of candidates is

denoted Ck, Ck?Lk.

·

The

prune step:Scan

the database to determine the count of each candidate in Ck. When

the count is less than the minimum support count, it should be delete from the

candidate itemsets. If any (k-1)

dimensional subset of a candidate k-dimensional

itemsets is not in Lk-1,the the candidate cannot

be frequent either ,after this we can get the k dimensional itemsets ,which is

denoted Lk.

Table 2.2 Notation for mining algorithm

k-item set

An item set having k items.

Lk

Set of large (frequent)

k-item set.

Ck

Set of candidate k-item

set.

2.2.2

Classical Apriori Algorithm

Apriori

employs an iterative approach known as a level-wise search 15, where k-itemsets are used to explore (k+1)-itemsets. First, the set of

frequent 1-itemsets is found.

This set is denoted L1.L1is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so on, until no

more frequent k-itemsets can be

found.

The

finding of each Lkrequires

one full scan of the database. In order to find all the frequent itemsets, the

algorithm adopted the recursive method. The main idea is as follows 16:

L1 = {large 1-itemsets};

for (k=2; Lk-1??; k++) do

{

Ck=Apriori-gen (Lk-1); // the

new candidates

for each transactions t?Ddo//scan D for

counts

{

Ct=subset (Ck, t);

// get the subsets of t that

are candidates

for each candidates c? Ct do

c.count++;

}

Lk={c?Ck

|c.count?minsup}

}

Return=?kLk;

All

nonempty subsets of a frequent itemsets must also be frequent. To reduce the

size of Ck, pruning

is used as follows. If any (k-1)-subset

of a candidate k-itemsets is

not in Lk-1, then

the candidate cannot be frequent either and so can be removed from Ck.

Given

a set of transactions in a database where each letter corresponds to a certain

product and each transaction corresponds to a customer buying the products A,

B, C or D the first step in the Apriori algorithm is to count the support

(number of occurrences) of each item separately

Table 2.3 Sample Retailers

Transactional Database

Transactions

List

of Items

T1

A,B,C,D

T2

B,C,D

T3

B,C

T4

A,B,C

T5

A,B,C,D

T6

B,D

Table 2.4 Item represent support

Item

Support

A

3

B

6

C

5

D

4

The items in the transactions represented in Table 2.3

have their support represented in Table 2.4.

In

the example the complete set of large itemset L in this first iteration is L =

{A, B, C, D} since all of these terms meets the support threshold. If any of

these items had been below the support threshold they had not been included in

the subsequent steps. In the next steps we will form all pairs, triples and so

on of the items in Table 2.5.

Table

2.5 Frequent 2-Item set and support counter

2-Itemset

Support

A,B

3

A,C

3

A,D

2

B,C

5

B,D

4

C,D

3

Table

2.6 Frequent 2 Itemsets

Large

Itemset

A,B

A,C

B,C

B,D

C,D

In table 2.6 the new item sets are illustrated

together with respective support.

Next we generate the 3-sets by joining the full set

of large item sets in table 2.7 over a commonitem.

Table

2.7 Frequent 3- Itemsets and support

3-Itemset

Support

A,B,C

3

A,C,D

2

A,B,D

2

B,C,D

3

Table 2.8 Frequent 3-Itemset

Large

Itemset

A,B,C

B,C,D

The

only 3-set that fulfills the support threshold is {A, B, C} and {B, C, D} as

illustrated in table 2.8

Table 2.9 Frequent 4 –Itemset and

support

4-Itemset

Support

A,B,C,D

2

Table

2.10 Frequent 4-Itemset null

Large Itemset

Ø

In

table2.10 last item set is {A, B, C, D} and occurs only two times, hence it is

not fulfilling our support threshold. So large frequent item set are A, B, C

and B, C, D.

Advantages of Apriori

·

Easy

implementation.

·

Initial

Information- transaction database D and user-defined minimum support threshold

Min_supp.

·

Algorithm

uses information from previous steps to produce the frequent itemsets 18.

2.3 Related Work

One

of the most well-known and popular data mining techniques is the Association

rules or frequent item sets mining algorithm. The algorithm was originally

proposed by Agrawal et al. 21 22 for market basket analysis. Because of its

important applicability, many revised algorithms have been introduced since

then, and Association rule mining is still a widely researched area. Many

variations done on the frequent pattern-mining algorithm of Apriori was

discussed in this article.

AIS

algorithm in 21 which generates candidate item sets during each pass of the

database scan. Large item sets from preceding pass are checked if they were

presented in the current transaction. Therefore extending existing item sets

created new item sets. This algorithm turns out to be ineffective because it

generates too many candidate item sets. It requires more space and at the same

time this algorithm requires too many passes over the whole database and also

it generates rules with one consequent item.

Agrawalet.

al. 22 developed various versions of Apriori algorithm such as Apriori, and

AprioriHybrid. Apriori and generate item sets using the large item sets found

in the preceding pass, without consider the transactions. improves Apriori by

using the database at the first pass. Counting in succeeding passes is done

using encodings created in the first pass, which is much smaller than the

database. This leads to a dramatic performance improvement of three times

faster than AIS. A further enhancement, called AprioriHybrid, was achieved when

Apriori was used in the original passes and switches to in the later passes if

the candidate k-itemset is expected to fit into the main memory.

Park. J. S

et.al 23 find out that different versions of Apriori were available; the

problem with Apriori was that it generates too many 2-item sets that were not

frequent. A Direct Hashing and Pruning (DHP) algorithm was developed in that reduce the size of candidate set by

filtering any k-item set out of the hash table, if the hash entry does not have

minimum support. This influential filtering capability allows DHP to complete

execution when Apriori was still at its second pass and hence shows enhancement

in execution time and utilization of space.

Scalability

is a different important area of data mining because of its huge size. Hence,

algorithms should be able to “scale up” to handle large amount of data.

Eui-Hong et. al 24 tried to create

data distribution and candidate distribution scalable by Intelligent Data

Distribution (IDD) algorithm and Hybrid Distribution (HD) algorithm

respectively. IDD addresses the issues of communication overhead and

unnecessary computation by using comprehensive memory to partition. Candidates

and move data efficiently. HD improves over IDD by dynamically partitioning the

candidate set to keep good load balance.

An further scalability study of data mining

was reported by introducing a light-weight data structure called Segment Support

Map (SSM) with the purpose of reduces the number of candidate item sets

required for counting 25. SSM contains the support count for the 1-item set.

The individual support counts are added jointly as the upper bound for k-item

sets. Applying this to Apriori, the endeavor to generate 1-item set is saved by

simply inspecting those SSM support counts that exceed the support threshold.

In addition, those 1-item sets that do not meet the threshold will be

unnecessary to reduce the number of higher-level item sets to be counted.

Evolutionary

Algorithms (EA) are commonly adopted in many scientific vicinity. EA borrows

mechanisms of biological evolution and applies them in problem-solving, mainly

suitable for searching and optimization problems. Hence, the problem of mining

with Association rules is a natural fit. in addition Association rule mining

Evolutionary algorithms were also reported that can generate association rules

26. It allows overlapping intervals in different item sets.

An

improved version of original Apriori- All algorithm was developed for sequence

mining in 28. It adds the property of the user ID through every step of

producing the candidate set and every step of scanning the database to decide

about whether an item in the candidate set should be used to make next

candidate set. The algorithm reduces the size of candidate set in arrange to

reduce the number of database scanning.

Different

works were reported in the literature to adapt the Apriori logic therefore when

improve the efficiency of generating rules. Improved version of Apriori

algorithm is accessible in 29 where; the efficiency is improved by scanning

the database in forward and backward directions. An improved association

rule-mining algorithm that reduces scanning time of candidate sets using hash

tree. a further version of Apriori is reported in 30 as an algorithm called

IApriori algorithm, which optimizes the join process of frequent item sets

generated to reduce the size of the candidate item sets.

The study by Amornchewin et al 31, predictable an incremental itemset

mining algorithm based on Apriori. The presented algorithm finds frequent

itemsets and infrequent itemsets that are expected to be frequent after the

coming of new transactions. This algorithm uses the maximum support count of

1-itemsets in the database before the coming of increments for finding possible

frequent itemsets, called promising itemsets. In other words, in order to find

a threshold value for finding promising itemsets, the maximum support count of

1-itemsets is used. It scans only new transactions, but it assumes that minimum

support value does not change.

Liu Jing, et

al,32 worked on the Apriori algorithm

for improving its efficiency because with huge database efficiency is the

mainly important and the promising factor they contain

done by reduced the number of scanning data base, reduced the

number of candidate item-set which might become frequent item.

Rui Chang et al 33, have planned the APRIORI-IMPROVE

algorithm which reads transaction database by scanning only one time and does

not generate candidate sets which reduces the task by reducing the response

time as in this we need not generate the C2 which is identified as candidate

2-itemset. It uses hash structure to generate L2 and uses an efficient parallel

data representation and optimized approach of storage to save time and space.

Bo Wu 34 worked on Apriori-Growth Algorithm combines Apriori algorithm and FP-Growth to Sextract

association rules.

Sanober and Madhuri 35 Association rule mining based on Trade List, they

anticipated a new mining algorithm was define based on frequent item set.

Apriori Algorithm scans the record every time when it finds the frequent item

set so it is very time consuming and at each step it generates candidate item

set. So used for large databases it takes lots of space to store candidate item

set. In undirected item set graph, it is enhancement on Apriori but it takes

time and space for tree generation. The defined algorithm scans the database at

the start only one time and then from that scanned data base it generates the

Trade List. From all previous studies still the performance issue is the

biggest problem because of the large volume of data, this suggestion is to focus on Maintaining

the speed during the time of arrival data and remove unnecessary data and non

important to keep the storage space .

Even

though fast algorithms were reported for Association mining it still inherits

the disadvantage of scanning the whole data base many times. The analysis

reveals that more concentration is required to address the issues related to

reduce the number of database scan, and also to reduce memory space with less

execution speed. This results in a large number of disk reads and placing a huge

load on the I/O subsystem. These restrictions and other related issues

motivated us to continue the research work in this area.

JOSEPH A. ISSA, “Performance Evaluation and

Estimation Model Using Regression Method for Hadoop WordCount”, Received

November 19, 2015, accepted December 12, 2015, date of publication December 18,

2015, date of current version December 29, 2015.

In this paper, the writer offered a

distinct performance analysis and analysis for Hadoop WordCount workload

utilizing different processors similar to Intel’s ATOM D525, Xeon X5690, and

AMD’s Bobcat E350. Our analysis suggests that Hadoop WordCount is compute-sure

workload in both map segment and scale down segment. The outcome exhibit that

enabling HT and growing the number of sockets have a high impact on the Hadoop

WordCount performance even as reminiscence velocity and capacity does now not

have an impact on efficiency vastly. We also conclude that the Intel’s ATOM

cluster can reap a higher efficiency/watt in comparison with AMD’s Bobcat

cluster at highest performance. Evaluating Intel’s ATOM to Intel’s Xeon X5690,

the performance/buck for Xeon is better com- pared to the performance/buck for

ATOM. We additionally provided estimation models that may estimate the complete

execution time with admire to enter dimension trade utilizing Amdahl’s

legislation regression process. The estimation model common error was not up to

5% compared to a measured knowledge line.

Yaxiong Zhao, Jie Wu, and Cong Liu,

“Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce

Framework”,ISSNll10070214ll05/10llpp39-50 Volume 19, Number 1, February 2014

In this paper, author recommends Dache,

a knowledge-conscious cache framework for big-data functions. In Dache, tasks

publish their intermediate outcome to the cache manager. A project queries the

cache supervisor before executing the specific computing work. A novel cache

description scheme and a cache request and reply protocol are designed. We

enforce Dache by means of extending Hadoop. Test bed experiment results show

that Dache tremendously improves the completion time of MapReduce jobs.

The author gift the design and

evaluation of a data- aware cache framework that requires minimal exchange to

the customary MapReduce programming mannequin for provisioning incremental

processing for massive- data purposes making use of the MapReduce mannequin. We

recommend Dache, a knowledge-mindful cache description scheme, protocol, and

structure.

Zhuoyao Zhang LudmilaCherkasova, “Benchmarking Approach for Designing a

MapReduce Performance Model”, ICPE’13, April 21-24, 2013

In this work, author presents a novel

efficiency analysis framework for answering this question. We observe that the

execution of every map (lessen) duties consists of distinctive, good-defined

knowledge processing phases. Handiest map and scale back services are

customized and their executions are consumer-outlined for exclusive MapReduce

jobs.

The executions of the remaining phases

are time-honored and rely on the amount of information processed by means of

the phase and the performance of underlying Hadoop cluster. First, we design a

suite of parameterizable micro benchmarks to measure normal phases and to

derive a platform performance model of a given Hadoop cluster. Then using the

job prior executions, we summarize job’s houses and efficiency of its customized

map/diminish functions in a compact job profile. Ultimately, via combining the

advantage of the job profile and the derived platform performance mannequin, we

offer a MapReduce efficiency model that estimates the application completion

time for processing a new dataset.

The evaluation gain knowledge of

justifies our procedure and the proposed framework: we’re able to safely

predict efficiency of the various sets of twelve MapReduce applications. The

expected completion times for many experiments are within 10% of the measured

ones (with a worst case resulting in 17% of error) on our 66-node Hadoop

cluster.

NikzadBabaiiRizvandi, Albert Y. Zomaya,

Ali JavadzadehBoloori, Javid Taheri1, “On Modeling Dependency between MapReduce

Configuration Parameters and Total Execution Time”, 2012

In this paper, we advocate an analytical

system to model the dependency between configuration parameters and whole

execution time of Map-diminish functions. Our strategy has three key phases:

profiling, modeling, and prediction. In profiling, an application is run

several occasions with specific units of MapReduce configuration parameters to

profile the execution time of the applying on a given platform. Then in

modeling, the relation between these parameters and total execution time is

modeled with the aid of multivariate linear regression.

2.4Summary

During the analysis and taking the experiment results of the system

finds that data are more accurate than the classical process so that it gives

better results from the old process. In this a wide variety

of existing mechanism, algorithms and architectures is studied for identifying

the issues removed and remains in Association Rule Mining area. Later on, this gives

a brief categorization of various approaches, which has been suggested over the

last few years on Association Rule Mining.