CHAPTER X is called the antecedent and Y

CHAPTER
2

Literature
Survey

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

2.1 Association Rules 

This part introduces
the basic concepts off
pattern mining for discovery of attractive associations
and correlations between item sets in transactional
and relational database.
Association rulemining can
bedefined formallyas follows:

Let X, Y be a set of
items an association rule has the form

X->Y

Where

X ?Y = Ø.

X is called the
antecedent and Y is called the consequent of the rule where, X, Y is a set of
items called as an itemset or a pattern.

Let frequency (X) be
the number of rows (transactions) containing X itemset in the given database.
The support of an itemset X is defined as the fraction of all rows containing
the itemset, i.e. frequency  (X)/D.

The support of an
association rule is the support of union of X and Y, 8 i.e.

Support (X->Y) = (X U Y)/ D

The confidence of an
association rule is defined as the percentage of rows in D containing itemset X
that also contain itemset Y, i.e. 

Confidence (X->Y) = P(X/Y) = Support(X U Y)/ support (X)

 

The overall performance
of mining association
rules is determined
primarily by
the first step. The second step is easy. After the large
item sets are identified, the corresponding association
rules can be derivative in straight forward manner.
Our main consideration of the thesis is First step
i.e. to find the extraction of
frequent itemsets
9.

                           Fig 2.1 Generating
Association rules

2.1.1
Association rule description

Mining association rule is one of main
content of data mining research at present, and emphasizes particularly on
finding the relation of differ items in the database

 

Table
2.1 Basic Concepts of AssociationRules

Name

Explain

Formula

Confidence

Probability of set Y appear
Only if X appear

P(Y|X)

Support

Probability of set X and Y
Appear simultaneity

P(Y?X)

Expected
Confidence

Probability of set Y appear

P(Y)

Lift

Ratio of confidence to
Expected confidence

P(Y|X)/P(Y)

 

 

2.1.2 Classification of Association Rule

There
are many kinds of association rules. Association rules can be classified in
various ways, based on the following criteria:

·        
Based on the types of values handled in the rule,
the rules can be divided into Boolean association rule and quantitative
association rule.

·        
Based on the levels of abstractions involved in the
rule set, the rules could be divided into the single-level association rule and
the multi-level association rule.

·        
Based on the dimensions of data involved in the
rule set, the rules could be divided into the single-dimensional association
rule and the multi-dimensional association rule.

2.2 Apriori Algorithm

2.2.1
Introduction of Apriori algorithm

Apriori
is a classic algorithm for learning association rules in data mining. Apriori is an influential algorithm for
mining frequent itemsets for Boolean association rules 11. The Apriori
algorithm is a classical data mining method for association rule discovery

Apriori
uses an iterative search method layer by layer, where k-dimensional itemsets
are used to explore (k+1)-dimensional itemsets.

First,
the set of frequent 1- dimensional itemsets is found and denoted L1,
Next, L1 is used to find L2, the set of L2
frequent 2-itemsets, which is used to find L3,and so on until no
more frequent k-dimensional itemsets can be found13 .Finally, getting the
rules from large set of data items.

How
Li-1 is used to find Li is consisting of two step
process, join and prune actions as followed 14:

·        
The
join step:Join
Lk-1 with itself, than combine the same extension item appeared to
generate a possible candidate k-dimensional itemsets, this set of candidates is
denoted Ck, Ck?Lk.

·        
The
prune step:Scan
the database to determine the count of each candidate in Ck. When
the count is less than the minimum support count, it should be delete from the
candidate itemsets. If any  (k-1)
dimensional subset of a  candidate  k-dimensional 
itemsets  is not  in Lk-1,the the candidate cannot
be frequent either ,after this we can get the k dimensional itemsets ,which is
denoted Lk.

 

Table 2.2 Notation for mining algorithm

k-item set

An item set having k items.

Lk

Set of large (frequent)
k-item set.

Ck

Set of candidate k-item
set.

 

2.2.2
Classical Apriori Algorithm

Apriori
employs an iterative approach known as a level-wise search 15, where k-itemsets are used to explore (k+1)-itemsets. First, the set of
frequent 1-itemsets is found.
This set is denoted L1.L1is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so on, until no
more frequent k-itemsets can be
found.

The
finding of each Lkrequires
one full scan of the database. In order to find all the frequent itemsets, the
algorithm adopted the recursive method. The main idea is as follows 16:

 

 

L1 = {large 1-itemsets};

for (k=2; Lk-1??; k++) do

              
{

Ck=Apriori-gen (Lk-1); // the
new candidates

for each transactions t?Ddo//scan D for
counts

              
{

 

                  
Ct=subset (Ck, t);

                  
// get the subsets of t that
are candidates

for each candidates c? Ct do

c.count++;

               }

Lk={c?Ck
|c.count?minsup}

             
}

                 
Return=?kLk;

 

All
nonempty subsets of a frequent itemsets must also be frequent. To reduce the
size of Ck, pruning
is used as follows. If any (k-1)-subset
of a candidate k-itemsets is
not in Lk-1, then
the candidate cannot be frequent either and so can be removed from Ck.

Given
a set of transactions in a database where each letter corresponds to a certain
product and each transaction corresponds to a customer buying the products A,
B, C or D the first step in the Apriori algorithm is to count the support
(number of occurrences) of each item separately

Table 2.3 Sample Retailers
Transactional Database

Transactions
 

List
of Items

T1

A,B,C,D

T2

B,C,D

T3

B,C

T4

A,B,C

T5

A,B,C,D

T6

B,D

 

Table 2.4 Item represent support

Item
 

Support

A

3

B

6

C

5

D

4

 

The items in the transactions represented in Table 2.3
have their support represented in Table 2.4.

In
the example the complete set of large itemset L in this first iteration is L =
{A, B, C, D} since all of these terms meets the support threshold. If any of
these items had been below the support threshold they had not been included in
the subsequent steps. In the next steps we will form all pairs, triples and so
on of the items in Table 2.5.

Table
2.5 Frequent 2-Item set and support counter

2-Itemset

Support

A,B

3

A,C

3

A,D

2

B,C

5

B,D

4

C,D

3

 

Table
2.6 Frequent 2 Itemsets

 
Large
Itemset

A,B

A,C

B,C

B,D

C,D

 

In table 2.6 the new item sets are illustrated
together with respective support.

Next we generate the 3-sets by joining the full set
of large item sets in table 2.7 over a commonitem.

Table
2.7 Frequent 3- Itemsets and support

3-Itemset

Support

A,B,C

3

A,C,D

2

A,B,D

2

B,C,D

3

 

Table 2.8 Frequent 3-Itemset

Large
Itemset

A,B,C

B,C,D

 

The
only 3-set that fulfills the support threshold is {A, B, C} and {B, C, D} as
illustrated in table 2.8

Table 2.9 Frequent 4 –Itemset and
support

4-Itemset

Support

A,B,C,D

2

 

Table
2.10 Frequent 4-Itemset null

Large Itemset

        Ø

 

In
table2.10 last item set is {A, B, C, D} and occurs only two times, hence it is
not fulfilling our support threshold. So large frequent item set are A, B, C
and B, C, D.

Advantages of Apriori

·        
Easy
implementation.

·        
Initial
Information- transaction database D and user-defined minimum support threshold
Min_supp.

·        
Algorithm
uses information from previous steps to produce the frequent itemsets 18.

 

2.3 Related Work

One
of the most well-known and popular data mining techniques is the Association
rules or frequent item sets mining algorithm. The algorithm was originally
proposed by Agrawal et al. 21 22 for market basket analysis. Because of its
important applicability, many revised algorithms have been introduced since
then, and Association rule mining is still a widely researched area. Many
variations done on the frequent pattern-mining algorithm of Apriori was
discussed in this article.

AIS
algorithm in 21 which generates candidate item sets during each pass of the
database scan. Large item sets from preceding pass are checked if they were
presented in the current transaction. Therefore extending existing item sets
created new item sets. This algorithm turns out to be ineffective because it
generates too many candidate item sets. It requires more space and at the same
time this algorithm requires too many passes over the whole database and also
it generates rules with one consequent item.

Agrawalet.
al. 22 developed various versions of Apriori algorithm such as Apriori, and
AprioriHybrid. Apriori and generate item sets using the large item sets found
in the preceding pass, without consider the transactions. improves Apriori by
using the database at the first pass. Counting in succeeding passes is done
using encodings created in the first pass, which is much smaller than the
database. This leads to a dramatic performance improvement of three times
faster than AIS. A further enhancement, called AprioriHybrid, was achieved when
Apriori was used in the original passes and switches to in the later passes if
the candidate k-itemset is expected to fit into the main memory.

 

Park. J.  S
et.al 23 find out that different versions of Apriori were available; the
problem with Apriori was that it generates too many 2-item sets that were not
frequent. A Direct Hashing and Pruning (DHP) algorithm was developed in  that reduce the size of candidate set by
filtering any k-item set out of the hash table, if the hash entry does not have
minimum support. This influential filtering capability allows DHP to complete
execution when Apriori was still at its second pass and hence shows enhancement
in execution time and utilization of space.

Scalability
is a different important area of data mining because of its huge size. Hence,
algorithms should be able to “scale up” to handle large amount of data.
Eui-Hong et. al  24 tried to create
data distribution and candidate distribution scalable by Intelligent Data
Distribution (IDD) algorithm and Hybrid Distribution (HD) algorithm
respectively. IDD addresses the issues of communication overhead and
unnecessary computation by using comprehensive memory to partition. Candidates
and move data efficiently. HD improves over IDD by dynamically partitioning the
candidate set to keep good load balance.

 An further scalability study of data mining
was reported by introducing a light-weight data structure called Segment Support
Map (SSM) with the purpose of reduces the number of candidate item sets
required for counting 25. SSM contains the support count for the 1-item set.
The individual support counts are added jointly as the upper bound for k-item
sets. Applying this to Apriori, the endeavor to generate 1-item set is saved by
simply inspecting those SSM support counts that exceed the support threshold.
In addition, those 1-item sets that do not meet the threshold will be
unnecessary to reduce the number of higher-level item sets to be counted.

Evolutionary
Algorithms (EA) are commonly adopted in many scientific vicinity. EA borrows
mechanisms of biological evolution and applies them in problem-solving, mainly
suitable for searching and optimization problems. Hence, the problem of mining
with Association rules is a natural fit. in addition Association rule mining
Evolutionary algorithms were also reported that can generate association rules
26. It allows overlapping intervals in different item sets.

An
improved version of original Apriori- All algorithm was developed for sequence
mining in 28. It adds the property of the user ID through every step of
producing the candidate set and every step of scanning the database to decide
about whether an item in the candidate set should be used to make next
candidate set. The algorithm reduces the size of candidate set in arrange to
reduce the number of database scanning.

Different
works were reported in the literature to adapt the Apriori logic therefore when
improve the efficiency of generating rules. Improved version of Apriori
algorithm is accessible in 29 where; the efficiency is improved by scanning
the database in forward and backward directions. An improved association
rule-mining algorithm that reduces scanning time of candidate sets using hash
tree. a further version of Apriori is reported in 30 as an algorithm called
IApriori algorithm, which optimizes the join process of frequent item sets
generated to reduce the size of the candidate item sets.

The study by Amornchewin et al 31, predictable an incremental itemset
mining algorithm based on Apriori. The presented algorithm finds frequent
itemsets and infrequent itemsets that are expected to be frequent after the
coming of new transactions. This algorithm uses the maximum support count of
1-itemsets in the database before the coming of increments for finding possible
frequent itemsets, called promising itemsets. In other words, in order to find
a threshold value for finding promising itemsets, the maximum support count of
1-itemsets is used. It scans only new transactions, but it assumes that minimum
support value does not change.

 Liu Jing, et
al,32  worked on the Apriori algorithm
for improving its efficiency because with huge database efficiency is the
mainly important and the promising factor they contain

 

done by reduced the number of scanning data base, reduced the
number of candidate item-set which might become frequent item.

Rui Chang et al 33, have planned the APRIORI-IMPROVE
algorithm which reads transaction database by scanning only one time and does
not generate candidate sets which reduces the task by reducing the response
time as in this we need not generate the C2 which is identified as candidate
2-itemset. It uses hash structure to generate L2 and uses an efficient parallel
data representation and optimized approach of storage to save time and space.

Bo Wu 34 worked on Apriori-Growth Algorithm combines Apriori algorithm and FP-Growth to Sextract
association rules.

Sanober and Madhuri 35 Association rule mining based on Trade List, they
anticipated a new mining algorithm was define based on frequent item set.
Apriori Algorithm scans the record every time when it finds the frequent item
set so it is very time consuming and at each step it generates candidate item
set. So used for large databases it takes lots of space to store candidate item
set. In undirected item set graph, it is enhancement on Apriori but it takes
time and space for tree generation. The defined algorithm scans the database at
the start only one time and then from that scanned data base it generates the
Trade List. From all previous studies still the performance issue is the
biggest problem because of the large volume of data,  this suggestion is to focus on Maintaining
the speed during the time of arrival data and remove unnecessary data and non
important to keep the storage space .

Even
though fast algorithms were reported for Association mining it still inherits
the disadvantage of scanning the whole data base many times. The analysis
reveals that more concentration is required to address the issues related to
reduce the number of database scan, and also to reduce memory space with less
execution speed. This results in a large number of disk reads and placing a huge
load on the I/O subsystem. These restrictions and other related issues
motivated us to continue the research work in this area.

 

 JOSEPH A. ISSA, “Performance Evaluation and
Estimation Model Using Regression Method for Hadoop WordCount”, Received
November 19, 2015, accepted December 12, 2015, date of publication December 18,
2015, date of current version December 29, 2015.

In this paper, the writer offered a
distinct performance analysis and analysis for Hadoop WordCount workload
utilizing different processors similar to Intel’s ATOM D525, Xeon X5690, and
AMD’s Bobcat E350. Our analysis suggests that Hadoop WordCount is compute-sure
workload in both map segment and scale down segment. The outcome exhibit that
enabling HT and growing the number of sockets have a high impact on the Hadoop
WordCount performance even as reminiscence velocity and capacity does now not
have an impact on efficiency vastly. We also conclude that the Intel’s ATOM
cluster can reap a higher efficiency/watt in comparison with AMD’s Bobcat
cluster at highest performance. Evaluating Intel’s ATOM to Intel’s Xeon X5690,
the performance/buck for Xeon is better com- pared to the performance/buck for
ATOM. We additionally provided estimation models that may estimate the complete
execution time with admire to enter dimension trade utilizing Amdahl’s
legislation regression process. The estimation model common error was not up to
5% compared to a measured knowledge line.

Yaxiong Zhao, Jie Wu, and Cong Liu,
“Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce
Framework”,ISSNll10070214ll05/10llpp39-50 Volume 19, Number 1, February 2014

In this paper, author recommends Dache,
a knowledge-conscious cache framework for big-data functions. In Dache, tasks
publish their intermediate outcome to the cache manager. A project queries the
cache supervisor before executing the specific computing work. A novel cache
description scheme and a cache request and reply protocol are designed. We
enforce Dache by means of extending Hadoop. Test bed experiment results show
that Dache tremendously improves the completion time of MapReduce jobs.

The author gift the design and
evaluation of a data- aware cache framework that requires minimal exchange to
the customary MapReduce programming mannequin for provisioning incremental
processing for massive- data purposes making use of the MapReduce mannequin. We
recommend Dache, a knowledge-mindful cache description scheme, protocol, and
structure.

Zhuoyao Zhang LudmilaCherkasova, “Benchmarking Approach for Designing a
MapReduce Performance Model”, ICPE’13, April 21-24, 2013

In this work, author presents a novel
efficiency analysis framework for answering this question. We observe that the
execution of every map (lessen) duties consists of distinctive, good-defined
knowledge processing phases. Handiest map and scale back services are
customized and their executions are consumer-outlined for exclusive MapReduce
jobs.

The executions of the remaining phases
are time-honored and rely on the amount of information processed by means of
the phase and the performance of underlying Hadoop cluster. First, we design a
suite of parameterizable micro benchmarks to measure normal phases and to
derive a platform performance model of a given Hadoop cluster. Then using the
job prior executions, we summarize job’s houses and efficiency of its customized
map/diminish functions in a compact job profile. Ultimately, via combining the
advantage of the job profile and the derived platform performance mannequin, we
offer a MapReduce efficiency model that estimates the application completion
time for processing a new dataset.

The evaluation gain knowledge of
justifies our procedure and the proposed framework: we’re able to safely
predict efficiency of the various sets of twelve MapReduce applications. The
expected completion times for many experiments are within 10% of the measured
ones (with a worst case resulting in 17% of error) on our 66-node Hadoop
cluster.

NikzadBabaiiRizvandi, Albert Y. Zomaya,
Ali JavadzadehBoloori, Javid Taheri1, “On Modeling Dependency between MapReduce
Configuration Parameters and Total Execution Time”, 2012

In this paper, we advocate an analytical
system to model the dependency between configuration parameters and whole
execution time of Map-diminish functions. Our strategy has three key phases:
profiling, modeling, and prediction. In profiling, an application is run
several occasions with specific units of MapReduce configuration parameters to
profile the execution time of the applying on a given platform. Then in
modeling, the relation between these parameters and total execution time is
modeled with the aid of multivariate linear regression.

2.4Summary

During the analysis and taking the experiment results of the system
finds that data are more accurate than the classical process so that it gives
better results from the old process. In this a wide variety
of existing mechanism, algorithms and architectures is studied for identifying
the issues removed and remains in Association Rule Mining area. Later on, this gives
a brief categorization of various approaches, which has been suggested over the
last few years on Association Rule Mining.

x

Hi!
I'm Freda!

Would you like to get a custom essay? How about receiving a customized one?

Check it out