A New Parallel Partition Prime Multiple Algorithm for Data Mining

نویسندگان

  • Manish Tiwari
  • Partha Pratim Bhattacharya
  • Jitendra Agrawal
  • Rajiv Gandhi
چکیده

One of the important problems in data mining is discovering association rules from databases. Each transaction contains a set of items. Discovering the frequent itemsets require a lot of computation power, memory and input/output values, which can only be provided by parallel computer. In this paper, we proposed a new Parallel Partition Prime Multiple Algorithm for association rule mining. Proposed algorithm addresses the shortcoming of previously proposed Parallel Buddy Prima Algorithm. The proposed algorithm divides transaction database equally according to their assignment of variable for each processor. The decision of assignment of next transaction to the processor depends on the value of count variable of itemset per transaction. It reduces the time and data complexity. 1. OVERVIEW OF DATA MINING The explosive growth of data poses a challenge for finding new techniques to extract useful patterns from such a huge amount of data. Data mining emerged as the new research area to meet this challenge and recently attracted a lot of research attention. “Data mining involves the use of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data sets”[1]. These tools can include statistical models, mathematical algorithms and machine learning methods (algorithms that improve their performance automatically through experience such as neural networks or decision trees). Consequently, data mining consists of huge amount of collecting and managing data, it also includes analysis and prediction [2]. 1.1 KNOWLEDGE DISCOVERY IN DATABASE (KDD): The real world data tend to be incomplete and noisy due to the manual input mistakes. The integrated data sources can be stored in a database, data warehouse or other repositories. The second process is to select task related data from the integrated resources and transform them into a format that is ready to be mined. Suppose we want to analyze which items are often purchased together in a supermarket and the database that records the purchase history may contains customer ID, items bought, transaction time, prices, number of each item and so on. 1.2 ASSOCIATION RULE Association rule mining [3][4] is one of the most important and well-researched techniques of data mining. It aims to extract interesting correlations, frequent patterns, association or casual structures among sets of items in the transaction database or other data repositories. Association rule mining finds interesting association or correlation relationships among a large set of data items [3]. These rules are considered interesting if they satisfy both a minimum support threshold and a minimum confidence threshold [5]. A more formal definition is given in [6]. Let I = {i1, i2... im} be a set of items and D, the task-relevant data, be a set of database transactions where each transaction T is a set of items such that T ⊆ I. Each transaction is associated with an identifier, called TRANSACTION ID. Let A be a set of items. A transaction T is said to contain A if and only if A ⊆ T. An association rule is implication of the form A ⇒ B, where A ⊂ I, B ⊂ I, and A ∩ B = φ [3]. Support (s): the support s of the rule A⇒B is defined as ISSN: 2278-5183 International Journal of Computers and Distributed Systems www.ijcdsonline.com Vol. No.2, Issue 1, December 2012 66 |Pa g e w w w . c i r w o r l d . c o m Confidence (c):Confidence defined as the rule 2. PARALLEL BUDDY PRIMA ALGORITHMS 2.1 Buddy Prima Algorithm: In Buddy Prima Algorithm, Support count can be calculated easly.The weakness of this representation is that the product of the prime number is very large number for a transaction with more number of items This algorithm requires lot of computation power, memory and input/output values for large-scale association mining. To overcome these problems, Parallel Buddy Prima algorithm [7], a parallel version is proposed. This representation uses Prime numbers to represent the items in the transaction. Each item is assigned a unique Prime number. Each transaction is represented by the product of the corresponding prime numbers of individual items in the transaction. Since the product of the prime numbers is unique, modulo division of prime product of the itemset can check the presence of itemset in the transaction. If the remainder is zero, then the itemset is present in the transaction. If the remainder is nonzero, then the itemset is not present in the transaction. By checking the presence of itemset in transactions using the above method Buddy Prima algorithm uses Candidate Distribution technique. This algorithm provides scalability, in terms of the data dimension, size or runtime performance for large databases. 2.2 Parallel Buddy Prima Algorithm: In this algorithm, the computation time of the itemset generation is reduced. Candidate distribution technique assigns the candidate itemsets generated from different parts of database to different processors and each processor is assigned disjoint candidates, independent of other processors. At the same time, the database is shared among all processors, so that each processor can generate global count independently. The Master node prunes the transactions by removing 1-infrequent itemsets and stores the Prime multiple for each transaction in shared memory. It finds the Maximal length transaction size Maxlen and puts in shared memory. It divides the transactions equally in each node for candidate generation. Though horizontal partitioning, vertical partitioning and checkerboard partitioning method can be used to divide and distribute the transactions. Master connects to each slave node and initiates the process of finding the frequent itemset. Finally, the Master node shows the global frequent itemsets after gathering the local frequent itemsets. After the Master node initiates the slave node, it reads the allotted number of transactions and maximal length transaction size Maxlen. It uses the buddy approach to find the maximal frequent itemset and Prima representation to quickly find support count of an itemset. Then, it returns the frequent itemsets to Master node. For partitioning, candidate distribution technique is adopted to handle large datasets with large itemsets. Because, previously itemset used are long string, consumes high memory space beside this data scanning is difficult and time consuming. Here we are proposing to use PRIME number to assign items. The PRIME representation consumes less memory as each transaction is replaced with the product of the equivalent prime numbers of their items, as results, it reduces the time taken to determine the support count of the Itemset [7]. Similarly, it reduces the time and data complexity because of unique multiplication property. The performance of proposed algorithm is studied and compared with the other existing algorithms. 3. IMPLEMENTATION OF PARALLEL BUDDY PRIMA ALGORITHM Here, we implemented Parallel Buddy Prima Algorithm [8] with real life transactions in supermarket. In this example, Table 1 represents the transaction ID at a Database which occur for selling of each items at the shop. Table 2 represents assigned prime number allotted to every item that was sold in the supermarket. There are various items that are not sold in supermarket very frequently therefore we set a minimum support count 3 to remove infrequent items in our example as represented in Table 3, Thereafter, Table 4 is generated to calculate Prime multiplications of transactions in the database of supermarket. Table 1: Transaction Database for Supermarket ISSN: 2278-5183 International Journal of Computers and Distributed Systems www.ijcdsonline.com Vol. No.2, Issue 1, December 2012 67 |Pa g e w w w . c i r w o r l d . c o m TID Transactions T1 1,3,7,13 T2 4,6,10,11 T3 3,9,13 T4 4,5,7,8,14 T5 1,2,3,7,9,13 T6 4,5,7,8,9,10,14 T7 1,2,4,5,6,10,11 T8 1,3,7,9,10 T9 1,3,7,11,13 T10 4,5,6,10,11 T11 1,3,4,5,7,8,14 T12 5,8,12 Table 2: Assign item numbers and equivalent prime number of item Items Allotted Prime Number 1 2 2 3 3 5 4 7 5 11 6 13 7 17 8 19 9 23 10 29 11 31 12 37 13 41 14 43 Table 3: Transaction database after removing infrequent item. TID Transaction T1 1,3,7,13 T2 4,6,10,11 T3 3,1,13 T4 4,5,7,8,14 T5 1,3,7,9,13 T6 4,5,7,8,9,10,14 T7 1,4,5,6,10,11 T8 1,3,7,9,10 T9 1,3,7,11,13 T10 4,5,6,10,11 T11 1,3,4,5,7,8,14 T12 5,8 Table 4: Prima Representation of Transaction Database and their Prime Multiplications. TID Transaction Trans. Multiple T1 2*5*17*41 6970 T2 7*13*29*31 81809 T3 5*23*41 4715 T4 7*11*17*19*43* 1069453 ISSN: 2278-5183 International Journal of Computers and Distributed Systems www.ijcdsonline.com Vol. No.2, Issue 1, December 2012 68 |Pa g e w w w . c i r w o r l d . c o m T5 2*5*17*23*41 160310 T6 7*11*17*19*23*29*43 713325151 T7 2*7*11*13*29*31 1799798 T8 2*5*17*23*29 113390 T9 2*5*17*31*41 216070 T10 7*11*13*29*31 899899 T11 2*5*7*11*17*19*43 10694530 T12 11*19 209 Now suppose we want to know that itemset {3, 7}occurs in which transactions, we take allotted prime number to item {3,5} and multiply 5*17=85 (see Table 2) and perform modulo Division as shown in Table 5. If the remainder is 0 for modulo division of transactions multiple, it indicate that item is present in the transaction set. The Table 5 representing Modulo Division for finding the support count of {3, 7} shows presence of item. Now, we can conclude that the {3, 7} is present in transaction T1, T5, T8, T9, T11. Table 5: Support count determination for {3, 7} TID Modulo Division Remainder Items Presence T1 6970 mod 85 0 Yes T2 81809 mod 85 Non-Zero No T3 4715 mod 85 Non-Zero No T4 1069453 mod 85 Non-Zero No T5 160310 mod 85 0 Yes T6 713325151 mod 85 Non-Zero No T7 1799798 mod 85 Non-Zero No T8 113390 mod 85 0 Yes T9 216070 mod 85 0 Yes T10 899899 mod 85 Non-Zero No T11 10694530 mod 85 0 Yes T12 209 mod 85 Non-Zero No The parallel buddy prima algorithm does not follow any intelligent load balancing algorithm Therefore, it may be possible that one processor has more workload as compared to others, e.g. Table 6 represents dummy transaction database where transaction T1, T2 have 7, 8 items and transaction T3, T4 have 2, 2 transactions only Table 6: Horizontal Partitioning (Transactions are distributed between the processors without any intelligent load balancing). 4. PROPOSED ALGORITHM Several algorithms [9-12] have been proposed to mine all the frequent itemsets in a transaction database. These algorithms are differed from one another in handling the candidate sets, parallel design space and reducing the number of database passes. The main idea behind most of the algorithms is to divide transactions equally in each processor and then apply bottom up approach for generating frequent itemsets. If the maximum itemset is longer, top down search is suitable. For transactions with a medium sized maximal frequent set, a combination of both these approaches performs well. We proposed new parallel algorithm here named Parallel Partition Prime Multiple Algorithm (PPPMA) for TID Transaction T1 1,2,3,4,5,6,10 T2 1,4,5,6,7,8,9,10 T3 4,5 T4 5,6 TID Transaction Item T1 1,2,3,4,5,6,10 7 T2 1,4,5,6,7,8,9,10 8 TID Transaction Item T3 4,5 2 T4 5,6 2 ISSN: 2278-5183 International Journal of Computers and Distributed Systems www.ijcdsonline.com Vol. No.2, Issue 1, December 2012 69 |Pa g e w w w . c i r w o r l d . c o m association rule mining as well as load balancing. In this approach, the database is partitioned and distributed across the clients based on transaction limit to each processor. It reduces the size of transaction prime multiples efficiently. Items can be purchased in any order or any combination. We first divide the transactions equally between the processors. The less prime number assigned to those items having high frequency in transaction. The new proposed algorithm can be explained using the following steps: Step1 : Find the infrequent item of length 1 and store in memory IF1 by putting constraints, minimum support count of 3. Step 2 : Remove the Infrequent 1 item as denoted by IF1. Table 7: Transaction database after putting constraints, minimum support count of 3 and removing infrequent items TID Transaction T1 1,3,7,13 T2 4,6,10,11 T3 3,9,13 T4 4,5,7,8,14 T5 1,3,7,9,13 T6 4,5,7,8,9,10,14 T7 1,4,5,6,10,11 T8 1,3,7,9,10 T9 1,3,7,11,13 T10 4,5,6,10,11 T11 1,3,4,5,7,8,14 T12 5,8, Step3: Now take the transaction count for each Ti. Step4: Find the size Maxlen of maximal size transaction in database and stored value. Step5: Divide the transaction equally based on number of node for that we apply static load balancing. Here N is set of Processors, T is Set of transaction ID Ti is member of T, The method for load balancing is given below, where T > N (A) We have taken a item count variable Ai for each processor of N, (B) For each processor of N, For transaction T1 to Ti, Assign Ti to Ni and store item count of Ti into Ai, Now find smallest value of Ai, Now take another Ti and put in the processor where Ai value is small and add the count value to corresponding Ai Now this process continues till all processor get equally loaded. EXAMPLE According to the above algorithm we have divided the twelve transactions in three processors as shown in Table 8. Table 8: Horizontal Partitioning with New Load Balancing Method Load for processor1 Load for processor2 Load for processor3 TID Trans Item T1 1,3,7,13 4 T5 1,3,7,9,13 5 T8 1,3,7,9,10 5 ISSN: 2278-5183 International Journal of Computers and Distributed Systems www.ijcdsonline.com Vol. No.2, Issue 1, December 2012 70 |Pa g e w w w . c i r w o r l d . c o m Prime Multiplication & Modulo Division for finding Support Count is given below. Step 6: Assign prime number Pi to each unique item Ti to each processor separately. For it items are arranged on basis of frequency descending order and low prime number is allotted to higher frequency item. Step 7: Represent each transaction Ti of m by the multiple Mi of all prime number representation Pi of the items in the transaction (P1* p2....*pm) and store corresponding processor memory. Step 8: find the support count of itemset S Mi mod k is calculated. where, k is corresponding prime number of S. Presence of item is founded by remainder and it is stored separately. The prime number for processor 1 are allotted as shown in Table 9: Table 9: Allocation of prime number for processor1 Item Frequency of occurring Assign prime number 1 3 2 3 3 3 7 3 5 9 2 7 10 2 11 13 2 13 4 1 17 5 1 19 6 1 23 11 1 29 Table 9 represents allocation of prime numbers using Step 6. We select the processor1 and took the value of items. Higher frequency item is represented by low prime number. Prime number multiplication is calculated processor1 as shown in Table 10: Table 10: Prime Multiplication for Processor 1 TID Transaction Transaction multiple T1 2*3*5*13 390 T5 2*3*5*7*13 2730 T8 2*3*5*7*11 2310 T10 17*19*23*11*29 2369851 According to the Step 7, multiplication values from table 7 are stored separately for further calculation. Table11: Support count determination for {3,7} for processor 1 T10 4,5,6,10,11 5 TID Trans Item T2 4,6,10,11 4 T6 4,5,7,8,9,10,14 7 T9 1,3,7,11,13 5 T12 5,8 2 TID Trans Item T3 3,9,13 3 T4 4,5,7,8,14 5 T7 1,4,5,6,10,11 6 T11 1,3,4,5,7,8,14 7 TID Modulo Division Remainder Item’s Presence T1 390 mod 15 0 Yes T5 2730 mod 15 0 Yes T8 2310 mod15 0 Yes T1

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A new approach based on data envelopment analysis with double frontiers for ranking the discovered rules from data mining

Data envelopment analysis (DEA) is a relatively new data oriented approach to evaluate performance of a set of peer entities called decision-making units (DMUs) that convert multiple inputs into multiple outputs. Within a relative limited period, DEA has been converted into a strong quantitative and analytical tool to measure and evaluate performance. In an article written by Toloo et al. (2009...

متن کامل

A New Hybrid Parallel Simulated Annealing Algorithm for Travelling Salesman Problem with Multiple Transporters

In today’s competitive transportation systems, passengers search to find traveling agencies that are able to serve them efficiently considering both traveling time and transportation costs. In this paper, we present a new model for the traveling salesman problem with multiple transporters (TSPMT). In the proposed model, which is more applicable than the traditional versions, each city has diffe...

متن کامل

High-Dimensional Unsupervised Active Learning Method

In this work, a hierarchical ensemble of projected clustering algorithm for high-dimensional data is proposed. The basic concept of the algorithm is based on the active learning method (ALM) which is a fuzzy learning scheme, inspired by some behavioral features of human brain functionality. High-dimensional unsupervised active learning method (HUALM) is a clustering algorithm which blurs the da...

متن کامل

Cluster Based Partition Approach for Mining Frequent Itemsets

Data Mining is the process of extracting interesting and previously unknown patterns and correlations form huge data stored in databases. Association rule mininga descriptive mining technique of data mining, is the process of discovering items or literals which tend to occur together in transactions. As the data to be mined is large, the time taken for accessing data is considerable. This paper...

متن کامل

Mining High Utility Itemsets in Big Data

In recent years, extensive studies have been conducted on high utility itemsets (HUI) mining with wide applications. However, most of them assume that data are stored in centralized databases with a single machine performing the mining tasks. Consequently, existing algorithms cannot be applied to the big data environments, where data are often distributed and too large to be dealt with by a sin...

متن کامل

A Hybrid Parallel SOM Algorithm for Large Maps in Data-Mining

We propose a method for a parallel implementation of the Self-Organizing Map (SOM) algorithm, widely used in data-mining. We call this method Hybrid in the sense that it combines the advantages of the common network-partition and data-partition approaches, and is particularly effective when dealing with large maps. Based on the fact that a global topological ordering of the map is achieved in a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012