Engineering Seminar Topics :: Seminar Paper: Optimized Distributed Association Rule Mining Algorithm

With the explosive growth of information sources available on the World Wide Web, it has become increasingly necessary for users to utilize automated tools in find the desired information resources, and to track and analyze their usage patterns.

Association rule mining is an active data mining research area. However, most ARM algorithms cater to a centralized environment. In contrast to previous ARM algorithms, ODAM is a distributed algorithm for geographically distributed data sets that reduces communication costs. Recently, as the need to mine patterns across distributed databases has grown, Distributed Association Rule Mining (D-ARM) algorithms have been developed. These algorithms, however, assume that the databases are either horizontally or vertically distributed. In the special case of databases populated from information extracted from textual data, existing D-ARM algorithms cannot discover rules based on higher-order associations between items in distributed textual documents that are neither vertically nor horizontally distributed, but rather a hybrid of the two.

Modern organizations are geographically distributed. Typically, each site locally stores its ever increasing amount of day-to-day data. Using centralized data mining to discover useful patterns in such organizations' data isn't always feasible because merging data sets from different sites into a centralized site incurs huge network communication costs. Data from these organizations are not only distributed over various locations but also vertically fragmented, making it difficult if not impossible to combine them in a central location. Distributed data mining has thus emerged as an active subarea of data mining research.

A significant area of data mining research is association rule mining. Unfortunately, most ARM algorithms focus on a sequential or centralized environment where no external communication is required. Distributed ARM algorithms, on the other hand, aim to generate rules from different data sets spread over various geographical sites; hence, they require external communications throughout the entire process. DARM algorithms must reduce communication costs so that generating global association rules costs less than combining the participating sites' data sets into a centralized site. However, most DARM algorithms don't have an efficient message optimization technique, so they exchange numerous messages during the mining process. We have developed a distributed algorithm, called Optimized Distributed Association Mining, for geographically distributed data sets. ODAM generates support counts of candidate itemsets quicker than other DARM algorithms and reduces the size of average transactions, data sets, and message exchanges.

Existing Method

The Data mining Algorithms can be categorized into the following :

Association Algorithm
Classification
Clustering Algorithm

Classification:

The process of dividing a dataset into mutually exclusive groups such that the members of each group are as "close" as possible to one another, and different groups are as "far" as possible from one another, where distance is measured with respect to specific variable(s) you are trying to predict. For example, a typical classification problem is to divide a database of companies into groups that are as homogeneous as possible with respect to a creditworthiness variable with values "Good" and "Bad."

Clustering:

Given databases of sufficient size and quality, data mining technology can generate new business opportunities by providing these capabilities:

Automated prediction of trends and behaviors. Data mining automates the process of finding predictive information in large databases. Questions that traditionally required extensive hands-on analysis can now be answered directly from the data — quickly. A typical example of a predictive problem is targeted marketing. Data mining uses data on past promotional mailings to identify the targets most likely to maximize return on investment in future mailings. Other predictive problems include forecasting bankruptcy and other forms of default, and identifying segments of a population likely to respond similarly to given events.

Automated discovery of previously unknown patterns. Data mining tools sweep through databases and identify previously hidden patterns in one step. An example of pattern discovery is the analysis of retail sales data to identify seemingly unrelated products that are often purchased together. Other pattern discovery problems include detecting fraudulent credit card transactions and identifying anomalous data that could represent data entry keying errors.

DARM discovers rules from various geographically distributed data sets. However, the network connection between those data sets isn't as fast as in a parallel environment, so distributed mining usually aims to minimize communication costs.

Proposed System

Unlike other algorithms, ODAM offers better performance by minimizing candidate itemset generation costs. It achieves this by focusing on two major DARM issues communication and synchronization. Communication is one of the most important DARM objectives. DARM algorithms will perform better if we can reduce communication (for example, message exchange size) costs. Synchronization forces

each participating site to wait a certain period until globally frequent itemset generation completes. Each site will wait longer if computing support counts takes more time. Hence, we reduce the computation time of candidate itemsets' support counts.

To reduce communication costs, we highlight several message optimization techniques. ARM algorithms and on the message exchange method, we can divide the message optimization techniques into two methods direct and indirect support counts exchange. Each method has different aims, expectations, advantages, and disadvantages. For example, the first method exchanges each candidate itemset's support count to generate globally frequent itemsets of that pass (CD and FDM are examples of this approach). All sites share a common globally frequent itemset with identical support counts, so rules that are generated at different participating sites have identical confidence. This approach focuses on a rule's exactness and correctness.

System Requirement

Hardware specifications:

Processor : Intel Processor IV

RAM : 128 MB

Hard disk : 20 GB

CD drive : 40 x Samsung

Floppy drive : 1.44 MB

Monitor : 15’ Samtron color

Keyboard : 108 mercury keyboard

Mouse : Logitech mouse

Software Specification

Operating System – Windows XP/2000

Language used – J2sdk1.4.0, JCreator

Engineering Seminar Topics :: Seminar Paper

Optimized Distributed Association Rule Mining Algorithm - ODAM

No comments:

Post a Comment

Seminar Topics