Reduct Generation in Information Systems

نویسندگان

  • Janusz Starzyk
  • Dale E. Nelson
  • Kirk Sturtz
چکیده

When data sets are analyzed, statistical pattern recognition is often used to find the information hidden in the data. Another approach to information discovery is data mining. Data mining is concerned with finding previously undiscovered relationships in data sets. Rough set theory provides a theoretical basis from which to find these undiscovered relationships. Automatic Target Recognition (ATR) is one area which can benefit from this approach. We present a new theoretical concept, strong equivalence, and an efficient algorithm, the Expansion Algorithm, for generation of all reducts of an information system. The process of finding reducts has been proven to be NP-hard. Using the elimination method, problems of size 13 could be solved in reasonable times. Using our Expansion Algorithm, the size of problems that can be solved has grown to 40. Further, by using the strong equivalence property in the Expansion Algorithm, additional savings of up to 50% can be achieved. This paper describes the fundamentals of this algorithm and the simulation results obtained from randomly generated information systems. The full paper provides the mathematical foundations of the algorithm. 1.0 Introduction In the world today, we are inundated with volumes of data. Businesses have been accumulating vast amounts of data in accounting, inventory and sales records. For decades this has been entered and stored on computers. Business leaders know that there is a wealth of information that could improve business operations if only there was a good way to discover the information contained in the data. The military is interested in building robust automatic target recognition (ATR) systems. To build these systems, there is a set of measured or synthetic data that can be used for training and test. In the past, statistical pattern recognition has been used to build ATR systems. However, if the training data is viewed as an information system, then the procedures and methods of data mining can be used to find the previously unrecognized relationships in the data that will convert the data to information [3,6,7,10]. An information system can be characterized as a relational database, where the information is stored in a table. Each row in the table represents an individual record. Each column represents some an attribute of the records or a field. The columns could represent weight or height, or even some measurement of a target such as height or length. Several records can be considered together to represent a logical grouping. For instance, there might be several examples of different kinds of airplanes, or there might be several examples of people who successfully paid off their loan. One operation in data mining is the determination of a minimal set of attributes necessary to distinguish between the different groups in the data. The process of determining which records belong to each of the groups is called classification. Each group of attributes that can distinguish between the groups is called a reduct [1, 4,7]. In this paper, due to space limitations, we take a naïve approach to explain the concepts involved. Therefore, mathematical rigor will not be enforced. The full paper, on which this summary is based, is mathematically rigorous. 2.0 Elimination Method The first step in generating reducts is to make the training set non-ambiguous and eliminate duplicates. The training set is said to be ambiguous when two signals are identical but belong to two different groups. When this happens both signals should be removed from the training set. This is analogous to a teacher telling a student that 1+1=2 and 1+1=3. One or both of the examples is wrong. Eliminating them eliminates ambiguities. If two or more identical signals represent the same class, all but one of the signals should be eliminated. This reduces computational time. The procedure from this point is to take all possible combinations of the attribute columns. Each combination forms a set which is checked for ambiguity. If there is no ambiguity, then that set of attributes is a reduct. We call this method the elimination method because we are eliminating attributes to check for a reduct. 3.0 Rough Set Theory Rough set theory was developed by Pawlak [3] for use in reasoning from imprecise data. This theory can also be used to formally develop a method for discovering relationships in data (data mining). Rough set theory, is concerned with three basic components; granularity of knowledge, approximation of sets, and data mining [4, 5]. One aspect of data mining is the finding of all reducts. Skowron [7] has proven this process to be NP-hard. This means that as the size of the problem increases, the time to compute the reducts increases faster than polynomially. Most real world problems involve vast amounts of data. Therefore, it behooves us to find as efficient an algorithm as possible to generate these reducts. The algorithm presented is several orders of magnitude better than the elimination method. The first step in the rough set approach to generating all the reducts is to form a discernibility matrix. The discernibility matrix is an NxN matrix where N represents the number of records in the training set. For each entry in the table, we are comparing the record represented by the row number with the record represented by the column number. We further assign a label to each of the attributes. We assume each record represents one group. We enter in the table the labels of the attributes which have different values. In other words, these attributes are the ones which allows us to distinguish (discern) that they (the records) are different. We define an operator ∨ (called disjunction) which allows us to distinguish between these two records by using attribute 1 OR attribute 5 OR ..., etc. It is easily seen that the diagonal is empty (there are no attributes that allow us to distinguish a record from itself). Further the matrix is symmetrical about the diagonal (the attributes that allow us to distinguish record 1 from 5 are the same as the attributes to distinguish record 5 from 1). Using the discernibility matrix, it is now possible to form the discernibility function using another operator, ∧ (called conjunction). For simplicity we use the term “or” to represent our ∨ operator and “and” to represent our ∧ operator. We form the discernibility function by or-ing all the values in one entry in the discernibility matrix and then and-ing all these together. The discernibility function can often be simplified by the process of absorption. For example, suppose one of the disjuncts in the discernibility function is ) ( b a∨ , while another disjunct is ) ( c b a ∨ ∨ . Since attribute a or b is required to satisfy the first disjunct, the second disjunct will be satisfied by either a or b, and attribute c is not required so we can eliminate the ) ( c b a ∨ ∨ term. We have determined all reducts when this equation is reduced to a disjunction of conjunctions. This final form yields all possible classifiers for the given information system! We introduce a new concept, strong equivalence, which can be used to achieve significant speed improvements in our algorithm. We say two attributes are locally strongly equivalent if either both attributes are simultaneously present or simultaneously absent in any disjunctive entry of the discernibility function. When two attributes are locally strongly equivalent then they may be represented by a single attribute. 4.0 Expansion Law There is a simple way to explain the expansion law. First, find the attribute that occurs most frequently (at least twice). OR this single attribute with all the other disjunctive terms which DO NOT contain the selected attribute. AND all the previous terms with all the disjunctive terms in the function removing the selected attribute from each disjunctive term in which it appears. This process is illustrated in the following example in step 3. 5.0 Distribution Algorithm We now introduce our algorithm for efficiently computing all the reducts. The algorithm is as follows: Given: k A f f f ∧ ∧ = ... 1 where A f is the discernibility function. Step 1. In each component i f of the discernibility function, apply the absorption law to eliminate all disjunctive expressions which are supersets of another disjunctive expression; e.g. ) ( ) ( ) ( b a c b a b a ∨ = ∨ ∨ ⊂ ∨ . Step 2. Replace each strongly equivalent subset of attributes in each component i f by a single attribute that represents this class. A strongly equivalent subset is identified in each component i f if the corresponding set of attributes is simultaneously either present or absent in each subset of its conjuncts. Step 3. In each component i f select an attribute which belongs to the largest number of conjunctive sets, numbering at least two, and apply the expansion law. Write the resulting form as a disjunction 2 1 i i i f f f ∨ = . Step 4. Repeat steps 1 through 3 until you cannot apply the expansion law, then A f is said to be in the simple form. Step 5. For each component i f of the resulting simple form, substitute all locally strongly equivalent classes for their corresponding attributes. Step 6. Calculate the reducts by expanding the final discernibility function. Step 7. Determine the minimal elements, with respect to the inclusion relation, of the set U p

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Rough Set Approach for Generation and Validation of Rules for Missing Attribute

Data mining has emerged as most significant and continuously evolving field of research because of it‘s ever growing and far reaching applications into various areas such as medical, military, financial markets, banking etc. One of the most useful applications of data mining is extracting significant and earlier unknown knowledge from real-world databases. This knowledge may be in the form of r...

متن کامل

Reduct Generation in Information Systems

In any information system, the reducts are useful in classifying data. Janusz Starzyk developed an algorithm for computing reducts using strong equivalence and the law of expansion on the data. However, implementation of this algorithm is cumbersome for huge volume of data. This paper deals with a technique for obtaining the reduct of the entire system by partitioning it into two with respect t...

متن کامل

Attribute Reduction and Information Granularity

This work was supported by Science and Technology Commission of Shanghai Municipality, No.705931 ABSTRACT In the view of granularity, this paper analyzes the influence of three attribute reducts on an information system, finding that the possible reduct and − μ decision reduct will make the granule view coarser, while discernible reduct will not change the granule view. In addition, we investig...

متن کامل

Attribute Reduction using Forward Selection and Relative Reduct Algorithm

Attribute reduction of an information system is a key problem in rough set theory and its applications. Rough set theory has been one of the most successful methods used for feature selection. Rough set is one of the most useful data mining techniques. This paper proposes relative reduct to solve the attribute reduction problem in roughest theory. It is the most promising technique in the Rough...

متن کامل

Compositional Specification Calculus for Information Systems Development

The paper presents a novel approach for type speciication manipulations as the basic operations intended to develop various forms of compositions in information systems. Among them are interoperable compositions of pre-existing components formed during the information systems design, heterogeneous multidatabase compositions, database sche-ma and ontology integration, compositions of workkows, c...

متن کامل

Approach to approximate distribution reduct in incomplete ordered decision system

The original rough set model cannot be used to deal with the incomplete information systems. Nevertheless, by relaxing the indiscernibility relation to more general binary relations, many improved rough set models have been successfully applied into the incomplete information systems for knowledge acquisition. This article presents an explorative research focusing on the transition from incompl...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999