The MiningMart Approach to Knowledge Discovery in Databases

نویسندگان

  • Katharina MORIK
  • Martin SCHOLZ
چکیده

Although preprocessing is one of the key issues in data analysis, it is still common practice to address this task by manually entering SQL statements and using a variety of stand-alone tools. The results are not properly documented and hardly re-usable. The MiningMart system presented in this chapter focusses on setting up and re-using best-practice cases of preprocessing data stored in very large databases. A meta-data model named M4 is used to declaratively define and document both, all steps of such a preprocessing chain and all the data involved. For data and applied operators there is an abstract level, understandable by human users, and an executable level, used by the meta-data compiler to run cases for given data sets. An integrated environment allows for a rapid development of preprocessing chains. Case adaptation to different environments is supported by just specifying all involved database entities in the target DBMS. This allows to re-use best-practice cases published on the Internet. 1 Acquiring Knowledge from Existing Databases The use of very large databases has enhanced in the last years from supporting transactions to additionally reporting business trends. The interest in analyzing the data has increased. One important topic is customer relationship management with the particular tasks of customer segmentation, customer profitability, customer retention, and customer acquisition (e.g. by direct mailing). Other tasks are the prediction of sales in order to minimize stocks, the prediction of electricity consumption or telecommunication services at particular day times in order to minimize the use of external services or optimize network routing, respectively. The health sector demands several analysis tasks for resource management, quality control, and decision making. Existing databases which were designed for transactions, such as billing and booking, are now considered a mine of information, and digging knowledge from the already gathered data is considered a tool for building up an organizational memory. Managers of an institution want to be informed about states and trends of their business. Hence, they demand concise reports from the database department. On-line Analytical Processing (OLAP) offers interactive data analysis by aggregating data and counting the frequencies. This already answers questions like the following: What are the attributes of my most frequent customers? Which are the frequently sold products? How many returns did I receive after my last direct mailing action? What is the average duration of stay in my hospital? 2 The MiningMart Approach to Knowledge Discovery in Databases Reports that support managers in decision making need more detailed information. Questions are more specific, for instance: Which customers are most likely to sell their insurance contract back to the insurance company before it ends? How many sales of a certain item do I have to expect in order to not offer empty shelfs to customers and at the same time minimize my stock? Which group of customers best answers to direct mailing advertising a particular product? Who are the most cost-intensive patients in my hospital? Knowledge Discovery in Databases (KDD) can be considered a high-level query language for relational databases that aims at generating sensible reports such that a company may enhance its performance. The high-level question is answered by a data mining step. Several data mining algorithms exist. However, their application is still a cumbersome process. Several reasons explain, why KDD has not yet become a standard procedure. We list here the three obstacles that – in our view – are the most important ones and then discuss one after the other. Most tools for data mining need to handle the data internally and cannot access the database directly. Sampling the data and converting them into the desired format enhances the effort for data analysis. Preprocessing of the given data is decisive for the success of the data mining step. Aggregation, discretization, data cleaning, the treatment of null values, and the selection of relevant attributes are steps that still have to be programmed (usually in SQL) without any high-level support. The selection of the appropriate algorithm for the data mining step as well as for preprocessing is not yet well understood, but remains the result of a trial and error process. The conversion of given data into the formats of diverse data mining tools is eased by toolboxes which use a common representation language for all the tools. Then, the given data need to be transformed only once and can be input into diverse tools. A first approach to such a toolbox was the development of a Common Knowledge Representation Language (CKRL), from which translators to several learning algorithms were implemented in the European project Machine Learning Toolbox [3, 11]. Today, the weka collection of learning algorithms implemented in JAVA with a common input format offers the opportunity to apply several distinct algorithms on a data set [15]. However, these toolboxes do not scale up to real-world databases naturally1. In contrast, database management systems offer basic statistical or OLAP procedures on the given data, but do not yet provide users with more sophisticated data mining algorithms. Building upon the database facilities and integrating data mining algorithms into the database environment will be the synergy of both developments. We expect the first obstacle for KDD applications to be overcome very soon. The second obstacle is the most important one. If we inspect real-world applications of knowledge discovery, we realize that up to 80 percent of the efforts are spent on the clever preprocessing of the data. Preprocessing has long been underestimated, both, in its relevance and in its complexity. If the data conversion problem is solved, the preprocessing is not at all done. Feature generation and selection2 (in databases this means to construct additional columns and select the relevant attributes for further 1Specialized on multi-relational learning algorithms, the ILP toolbox from Stefan Wrobel (to be published in the network ILPnet2) allows to try several logic learning programs on a database. 2Specialized on feature generation and selection, the toolbox YALE offers the opportunity to try and test diverse feature sets for learning with the support vector machine [6]. However, the YALE environment does not access a database. The MiningMart Approach to Knowledge Discovery in Databases 3 learning) is a major challenge for KDD [9]. Machine learning is not restricted to the data mining step, but is also applicable in preprocessing. This view offers a variety of learning tasks that are not as well investigated as is learning classifiers. For instance, an important task is to acquire events and their duration (i.e. a time interval) on the basis of time series (i.e. measurements at time points). Another example is the replacement of null values in the database by the results of a learning algorithm. Given attributes Ai without null values, we may train our algorithm to predict the values of attribute Aj on those records, which do have a value for Aj . The learning result can then be applied in order to replace null values in Aj . Records without null values are a prerequisite for the application of some algorithms. These algorithms become applicable as the data mining step because of the learning in the preprocessing. With respect to preprocessing, we are just beginning to explore our opportunities. It is a field of greatest potential. The third obstacle, the selection of the appropriate algorithm for a data mining task has long been on the research agenda of machine learning. The main problem is, that nobody has yet been able to identify reliable rules predicting when one algorithm should be superior to others. Beginning with the Mlt-Consultant [13] there was the idea of having a knowledge-based system support the selection of a machine learning method for an application. The Mlt-Consultant succeeded in differentiating the nine learning methods of the Machine Learning Toolbox with respect to specific syntactic properties of the input and output languages of the methods. However, there was little success in describing and differentiating the methods on an application level that went beyond the well known classification of machine learning systems into classification learning, rule learning, and clustering. Also, the European Statlog-Project [10], which systematically applied classification learning systems to various domains, did not succeed in establishing criteria for the selection of the best classification learning system. It was concluded that some systems have generally acceptable performance. In order to select the best system for a certain purpose, they must each be applied to the task and the best selected through a test-method such as cross-validation. Theusinger and Lindner [14] are in the process of re-applying this idea of searching for statistical dataset characteristics necessary for the successful applications of tools. An even more demanding approach was started by Engels [4]. This approach not only attempts to support the selection of data mining tools, but to build a knowledge-based process planning support for the entire knowledge discovery process. To date this work has not led to a usable system [5]. The European project MetaL now aims at learning how to combine learning algorithms and datasets [2]. Although successful in many respects, there is not enough knowledge available in order to propose the correct combination of preprocessing operations for a given dataset and task. The IDEA system now tries the bottom-up exploration of the space of preprocessing chains [1]. Ideally, the system would evaluate all possible transformations in parallel, and propose the most successful sequence of preprocessing steps to the user. For short sequences and few algorithms, this approach is feasible. Problems like the collection of all data concerning one customer (or patient) from several tables, or the generation of most suitable features enlarge the preprocessing sequences considerably. Moreover, considering learning algorithms as preprocessing steps enlarges the set of algorithms per step. For long sequences and many algorithms this principled approach of IDEA becomes computationally infeasible. If the pairing of data and algorithms is all that difficult, can we support an application developer at all? The difficulty of the principled approaches to algorithm selection is that they all start from scratch. They apply rules that pair data and algorithm characteristics, or plan a sequence of steps, or try and evaluate possible sequences for each application anew. However, there are similar applications where somebody has already done the cumbersome exploration. Why not using these efforts to ease the new application development? Normally, it is much easier to solve a task if we are informed about the solution of a similar task. This is the basic assumption of case-based reasoning and it is the basis of the MiningMart approach. A successful case of a full KDD 4 The MiningMart Approach to Knowledge Discovery in Databases process is described at the meta-level. This description at the meta-level can be used as a blueprint for other, similar cases. In this way, the MiningMart project3 eases preprocessing and algorithm selection in order to make KDD an actual high-level query language accessing real world databases. 2 The MiningMart Approach Now that we have stated our goal of easing the KDD process, we may ask: What is MiningMart’s path to reaching the goal? A first step is to implement operators that perform data transformations such as, e.g., discretization, handling null values, aggregation of attributes into a new one, or collecting sequences from time-stamped data. The operators directly access the database and are capable of handling large masses of data. Given database oriented operators for preprocessing, the second step is to develop and collect successful cases of knowledge discovery. Since most of the time is used to find chains of operator applications that lead to good answers to complex questions, it is cumbersome to develop such chains over and over again for very similar discovery tasks and data. Currently, in practice even the same task on data of the same format is implemented anew every time new data are to be analyzed. Therefore, the re-use of successful cases speeds up the process considerably. The particular approach of the MiningMart project is to allow the re-use of cases by means of meta-data, also called ontologies. Meta-data describe the data as well as the operator chains. A compiler generates the SQL code according to the meta-data. Several KDD applications have been considered when developing the operators, the method, and the meta-model. In the remaining part of this chapter, we shall first present the meta-data together with their editors and the compiler. We then describe the case base. We conclude the chapter by summarizing the MiningMart approach and relating it to other approaches. 2.1 The Meta-Model of Meta-Data M4 Ontologies or meta-data have been a key to success in several areas. For our purposes, the advantages of meta-data driven software generation are: Abstraction: Meta-data are given at different levels of abstraction, a conceptual (abstract) and a relational (executable) level. This makes an abstract case understandable and re-usable.ion: Meta-data are given at different levels of abstraction, a conceptual (abstract) and a relational (executable) level. This makes an abstract case understandable and re-usable. Data documentation: All attributes together with the database tables and views, which are input to a preprocessing chain are explicitly listed at both, the conceptual and relational part of the meta-data level. An ontology allows to organize all data by means of inheritance and relationships between concepts. For all entities involved, there is a text field for documentation. This makes the data much more understandable, e.g. by human domain experts, than just referring to the names of specific database objects. Furthermore, statistics and important features for data mining (e.g., presence of null values) are accessible as well. This extends the meta-data as are usual in relational databases and gives a good impression of the data sets at hand. Case documentation: The chain of preprocessing operators is documented, as well. First of all the declarative definition of an executable case in the M4 model can already be considered to be documentation. Furthermore, apart from the opportunity to use “speaking names” for steps and data objects, there are text fields to document all steps of a case together with their parameter settings. This helps to quickly figure out the relevance of all steps and makes cases reproducable. 3The MiningMart project is supported by the European Union under the contract IST-11993. The MiningMart Approach to Knowledge Discovery in Databases 5 Figure 1: Overview of the MiningMart system In contrast, the current state of documentation is most often the memory of the particular scientist who developed the case. Ease of case adaptation: In order to run a given sequence of operators on a new database, only the relational meta-data and their mapping to the conceptual meta-data has to be written. A sales prediction case can for instance be applied for different kinds of shops, or a standard sequence of steps for preparing time series for a specific learner might even be applied as a template in very different mining contexts. The same effect eases the maintanance of cases, when the database schema changes over time. The user just needs to update the corresponding links from the conceptual to the relational level. This is especially easy, having all abstract M4 entities documented. The MiningMart project has developed a model for meta-data together with its compiler, and has implemented human-computer interfaces that allow database managers and case designers to fill in their application-specific meta-data. The system will support preprocessing and can be used stand-alone or in combination with a toolbox for the data mining step. This section gives an overview of how a case is represented at the meta-level, how it is practically applied to a database, and which steps need to be performed, when developing a new case or adapting a given one. The form in which meta-data are to be written is specified in the meta-model of meta-data, M4. It is structured along two dimensions, topic and abstraction. The topic is either the data or the case. The data are the ones to be analyzed. The case is a sequence of (preprocessing) steps. The abstraction is either conceptual or relational. Where the conceptual level is expected to be the same for various applications, the relational level actually refers to the particular database at hand. The conceptual data model describes concepts like Customer and Product and relationships between them like Buys. The relational data model describes the business data that are analyzed. Most often it already exists in the database system in the form of the database schema. The meta-data written in the form as specified by M4 are stored in a relational 6 The MiningMart Approach to Knowledge Discovery in Databases Figure 2: Simplified UML diagram of the MiningMart Meta Model (M4) The MiningMart Approach to Knowledge Discovery in Databases 7 Figure 3: The Concept Editor database themselves. Figure 2 shows a simplified UML diagram of the M4 model. Each case contains steps, each of which embeds an operator an parameters. Apart from values, not shown here, parameters may be concepts, base attributes, or a multi column feature, a feature containing multiple base attributes. This part is a subset of the conceptual part of M4. The relational part contains columnsets and columns. Columnsets either refer to database tables, or to virtual (meta-data only) or database views. Each columnset consists of a set of columns, each of which refers to a database attribute. On the other hand columns are the relational counterpart of base attributes. For columns and base attributes there is a predefined set of data types, which is also omitted in Figure 2. 2.2 Editing the Conceptual Data Model As depicted in Figure 1, there are different kinds of experts working at different ends of a knowledge discovery process. First of all a domain expert will define a conceptual data model, using a concept editor. The entities involved in data mining are made explicit by this expert. The conceptual model of M4 is about concepts having features, and relationships between these concepts. Examples for concepts are Customer and Product. Although at the current stage of development concepts refer to either database views or tables, they should rather be considered as part of a more abstract model of the domain. Concepts consist of features, either base attributes or multi column fetures. A base attribute corresponds to a single database attribute, e.g. the name of a customer. A multi column feature 8 The MiningMart Approach to Knowledge Discovery in Databases Figure 4: Statistics of a database view is a feature containing a fixed set of base attributes. This kind of feature should be used, when information is split over multiple base attributes. An example is to define a single multi column fetaure for the amount and the currency of a bank transfer, which are both represented by base attributes. Relationships are connections between concepts. There could be a relationship named Buys between the concepts Customer and Product, for example. At the database level one-to-many relationships are represented by foreign key references, many-to-many relationships make use of cross tables. However, these details are hidden from the user at the abstract conceptual level. To organize concepts and relationships the M4 model offers the opportunity to use inheritance. Modelling the domain in this fashion, the concept Customer could have subconcepts like Private Customer and Business Customer. Subconcepts inherit all features of their superconcept. The relationship Buys could for instance have a subrelationship Purchases on credit. Figure 3 shows a screenshot of the concept editor, while it is used to list and edit base attributes. The right part of the lower window states, that the selected concept Sales Data is connected to another concept Holidays by a relationship week has holiday. 2.3 Editing the Relational Model Given a conceptual data model, a database administrator maps the involved entities to the corresponding database objects. The relational data model of M4 is capable of representing all the relevant properties of a relational database. The most simple mapping from the conceptual to the relational level is given, if concepts directly correspond to database tables or views. This can always be achieved manually by inspecting the database and creating a view for each concept. However, more sophisticated ways of graphically selecting features in the database and aggregating them to concepts increase the acceptance by end users and ease the adaptation of cases to other environThe MiningMart Approach to Knowledge Discovery in Databases 9 Figure 5: An illustration of the coupling of the abstract conceptual and executable level. ments. In the MiningMart project, the relational editor is intended to support this kind of activity. In general it should be possible to map all reasonable representations of entities to reasonable conceptual definitions. A simple mapping of the concept Customer, containing the features Customer ID, Name, Address to the database would be to state that the table CUSTOMER holds all the necessary attributes, e.g. CUSTOM ID, CUST NAME and CUST ADDR. Having the information about name and address distributed over different tables (e.g. sharing the key attribute CUSTOM ID) is an example for more complex mappings. In this case the relation editor should be able to use a join operation. Apart from connecting conceptual to database entities, the relation editor offers a data viewer and is capable of displaying statistics of connected views or tables. Figure 4 shows an example of the statistics displayed. For each view or table the number of tuples and the numbers of nominal, ordinal and time attributes are counted. For numerical attributes the number of different and missing values is displayed, the minimum, maximum, average, median and modal value are calculated together with the standard deviation and variance. For ordinal and time attributes the most reasonable subset of this information is given. Finally we have information on the distribution of the values for all attributes. 2.4 The Case and Its Compiler All the information about the conceptual descriptions and about the according database objects involved are represented within the M4 model and stored within relational tables. M4 cases denote a collection of steps, basically performed sequentially, each of which changes or augments one or more concepts. Each step is related to exactly one M4 operator, and holds all of its input arguments. The M4 compiler reads the specifications of steps and executes the according operator, passing all the necessary inputs to it. This process requires the compiler to translate the conceptual entities, like input concepts of a step, to the corresponding relational entities, like database table name, the name of a view or the SQL definition of a virtual view, which is only defined as relational meta-data in the M4 model. Two kinds of operators are distinguished, manual and machine learning operators. Manual operators just read the M4 meta-data of their input and add an SQL-definition to the meta-data for their output, establishing a virtual table. Currently, the MiningMart system offers 20 manual operators for selecting rows, selecting columns, handling time data, and generating new columns for the purposes of, e.g., handling null values, discretization, moving windows over time series, gathering information concerning an individual (e.g.,customer, patient, shop). External machine learning operators on the other hand are invoked by using a 10 The MiningMart Approach to Knowledge Discovery in Databases wrapper approach. Currently, the MiningMart system offers learning of decision trees, k-means, and the support vector machine as learning preprocessing operators4. The necessary business data are read from the relational database tables, converted to the required format and passed to the algorithm. After execution the result is read by the wrapper, parsed, and either stored as an SQL-function, or materialized as additional business data. In any case the M4 meta-data will have to be updated by the compiler. A complex machine learning tool to replace missing values is an example for operators altering the database. In contrast, for operators like a join it is sufficient to virtually add the resulting view together with its corresponding SQL-statement to the meta-data. Figure 5 illustrates, how the abstract and the executable or relational level interact. First of all just the upper sequence is given, an input concept, a step, and an output concept. The concept definitions contain features, the step contains an operator together with its parameter settings. Apart from operator specific parameters, the input and output concept are parameters of the step, as well. The compiler needs the inputs, e.g. the input concept and its features to be mapped to relational objects before execution. The mapping may either be defined manually, using the relation editor, or it may be a result of executing the preceeding step. If there is a corresponding relational database object for each input, then the compiler executes the embedded operator. In the example this is a simple operator named “DeleteRowsWithMissingValues”. The corresponding executable part of this operator generates a view definition in the database and in the relational meta-data of M4. The latter is connected to the conceptual level, so that afterwards there is a mapping from the output concept to a view definition. The generated views may be used as inputs to subsequent steps, or they may be used by other tools for the data mining step. Following the overall idea of declarative knowledge representation of the project, known pre-conditions and assertions of operators are formalized in the M4 schema. Conditions are checked at runtime, before an operator is applied. Assertions help to decrease the number of necessary database accesses, because necessary properties of the data can be derived from formalized knowledge, saving expensive database scans. A step replacing missing values might be skipped, for instance, if the preceding operator is known not not produce any missing values. If a user applies linear scaling to an attribute, then all values are known to lie in a specific interval. If the succeeding operator requires all values to be positive, then this pre-condition can be derived from the formalized knowledge about the linear scaling operator, rather than to recalculate this property by another database scan. The task of a case designer, ideally a data mining expert, is to find sequences of steps resulting in a representation well suited for the given data mining task. This work is supported by a special tool, the case editor. Figure 6 shows a screenshot of a rather small example case edited by this tool. Typically a preprocessing chain consists of many different steps, usually organized as a directed acyclic graph, rather than as a linear sequence as the example case shown in Figure 6. To support the case designer a list of available operators and their overall categories, e.g. feature construction, clustering or sampling is part of the conceptual case model M4. The idea is to offer a fixed set of powerful pre-processing operators, in order to offer a comfortable way of setting up cases on the one hand, and ensuring re-usability of cases on the other. By modeling real world cases in the scope of the project further useful operators will be identified, implemented and added to the repository. For each step the case designer chooses an applicable operator from the collection, sets all of its parameters, assigns the input concepts, input attributes and/or input relations and specifies the output. To ease the process of editing cases, applicability constraints on the basis of meta-data are provided as formalized knowledge and are automatically checked by the human computer interface. This way only valid sequences of steps can be produced by a case designer. Furthermore, the case editor supports 4Of course, the algorithms may also be used in the classical way, as data mining step operators. The MiningMart Approach to Knowledge Discovery in Databases 11 Figure 6: A small example case in the case editor. the user by automatically defining output concepts of steps according to the metadata constraints, and by offering property windows tailored to the demands of chosen operators. A sequence of many steps, namely a case in M4 terminology, transforms the original database into another representation. Each step and their ordering is formalized within M4, so the system is automatically keeping track of the performed activities. This enables the user to interactively edit and replay a case or parts of it. As soon as an efficient chain of preprocessing has been found, it can easily be exported and added to an Internet repository of best-practice MiningMart cases. Only the conceptual meta-data is submitted, so even if a case handles sensitive information, as given for most medical or business applications, it is still possible to distribute the valuable meta-data for re-use, while hiding all the sensitive data and even the local database schema.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

3. The MiningMart Approach to Knowledge Discovery in Databases

Although preprocessing is one of the key issues in data analysis, it is still common practice to address this task by manually entering SQL statements and using a variety of stand-alone tools. The results are not properly documented and hardly re-usable. The MiningMart system presented in this chapter focuses on setting up and re-using best practice cases of preprocessing data stored in very la...

متن کامل

Application of Rough Set Theory in Data Mining for Decision Support Systems (DSSs)

Decision support systems (DSSs) are prevalent information systems for decision making in many competitive business environments. In a DSS, decision making process is intimately related to some factors which determine the quality of information systems and their related products. Traditional approaches to data analysis usually cannot be implemented in sophisticated Companies, where managers ne...

متن کامل

A Probabilistic Bayesian Classifier Approach for Breast Cancer Diagnosis and Prognosis

Basically, medical diagnosis problems are the most effective component of treatment policies. Recently, significant advances have been formed in medical diagnosis fields using data mining techniques. Data mining or Knowledge Discovery is searching large databases to discover patterns and evaluate the probability of next occurrences. In this paper, Bayesian Classifier is used as a Non-linear dat...

متن کامل

A Probabilistic Bayesian Classifier Approach for Breast Cancer Diagnosis and Prognosis

Basically, medical diagnosis problems are the most effective component of treatment policies. Recently, significant advances have been formed in medical diagnosis fields using data mining techniques. Data mining or Knowledge Discovery is searching large databases to discover patterns and evaluate the probability of next occurrences. In this paper, Bayesian Classifier is used as a Non-linear dat...

متن کامل

بررسی کاربردهای داده کاوی در نظام سلامت

Introduction: Extensive amounts of data stored in medical databases require the development of specialized tools for accessing the data, data analysis, knowledge discovery, and the effective use of the data. Data mining is one of the most important methods. The article sketches the used Data Mining techniques, and illustrates their applicability to medical diagnostic and prognostic problems. ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003