Carcinogenicity Prediction of Aromatic Compounds Using Self-organising Data Mining
نویسندگان
چکیده
Self-organising data mining is a new approach that supports the workflow process of a Knowledge Discovery more comprehensively and that targets on increasing both reliability and the predictive and descriptive power of generated models of ill-defined systems such as ecotoxicological systems. This paper reports results from applying a self-organising data mining to describe and predict carcinogenicity of aromatic compounds using molecular descriptors. 1. Self-organising Data Mining Knowledge Discovery from Data [1] is of increasing importance to model, describe, predict, and understand the complex behaviour of real-world systems in many domains. For an objectively working and sophisticated knowledge discovery, it is necessary to limit the user involvement in the knowledge extraction process to a contributing inclusion of well known a priori or domain knowledge. Soft computing, i.e., Fuzzy Modelling, Neural Networks, Genetic Algorithms, Inductive Modelling, and other methods of a more or less automatic model generation, is a widely used tool within knowledge discovery. However, its application has been almost exclusively focused on data mining. A new approach that is going to reflect the workflow character of a knowledge discovery – data pre-processing, dimension reduction, data mining, model evaluation, and combining models more comprehensively, is called selforganising data mining [2] [3]. Of special importance for further application of an obtained model is its final evaluation. From data analysis, only, it is impossible to decide whether the estimated model reflects the causal relationship between input and output, adequately, or if it is just a stochastic model of noncausal correlations. An automated procedure used in selforganising data mining that may help evaluating the usefulness of a created model is described in [2] [4]. 2. Carcinogenicity Prediction of Aromatic Compounds Man is exposed to many chemicals of natural and synthetic origin. An urgent question concerns their potential negative effects on human health. To identify chemicals inducing toxicity and to limit the incidence of human cancers and other diseases, rodent bioassays are the principal methods used today. However, this approach is not altogether problem-free, on several accounts: (1) the cost of the assay (>1 mill. U.S. dollars/chemical); (2) the time needed for the tests (3-5 years); (3) ethical considerations and public pressure to reduce or eliminate the use of animals in research and testing; (4) difficulties in the extrapolation to man [5]. Additionally, ecotoxicological systems such as the effect of aromatic compounds on the beginning of cancer or other tumours are complex, ill-defined systems. These systems are characterised by (1) inadequate a priori information about the system; (2) large number of potential, often immeasurable variables; (3) noisy and few data samples; and (4) fuzzy objects [2]. The economical, ethical, and methodological problems resulting from applying theory-driven methods or even dedicated experts systems [5] suggest using a data-driven approach as outlined in 1. In a first test, we used the initial data set of 104 aromatic compounds and 34 molecular descriptors as listed in [5] and the KnowledgeMiner software [6] for modelling and knowledge extraction. Analog Complexing, GMDH Neural Networks, and Fuzzy Rule Induction [2] was applied to generate a set of single models as well as a combined solution. A summarised report of this test can be found in [7]. Concluding from the results of this test, in a second modelling run, a revised data set of only 92 aromatic compounds was used. Here, experts removed some compounds whose experimental carcinogenicity is not available, finally. Since the models of the first modelling run also constantly reported the largest errors for 7 of the 12 removed compounds (marked as: O2, O50, O55, O68, O83, O85, and O94 in [7]), a goal of the second run was to test if accuracy and reliability of models will rise and to predict carcinogenicity values for the removed 12 compounds. This time, we applied a prototype of KnowledgeMiner that implements a new algorithm for creating a nucleus by means of a multileveled self-organised state space dimension reduction and an automated, second level validation procedure for linear GMDH models [2] [4]. The basic idea here is that evaluation of models has been based on the assumption that data mining algorithms are ideal noise filters. However, even an automated leave-one-out cross-validation, for example, applied to verify any model candidate generated during modelling (first level of validation to avoid overfitting) as implemented in KnowledgeMiner does not show ideal noise filtering. Therefore, the evaluation of a final model needs additional justification with the specific noise filtering characteristic of the algorithm it was created with. We identified this characteristic empirically by means of Monte Carlo simulation for KnowledgeMiner’s linear GMDH models, until now. Therefore, in this test, we limited modelling to the generation of linear regression models. To allow models expressing a nonlinear relation, nevertheless, we added several synthesised variables to the initial 34 descriptors xi: which, differently combined, are building 5 distinct data sets for modelling. First, to show the impact of the reduced data set on model accuracy, we used a corresponding set of input variables {x, u} and the same algorithm as reported for the linear GMDH model in [7]. Indeed, a decreased approximation error variance (AEV) is obtained for this model, M0, (table 1), and it is described by the equation: However, the above model evaluation shows that this model describes only 28% of the target data (descriptive power), and that it was built on a poor, i.e., high variables per samples ratio (65/92) relative to the data mining algorithm used. From [4] follows that this implies an increased risk to model also noncausal correlations, which is underlined by the fact that only 5 descriptors are jointly included in both linear models. So, model reliability needs improvement. There are two options: Either adding additional samples – actually measured samples or samples obtained by Jittering, for example – or using less input variables. The latter can be realised by finding a nucleus, first. Two aspects are important here: (1) u i = 1
منابع مشابه
Characteristic Substructures and Properties in Chemical Carcinogens Studied by the Cascade Model
MOTIVATION Chemical carcinogenicity is an important subject in health and environmental sciences, and a reliable method is expected to identify characteristic factors for carcinogenicity. The predictive toxicology challenge (PTC) 2000-2001 has provided the opportunity for various data mining methods to evaluate their performance. The cascade model, a data mining method developed by the author, ...
متن کاملWarmr: a data mining tool for chemical data
Data mining techniques are becoming increasingly important in chemistry as databases become too large to examine manually. Data mining methods from the field of Inductive Logic Programming (ILP) have potential advantages for structural chemical data. In this paper we present Warmr, the first ILP data mining algorithm to be applied to chemoinformatic data. We illustrate the value of Warmr by app...
متن کاملStructure-activity relationship studies of chemical mutagens and carcinogens: mechanistic investigations and prediction approaches.
2.3.3. Carcinogenicity of N-Nitrosamines 1781 2.4. Quinolines 1781 2.5. Triazenes 1782 2.6. Polycyclic Aromatic Hydrocarbons 1783 2.6.1. QSARs for the Carcinogenicity of PAHs 1783 2.6.2. QSARs for the Genotoxicity of PAHs 1784 2.7. Halogenated Aliphatics 1784 2.8. Direct-Acting Compounds 1785 2.8.1. Mutagenicity of Platinum Amines 1785 2.8.2. Mutagenicity of Lactones 1785 2.8.3. Mutagenicity of...
متن کاملCarcinogenicity of the aromatic amines: from structure-activity relationships to mechanisms of action and risk assessment.
Aromatic amines represent one of the most important classes of industrial and environmental chemicals: many of them have been reported to be powerful carcinogens and mutagens, and/or hemotoxicants. Their toxicity has been studied also with quantitative structure-activity relationship (QSAR) methods: these studies are potentially suitable for investigating mechanisms of action and for estimating...
متن کاملRunning Title: lazar Carcinogenicity Predictions Lazy Structure-Activity Relationships (lazar) for the Prediction of Rodent Carcinogenicity and Salmonella Mutagenicity
lazar is a new tool for the prediction of toxic properties of chemical structures. It derives predictions for query structures from a database with experimentally determined toxicity data. lazar generates predictions by searching the database for compounds that are similar with respect to a given toxic activity and calculating the prediction from their activities. Apart form the prediction, laz...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002