Randomization of real-valued matrices for assessing the significance of data mining results

نویسندگان

  • Markus Ojala
  • Niko Vuokko
  • Aleksi Kallio
  • Niina Haiminen
  • Heikki Mannila
چکیده

Randomization is an important technique for assessing the significance of data mining results. Given an input data set, a randomization method samples at random from some class of datasets that share certain characteristics with the original data. The measure of interest on the original data is then compared to the measure on the samples to assess its significance. For certain types of data, e.g., gene expression matrices, it is useful to be able to sample datasets that share row and column means and variances. Testing whether the results of a data mining algorithm on such randomized datasets differ from the results on the true dataset tells us whether the results on the true data were an artifact of the row and column means and variances, or due to some more interesting phenomena in the data. In this paper, we study the problem of generating such randomized datasets. We describe three alternative algorithms based on local transformations and Metropolis sampling, and show that the methods are efficient and usable in practice. We evaluate the performance of the methods both on real and generated data. The results indicate that the methods work efficiently and solve the defined problem.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Randomization methods for assessing data analysis results on real-valued matrices

Randomization is an important technique for assessing the significance of data analysis results. Given an input dataset, a randomization method samples at random from some class of datasets that share certain characteristics with the original data. The measure of interest on the original data is then compared to the measure on the samples to assess its significance. For certain types of data, e...

متن کامل

Randomization algorithms for assessing the significance of data mining results

Aalto University, P.O. Box 11000, FI-00076 Aalto www.aalto.fi Author Markus Ojala Name of the doctoral dissertation Randomization Algorithms for Assessing the Significance of Data Mining Results Publisher School of Science Unit Department of Information and Computer Science Series Aalto University publication series DOCTORAL DISSERTATIONS 99/2011 Field of research Computer and Information Scien...

متن کامل

A Novel Error-Tolerant Frequent Itemset Model for Binary and Real-Valued Data

Frequent pattern mining has been successfully applied to a broad range of applications, however, it has two major drawbacks, which limits its applicability to several domains. First, as the traditional ‘exact’ model of frequent pattern mining uses a strict definition of support, it limits the recovery of frequent itemset patterns in real-life data sets where the patterns may be fragmented due t...

متن کامل

(T) FUZZY INTEGRAL OF MULTI-DIMENSIONAL FUNCTION WITH RESPECT TO MULTI-VALUED MEASURE

Introducing more types of integrals will provide more choices todeal with various types of objectives and components in real problems. Firstly,in this paper, a (T) fuzzy integral, in which the integrand, the measure andthe integration result are all multi-valued, is presented with the introductionof T-norm and T-conorm. Then, some classical results of the integral areobtained based on the prope...

متن کامل

Maximum Entropy Models for Iteratively Identifying Subjectively Interesting Structure in Real-Valued Data

In exploratory data mining it is important to assess the significance of results. Given that analysts have only limited time, it is important that we can measure this with regard to what we already know. That is, we want to be able to measure whether a result is interesting from a subjective point of view. With this as our goal, we formalise how to probabilistically model real-valued data by th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008