Exploiting Geometric Structure of High Dimensional Data for Learning: An Empirical Study

نویسندگان

  • Xueyuan Zhou
  • Partha Niyogi
  • Pedro Felzenszwalb
  • Nathan Srebro
چکیده

In machine learning, high dimensional data generally should have a high degree of freedom. However, recent experiments in machine learning show that real world data in high dimensions is usually governed by a surprisingly low dimensions. We believe that in high dimensions, geometry information, for example, the “shape” of data distribution, can help learning algorithms to perform better. A geometric transform of high dimensions targeted for learning is attractive for high dimensional machine learning problems. In this paper, we gave an empirical evaluation by experiments comparing how geometric information of high dimensional data can improve learning. We consider two geometric transforms, Laplacian Eigenmaps and Diffusion maps as our general geometric transforms. Distance in spaces after the transform is discussed. We compared classification results from data in original spaces to a new representation of the data after geometric transforms in various application areas, including image, text, acoustic signals, microarray data, and artificial data sets. Results showed that learning algorithms can take advantage of geometric information for most real world data in high dimensions. When labeled examples are extremely few, geometric transforms showed great improvement in learning. We also found cases when these geometric transforms fail in artificial data sets. General conditions when the transforms can result in better classifications are discussed.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Image alignment via kernelized feature learning

Machine learning is an application of artificial intelligence that is able to automatically learn and improve from experience without being explicitly programmed. The primary assumption for most of the machine learning algorithms is that the training set (source domain) and the test set (target domain) follow from the same probability distribution. However, in most of the real-world application...

متن کامل

Finding and Leveraging Structure in Learning Problems

The problem of learning from noisy and high dimensional data is an important challenge that has received much attention in the modern machine learning and statistics literature. These problems arise in numerous applications: large scale collaborative filtering, learning gene regulatory networks and genome wide association studies to name a few. This thesis focuses on understanding the statistic...

متن کامل

High-Dimensional Unsupervised Active Learning Method

In this work, a hierarchical ensemble of projected clustering algorithm for high-dimensional data is proposed. The basic concept of the algorithm is based on the active learning method (ALM) which is a fuzzy learning scheme, inspired by some behavioral features of human brain functionality. High-dimensional unsupervised active learning method (HUALM) is a clustering algorithm which blurs the da...

متن کامل

An Investigation into Indices of Professional Ethics for Faculty Members Using FAHP

Background: Every organizational unit requires ethical codes, also known as professional ethics, in compliance with its professional structure. Therefore, this study was conducted to identify and rate the indices of professional ethics for faculty members using the fuzzy analytical hierarchy process (FAHP). Method: The statistical population of this descriptive-survey study included faculty mem...

متن کامل

بررسی امکان کاربرد سیستم استنتاج فازی- عصبی تطبیقی (ANFIS) در برآورد بار رسوب معلق بابل‌رود

Sediment load estimation is one of the most important issues in rivers & dam reservoirs management and generally in water projects. Various empirical equations show that proper analytical or empirical method is not suggested for correct estimation of suspended sediment, yet. In the present study, to assessment of closer estimation to actual data of transported sediment in Ghoran Talar station l...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008