Supervised dimensionality reduction for big data

نویسندگان

چکیده

Abstract To solve key biomedical problems, experimentalists now routinely measure millions or billions of features (dimensions) per sample, with the hope that data science techniques will be able to build accurate data-driven inferences. Because sample sizes are typically orders magnitude smaller than dimensionality these data, valid inferences require finding a low-dimensional representation preserves discriminating information (e.g., whether individual suffers from particular disease). There is lack interpretable supervised reduction methods scale dimensions strong statistical theoretical guarantees. We introduce an approach extending principal components analysis by incorporating class-conditional moment estimates into projection. The simplest version, Linear Optimal Low-rank projection, incorporates means. prove, and substantiate both synthetic real benchmarks, Low-Rank Projection its generalizations lead improved representations for subsequent classification, while maintaining computational efficiency scalability. Using multiple brain imaging datasets consisting more 150 million features, several genomics 500,000 outperforms other scalable linear in terms accuracy, only requiring few minutes on standard desktop computer.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Dimensionality reduction for supervised learning

Outline Motivation Dimensionality reduction Experimental setup Results Discussion References Outline Motivation Supervised learning High dimensionality Dimensionality reduction Principal component analysis Random projections Experimental setup Algorithms and datasets Procedure Results Discussion Outline Motivation Dimensionality reduction Experimental setup Results Discussion References Motivat...

متن کامل

Semi-Supervised Dimensionality Reduction

Dimensionality reduction is among the keys in mining highdimensional data. This paper studies semi-supervised dimensionality reduction. In this setting, besides abundant unlabeled examples, domain knowledge in the form of pairwise constraints are available, which specifies whether a pair of instances belong to the same class (must-link constraints) or different classes (cannot-link constraints)...

متن کامل

Cs 229r: Algorithms for Big Data 2 Dimensionality Reduction 2.2 Limitations of Dimensionality Reduction

In the last lecture we proved several space lower bounds for streaming algorithms using the communication complexity model, and some ideas from information theory. In this lecture we will move onto the next topic: dimensionality reduction. Dimensionality reduction is useful when solving high-dimensional computational geometry problems , such as: • clustering • nearest neighbors search • numeric...

متن کامل

A scalable supervised algorithm for dimensionality reduction on streaming data

Algorithms on streaming data have attracted increasing attention in the past decade. Among them, dimensionality reduction algorithms are greatly interesting due to the desirability of real tasks. Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) are two of the most widely used dimensionality reduction approaches. However, PCA is not optimal for general classification pro...

متن کامل

Generalization Bounds for Supervised Dimensionality Reduction

We introduce and study the learning scenario of supervised dimensionality reduction, which couples dimensionality reduction and a subsequent supervised learning step. We present new generalization bounds for this scenario based on a careful analysis of the empirical Rademacher complexity of the relevant hypothesis set. In particular, we show an upper bound on the Rademacher complexity that is i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Nature Communications

سال: 2021

ISSN: ['2041-1723']

DOI: https://doi.org/10.1038/s41467-021-23102-2