Phase Diagram for Variable Selection and Non-optimal Regions for L and L Penalization Methods
نویسنده
چکیده
Consider a linear model Y = Xβ + z, z ∼ N(0, In), where the rows of X are iid samples from N(0,Σ), with Σ being a p by p matrix. It is believed that only a small fraction of the coordinates of β is nonzero, and we are interested in identifying such coordinates. We adopt an asymptotic framework where both p are n are large. In certain ranges, we find that the above problem reduces to a normal means problem: Ỹ = Σ−1β + z̃, z̃ ∼ N(0,Σ−1), which is relatively easier to analyze. We introduce the notion of it phase space, the twodimensional domain calibrated by the number of nonzero coordinates of β and the magnitude of them. With careful calibrations, we identify three regions in the phase space: it Region of Exact Recovery, it Region of Almost Full Recovery, and it Region of No Recovery. In the first region, exact recovery of all signals (i.e. nonzero coordinates of β) is possible. In the second region, exact recovery is impossible, but it is possible to recover most of the signals. In the last region, it is impossible to identify any significant portion of the signals. The L1-penalization methods are well-known approaches to variable selection. Surprisingly, we find that the regions where such methods achieve the optimal rate of convergence are substantially smaller than that of the optimal procedures. The phenomenon persists even when we replace the L1-penalization by the L0-penalization (the latter may yield more efficient approaches in signal recovery, but are also computationally more difficult). We explain why such approaches yield non-optimal rates. We also introduce a new approach whose partition of the phase space coincides with that of the optimal procedures.
منابع مشابه
Flood Flow Frequency Model Selection Using L-moment Method in Arid and Semi Arid Regions of Iran
Statistical frequency analysis is the most common procedure for the analysis of flood data at a gauged location thatin first step it is needed to select a model to represent the population. Among them, the central moment has been themost common and widely used, and with the using of computers, the application of the maximum likelihood hasincreased. This research was carried out in order to reco...
متن کاملEvaluating of the efficiency of AMMI and BLUP models and their integration for identifying high-yielding durum wheat (Triticum turgidum L. var. durum) genotypes adapted to warm rainfed regions of Iran
The aim of this study was to evaluate the efficiency of yield stability analysis models and to assess genotype × environment interaction effect on grain yield of 20 durum wheat genotypes for identifying high yielding and adapted genotypes by BLU and AMMI models using experimental data of four cropping cycles (2009-2013) in five filed stations in warm rainfed regions of Iran. The results of Like...
متن کاملInterrelationships among grain yield and related characters of four oilseed rape (Brassica napus L.) cultivars under drought stress conditions
Four rapeseed cultivars (Hayola 401, Hayola 308, RGS and Option) were evaluated for some physiological traits under stress (50 % field capacity (FC) and non-stress (irrigated) conditions. The factorial set of treatments was arranged within a randomized complete block design with three replications. The collected data were analyzed using path and factor analyses. These is results showed that...
متن کاملEvaluation of selection indices for drought tolerance of corn (Zea mays L.) hybrids
Drought is one of the major problems affecting crops production, including corn, in many parts of Iran. In order to detect drought tolerant grain corn hybrids, an experiment with twenty corn hybrids was conducted during 2006 in Qom Province, Iran, using a complete randomized block design with four replications, under optimal moisture and drought stress condition. Results showed diversity among ...
متن کاملOptimality of graphlet screening in high dimensional variable selection
Consider a linear model Y = Xβ + σz, where X has n rows and p columns and z ∼ N(0, In). We assume both p and n are large, including the case of p n. The unknown signal vector β is assumed to be sparse in the sense that only a small fraction of its components is nonzero. The goal is to identify such nonzero coordinates (i.e., variable selection). We are primarily interested in the regime where s...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010