Nonlinear ICA through low-complexity autoencoders

نویسندگان

  • Sepp Hochreiter
  • Jürgen Schmidhuber
چکیده

We train autoencoders by Flat Minimum Search (FMS), a regularizer algorithm for finding low-complexity networks describable by few bits of information. As a by-product, this encourages nonlinear independent component analysis (ICA) and sparse codes of the input data. Flat minima are regions in weight space where (a) there is tolerable small error and (b) you can perturb the weights without greatly affecting the network’s output. Hence the weights may be given with low precision: few bits of information are required to describe the corresponding “simple” or low complexity-network. Low network complexity is generally associated with high generalization performance. To simplify the algorithm for finding flat minima, we do not consider maximal connected regions but focus on so-called “boxes” within regions: for each weight vector w leading to tolerable small error, its box Mw in weight space is a W -dimensional hypercuboid with center w, whereW is the number of weights. For simplicity, each edge of the box is taken to be parallel to one weight axis. Half the length of the box edge in direction of the axis corresponding to weight wij is denoted by wij , which gives the precision of wij . Mw’s box volume is defined by V ( w) := 2W Qi;j wij , where w denotes the vector with components wij . Our goal is to find large boxes within flat minima. Towards this end we try to find minimal B := log 1 2W V ( w) = Pi;j log wij . Note the relationship to MDL: B is the number of bits (save a constant) required to describe all weights in the net. FMS [1] minimizes E = Eq + B by gradient descent, where Eq is the training set mean squared error, and > 0 scales the influence of B = T1 + T2, where T1 := X i;j2O H[H I logX k2O @yk @wij 2 and This work was supported by DFG grant SCHM 942/3-1 and DFG grant BR 609/10-2 from “Deutsche Forschungsgemeinschaft”. J.S. would also like to acknowledge support from SNF grant 21-43’417.95 “predictability minimization”. T2 := W logX k2O0BB@ X i;j2O H[H I @yk @wij rPk2O @yk @wij 21CCA2 ; where O;H; I denote index sets for output, hidden, input units, respectively. yk denotes the activation of an output unit, which depends on the weights wij . B is derived from two flatness conditions, FC1 and FC2. Perturbing the weights w by w, we obtain ED (w; w) := Pk2O yk (w + w) yk (w) 2. To enforce flatness, FC1 wants to keep ED low: ED (w; w) X k2O0@Xi;j @yk @wij wij1A2 X k2O0@Xi;j @yk @wij j wij j1A2 , where > 0 is small enough to allow for linear approximation. Many boxesMw define a flat region and satisfy FC 1. To select a particular, very flat Mw, the following FC2 uses up degrees of freedom left by FC1 — it enforces minimal net output variance within a box given a constant box volume: 8i; j; u; v : ( wij)2X k2O @yk @wij 2 = ( wuv)2X k2O @yk @wuv 2 : Inserting FC2 into FC1 (using “=” instead of “ ”, since we search for maximal wij ), we obtain: j wuv j = p rPk @ok @wuv 2vuuuuutPk0BB@Pi;j @ok @wij rPk @ok @wij 21CCA2 Inserting the previous equation into the definition of B we obtain above formula for B, where the constant factor 12 and the term log are skipped, since during gradient descent constant terms vanish and constant factors are absorbed by the learning factor. A component function (CF) is the function determining the activation of a code component (hidden unit) in response to a given input. Consider the rewritten first term of B: T1 = X i;j2O H[H I 2 log f 0 i (si) + 2 log yj + logX k2O @yk @yi 2! = 2 X i2O[H fan-in (i) log f 0 i (si) + 2 X j2H[I fan-out (j) log yj + X i2O[H fan-in (i) logX k2O @yk @yi 2 ; where f 0 i (si) is the derivative of the activation function of unit i with activation yi and fan-in(i) (fan-out(i)) denotes the number of incoming (outgoing) weights of unit i. T1 makes (1) unit activations decrease to zero, (2) firstorder derivatives of activation functions decrease to zero, and (3) the influence of units on the output decreases to zero. T1 is the reason why low-complexity (or simple) CFs are preferred. Point (1) above favors sparse hidden unit activations (here: few active code components); point (2) favors non-informative hidden unit activations hardly affected by small input changes. Point (3) favors sparse hidden unit activations in the sense that “few hidden units contribute to producing the output”. T2 punishes units with similar influence on the output. We reformulate it: T2 = W log jOj jO H j2 + jI j2X k2OX i2H X u2H @yk @yi @yk @yu rPk2O @yk @yi 2 rPk2O @yk @yu 21CCA ; where j:j denotes the number of elements in a set. We observe: (1) an output unit that is very sensitive with respect to two given hidden units will heavily contribute to T2. (2) This large contribution can be reduced by making both hidden units have large impact on other output units. So FMS essentially tries to figure out a way of using (1) as few CFs as possible for each output unit (this leads to separation of CFs), while simultaneously (2) using the same CFs for as many output units as possible (common CFs). The results above give rise to a new method for source separation: simply train autoencoders (e.g., [2, 3, 4, 5]) via FMS. The method’s name is LOCOCODE [6, 7, 8, 9, 10], which stands for “Low-complexity coding and decoding.” LOCOCODE generates lococodes that (1) convey information about the input data, (2) can be computed by a lowcomplexity mapping (LCM), (3) can be decoded by a LCM (for alternative approaches using low-complexity nets to achieve ICA see [11, 12].). The analysis above shows that LOCOCODE essentially attempts at describing single inputs with as few and as simple features as possible. This reflects a basic assumption, namely, that the true input “causes” are indeed few and simple. Training sets whose elements are all describable by few features will result in sparse codes. Sparseness [13, 14, 15, 16, 17, 18, 19] is not viewed as an a priori good thing, and is not enforced explicitly, but only if the input data indeed is naturally describable by a sparse code. LOCOCODE (a) is not (like PCA and ICA [20, 21, 22, 23, 24, 25, 26, 27]) inherently limited to the linear case [10], (b) does not need (like ICA) a priori information about the number of independent data sources (even when ICA knows the number of sources, LOCOCODE outperforms ICA) [8], and (c) has a higher coding efficiency (bits per input pixel) than PCA and ICA [9]. Unlike codes obtained with standard autoencoders, lococodes are based on feature detectors, never unstructured, usually sparse, sometimes factorial or local (depending on statistical properties of the data). Although LOCOCODE is not explicitly designed to enforce sparse or factorial codes, it extracts optimal codes for nonlinear, difficult versions of the “bars” benchmark problem, whereas ICA and PCA do not [10, 8]. It produces familiar, biologically plausible feature detectors when applied to real world images, and codes with fewer bits per pixel than ICA and PCA. Unlike ICA it does not need to know the number of independent sources. As a preprocessor for a vowel recognition benchmark problem it sets the stage for excellent classification performance [10]. Although LOCOCODE works well for visual inputs, it may be less useful for discovering input causes that can only be represented by high-complexity input transformations, or for discovering many features (causes) collectively determining single input components (as, e.g., in acoustic signal separation, where ICA does not suffer from the fact that each source influences each input component and none is computable by a low-complexity function). For even more general, algorithmic methods reducing net complexity see [28]. For the authors’ alternative neural approaches to nonlinear ICA see [29, 30]. Our results reveil an interesting, previously ignored connection between two important fields: regularization, and ICA. They may represent a first step towards unification of regularization and unsupervised learning.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Feature Extraction Through LOCOCODE

Low-complexity coding and decoding (LOCOCODE) is a novel approach to sensory coding and unsupervised learning. Unlike previous methods, it explicitly takes into account the information-theoretic complexity of the code generator. It computes lococodes that convey information about the input data and can be computed and decoded by low-complexity mappings. We implement LOCOCODE by training autoass...

متن کامل

Low-Complexity Coding and Decoding

We present a novel approach to sensory coding and unsu-pervised learning. It is called \Low-complexity coding and decoding" (Lococode). Unlike previous methods it explicitly takes into account the information-theoretic complexity of the code generator: lococodes (1) convey information about the input data and (2) can be computed and decoded by low-complexity mappings. To implement Lococode we t...

متن کامل

Nonlinear Extensions of Reconstruction ICA

In a recent paper [1] it was observed that unsupervised feature learning with overcomplete features could be achieved using linear autoencoders (named Reconstruction Independent Component Analysis). This algorithm has been shown to outperform other well-known algorithms by penalizing the lack of diversity (or orthogonality) amongst features. In our project, we wish to extend and improve this al...

متن کامل

Deep Unsupervised Clustering Using Mixture of Autoencoders

Unsupervised clustering is one of the most fundamental challenges in machine learning. A popular hypothesis is that data are generated from a union of low-dimensional nonlinear manifolds; thus an approach to clustering is identifying and separating these manifolds. In this paper, we present a novel approach to solve this problem by using a mixture of autoencoders. Our model consists of two part...

متن کامل

Lococode

\Low-complexity coding and decoding" (Lococode) is a novel approach to sensory coding and unsupervised learning. Unlike previous methods it explicitly takes into account the information-theoretic complexity of the code generator: lococodes (1) convey information about the input data and (2) can be computed and decoded by low-complexity mappings. We implement Lococode by training autoassociators...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999