A vector space model of semantics using Canonical Correlation Analysis
ثبت نشده
چکیده
We present an efficient method that uses canonical correlation analysis (CCA) between words and their contexts (i.e., the neighboring words) to estimate a real-valued vector for each word that characterizes its “hidden state” or “meaning”. The use of CCA allows us to prove theorems characterizing how accurately we can estimate this hidden state. Recently developed algorithms for computing the required singular eigenvectors make it easy to compute models for billions of words of text with vocabularies in the hundreds of thousands. Experiments on the Google-ngram collection show that CCA between words and their contexts provides a mapping from each word to a low dimensional feature vector that captures information about the part of speech and meanings of the word. Unlike latent semantic analysis, which uses PCA, our method takes advantage of the information implicit in the word sequences. 1 The problem of state estimation Many people have clustered words based on their distributionally similarity (see, amongmany articles, (Pereira et al., 1993)), but such clustering ignores the many different dimensions that similarity could be computed on. We instead, characterize words using a vector space model (Turney and Pantel, 2010). We present a method for learning language models that estimates a “state” or latent variable representation for words based on their context. The vector learned for each word captures a wide variety of information about it, allowing us to predict the word’s part of speech, linguist features such as animacy, membership in wide variety of semantic categories (foods, drinks, colors, males, females ...) and direction of sentiment (happy vs. sad). More precisely, we estimate a hidden state associated with words by computing the dominant canonical correlations between target words and the words in their immediate context. The main computation, finding the singular value decomposition of a scaled version of the co-occurrence matrix of counts of words with their contexts, can be done highly efficiently. Use of CCA also allows us to prove theorems about the optimality of our reconstruction of the state. Our CCA-based multi-view learning method can be thought of as a generalization of widely used latent semantic analysis (LSA) methods. LSA (sometimes called LSI, when used for indexing) uses a principle component analysis (PCA) of the cooccurrence matrix between words and the documents they occur in to learn a latent vector representation of each word type (Landauer et al., 2008). We extend this method in two ways: (1) By looking at the correlation between words and the words in their context, we find a latent vector for each word that depends upon the sequence of nearby words, unlike the bag-of-words model standardly used in LSA. (2) We rescale the covariance matrix between words and context to give a correlation matrix (i.e., use CCA), allowing us to prove theorems about how accurately we can estimate the state. Unlike PCA, in CCA the dominant singular vectors are guaranteed to capture the key state information. Importantly, our method scales well, handling gigaword corpora with vocabularies of tens of thousands of words on a small computer. Using CCA to estimate a latent vector for each word based on the contexts it appears in, gives a significantly different (and as we show below, often more useful) representation than, for example, taking the PCA of the same data used to generate the CCA. We show below how to efficiently compute a vector that characterizes each word type by using the right singular values of the above CCA to map from the word space (size v) to the state space (size k). We call this mapping the attribute dictionary for words, as it associates with every word a vector that captures that word’s attributes. As will be made clear below, the attribute dictionary is arbitrary up to a rotation, but captures the information needed for any linear model to predict properties of the words such as part of speech or word sense. Characterizing words by a low dimensional (e.g. 50 dimensional) vector as we do is valuable in that words can be partitioned in many ways. They can be grouped based on part of speech, or by many aspects of their “meaning” including features such as animacy, gender, whether the word describes something that is good to eat or drink, whether it refers to a year or to a small number, etc. Such features can be used, for example, as features for word sense disambiguation or named entity disambiguation, or to help improve parsing algorithms.
منابع مشابه
تحلیل فراوانی منطقهای سیلاب با استفاده از روش کریجینگ متعارف در حوزه های آبخیز استان مازندران
Regional analysis is the stability method to improve estimates of flood frequency, which has become one of the dynamic sectors in hydrology and the new theories are testing, constantly. Application of geostatistical method is an innovation in this field for regional flood analysis.This technique is based on the interpolation of hydrological variables in the physiographical space instead of usin...
متن کاملIdentification of Risk Factors by Using Macroeconomic and Firm-Specific Variables Simultaneously in Tehran Stock Exchange by Applying Canonical Correlation Analysis
The main objective of this study is to give the insight of describing mixing accounting ratios and macroeconomic variables as the risk factors in Iran. The results indicate a significant relationship between book to market ratio, financial leverage, size factors and expected stock returns in the Iranian market. In consistent with the other studies, we came to the conclusion that the term struct...
متن کاملClassification of transformer faults using frequency response analysis based on cross-correlation technique and support vector machine
One of the most important methods for transformers fault diagnosis (especially mechanical defects) is the frequency response analysis (FRA) method. The most important step in the FRA diagnostic process is to differentiate the faults and classify them in different classes. This paper uses the intelligent support vector machine (SVM) method to classify transformer faults. For this purpose, two gr...
متن کاملImproving Vector Space Word Representations Using Multilingual Correlation
The distributional hypothesis of Harris (1954), according to which the meaning of words is evidenced by the contexts they occur in, has motivated several effective techniques for obtaining vector space semantic representations of words using unannotated text corpora. This paper argues that lexico-semantic content should additionally be invariant across languages and proposes a simple technique ...
متن کاملLearning the Semantics of Multimedia Content with Application to Web Image Retrieval and Classification
We use kernel Canonical Correlation Analysis to learn a semantic representation of Web images and their associated text. This representation is used in two applications. In first application we consider classification of images into one of three categories. We use SVM in the semantic space and compare against the SVM on raw data and against previously published results using ICA. In the second ...
متن کاملCanonical Correlation Analysis: An Overview with Application to Learning Methods
We present a general method using kernel canonical correlation analysis to learn a semantic representation to web images and their associated text. The semantic space provides a common representation and enables a comparison between the text and images. In the experiments, we look at two approaches of retrieving images based on only their content from a text query. We compare orthogonalization ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011