A Bit Level Representation for Time Series Data 1 Mining with Shape Based Similarity

نویسنده

  • GARETH JANACEK
چکیده

Clipping is the process of transforming a real valued series into a sequence of bits representing whether 10 each data is above or below the average. In this paper we argue that clipping is a useful and flexible transformation 11 for the exploratory analysis of large time dependent data sets. We demonstrate how time series stored as bits 12 can be very efficiently compressed and manipulated and that, under some assumptions, the discriminatory power 13 with clipped series is asymptotically equivalent to that achieved with the raw data. Unlike other transformations, 14 clipped series can be compared directly to the raw data series. We show that this means we can form a tight 15 lower bounding metric for Euclidean and Dynamic Time Warping distance and hence efficiently query by content. 16 Clipped data can be used in conjunction with a host of algorithms and statistical tests that naturally follow from 17 the binary nature of the data. A series of experiments illustrate how clipped series can be used in increasingly 18 complex ways to achieve better results than with other popular techniques. The usefulness of the representation 19 is demonstrated by the fact that the results with clipped data are consistently better than those achieved with 20 a Wavelet or Discrete Fourier Transformation at the same compression ratio for both clustering and query by 21 content. The flexibility of the representation is shown by the fact that we can take advantage of a variable run 22 length encoding of clipped series to define an approximation of the Kolmogorov complexity and hence perform 23 Kolmogorov based clustering. 24

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel Bit Level Time Series Representation with Implications for Similarity Search and Clustering

Because time series are a ubiquitous and increasingly prevalent type of data, there has been much research effort devoted to time series data mining in recent years. As with all data mining problems, the key to effective and scalable algorithms is choosing the right representation of the data. Many high level representations of time series have been proposed for data mining, including spectral ...

متن کامل

A Novel Bit Level Time Series Representation with Implication of Similarity Search and Clustering

Because time series are a ubiquitous and increasingly prevalent type of data, there has been much research effort devoted to time series data mining recently. As with all data mining problems, the key to effective and scalable algorithms is choosing the right representation of the data. Many high level representations of time series have been proposed for data mining. In this work, we introduce...

متن کامل

Finding Structural Similarity in Time Series Data Using Bag-of-Patterns Representation

For more than one decade, time series similarity search has been given a great deal of attention by data mining researchers. As a result, many time series representations and distance measures have been proposed. However, most existing work on time series similarity search focuses on finding shape-based similarity. While some of the existing approaches work well for short time series data, they...

متن کامل

A Hybrid Time Series Clustering Method Based on Fuzzy C-Means Algorithm: An Agreement Based Clustering Approach

In recent years, the advancement of information gathering technologies such as GPS and GSM networks have led to huge complex datasets such as time series and trajectories. As a result it is essential to use appropriate methods to analyze the produced large raw datasets. Extracting useful information from large data sets has always been one of the most important challenges in different sciences,...

متن کامل

Algorithms for Segmenting Time Series

As with most computer science problems, representation of the data is the key to ecient and eective solutions. Piecewise linear representation has been used for the representation of the data. This representation has been used by various researchers to support clustering, classication, indexing and association rule mining of time series data. A variety of algorithms have been proposed to obtain...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006