نتایج جستجو برای: data cleaning

تعداد نتایج: 2424654  

2012
Mourad Ouzzani

Textbook database examples are often wrong and simplistic. Unfortunately Data is never born clean or pure. Errors, missing values, repeated entries, inconsistent instances and unsatisfied business rules are the norm rather than the exception. Data cleaning (also known as data cleansing, record linkage and many other terminologies) is growing as a major application requirement and an interdiscip...

Journal: :PVLDB 2017
Yeounoh Chung Sanjay Krishnan Tim Kraska

Data cleaning, whether manual or algorithmic, is rarely perfect leaving a dataset with an unknown number of false positives and false negatives after cleaning. In many scenarios, quantifying the number of remaining errors is challenging because our data integrity rules themselves may be incomplete, or the available gold-standard datasets may be too small to extrapolate. As the use of inherently...

Journal: :PVLDB 2015
Moria Bergman Tova Milo Slava Novgorodov Wang Chiew Tan

As key decisions are often made based on information contained in a database, it is important for the database to be as complete and correct as possible. For this reason, many data cleaning tools have been developed to automatically resolve inconsistencies in databases. However, data cleaning tools provide only best-effort results and usually cannot eradicate all errors that may exist in a data...

Journal: :Information Visualization 2011
Sean Kandel Jeffrey Heer Catherine Plaisant Jessie Kennedy Frank van Ham Nathalie Henry Riche Chris Weaver Bongshin Lee Dominique Brodbeck Paolo Buono

In spite of advances in technologies for working with data, analysts still spend an inordinate amount of time diagnosing data quality issues and manipulating data into a usable form. This process of ‘data wrangling’ often constitutes the most tedious and time-consuming aspect of analysis. Though data cleaning and integration are longstanding issues in the database community, relatively little r...

Journal: :Inf. Syst. 2004
William E. Winkler

This paper provides a survey of two classes of methods that can be used in determining and improving the quality of individual files or groups of files. The first are edit/imputation methods for maintaining business rules and for imputing for missing data. The second are methods of data cleaning for finding duplicates within files or across files. Published by Elsevier Ltd.

2007
Stefan Brüggemann Thomas Aden

Data validation and cleaning are integral processes of the data qualitymanagement cycle. Domain specific knowledge is needed to detect and correct semantic errors. Ontologies can be used to represent valid and invalid attribute value combinations to detect and correct invalid data. We introduce reorganization operations formaintaining such an ontology in the data quality management cycle.

2006
Edward L. Glaeser

Economists are quick to assume opportunistic behavior in almost every walk of life other than our own. Our empirical methods are based on assumptions of human behavior that would not pass muster in any of our models. The solution to this problem is not to expect a mass renunciation of data mining, selective data cleaning or opportunistic methodology selection, but rather to follow Leamer’s lead...

2008
Hamid Haidarian Shahri

The problems of data quality and data cleaning are inevitable in data integration from distributed operational databases and online transaction processing (OLTP) systems (Rahm & Do, 2000). This is due to the lack of a unified set of standards spanning over all the distributed sources. One of the most challenging and resource-intensive phases of data cleaning is the removal of fuzzy duplicate re...

1999
ROBERT M. HARALICK

A spatial clustering procedure applicable to multi-spectral image data is discussed. The procedure takes into account the spatial distribution of the measurements as well as their distribution in measurement space. The procedure calls for the generation and then thresholding of the gradient image, cleaning the thresholded image, labeling the connected regions in the cleaned image, and clusterin...

2003
Eiman Elnahrawy Badri Nath

We present our ongoing work on data quality problems in sensor networks. Specifically, we deal with the problems of outliers, missing information, and noise. We propose an approach for modeling and online learning of spatio-temporal correlations in sensor networks. We utilize the learned correlations to discover outliers and recover missing information. We also propose a Bayesian approach for r...

نمودار تعداد نتایج جستجو در هر سال

با کلیک روی نمودار نتایج را به سال انتشار فیلتر کنید