data cleaning

نتایج جستجو برای: data cleaning

تعداد نتایج: 2424654 فیلتر نتایج به سال:

Discovering Editing Rules For Data Cleaning

2012

Thierno Diallo Jean-Marc Petit Sylvie Servigne

Dirty data continues to be an important issue for companies. The database community pays a particular attention to this subject. A variety of integrity constraints like Conditional Functional Dependencies (CFD) have been studied for data cleaning. Data repair methods based on these constraints are strong to detect inconsistencies but are limited on how to correct data, worse they can even intro...

متن کامل

Wisteria: Nurturing Scalable Data Cleaning Infrastructure

Journal: :PVLDB 2015

Daniel Haas Sanjay Krishnan Jiannan Wang Michael J. Franklin Eugene Wu

Analysts report spending upwards of 80% of their time on problems in data cleaning. The data cleaning process is inherently iterative, with evolving cleaning workflows that start with basic exploratory data analysis on small samples of dirty data, then refine analysis with more sophisticated/expensive cleaning operators (e.g., crowdsourcing), and finally apply the insights to a full dataset. Wh...

متن کامل

Discovering Informative Patterns and Data Cleaning

1994

Isabelle Guyon Nada Matic Vladimir Vapnik

We present a method for discovering informative patterns from data. With this method, large databases can be reduced to only a few representative data entries. Our framework encompasses also methods for cleaning databases containing corrupted data. Both on-line and off-line algorithms are proposed and experimentally checked on databases of handwritten images. The generality of the framework mak...

متن کامل

Correlation-Based Methods for Biological Data Cleaning

2007

YONG KOH Wynne Hsu

متن کامل

On-line outlier detection and data cleaning

Journal: :Computers & Chemical Engineering 2004

Hancong Liu Sirish Shah Wei Jiang

Outliers are observations that do not follow the statistical distribution of the bulk of the data, and consequently may lead to erroneous results with respect to statistical analysis. Many conventional outlier detection tools are based on the assumption that the data is identically and independently distributed. In this paper, an outlier-resistant data filter-cleaner is proposed. The proposed d...

متن کامل

Cleaning the GenBank Arabidopsis thaliana data set.

Journal: :Nucleic acids research 1996

P G Korning S M Hebsgaard P Rouze S Brunak

Data driven computational biology relies on the large quantities of genomic data stored in international sequence data banks. However, the possibilities are drastically impaired if the stored data is unreliable. During a project aiming to predict splice sites in the dicot Arabidopsis thaliana, we extracted a data set from the A.thaliana entries in GenBank. A number of simple 'sanity' checks, ba...

متن کامل

An Interactive Framework for Data Cleaning

2000

Vijayshankar Raman Joseph M. Hellerstein

Cleaning organizational data of discrepancies in structure and content is important for data warehousing and Enterprise Data Integration (EDI). Current commercial solutions for data cleaning involve many iterations of time-consuming “data quality” analysis to find errors, and long-running transformations to fix them. Users need to endure long waits and often write complex transformation program...

متن کامل

Models for Distributed, Large Scale Data Cleaning

2014

Vincent J. Maccio Fei Chiang Douglas G. Down

Poor data quality is a serious and costly problem affecting organizations across all industries. Real data is often dirty, containing missing, erroneous, incomplete, and duplicate values. Declarative data cleaning techniques have been proposed to resolve some of these underlying errors by identifying the inconsistencies and proposing updates to the data. However, much of this work has focused o...

متن کامل

Exploiting Relationships for Domain-Independent Data Cleaning

2005

Dmitri V. Kalashnikov Sharad Mehrotra Zhaoqi Chen

In this paper, we address the problem of reference disambiguation. Specifically, we consider a situation where entities in the database are referred to using descriptions (e.g., a set of instantiated attributes). The objective of reference disambiguation is to identify the unique entity to which each description corresponds. The key difference between the approach we propose (called RelDC) and ...

متن کامل

Estimating Data Integration and Cleaning Effort

2015

Sebastian Kruse Paolo Papotti Felix Naumann

Data cleaning and data integration have been the topic of intensive research for at least the past thirty years, resulting in a multitude of specialized methods and integrated tool suites. All of them require at least some and in most cases significant human input in their configuration, during processing, and for evaluation. For managers (and for developers and scientists) it would be therefor...

متن کامل

نمودار تعداد نتایج جستجو در هر سال

با کلیک روی نمودار نتایج را به سال انتشار فیلتر کنید