نتایج جستجو برای: data cleaning
تعداد نتایج: 2424654 فیلتر نتایج به سال:
Dirty data continues to be an important issue for companies. The database community pays a particular attention to this subject. A variety of integrity constraints like Conditional Functional Dependencies (CFD) have been studied for data cleaning. Data repair methods based on these constraints are strong to detect inconsistencies but are limited on how to correct data, worse they can even intro...
Analysts report spending upwards of 80% of their time on problems in data cleaning. The data cleaning process is inherently iterative, with evolving cleaning workflows that start with basic exploratory data analysis on small samples of dirty data, then refine analysis with more sophisticated/expensive cleaning operators (e.g., crowdsourcing), and finally apply the insights to a full dataset. Wh...
We present a method for discovering informative patterns from data. With this method, large databases can be reduced to only a few representative data entries. Our framework encompasses also methods for cleaning databases containing corrupted data. Both on-line and off-line algorithms are proposed and experimentally checked on databases of handwritten images. The generality of the framework mak...
Outliers are observations that do not follow the statistical distribution of the bulk of the data, and consequently may lead to erroneous results with respect to statistical analysis. Many conventional outlier detection tools are based on the assumption that the data is identically and independently distributed. In this paper, an outlier-resistant data filter-cleaner is proposed. The proposed d...
Data driven computational biology relies on the large quantities of genomic data stored in international sequence data banks. However, the possibilities are drastically impaired if the stored data is unreliable. During a project aiming to predict splice sites in the dicot Arabidopsis thaliana, we extracted a data set from the A.thaliana entries in GenBank. A number of simple 'sanity' checks, ba...
Cleaning organizational data of discrepancies in structure and content is important for data warehousing and Enterprise Data Integration (EDI). Current commercial solutions for data cleaning involve many iterations of time-consuming “data quality” analysis to find errors, and long-running transformations to fix them. Users need to endure long waits and often write complex transformation program...
Poor data quality is a serious and costly problem affecting organizations across all industries. Real data is often dirty, containing missing, erroneous, incomplete, and duplicate values. Declarative data cleaning techniques have been proposed to resolve some of these underlying errors by identifying the inconsistencies and proposing updates to the data. However, much of this work has focused o...
In this paper, we address the problem of reference disambiguation. Specifically, we consider a situation where entities in the database are referred to using descriptions (e.g., a set of instantiated attributes). The objective of reference disambiguation is to identify the unique entity to which each description corresponds. The key difference between the approach we propose (called RelDC) and ...
Data cleaning and data integration have been the topic of intensive research for at least the past thirty years, resulting in a multitude of specialized methods and integrated tool suites. All of them require at least some and in most cases significant human input in their configuration, during processing, and for evaluation. For managers (and for developers and scientists) it would be therefor...
نمودار تعداد نتایج جستجو در هر سال
با کلیک روی نمودار نتایج را به سال انتشار فیلتر کنید