Uncertainty Based Optimal Sample Selection for Big Data

نویسندگان

چکیده

In Machine learning and pattern recognition, building a better predictive model is one of the key problems in presence big or massive data; especially, if that data contains noisy unrepresentative samples. These types samples adversely affect may degrade its performance. To alleviate this problem, sometimes, it becomes necessary to sample after eliminating unnecessary instances by maintaining underlying distribution intact. This process called sampling instance selection (IS). However, process, substantial computational cost involved. paper discusses an uncertainty based optimal (UBOSS) method which can select subset efficiently. Our proposed work comprises three main steps; initially, uses IS identify patterns representative from original set; then, uncertainty-based selector designed obtain fuzziness (i.e., type uncertainty) those using classifier whose output membership fuzzy vector; further utilizes divide-and-conquer strategy Experiments are conducted on six datasets evaluate performance method. Results show our methodology outperforms when compared with optimum samples) baseline methods CNN, IB3, DROP3).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Fuzzy TOPSIS Approach for Big Data Analytics Platform Selection

Big data sizes are constantly increasing. Big data analytics is where advanced analytic techniques are applied on big data sets. Analytics based on large data samples reveals and leverages business change. The popularity of big data analytics platforms, which are often available as open-source, has not remained unnoticed by big companies. Google uses MapReduce for PageRank and inverted indexes....

متن کامل

Feature Selection for Small Sample Sets with High Dimensional Data Using Heuristic Hybrid Approach

Feature selection can significantly be decisive when analyzing high dimensional data, especially with a small number of samples. Feature extraction methods do not have decent performance in these conditions. With small sample sets and high dimensional data, exploring a large search space and learning from insufficient samples becomes extremely hard. As a result, neural networks and clustering a...

متن کامل

Learning ELM-Tree from big data based on uncertainty reduction

A challenge in big data classification is the design of highly parallelized learning algorithms. One solution to this problem is applying parallel computation to different components of a learning model. In this paper, we first propose an extreme learning machine tree (ELM-Tree) model based on the heuristics of uncertainty reduction. In the ELM-Tree model, information entropy and ambiguity are ...

متن کامل

Data Mining Approaches for Geo-Spatial Big Data: Uncertainty Issues

The availability of a vast amount of heterogeneous information from a variety of sources ranging from satellite imagery to the Internet has been termed as the problem of Big Data. Currently there is a great emphasis on the huge amount of geophysical data that has a spatial basis or spatial aspects. To effectively utilize such volumes of data, data mining techniques are needed to manage discover...

متن کامل

A Random Sample Partition Data Model for Big Data Analysis

Big data sets must be carefully partitioned into statistically similar data subsets that can be used as representative samples for big data analysis tasks. In this paper, we propose the random sample partition (RSP) to represent a big data set as a set of non-overlapping data subsets, i.e. RSP data blocks, where each RSP data block has the same probability distribution with the whole big data s...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Access

سال: 2023

ISSN: ['2169-3536']

DOI: https://doi.org/10.1109/access.2022.3233598