Uncertainty Based Optimal Sample Selection for Big Data
نویسندگان
چکیده
In Machine learning and pattern recognition, building a better predictive model is one of the key problems in presence big or massive data; especially, if that data contains noisy unrepresentative samples. These types samples adversely affect may degrade its performance. To alleviate this problem, sometimes, it becomes necessary to sample after eliminating unnecessary instances by maintaining underlying distribution intact. This process called sampling instance selection (IS). However, process, substantial computational cost involved. paper discusses an uncertainty based optimal (UBOSS) method which can select subset efficiently. Our proposed work comprises three main steps; initially, uses IS identify patterns representative from original set; then, uncertainty-based selector designed obtain fuzziness (i.e., type uncertainty) those using classifier whose output membership fuzzy vector; further utilizes divide-and-conquer strategy Experiments are conducted on six datasets evaluate performance method. Results show our methodology outperforms when compared with optimum samples) baseline methods CNN, IB3, DROP3).
منابع مشابه
A Fuzzy TOPSIS Approach for Big Data Analytics Platform Selection
Big data sizes are constantly increasing. Big data analytics is where advanced analytic techniques are applied on big data sets. Analytics based on large data samples reveals and leverages business change. The popularity of big data analytics platforms, which are often available as open-source, has not remained unnoticed by big companies. Google uses MapReduce for PageRank and inverted indexes....
متن کاملFeature Selection for Small Sample Sets with High Dimensional Data Using Heuristic Hybrid Approach
Feature selection can significantly be decisive when analyzing high dimensional data, especially with a small number of samples. Feature extraction methods do not have decent performance in these conditions. With small sample sets and high dimensional data, exploring a large search space and learning from insufficient samples becomes extremely hard. As a result, neural networks and clustering a...
متن کاملLearning ELM-Tree from big data based on uncertainty reduction
A challenge in big data classification is the design of highly parallelized learning algorithms. One solution to this problem is applying parallel computation to different components of a learning model. In this paper, we first propose an extreme learning machine tree (ELM-Tree) model based on the heuristics of uncertainty reduction. In the ELM-Tree model, information entropy and ambiguity are ...
متن کاملData Mining Approaches for Geo-Spatial Big Data: Uncertainty Issues
The availability of a vast amount of heterogeneous information from a variety of sources ranging from satellite imagery to the Internet has been termed as the problem of Big Data. Currently there is a great emphasis on the huge amount of geophysical data that has a spatial basis or spatial aspects. To effectively utilize such volumes of data, data mining techniques are needed to manage discover...
متن کاملA Random Sample Partition Data Model for Big Data Analysis
Big data sets must be carefully partitioned into statistically similar data subsets that can be used as representative samples for big data analysis tasks. In this paper, we propose the random sample partition (RSP) to represent a big data set as a set of non-overlapping data subsets, i.e. RSP data blocks, where each RSP data block has the same probability distribution with the whole big data s...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Access
سال: 2023
ISSN: ['2169-3536']
DOI: https://doi.org/10.1109/access.2022.3233598