Tri-training and MapReduce-based massive data learning
نویسندگان
چکیده
The real-world applications to massive data processing raise two huge challenges to the traditional supervised machine learning. First, the sufficient training examples to ensure the generalization ability become unavailable, since the task of labeling examples by experts is time-consuming and expensive; second, it is impossible to load massive data into memory, and the response time is unacceptable for training and classifying the massive data by a traditional serial mode. In this paper, we gracefully combine the semi-supervised learning with the parallel computing technique to meet these two challenges together. In detail, (1) the co-training style semi-supervised learning named Tri-training is exploited and revised in order to perform learning from the limited amount of labeled data and the large amount of unlabeled data. In particular, the co-training process of Tri-training approach is revised by introducing a data editing operation to remove the newly mislabeled data before it is used to re-train another classifier. Since the data editing can effectively remedy the unavoidable mislabeling problem in co-training style algorithms, the generalization ability of the learned hypothesis can be improved. (2) Furthermore, the learning algorithm for each individual classifier and the data editing operation in the Tri-training are re-formed as the Google’s MapReduce parallel pattern instead of the traditional memory-resident and serial mode. The MapReduce pattern ensures that each co-training iteration and predicting process in Tri-training could be implemented in parallel on the clusters of commodity PCs. The experiment results on UCI datasets show the improvement of our revised Tri-training over the accuracy, even if it performs learning only with the limited amount of labeled data. The real application to detect the small pulmonary nodes in CT chest images also proves that our revised Tri-training algorithm can effectively reduce the false negative rate and the false positive rate. As well as, the application proves the scalability to massive data due to the adaptation to the MapReduce parallel paradigm.
منابع مشابه
Elastic extreme learning machine for big data classification
Extreme Learning Machine (ELM) and its variants have been widely used for many applications due to its fast convergence and good generalization performance. Though the distributed ELM based on MapReduce framework can handle very large scale training dataset in big data applications, how to cope with its rapidly updating is still a challenging task. Therefore, in this paper, a novel Elastic Extr...
متن کاملPLANET: Massively Parallel Learning of Tree Ensembles with MapReduce
Classification and regression tree learning on massive datasets is a common data mining task at Google, yet many state of the art tree learning algorithms require training data to reside in memory on a single machine. While more scalable implementations of tree learning have been proposed, they typically require specialized parallel computing architectures. In contrast, the majority of Google’s...
متن کاملDynamic Cost-sensitive Ensemble Classification based on Extreme Learning Machine for Mining Imbalanced Massive Data Streams
In order to lower the classification cost and improve the performance of the classifier, this paper proposes the approach of the dynamic cost-sensitive ensemble classification based on extreme learning machine for imbalanced massive data streams (DCECIMDS). Firstly, this paper gives the method of concept drifts detection by extracting the attributive characters of imbalanced massive data stream...
متن کاملA MapReduce based distributed SVM algorithm for binary classification
Although Support Vector Machine (SVM) algorithm has a high generalization property to classify for unseen examples after training phase and it has small loss value, the algorithm is not suitable for real-life classification and regression problems. SVMs cannot solve hundreds of thousands examples in training dataset. In previous studies on distributed machine learning algorithms, SVM is trained...
متن کاملAccelerating Mahout on Heterogeneous Clusters Using Hadoopcl
MapReduce is a programming model capable of processing massive data in parallel across hundreds of computing nodes in a cluster. It hides many of the complicated details of parallel computing and provides a straightforward interface for programmers to adapt their algorithms to improve productivity. Many MapReduce-based applications have utilized the power of this model, including machine learni...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Int. J. General Systems
دوره 40 شماره
صفحات -
تاریخ انتشار 2011