Hierarchical Hyperlink Prediction for the WWW
نویسندگان
چکیده
The hyperlink prediction task, that of proposing new links between webpages, can be used to improve search engines, expand the visibility of web pages, and increase the connectivity and navigability of the web. Hyperlink prediction is typically performed on webgraphs composed by thousands or millions of vertices, where on average each webpage contains less than fifty links. Algorithms processing graphs so large and sparse require to be both scalable and precise, a challenging combination. Similaritybased algorithms are among the most scalable solutions within the link prediction field, due to their parallel nature and computational simplicity. These algorithms independently explore the nearby topological features of every missing link from the graph in order to determine its likelihood. Unfortunately, the precision of similarity-based algorithms is limited, which has prevented their broad application so far. In this work we explore the performance of similarity-based algorithms for the particular problem of hyperlink prediction on large webgraphs, and propose a novel method which assumes the existence of hierarchical properties. We evaluate this new approach on several webgraphs and compare its performance with that of the current best similarity-based algorithms. Its remarkable performance leads us to argue on the applicability of the proposal, identifying several use cases of hyperlink prediction. We also describes the approach we took for the computation of large-scale graphs from the perspective of high-performance computing, providing details on the implementation and parallelization of code. Web Mining, Link Prediction, Large Scale Graph Mining
منابع مشابه
Mining the web with hierarchical crawlers - a resource sharing based crawling approach
An important component of any web search engine is its crawler, which is also known as robot or spider. An efficient set of crawlers make any search engine more powerful, apart from its other measures of performance, such as its ranking algorithm, storage mechanism, indexing techniques, etc. In this paper, we have proposed an extended technique for crawling over the World Wide Web (WWW) on beha...
متن کاملIntegrity Constraints for Hyperlinks in a Hypermedia Database System: AYATORI
Internet users have become well acquainted with the World Wide Web (WWW) system, and WWW has become the most significant service on the Internet. In the near future, the importance of large scale hypermedia database systems based on WWW technologies is expected to continue to increace. The present study focuses on the issue of managing hyperlink integrity constraints on WWW like hypermedia data...
متن کاملHierarchical Alpha-cut Fuzzy C-means, Fuzzy ARTMAP and Cox Regression Model for Customer Churn Prediction
As customers are the main asset of any organization, customer churn management is becoming a major task for organizations to retain their valuable customers. In the previous studies, the applicability and efficiency of hierarchical data mining techniques for churn prediction by combining two or more techniques have been proved to provide better performances than many single techniques over a nu...
متن کاملA Hyperlink Focused Browse Assistant for the World Wide Web
This paper describes a browse assistant focusing on hyperlinks. It discusses the concept and an accompanying prototype implementation of the assistant. The aim of the assistant is to increase the usability of navigation through the World Wide Web (WWW) by the provision of more detailed hyperlink information for each browsed HTML-document. Extracted from a personal link database it offers helpfu...
متن کاملExploiting hyperlinks for automatic information discovery on the WWW
The explosion of the World Wide Web as a global information network brings with it a number of related challenges for information retrieval and automation. The link structure, which is the main feature of the hypermedia environment, can be a rich source of information for exploration. This paper is centered around the exploiting of hyperlinks in the subject of automatic discovery. In this paper...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1611.09084 شماره
صفحات -
تاریخ انتشار 2016