Links tell us about lexical and semantic Web content
نویسنده
چکیده
The latest generation of Web search tools is beginning to exploit hypertext link information to improve ranking 1, 2 and crawling 3–5 algorithms. The hidden assumption behind such approaches, a correlation between the graph structure of the Web and its content, has not been tested explicitly despite increasing research on Web topology 6–9. Here I formalize and quantitatively validate two conjectures drawing connections from link information to lexical and semantic Web content. The link-content conjecture states that a page is similar to the pages that link to it, i.e., one can infer the lexical content of a page by looking at the pages that link to it. I also show that lexical inferences based on link cues are quite heterogeneous across Web communities. The link-cluster conjecture states that pages about the same topic are clustered together, i.e., one can infer the meaning of a page by looking at its neighbours. These results explain the success of the newest search technologies and open the way for more dynamic and scalable methods to locate information in a topic or user driven way. All search engines basically perform two functions: (i) crawling Web pages to maintain an index, and (ii) matching URLs in the index database against user queries. Effective search engines achieve a high coverage of the Web, keep their index fresh, and rank hits in a way that correlates with the user's notion of relevance. Ranking and crawling algorithms use cues from words and hyperlinks, associated respectively with lexical and link topology. In the former, two pages are close to each other if they have similar textual content; in the latter, if there is a short path between them. Lexical metrics are traditionally used by search engines to rank hits according to their similarity to the query, thus attempting to infer the semantics of pages from their lexical representation. Similarity metrics are derived from the vector space model 10 , that represents each document or query by a vector with one dimension for each term and a weight along that dimension that estimates the term's contribution to the meaning of the document. The cluster hypothesis behind this model is that a document lexically close to a relevant document is also relevant with high probability 11. Links have traditionally been used by search engine crawlers only in exhaustive, centralized algorithms. However the latest generation of Web search tools is beginning to integrate …
منابع مشابه
Lexical and semantic clustering by Web links
Recent Web searching and mining tools are combining text and link analysis to improve ranking and crawling algorithms. The central assumption behind such approaches is that there is a correlation between the graph structure of the Web and the text and meaning of pages. Here I formalize and quantitatively validate two conjectures drawing connections from linkage information to lexical and semant...
متن کاملCorrection and Extension of WordNet 1.7 for Knowledge-based Applications
This article presents the transformation of the noun-related part of WordNet (108,000 nouns, 74,500 categories representing their meanings, and 95,000 semantic links between them) into a genuine “lexical ontology”, usable for knowledge representation, sharing and retrieval on the Web. To do so, (i) I generated intuitive identifiers for all the categories, (ii) introduced 353 lexical corrections...
متن کاملEnhancing Navigability in Websites Built Using Web Content Management Systems
Websites built using Web Content Management Systems (WCMSs) usually provide their users with three alternative access structures to surf their contents: indexes of categories, breadcrumb trails, and sitemaps. In addition, to find contents of his/her interest, a user can perform more or less advanced full-text searches. In this paper we propose an automatic approach to extend the navigation stru...
متن کاملStandards & best practice for multilingual computational lexicons: ISLE MILE and more
ISLE (International Standards for Language Engineering) is a transatlantic standards oriented initiative under the Human Language Technology (HLT) programme within the EU-US International Research Co-operation. It is a continuation of the European EAGLES (Expert Advisory Group for Language Engineering Standards) initiative, carried out through a number of subsequent projects funded by the Europ...
متن کاملHow the Multilingual Semantic Web can meet the Multilingual Web
The success of the Web is not based on technology. It is rather based on the availability of tooling to create web content, the fast number of content creators providing content, and finally the users who eagerly “digest” the content and are willing to pay for it, being part of various business models. Not only the Web in general, but also the Multilingual Web is growing. More and more content ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره cs.IR/0108004 شماره
صفحات -
تاریخ انتشار 2001