HTML Tags as Extraction Cues for Web Page Description Construction
نویسنده
چکیده
Using four previously identified samples of Web pages containing meta-tagged descriptions, the value of meta-tagged keywords, the first 200 characters of the body, and text marked with common HTML tags as extracts helpful for writing summaries was estimated by applying two measures: density of description words and density of two-word description phrases. Generally, titles and keywords showed the highest densities. Parts of the body showed densities not much different from the body as a whole: somewhat higher for the first 200 characters and for text tagged with "center" and "font"; somewhat lower for text tagged with "a"; not significantly different for "table" and "div". Evidence of non-random clumping of description words in the body of some pages nevertheless suggests that further pursuit of automatic passage extraction methods from the body may be worthwhile. Implications of the findings for aids to summarization, and specifically the TexNet32 package, are discussed.
منابع مشابه
Concurrent programming on the web with Webstream
We describe Webstream, a language to simplify the development of client-side web applications, particularly web-aware information agents. Webstream encapsulates web documents as streams of messages passing between concurrent lightweight threads, permitting operations to be carried out lazy-evaluation style while documents are in the process of being retrieved. Streams can be pipelined through f...
متن کاملWeb Content Extraction through Histogram Clustering
We describe a method to extract content text from diverse Web pages by using the HTML document’s Text-To-Tag Ratio (TTR) rather than specific HTML cues that are not constant across various Web pages. We describe how to compute the TTR on a line-by-line basis and then cluster the results into content and non-content areas. The resulting TTR-histogram is not easily clustered because of its one di...
متن کاملRemoving Noise Content from Online News Articles
A typical news web page consists of news articles. Along with the news article content tags, it also contains tags of navigation links, privacy & copyright information and advertisements. These tags are called as noise tags. Given an online news article in html form, existing works extract articles by discovering informative tags using various heuristic techniques. In this paper, we follow an a...
متن کاملWhite Page Construction from Web Pages for Finding People on the Internet
This paper proposes a method to extract proper names and their associated information from web pages for Internet/Intranet users automatically. The information extracted from World Wide Web documents includes proper nouns, E-mail addresses and home page URLs. Natural language processing techniques are employed to identify and classify proper nouns, which are usually unknown words. The informati...
متن کاملStructure based Data Extraction from Hidden Web Sources: A Review
In order to extract data from the web pages of Hidden web sources, many semi-automatic and automatic techniques are proposed based on structure and tags of HTML documents. These
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- InformingSciJ
دوره 6 شماره
صفحات -
تاریخ انتشار 2003