Reconstructing the Boundary of a Web Document

Show full item record

Title: Reconstructing the Boundary of a Web Document
Author: Sweet, James
Abstract: Documents found on the World Wide Web (WWW) may be composed of a single web page, or several web pages that are linked together by a table of contents or some other commonly known document construct. When a document spans multiple web pages, it is often inconvenient to print or download the entire document using available tools. This thesis introduces a concept called the document boundary to facilitate representation and analysis of multi-page web documents, and suggests a two-phase approach towards automated identification of document boundaries. In the first phase, individual pages are examined to determine which links are most likely to represent an intra-document link. This procedure is applied recursively to identify a group of candidate pages which may be part of the same document. In the second phase, the link topology and other features of the identified pages are examined in aggregate for indications of a multi-page document. A test suite of both single- and multi-page web documents was assembled using a mixture of handpicked documents and documents which were gathered by an arbitrary third party. The document boundary detection system was applied to the main page of each document. The document boundary detection system was able to achieve a success rate of 73% when its results were compared to the ground truth documents.
Record URI: http://hdl.handle.net/1850/15370
Date: 2002-04

Files in this item

Files Size Format View
JSweetThesis04-2003.pdf 2.120Mb PDF View/Open

The following license files are associated with this item:

This item appears in the following Collection(s)

Show full item record

Search RIT DML


Advanced Search

Browse