Reconstructing the Boundary of a Web Document

Show simple item record

dc.contributor.advisor Harrington, Steven
dc.contributor.advisor Jones, Price
dc.contributor.advisor Shaaban, Muhammad Sweet, James 2012-10-08T15:50:29Z 2012-10-08T15:50:29Z 2002-04
dc.description.abstract Documents found on the World Wide Web (WWW) may be composed of a single web page, or several web pages that are linked together by a table of contents or some other commonly known document construct. When a document spans multiple web pages, it is often inconvenient to print or download the entire document using available tools. This thesis introduces a concept called the document boundary to facilitate representation and analysis of multi-page web documents, and suggests a two-phase approach towards automated identification of document boundaries. In the first phase, individual pages are examined to determine which links are most likely to represent an intra-document link. This procedure is applied recursively to identify a group of candidate pages which may be part of the same document. In the second phase, the link topology and other features of the identified pages are examined in aggregate for indications of a multi-page document. A test suite of both single- and multi-page web documents was assembled using a mixture of handpicked documents and documents which were gathered by an arbitrary third party. The document boundary detection system was applied to the main page of each document. The document boundary detection system was able to achieve a success rate of 73% when its results were compared to the ground truth documents. en_US
dc.language.iso en_US en_US
dc.relation RIT Scholars content from RIT Digital Media Library has moved from to RIT Scholar Works, please update your feeds & links!
dc.subject Computer engineering en_US
dc.subject Document boundary en_US
dc.subject Web page en_US
dc.subject.lcc TK5105.888 .S94 2003
dc.subject.lcsh Web sites en_US
dc.title Reconstructing the Boundary of a Web Document en_US
dc.type Thesis en_US Kate Gleason College of Engineering en_US
dc.description.department Department of Computer Engineering en_US
dc.contributor.advisorChair Savakis, Andreas

Files in this item

Files Size Format View
JSweetThesis04-2003.pdf 2.120Mb PDF View/Open

This item appears in the following Collection(s)

Show simple item record

Search RIT DML

Advanced Search