Project ID: 321
Author: Yashwanth Chintala
Project Title: Extracting the Semantic Structure of Web Pages Using the Visual Based Segmentation Algorithm
Semester: 1 2008
Committe Chair: Dr. Long-zhuang Li
Committee Member 1: Dr. Mario Garcia
Committee Member 2: Dr. Thomas David
Project Description: The purpose of this project is to develop an algorithm to extract information from semi–structured Web pages. Many Web applications that use information retrieval, information extraction and automatic page adaptation can benefit from this structure. This project presents an automatic top-down, tag-tree independent approach to detect Web content structure. It simulates how a user understands Web layout structure based on his visual perception. It also segments the Web page based on the data records that is the most important information in the whole structure. Comparing to other existing techniques, our approach is independent to the underlying documentation representation such as HTML and works well even when the HTML structure is far different from the layout structure. The current method works on a large set of Web pages. This project uses VBS algorithm to extract information from most semi-structured Web pages. It recommends some rules, which improve the performance of the algorithm.
Project URL:   321.pdf
