blue banner

Graduate Projects - Details

Computer Science Program

Project ID: 295
Author: Tejashree P. Tendolkar
Project Title: Content-Rich Web Page Segmentation (CWPS)
Semester: Fall 2006
Committe Chair: Dr. Longzhuang Li
Committee Member 1: Dr. Dulal Kar
Committee Member 2: Dr. David Thomas
Project Description: The goal of this project is to calculate the visual similarity values based on different cues and segment the content in the Web page into subtopic structure. Web pages on the Internet have become an important resource for people to find information of their interest. But Web pages usually contain noise, such as advertisements and navigational bar which may easily distract users. Furthermore, a Web page may have content of different topics and a user is only interested in one topic. In this project a content-rich Web page segmentation method (CWPS) was implemented based on HTML tag structures and visual cues. Current plain text topic segmentation methods do not work well on Web pages. All the existing Web page structural segmentation algorithms employ HTML tag information to partition a Web page into a set of blocks with each containing related information. Basically, current structural segmentaion algorithms mostly can only segment the text in the content-rich Web pages at paragraph level. The CWPS method consists of three contiguous procedures: structural segmentation, visual similarity calculation, topic segmentation. In structural segmentation, a page is first partitioned into information blocks using a Web page segmentation algorithm. The noisy blocks are identified and removed, and text blocks are concatenated. In visual similarity calculation, visual cues are detected, and then the visual similarities between sentences are calculated and integrated with a plain text segmentation method to detect the subtopic boundaries. In topic segmentation, the plain text is segmented in subtopics. The CWPS Method takes into account the structure characteristics in Web pages as well as the visual similarities and lexical analysis of the extracted text blocks. The CWPS calculates the visual similarity values based on different visual cues which gives better subtoics partitioning results than the TextTiling method.
Project URL:   295.pdf
© Texas A&M University-Corpus Christi • 6300 Ocean Drive, Corpus Christi, Texas 78412 • 361-825-5700