Discover Your Island University

Graduate Projects


Project ID: 402
Author: Shanxian Mao
Project Title: Top-K Answering Under Uncertain Schema Mappings
Semester: 3 August 2012
Committe Chair: Dr. Longzhuang Li
Committee Member 1: Dr. Ahmen M. Mahdy
Committee Member 2: Dr. Dulal C. Kar
Project Description: The data sources of information systems running on various hardware and software platforms are independent to each other and mutually closed, which makes data exchange difficult. With the evolvement of the information application technology, data sharing between internal departments or external enterprises is necessarily required. Finally, data integration has been developed. The data integration is an application providing a bridge of communication between isolated sources and offering a platform for information exchange. However, due to the need of markets nowadays, the big-data sources become one of main burdens on the transaction rates for data integration systems. There are two semantics, by-table and by-tuple, which are developed to capture top-k answering in the data integration system. Both semantics are developed to attempt to enhance the performance when the system encounters uncertain queries or obscure schema mappings between local sources and their centralized system. However, although the current algorithms support some features to capture accurate top-k answering and try to avoid accessing all data from sources, they cannot effectively minimize the number of traversed items in most cases. Consequently, we are trying to propose our solutions to improve the efficiency for the data integration with uncertainty. In our research, we apply histogram-based approximation to capture an estimated list of top-k results in order to improve the ability of processing a large amount of data more efficiently. Histogram-based approximation is used to generate approximate values from histograms provided by sources, and the approximate values are summarized for calculating a confidence of top-k candidates. In our algorithm, the confidence is able to control termination of processing data in both by-table and by-tuple semantics. Finally, traditional by-table and by-tuple methods could be applied to present true top-k outputs, whose result can be utilized to evaluate our new approaches
Project URL:   402.pdf