Bimonthly    Since 1986
ISSN 1004-9037
Indexed in:
Publication Details
Edited by: Editorial Board of Journal of Data Acquisition and Processing
P.O. Box 2704, Beijing 100190, P.R. China
Sponsored by: Institute of Computing Technology, CAS & China Computer Federation
Undertaken by: Institute of Computing Technology, CAS
Distributed by:
China: All Local Post Offices
  • Table of Content
      15 March 2007, Volume 22 Issue 2   
    For Selected: View Abstracts Toggle Thumbnails
    Structural Join and Staircase Join Algorithms of Sibling Relationship
    Chang-Xuan Wan and Xi-Ping Liu
    Journal of Data Acquisition and Processing, 2007, 22 (2): 171-181 . 
    Abstract   PDF(446KB) ( 5386 )  
    The processing of XML queries can result in evaluation of various structural relationships. Efficient algorithms for evaluating ancestor-descendant and parent-child relationships have been proposed. Whereas the problems of evaluating preceding-sibling-following-sibling and preceding-following relationships are still open. In this paper, we studied the structural join and staircase join for sibling relationship. First, the idea of how to filter out and minimize unnecessary reads of elements using parent's structural information is introduced, which can be used to accelerate structural joins of parent-child and preceding-sibling-following-sibling relationships. Second, two efficient structural join algorithms of sibling relationship are proposed. These algorithms lead to optimal join performance: nodes that do not participate in the join can be judged beforehand and then skipped using B$^{+}$-tree index. Besides, each element list joined is scanned sequentially once at most. Furthermore, output of join results is sorted in document order. We also discussed the staircase join algorithm for sibling axes. Studies show that, staircase join for sibling axes is close to the structural join for sibling axes and shares the same characteristic of high efficiency. Our experimental results not only demonstrate the effectiveness of our optimizing techniques for sibling axes, but also validate the efficiency of our algorithms. As far as we know, this is the first work addressing this problem specially.
    Load Shedding for Window Joins over Streams
    Dong-Hong Han, Guo-Ren Wang, Chuan Xiao, and Rui Zhou
    Journal of Data Acquisition and Processing, 2007, 22 (2): 182-189 . 
    Abstract   PDF(397KB) ( 4688 )  
    We address several load shedding techniques over sliding window joins. We first construct a dual window architectural model including aux-windows and join-windows, and build statistics on aux-windows. With the statistics, we develop an effective load shedding strategy producing maximum subset join outputs. In order to accelerate the load shedding process, binary indexed trees have been utilized to reduce the cost on shedding evaluation. When streams have high arrival rates, we propose an approach incorporating front-shedding and rear-shedding, and find an optimal trade-off between them. As for the scenarios of variable speed ratio, we develop a plan reallocating CPU resources and dynamically resizing the windows. In addition, we prove that load shedding is not affected during the process of reallocation. Both synthetic and real data are used in our experiments, and the results show the promise of our strategies.
    MRST---An Efficient Monitoring Technology of Summarization on Stream Data
    Xiao-Bo Fan, Ting-Ting Xie, Cui-Ping Li, and Hong Chen
    Journal of Data Acquisition and Processing, 2007, 22 (2): 190-196 . 
    Abstract   PDF(519KB) ( 4323 )  
    Monitoring on data streams is an efficient method of acquiring the characters of data stream. However the available resources for each data stream are limited, so the problem of how to use the limited resources to process infinite data stream is an open challenging problem. In this paper, we adopt the wavelet and sliding window methods to design a multi-resolution summarization data structure, the Multi-Resolution Summarization Tree (MRST) which can be updated incrementally with the incoming data and can support point queries, range queries, multi-point queries and keep the precision of queries. We use both synthetic data and real-world data to evaluate our algorithm. The results of experiment indicate that the efficiency of query and the adaptability of MRST have exceeded the current algorithm, at the same time the realization of it is simpler than others.
    CLASCN: Candidate Network Selection for Efficient Top-k Keyword Queries over Databases
    Jun Zhang, Zhao-Hui Peng, Shan Wang, and Hui-Jing Nie
    Journal of Data Acquisition and Processing, 2007, 22 (2): 197-207 . 
    Abstract   PDF(477KB) ( 9881 )  
    Keyword Search Over Relational Databases (KSORD) enables casual or Web users easily access databases through free-form keyword queries. Improving the performance of KSORD systems is a critical issue in this area. In this paper, a new approach CLASCN (Classification, Learning And Selection of Candidate Network) is developed to efficiently perform top-$k$ keyword queries in schema-graph-based online KSORD systems. In this approach, the Candidate Networks (CNs) from trained keyword queries or executed user queries are classified and stored in the databases, and top-$k$ results from the CNs are learned for constructing CN Language Models (CNLMs). The CNLMs are used to compute the similarity scores between a new user query and the CNs from the query. The CNs with relatively large similarity score, which are the most promising ones to produce top-$k$ results, will be selected and performed. Currently, CLASCN is only applicable for past queries and New All-keyword-Used (NAU) queries which are frequently submitted queries. Extensive experiments also show the efficiency and effectiveness of our CLASCN approach.
    Composite Distance Transformation for Indexing and k-Nearest-Neighbor Searching in High-Dimensional Spaces
    Yi Zhuang, Yue-Ting Zhuang, and Fei Wu
    Journal of Data Acquisition and Processing, 2007, 22 (2): 208-217 . 
    Abstract   PDF(545KB) ( 4406 )  
    Due to the famous dimensionality curse problem, search in a high-dimensional space is considered as a ``hard'' problem. In this paper, a novel {\it composite distance transformation} method, which is called CDT, is proposed to support a fast $k$-nearest-neighbor ($k$-NN) search in high-dimensional spaces. In CDT, all ($n$) data points are first grouped into some clusters by a $k$-Means clustering algorithm. Then a composite distance key of each data point is computed. Finally, these index keys of such $n$ data points are inserted by a partition-based B^+-tree. Thus, given a query point, its $k$-NN search in high-dimensional spaces is transformed into the search in the single dimensional space with the aid of CDT index. Extensive performance studies are conducted to evaluate the effectiveness and efficiency of the proposed scheme. Our results show that this method outperforms the state-of-the-art high-dimensional search techniques, such as the X-Tree, VA-file, iDistance and NB-Tree.
    An Ontology-Based Approach for Semantic Conflict Resolution in Database Integration
    Qiang Liu, Tao Huang, Shao-Hua Liu, and Hua Zhong
    Journal of Data Acquisition and Processing, 2007, 22 (2): 218-227 . 
    Abstract   PDF(442KB) ( 5948 )  
    An important task in database integration is to resolve data conflicts, on both schema-level and semantic-level. Especially difficult the latter is. Some existing ontology-based approaches have been criticized for their lack of domain generality and semantic richness. With the aim to overcome these limitations, this paper introduces a systematic approach for detecting and resolving various semantic conflicts in heterogeneous databases, which includes two important parts: a semantic conflict representation model based on our classification framework of semantic conflicts, and a methodology for detecting and resolving semantic conflicts based on this model. The system has been developed, experimental evaluations on which indicate that this approach can resolve much of the semantic conflicts effectively, and keep independent of domains and integration patterns.
    A Novel Approach to Clustering Merchandise Records
    Tao-Yuan Cheng and Shan Wang
    Journal of Data Acquisition and Processing, 2007, 22 (2): 228-231 . 
    Abstract   PDF(267KB) ( 4485 )  
    Object identification is one of the major challenges in integrating data from multiple information sources. Since being short of global identifiers, it is hard to find all records referring to the same object in an integrated database. Traditional object identification techniques tend to use character-based or vector space model-based similarity computing in judging, but they cannot work well in merchandise databases. This paper brings forward a new approach to object identification. First, we use merchandise images to judge whether two records belong to the same object; then, we use Na\"\i ve Bayesian Model to judge whether two merchandise names have similar meaning. We do experiments on data downloaded from shopping websites, and the results show good performance.
    Efficient k-Nearest-Neighbor Search Algorithms for Historical Moving Object Trajectories
    Yun-Jun Gao, Chun Li, Gen-Cai Chen, Ling Chen, Xian-Ta Jiang, and Chun Chen
    Journal of Data Acquisition and Processing, 2007, 22 (2): 232-244 . 
    Abstract   PDF(655KB) ( 5800 )  
    $k$ Nearest Neighbor (kNN) search is one of the most important operations in spatial and spatio-temporal databases. Although it has received considerable attention in the database literature, there is little prior work on $k$NN retrieval for moving object trajectories. Motivated by this observation, this paper studies the problem of efficiently processing $k$NN $(k\ge 1)$ search on R-tree-like structures storing historical information about moving object trajectories. Two algorithms are developed based on best-first traversal paradigm, called BFP$k$NN and BFT$k$NN, which handle the $k$NN retrieval with respect to the static query point and the moving query trajectory, respectively. Both algorithms minimize the number of node access, that is, they perform a single access only to those qualifying nodes that may contain the final result. Aiming at saving main-memory consumption and reducing CPU cost further, several effective pruning heuristics are also presented. Extensive experiments with synthetic and real datasets confirm that the proposed algorithms in this paper outperform their competitors significantly in both efficiency and scalability.
    Indexing Future Trajectories of Moving Objects in a Constrained Network
    Ji-Dong Chen and Xiao-Feng Meng
    Journal of Data Acquisition and Processing, 2007, 22 (2): 245-251 . 
    Abstract   PDF(380KB) ( 3938 )  
    Advances in wireless sensor networks and positioning technologies enable new applications monitoring moving objects. Some of these applications, such as traffic management, require the possibility to query the future trajectories of the objects. In this paper, we propose an original data access method, the ANR-tree, which supports predictive queries. We focus on real life environments, where the objects move within constrained networks, such as vehicles on roads. We introduce a simulation-based prediction model based on graphs of cellular automata, which makes full use of the network constraints and the stochastic traffic behavior. Our technique differs strongly from the linear prediction model, which has low prediction accuracy and requires frequent updates when applied to real traffic with velocity changing frequently. The data structure extends the R-tree with adaptive units which group neighbor objects moving in the similar moving patterns. The predicted movement of the adaptive unit is not given by a single trajectory, but instead by two trajectory bounds based on different assumptions on the traffic conditions and obtained from the simulation. Our experiments, carried on two different datasets, show that the ANR-tree is essentially one order of magnitude more efficient than the TPR-tree, and is much more scalable.
    An Adaptive Approach to Schema Classification for Data Warehouse Modeling
    Hong-Ding Wang, Yun-Hai Tong, Shao-Hua Tan, Shi-Wei Tang, Dong-Qing Yang, and Guo-Hui Sun
    Journal of Data Acquisition and Processing, 2007, 22 (2): 252-260 . 
    Abstract   PDF(525KB) ( 4195 )  
    Data warehouse (DW) modeling is a complicated task, involving both knowledge of business processes and familiarity with operational information systems structure and behavior. Existing DW modeling techniques suffer from the following major drawbacks --- data-driven approach requires high levels of expertise and neglects the requirements of end users, while demand-driven approach lacks enterprise-wide vision and is regardless of existing models of underlying operational systems. In order to make up for those shortcomings, a method of classification of schema elements for DW modeling is proposed in this paper. We first put forward the vector space models for subjects and schema elements, then present an adaptive approach with self-tuning theory to construct context vectors of subjects, and finally classify the source schema elements into different subjects of the DW automatically. Benefited from the result of the schema elements classification, designers can model and construct a DW more easily.
    A Novel Approach to Revealing Positive and Negative Co-Regulated Genes
    Yu-Hai Zhao, Guo-Ren Wang, Ying Yin, and Guang-Yu Xu
    Journal of Data Acquisition and Processing, 2007, 22 (2): 261-272 . 
    Abstract   PDF(739KB) ( 3938 )  
    As explored by biologists, there is a real and emerging need to identify co-regulated gene clusters, which include both positive and negative regulated gene clusters. However, the existing pattern-based and tendency-based clustering approaches are only designed for finding positive regulated gene clusters. In this paper, a new subspace clustering model called {g-Cluster} is proposed for gene expression data. The proposed model has the following advantages: $1)$ find both positive and negative co-regulated genes in a shot, $2)$ get away from the restriction of magnitude transformation relationship among co-regulated genes, and $3)$ guarantee quality of clusters and significance of regulations using a novel similarity measurement {gCode} and a user-specified regulation threshold $\delta$, respectively. No previous work measures up to the task which has been set. Moreover, MDL technique is introduced to avoid insignificant g-Clusters generated. A tree structure, namely GS-tree, is also designed, and two algorithms combined with efficient pruning and optimization strategies to identify all qualified g-Clusters. Extensive experiments are conducted on real and synthetic datasets. The experimental results show that $1)$ the algorithm is able to find an amount of co-regulated gene clusters missed by previous models, which are potentially of high biological significance, and $2)$ the algorithms are effective and efficient, and outperform the existing approaches.
    Efficient Execution of Multiple Queries on Deep Memory Hierarchy
    Yan Zhang, Zhi-Feng Chen, and Yuan-Yuan Zhou
    Journal of Data Acquisition and Processing, 2007, 22 (2): 273-279 . 
    Abstract   PDF(452KB) ( 4100 )  
    This paper proposes a complementary novel idea, called MiniTasking to further reduce the number of cache misses by improving the data {\it temporal locality} for {\it multiple} concurrent queries. Our idea is based on the observation that, in many workloads such as decision support systems (DSS), there is usually significant amount of data sharing among different concurrent queries. MiniTasking exploits such data sharing to improve data temporal locality by scheduling query execution at three levels: query level batching, operator level grouping and mini-task level scheduling. The experimental results with various types of concurrent TPC-H query workloads show that, with the traditional N-ary Storage Model (NSM) layout, MiniTasking significantly reduces the L2 cache misses by up to 83\%, and thereby achieves 24\% reduction in execution time. With the Partition Attributes Across (PAX) layout, MiniTasking further reduces the cache misses by 65\% and the execution time by 9\%. For the TPC-H throughput test workload, MiniTasking improves the end performance up to 20\%.
    AbIx: An Approach to Content-Based Approximate Query Processing in Peer-to-Peer Data Systems
    Chao-Kun Wang, Jian-Min Wang, Jia-Guang Sun, Sheng-Fei Shi, and Hong Gao
    Journal of Data Acquisition and Processing, 2007, 22 (2): 280-286 . 
    Abstract   PDF(452KB) ( 3884 )  
    In recent years there has been a significant interest in peer-to-peer (P2P) environments in the community of data management. However, almost all work, so far, is focused on exact query processing in current P2P data systems. The autonomy of peers also is not considered enough. In addition, the system cost is very high because the information publishing method of shared data is based on each document instead of document set. In this paper, abstract indices (AbIx) are presented to implement content-based approximate queries in centralized, distributed and structured P2P data systems. It can be used to search as few peers as possible but get as many returns satisfying users' queries as possible on the guarantee of high autonomy of peers. Also, abstract indices have low system cost, can improve the query processing speed, and support very frequent updates and the set information publishing method. In order to verify the effectiveness of abstract indices, a simulator of 10,000 peers, over 3 million documents is made, and several metrics are proposed. The experimental results show that abstract indices work well in various P2P data systems.
    Analyzing Sequential Patterns in Retail Databases
    Unil Yun
    Journal of Data Acquisition and Processing, 2007, 22 (2): 287-296 . 
    Abstract   PDF(453KB) ( 4648 )  
    Finding correlated sequential patterns in large sequence databases is one of the essential tasks in data mining since a huge number of sequential patterns are usually mined, but it is hard to find sequential patterns with the correlation. According to the requirement of real applications, the needed data analysis should be different. In previous mining approaches, after mining the sequential patterns, sequential patterns with the weak affinity are found even with a high minimum support. In this paper, a new framework is suggested for mining weighted support affinity patterns in which an objective measure, sequential ws-confidence is developed to detect correlated sequential patterns with weighted support affinity patterns. To efficiently prune the weak affinity patterns, it is proved that ws-confidence measure satisfies the anti-monotone and cross weighted support properties which can be applied to eliminate sequential patterns with dissimilar weighted support levels. Based on the framework, a weighted support affinity pattern mining algorithm (WSMiner) is suggested. The performance study shows that WSMiner is efficient and scalable for mining weighted support affinity patterns.
    Tree Expressions for Information Systems
    Min Zhao, Su-Qing Han, and Jue Wang
    Journal of Data Acquisition and Processing, 2007, 22 (2): 297-307 . 
    Abstract   PDF(474KB) ( 3996 )  
    The discernibility matrix is one of the most important approaches to computing positive region, reduct, core and value reduct in rough sets. The subject of this paper is to develop a parallel approach of it, called tree expression. Its computational complexity for positive region and reduct is $O(m^{2}\tm n)$ instead of $O(m\tm n^{2})$ in discernibility-matrix-based approach, and is not over $O(n^{2}$) for other concepts in rough sets, where $m$ and $n$ are the numbers of attributes and objects respectively in a given dataset (also called an ``{\it information system}'' in rough sets). This approach suits information systems with $n\gg m$ and containing over one million objects.
    Computational Mechanisms for Metaphor in Languages: A Survey
    Chang-Le Zhou, Yun Yang, and Xiao-Xi Huang
    Journal of Data Acquisition and Processing, 2007, 22 (2): 308-319 . 
    Abstract   PDF(447KB) ( 5701 )  
    Metaphor computation has attracted more and more attention because metaphor, to some extent, is the focus of mind and language mechanism. However, it encounters problems not only due to the rich expressive power of natural language but also due to cognitive nature of human being. Therefore machine-understanding of metaphor is now becoming a bottle-neck in natural language processing and machine translation. This paper first suggests how a metaphor is understood and then presents a survey of current computational approaches, in terms of their linguistic historical roots, underlying foundations, methods and techniques currently used, advantages, limitations, and future trends. A comparison between metaphors in English and Chinese languages is also introduced because compared with development in English language Chinese metaphor computation is just at its starting stage. So a separate summarization of current progress made in Chinese metaphor computation is presented. As a conclusion, a few suggestions are proposed for further research on metaphor computation especially on Chinese metaphor computation.
    A Heuristic Clustering Algorithm for Mining Communities in Signed Networks
    Bo Yang and Da-You Liu
    Journal of Data Acquisition and Processing, 2007, 22 (2): 320-328 . 
    Abstract   PDF(624KB) ( 4741 )  
    Signed network is an important kind of complex network, which includes both positive relations and negative relations. Communities of a signed network are defined as the groups of vertices, within which positive relations are dense and between which negative relations are also dense. Being able to identify communities of signed networks is helpful for analysis of such networks. Hitherto many algorithms for detecting network communities have been developed. However, most of them are designed exclusively for the networks including only positive relations and are not suitable for signed networks. So the problem of mining communities of signed networks quickly and correctly has not been solved satisfactorily. In this paper, we propose a heuristic algorithm to address this issue. Compared with major existing methods, our approach has three distinct features. First, it is very fast with a roughly linear time with respect to network size. Second, it exhibits a good clustering capability and especially can work well with complex networks without well-defined community structures. Finally, it is insensitive to its built-in parameters and requires no prior knowledge.
    A Coarse-to-Fine Method for Shape Recognition
    Hui-Xuan Tang and Hui Wei
    Journal of Data Acquisition and Processing, 2007, 22 (2): 329-333 . 
    Abstract   PDF(315KB) ( 4202 )  
    In this paper the deformation invariant curve matching problem is addressed. The proposed approach exploits an image pyramid to constrain correspondence search at a finer level with those at a coarser level. In comparison to previous methods, this approach conveys much richer information: curve topology, affine geometry and local intensity are combined together to seek correspondences. In experiments, the method is tested in two applications, contour matching and shape recognition, and the results show that the approach is effective under perspective and articulated deformations.
    Text Classification Using Sentential Frequent Itemsets
    Shi-Zhu Liu and He-Ping Hu
    Journal of Data Acquisition and Processing, 2007, 22 (2): 334-ver . 
    Abstract   PDF(274KB) ( 5383 )  
    Text classification techniques mostly rely on single term analysis of the document data set, while more concepts, especially the specific ones, are usually conveyed by set of terms. To achieve more accurate text classifier, more informative feature including frequent co-occurring words in the same sentence and their weights are particularly important in such scenarios. In this paper, we propose a novel approach using sentential frequent itemset, a concept comes from association rule mining, for text classification, which views a sentence rather than a document as a transaction, and uses a variable precision rough set based method to evaluate each sentential frequent itemset's contribution to the classification. Experiments over the Reuters and newsgroup corpus are carried out, which validate the practicability of the proposed system.
SCImago Journal & Country Rank

ISSN 1004-9037


Editorial Board
Author Guidelines
Journal of Data Acquisition and Processing
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China

E-mail: info@sjcjycl.cn
  Copyright ©2015 JCST, All Rights Reserved