Loading...
Bimonthly    Since 1986
ISSN 1004-9037
/
Indexed in:
SCIE, Ei, INSPEC, JST, AJ, MR, CA, DBLP, etc.
Publication Details
Edited by: Editorial Board of Journal of Data Acquisition and Processing
P.O. Box 2704, Beijing 100190, P.R. China
Sponsored by: Institute of Computing Technology, CAS & China Computer Federation
Undertaken by: Institute of Computing Technology, CAS
Published by: SCIENCE PRESS, BEIJING, CHINA
Distributed by:
China: All Local Post Offices
 
  • Table of Content
      10 July 2008, Volume 23 Issue 4   
    For Selected: View Abstracts Toggle Thumbnails
    Database and Knowledge-Based Systems
    Clustering by Pattern Similarity
    Haixun Wang and Jian Pei
    Journal of Data Acquisition and Processing, 2008, 23 (4 ): 481-496 . 
    Abstract   PDF(583KB) ( 1618 )  
    The task of clustering is to identify classes of {\em similar} objects among a set of objects. The definition of similarity varies from one clustering model to another. However, in most of these models the concept of similarity is often based on such metrics as Manhattan distance, Euclidean distance or other $L_p$ distances. In other words, similar objects must have {\em close values} in at least a set of dimensions. In this paper, we explore a more general type of similarity. Under the {\it pCluster} model we proposed, two objects are similar if they exhibit a {\em coherent pattern} on a subset of dimensions. The new similarity concept models a wide range of applications. For instance, in DNA microarray analysis, the expression levels of two genes may rise and fall synchronously in response to a set of environmental stimuli. Although the magnitude of their expression levels may not be close, the patterns they exhibit can be very much alike. Discovery of such clusters of genes is essential in revealing significant connections in gene regulatory networks. E-commerce applications, such as collaborative filtering, can also benefit from the new model, because it is able to capture not only the closeness of values of certain leading indicators but also the closeness of (purchasing, browsing, etc.) patterns exhibited by the customers. In addition to the novel similarity model, this paper also introduces an effective and efficient algorithm to detect such clusters, and we perform tests on several real and synthetic data sets to show its performance.
    PGG: An Online Pattern Based Approach for Stream Variation Management
    Lu-An Tang, Bin Cui, Hong-Yan Li, Gao-Shan Miao, Dong-Qing Yang, and Xin-Biao Zhou
    Journal of Data Acquisition and Processing, 2008, 23 (4 ): 497-515 . 
    Abstract   PDF(7701KB) ( 1204 )  
    Many database applications require efficient processing of data streams with value variations and fluctuant sampling frequency. The variations typically imply fundamental features of the stream and important domain knowledge of underlying objects. In some data streams, successive events seem to recur in a certain time interval, but the data indeed evolves with tiny differences as time elapses. This feature, so called {\it pseudo periodicity}, poses a new challenge to stream variation management. This study focuses on the online management for variations over such streams. The idea can be applied to many scenarios such as patient vital signal monitoring in medical applications. This paper proposes a new method named Pattern Growth Graph (PGG) to detect and manage variations over evolving streams with following features: 1) adopts the wave-pattern to capture the major information of data evolution and represent them compactly; 2) detects the variations in a single pass over the stream with the help of wave-pattern matching algorithm; 3) only stores different segments of the pattern for incoming stream, and hence substantially compresses the data without losing important information; 4) distinguishes meaningful data changes from noise and reconstructs the stream with acceptable accuracy. Extensive experiments on real datasets containing millions of data items, as well as a prototype system, are carried out to demonstrate the feasibility and effectiveness of the proposed scheme.
    Updating Recursive XML Views of Relations
    Byron Choi, Gao Cong, Wenfei Fan, and Stratis D. Viglas
    Journal of Data Acquisition and Processing, 2008, 23 (4 ): 516-537 . 
    Abstract   PDF(808KB) ( 1752 )  
    This paper investigates the view update problem for {\sc XML} views published from relational data. We consider {\sc XML} views defined in terms of mappings directed by possibly recursive {\sc DTD}s compressed into {\sc DAG}s and stored in relations. We provide new techniques to efficiently support {\sc XML} view updates specified in terms of {\sc XP}ath expressions with recursion and complex filters. The interaction between {\sc XP}ath recursion and {\sc DAG} compression of {\sc XML} views makes the analysis of the {\sc XML} view update problem rather intriguing. Furthermore, many issues are still open even for relational view updates, and need to be explored. In response to these, on the {\sc XML} side, we revise the notion of side effects and update semantics based on the semantics of {\sc XML} views, and present efficient algorithms to translate {\sc XML} updates to relational view updates. On the relational side, we propose a mild condition on {\sc SPJ} views, and show that under this condition the analysis of deletions on relational views becomes { PTIME} while the insertion analysis is { NP}-complete. We develop an efficient algorithm to process relational view deletions, and a heuristic algorithm to handle view insertions. Finally, we present an experimental study to verify the effectiveness of our techniques.
    Continually Answering Constraint $\pmb k$-{\it\bfseries NN} Queries in Unstructured P2P Systems
    Bin Wang, Xiao-Chun Yang, Guo-Ren Wang, Ge Yu, Lei Chen, X. Sean Wang, and Xue-Min Lin
    Journal of Data Acquisition and Processing, 2008, 23 (4 ): 538-556 . 
    Abstract   PDF(3308KB) ( 1373 )  
    We consider the problem of efficiently computing distributed geographical $k$-{\it NN} queries in an unstructured peer-to-peer (P2P) system, in which each peer is managed by an individual organization and can only communicate with its logical neighboring peers. Such queries are based on local filter query statistics, and require as less communication cost as possible, which makes it more difficult than the existing distributed $k$-{\it NN} queries. Especially, we hope to reduce candidate peers and degrade communication cost. In this paper, we propose an efficient pruning technique to minimize the number of candidate peers to be processed to answer the $k$-{\it NN} queries. Our approach is especially suitable for continuous $k$-{\it NN} queries when updating peers, including changing ranges of peers, dynamically leaving or joining peers, and updating data in a peer. In addition, simulation results show that the proposed approach outperforms the existing Minimum Bounding Rectangle (MBR)-based query approaches, especially for continuous queries.
    Machine Learning and Data Mining
    New Information Distance Measure and Its Application in Question Answering System
    Xian Zhang, Yu Hao, Xiao-Yan Zhu, and Ming Li
    Journal of Data Acquisition and Processing, 2008, 23 (4 ): 557-572 . 
    Abstract   PDF(1124KB) ( 1796 )  
    In a question answering (QA) system, the fundamental problem is how to measure the distance between a question and an answer, hence ranking different answers. We demonstrate that such a distance can be precisely and mathematically defined. Not only such a definition is possible, it is actually provably better than any other feasible definitions. Not only such an ultimate definition is possible, but also it can be conveniently and fruitfully applied to construct a QA system. We have built such a system ------ QUANTA. Extensive experiments are conducted to justify the new theory.
    Cluster-Based Nearest-Neighbour Classifier and Its Application on the Lightning Classification
    Loris Nanni and Alessandra Lumini
    Journal of Data Acquisition and Processing, 2008, 23 (4 ): 573-581 . 
    Abstract   PDF(3230KB) ( 1462 )  
    The problem addressed in this paper concerns the prototype generation for a cluster-based nearest-neighbour classifier. It considers, to classify a test pattern, the lines that link the patterns of the training set and a set of prototypes. An efficient method based on clustering is here used for finding subgroups of similar patterns with centroid being used as prototype. A learning method is used for iteratively adjusting both position and local-metric of the prototypes. Finally, we show that a simple adaptive distance measure improves the performance of our nearest-neighbour-based classifier. The performance improvement with respect to other nearest-neighbour-based classifiers is validated by testing our method on a lightning classification task using data acquired from the Fast On-orbit Recording of Transient Events (FORTE) satellite, moreover the performance improvement is validated through experiments with several benchmark datasets. The performance of the proposed methods are also validated using the Wilcoxon Signed-Rank test.
    Predicting RNA Secondary Structure Using Profile Stochastic Context-Free Grammars and Phylogenic Analysis
    Xiao-Yong Fang, Zhi-Gang Luo, and Zheng-Hua Wang
    Journal of Data Acquisition and Processing, 2008, 23 (4 ): 582-589 . 
    Abstract   PDF(426KB) ( 2021 )  
    Stochastic context-free grammars (SCFGs) have been applied to predicting RNA secondary structure. The prediction of RNA secondary structure can be facilitated by incorporating with comparative sequence analysis. However, most of existing SCFG-based methods lack explicit phylogenic analysis of homologous RNA sequences, which is probably the reason why these methods are not ideal in practical application. Hence, we present a new SCFG-based method by integrating phylogenic analysis with the newly defined profile SCFG. The method can be summarized as: 1) we define a new profile SCFG, $M$, to depict consensus secondary structure of multiple RNA sequence alignment; 2) we introduce two distinct hidden Markov models, $\la$ and $\la'$, to perform phylogenic analysis of homologous RNA sequences. Here, $\la$ is for non-structural regions of the sequence and $\la'$ is for structural regions of the sequence; 3) we merge $\la$ and $\la'$ into $M$ to devise a combined model for prediction of RNA secondary structure. We tested our method on data sets constructed from the Rfam database. The {\it sensitivity} and {\it specificity} of our method are more accurate than those of the predictions by Pfold.
    Query Performance Prediction for Information Retrieval Based on Covering Topic Score
    Hao Lang, Bin Wang, Gareth Jones, Jin-Tao Li, Fan Ding, and Yi-Xuan Liu
    Journal of Data Acquisition and Processing, 2008, 23 (4 ): 590-601 . 
    Abstract   PDF(771KB) ( 1576 )  
    We present a statistical method called Covering Topic Score (CTS) to predict query performance for information retrieval. Estimation is based on how well the topic of a user's query is covered by documents retrieved from a certain retrieval system. Our approach is conceptually simple and intuitive, and can be easily extended to incorporate features beyond bag-of-words such as phrases and proximity of terms. Experiments demonstrate that CTS significantly correlates with query performance in a variety of TREC test collections, and in particular CTS gains more prediction power benefiting from features of phrases and proximity of terms. We compare CTS with previous state-of-the-art methods for query performance prediction including clarity score and robustness score. Our experimental results show that CTS consistently performs better than, or at least as well as, these other methods. In addition to its high effectiveness, CTS is also shown to have very low computational complexity, meaning that it can be practical for real applications.
    Predicting Chinese Abbreviations from Definitions: An Empirical Learning Approach Using Support Vector Regression
    Xu Sun, Hou-Feng Wang, and Bo Wang
    Journal of Data Acquisition and Processing, 2008, 23 (4 ): 602-611 . 
    Abstract   PDF(762KB) ( 1960 )  
    In Chinese, phrases and named entities play a central role in information retrieval. Abbreviations, however, make keyword-based approaches less effective. This paper presents an empirical learning approach to Chinese abbreviation prediction. In this study, each abbreviation is taken as a reduced form of the corresponding definition (expanded form), and the abbreviation prediction is formalized as a scoring and ranking problem among abbreviation candidates, which are automatically generated from the corresponding definition. By employing Support Vector Regression (SVR) for scoring, we can obtain multiple abbreviation candidates together with their SVR values, which are used for candidate ranking. Experimental results show that the SVR method performs better than the popular heuristic rule of abbreviation prediction. In addition, in abbreviation prediction, the SVR method outperforms the hidden Markov model (HMM).
    Scaling Conditional Random Fields by One-Against-the-Other Decomposition
    Hai Zhao and Chunyu Kit
    Journal of Data Acquisition and Processing, 2008, 23 (4 ): 612-619 . 
    Abstract   PDF(329KB) ( 1480 )  
    As a powerful sequence labeling model, conditional random fields (CRFs) have had successful applications in many natural language processing (NLP) tasks. However, the high complexity of CRFs training only allows a very small tag (or label) set, because the training becomes intractable as the tag set enlarges. This paper proposes an improved decomposed training and joint decoding algorithm for CRF learning. Instead of training a single CRF model for all tags, it trains a binary sub-CRF independently for each tag. An optimal tag sequence is then produced by a joint decoding algorithm based on the probabilistic output of all sub-CRFs involved. To test its effectiveness, we apply this approach to tackling Chinese word segmentation (CWS) as a sequence labeling problem. Our evaluation shows that it can reduce the computational cost of this language processing task by 40--50\% without any significant performance loss on various large-scale data sets.
    Architecture and High Performance Computer Systems
    Making Effective Decisions in Computer Architects Real-World: Lessons and Experiences with Godson-2 Processor Designs
    Wei-Wu Hu and Jian Wang
    Journal of Data Acquisition and Processing, 2008, 23 (4 ): 620-632 . 
    Abstract   PDF(5784KB) ( 1927 )  
    Although the design of many kinds of microprocessors has been under developing for several decades, the computer architecture R\&D community lacks well documented lessons and experiences about design decisions in the research literature. In this paper, we systematically present the design decisions we made during the designing and prototyping of Godson-2 series processors. The 250MHz Godson-2B, 450MHz Godson-2C, and 1GHz Godson-2E processors that implement 64-bit, four-issue, out-of-order architecture were taped out in 2003, 2004, and 2005, respectively. Each processor triples its predecessor in the SPEC CPU2000 rates. Our first-hand experiences and lessons gained from these designs would provide unique perspectives and insights that are not available in any existing text books and/or published papers. We summarize 10 critical lessons and experiences based on hundreds of our attempts at architectural and design optimizations for performance improvement of Godson-2 series processors. The issues include silicon-simulation correlation, design balancing, performance optimizing, and pico-architecture tuning. We conclude that persistent improvement, attitude towards work-on-silicon design, and insightful understanding of software and fabrication process are the three most important factors for designing a high performance processor with low energy consumption.
    Runtime Engine for Dynamic Profile Guided Stride Prefetching
    Qiong Zou, Xiao-Feng Li, and Long-Bing Zhang
    Journal of Data Acquisition and Processing, 2008, 23 (4 ): 633-643 . 
    Abstract   PDF(3812KB) ( 1218 )  
    Stride prefetching is recognized as an important technique to improve memory access performance. The prior work usually profiles and/or analyzes the program behavior offline, and uses the identified stride patterns to guide the compilation process by injecting the prefetch instructions at appropriate places. There are some researches trying to enable stride prefetching in runtime systems with online profiling, but they either cannot discover cross-procedural prefetch opportunity, or require special supports in hardware or garbage collection. In this paper, we present a prefetch engine for JVM (Java Virtual Machine). It firstly identifies the candidate load operations during just-in-time (JIT) compilation, and then instruments the compiled code to profile the addresses of those loads. The runtime profile is collected in a trace buffer, which triggers a prefetch controller upon a protection fault. The prefetch controller analyzes the trace to discover any stride patterns, then modifies the compiled code to inject the prefetch instructions in place of the instrumentations. One of the major advantages of this engine is that, it can detect striding loads in any virtual code places for both regular and irregular code, not being limited with plain loop or procedure scopes. Actually we found the cross-procedural patterns take about 30\% of all the prefetchings in the representative Java benchmarks. Another major advantage of the engine is that it has runtime overhead much smaller (the maximal is less than 4.0\%) than the benefits it brings. Our evaluation with Apache Harmony JVM shows that the engine can achieve an average 6.2\% speed-up with SPECJVM98 and DaCapo on Intel Pentium 4 platform, in spite of the runtime overhead.
    New Model and Algorithm for Hardware/Software Partitioning
    Ji-Gang Wu, Thambipillai Srikanthan, and Guang-Wei Zou
    Journal of Data Acquisition and Processing, 2008, 23 (4 ): 644-651 . 
    Abstract   PDF(1467KB) ( 1594 )  
    This paper focuses on the algorithmic aspects for the hardware/software (HW/SW) partitioning which searches a reasonable composition of hardware and software components which not only satisfies the constraint of hardware area but also optimizes the execution time. The computational model is extended so that all possible types of communications can be taken into account for the HW/SW partitioning. Also, a new dynamic programming algorithm is proposed on the basis of the computational model, in which source data, rather than speedup in previous work, of basic scheduling blocks are directly utilized to calculate the optimal solution. The proposed algorithm runs in $O(n\cdot A)$ for $n$ code fragments and the available hardware area $A$. Simulation results show that the proposed algorithm solves the HW/SW partitioning without increase in running time, compared with the algorithm cited in the literature.
    A General Approach to {\it\bfseries L$($h,k$)$}-Label Interconnection Networks
    Tiziana Calamoneri, Saverio Caminiti, and Rossella Petreschi
    Journal of Data Acquisition and Processing, 2008, 23 (4 ): 652-659 . 
    Abstract   PDF(904KB) ( 1396 )  
    Given two non-negative integers $h$ and $k$, an $L(h,k)$-{\em labeling} of a graph $G=(V,E)$ is a function from the set $V$ to a set of colors, such that adjacent nodes take colors at distance at least $h$, and nodes at distance 2 take colors at distance at least $k$. The aim of the $L(h,k)$-labeling problem is to minimize the greatest used color. Since the decisional version of this problem is NP-complete, it is important to investigate particular classes of graphs for which the problem can be efficiently solved. It is well known that the most common interconnection topologies, such as Butterfly-like, Bene\v{s}, CCC, Trivalent Cayley networks, are all characterized by a similar structure: they have nodes organized as a matrix and connections are divided into layers. So we naturally introduce a new class of graphs, called $(l \times n)$-{\em multistage graphs}, containing the most common interconnection topologies, on which we study the $L(h,k)$-labeling. A general algorithm for $L(h,k)$-labeling these graphs is presented, and from this method an efficient $L(2,1)$-labeling for Butterfly and CCC networks is derived. Finally we describe a possible generalization of our approach.
    Computer Network and Internet
    Survey on Anonymity in Unstructured Peer-to-Peer Systems
    Ren-Yi Xiao
    Journal of Data Acquisition and Processing, 2008, 23 (4 ): 660-671 . 
    Abstract   PDF(675KB) ( 4703 )  
    Although anonymizing Peer-to-Peer (P2P) networks often means extra cost in terms of transfer efficiency, many systems try to mask the identities of their users for privacy consideration. By comparison and analysis of existing approaches, we investigate the properties of unstructured P2P anonymity, and summarize current attack models on these designs. Most of these approaches are path-based, which require peers to pre-construct anonymous paths before transmission, thus suffering significant overhead and poor reliability. We also discuss the open problems in this field and propose several future research directions.
    Community Detection in Complex Networks
    Nan Du, Bai Wang, and Bin Wu
    Journal of Data Acquisition and Processing, 2008, 23 (4 ): 672-683 . 
    Abstract   PDF(12687KB) ( 2019 )  
    With the rapidly growing evidence that various systems in nature and society can be modeled as complex networks, community detection in networks becomes a hot research topic in physics, sociology, computer society, etc. Although this investigation of community structures has motivated many diverse algorithms, most of them are unsuitable when dealing with large networks due to their computational cost. In this paper, we present a faster algorithm {ComTector,} which is more efficient for the community detection in large complex networks based on the nature of overlapping cliques. This algorithm does not require any priori knowledge about the number or the original division of the communities. With respect to practical applications, {ComTector} is challenging with five different types of networks including the classic {Zachary Karate Club}, {Scientific Collaboration Network}, {South Florida Free Word Association Network}, {Urban Traffic Network}, {North America Power Grid} and the {Telecommunication Call Network}. Experimental results show that our algorithm can discover meaningful communities that meet both the objective basis and our intuitions.
    Research on Next-Generation Scalable Routers Implemented with H-Torus Topology
    You-Jian Zhao, Zu-Hui Yue, and Jian-Ping Wu
    Journal of Data Acquisition and Processing, 2008, 23 (4 ): 684-. 
    Abstract   PDF(4585KB) ( 2046 )  
    The exponential growth of user traffic has been driving routers to run at higher capacity. In a traditional router, the centralized switching fabric is becoming the bottleneck for its limited number of ports and complicated scheduling algorithms. Direct networks, such as 3-D Torus topology, have been successfully applied to the design of scalable routers. They show good scalability and fault tolerance. Unfortunately, its scalability is limited in practice. In this paper, we introduce another type of direct network, called H-Torus. This network shows excellent topological properties. On its basis, the designs of line card and routing algorithms are introduced. Extensive simulations show that the routing algorithm is very important in such a system and results in low latency with high throughput.
SCImago Journal & Country Rank
 

ISSN 1004-9037

         

Home
Editorial Board
Author Guidelines
Subscription
Journal of Data Acquisition and Processing
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China

E-mail: info@sjcjycl.cn
 
  Copyright ©2015 JCST, All Rights Reserved