Loading...
Bimonthly    Since 1986
ISSN 1004-9037
/
Indexed in:
SCIE, Ei, INSPEC, JST, AJ, MR, CA, DBLP, etc.
Publication Details
Edited by: Editorial Board of Journal of Data Acquisition and Processing
P.O. Box 2704, Beijing 100190, P.R. China
Sponsored by: Institute of Computing Technology, CAS & China Computer Federation
Undertaken by: Institute of Computing Technology, CAS
Published by: SCIENCE PRESS, BEIJING, CHINA
Distributed by:
China: All Local Post Offices
 
  • Table of Content
      05 March 2012, Volume 27 Issue 2   
    For Selected: View Abstracts Toggle Thumbnails
    Survey
    Computational Challenges in Characterization of Bacteria and Bacteria-Host Interactions Based on Genomic Data
    Chao Zhang (张潮), Member, ACM, IEEE, Guolu Zheng (郑国铝), Shun-Fu Xu (徐顺福) and Dong Xu, (许东), Member, IEEE
    Journal of Data Acquisition and Processing, 2012, (2): 225-239. 
    Abstract   PDF(462KB) ( 2774 )  
    With the rapid development of next-generation sequencing technologies, bacterial identification becomes a very important and essential step in processing genomic data, especially for metagenomic data. Many computational methods have been developed and some of them are widely used to address the problems in bacterial identification. In this article we review the algorithms of these methods, discuss their drawbacks, and propose future computational methods that use genomic data to characterize bacteria. In addition, we tackle two specific computational problems in bacterial identification, namely, the detection of host-specific bacteria and the detection of disease-associated bacteria, by offering potential solutions as a starting point for those who are interested in the area.
    Architecture
    PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs
    Xin-Hai Xu (徐新海), Student Member, CCF, ACM Xue-Jun Yang (杨学军), Senior Member, CCF, Member, ACM, IEEE Jing-Ling Xue (薛京灵), Senior Member, IEEE, Member, ACM Yu-Fei Lin (林宇斐), Student Member, CCF, ACM, and Yi-Song Lin (林一松)
    Journal of Data Acquisition and Processing, 2012, (2): 240-255. 
    Abstract   PDF(893KB) ( 2286 )  
    GPGPUs are increasingly being used to as performance accelerators for HPC (High Performance Computing) applications in CPU/GPU heterogeneous computing systems, including TianHe-1A, the world's fastest supercomputer in the TOP500 list, built at NUDT (National University of Defense Technology) last year. However, despite their performance advantages, GPGPUs do not provide built-in fault-tolerant mechanisms to offer reliability guarantees required by many HPC applications. By analyzing the SIMT (single-instruction, multiple-thread) characteristics of programs running on GPGPUs, we have developed PartialRC, a new checkpoint-based compiler-directed partial recomputing method, for achieving efficient fault recovery by leveraging the phenomenal computing power of GPGPUs. In this paper, we introduce our PartialRC method that recovers from errors detected in a code region by partially re-computing the region, describe a checkpoint-based fault-tolerance framework developed on PartialRC, and discuss an implementation on the CUDA platform. Validation using a range of representative CUDA programs on NVIDIA GPGPUs against FullRC (a traditional full-recomputing Checkpoint-Rollback-Restart fault recovery method for CPUs) shows that PartialRC reduces significantly the fault recovery overheads incurred by FullRC, by 73.5% when errors occur earlier during execution and 74.6% when errors occur later on average. In addition, PartialRC also reduces error detection overheads incurred by FullRC during fault recovery while incurring negligible performance overheads when no fault happens.
    Modeling a Dynamic Data Replication Strategy to Increase System Availability in Cloud Computing Environments
    Da-Wei Sun (孙大为), Student Member, CCF, ACM, Gui-Ran Chang (常桂然), Shang Gao (高尚), Li-Zhong Jin (靳立忠), and Xing-Wei Wang, (王兴伟), Senior Member, CCF, ACM
    Journal of Data Acquisition and Processing, 2012, (2): 256-272. 
    Abstract   PDF(529KB) ( 6610 )  
    Failures are normal rather than exceptional in the cloud computing environments. To improve system avai-lability, replicating the popular data to multiple suitable locations is an advisable choice, as users can access the data from a nearby site. This is, however, not the case for replicas which must have a fixed number of copies on several locations. How to decide a reasonable number and right locations for replicas has become a challenge in the cloud computing. In this paper, a dynamic data replication strategy is put forward with a brief survey of replication strategy suitable for distributed computing environments. It includes: 1) analyzing and modeling the relationship between system availability and the number of replicas; 2) evaluating and identifying the popular data and triggering a replication operation when the popularity data passes a dynamic threshold; 3) calculating a suitable number of copies to meet a reasonable system byte effective rate requirement and placing replicas among data nodes in a balanced way; 4) designing the dynamic data replication algorithm in a cloud. Experimental results demonstrate the efficiency and effectiveness of the improved system brought by the proposed strategy in a cloud.
    Checkpoint Management with Double Modular Redundancy Based on the Probability of Task Completion
    Seong Woo Kwak, Kwan-Ho You, and Jung-Min Yang
    Journal of Data Acquisition and Processing, 2012, (2): 273-280. 
    Abstract   PDF(1193KB) ( 1493 )  
    This paper proposes a checkpoint rollback strategy for real-time systems with double modular redundancy. Without built-in fault-detection and spare processors, our scheme is able to recover from both transient and permanent faults. Two comparisons are conducted at each checkpoint. First, the states stored in two consecutive checkpoints of one processor are compared for checking integrity of the processor. The states of two processors are also compared for detecting faults and the system rolls back to the previous checkpoint whenever required by logic of the proposed scheme. A Markov model is induced by the fault recovery scheme and analyzed to provide the probability of task completion within its deadline. The optimal number of checkpoints is selected so as to maximize the probability of task completion.
    Computer Network
    A Novel Intermediary Framework for Dynamic Edge Service Composition
    Claudia Canali, Member, IEEE, Michele Colajanni, Member, IEEE, Delfina Malandrino, Vittorio Scarano, Member, ACM, and Raffaele Spinelli
    Journal of Data Acquisition and Processing, 2012, (2): 281-297. 
    Abstract   PDF(1887KB) ( 1980 )  
    Multimedia content, user mobility and heterogeneous client devices require novel systems that are able to support ubiquitous access to the Web resources. In this scenario, solutions that combine flexibility, efficiency and scalability in offering edge services for ubiquitous access are needed. We propose an original intermediary framework, namely Scalable Intermediary Software Infrastructure (SISI), which is able to dynamically compose edge services on the basis of user prefer-ences and device characteristics. The SISI framework exploits a per-user profiling mechanism, where each user can initially set his/her personal preferences through a simple Web interface, and the system is then able to compose at run-time the necessary components. The basic framework can be enriched through new edge services that can be easily implemented through a programming model based on APIs and internal functions. Our experiments demonstrate that flexibility and edge service composition do not affect the system performance. We show that this framework is able to chain multiple edge services and to guarantee stable performance.
    Fault Tolerance and Recovery for Group Communication Services in Distributed Networks
    Yue-Hua Wang (王跃华), Student Member, IEEE, Zhong Zhou (周忠), Member, CCF, ACM, IEEE, Ling Liu, Senior Member, IEEE, and Wei Wu (吴威), Member, CCF
    Journal of Data Acquisition and Processing, 2012, (2): 298-312. 
    Abstract   PDF(2098KB) ( 1547 )  
    Group communication services (GCSs) are becoming increasingly important as a wide field of promising applications has emerged to serve millions of users distributed across the world. However, it is challenging to make the service fault tolerance and scalable to fulfill the voluminous demand of users in a distributed network (DN). While many reliable group communication protocols have been dedicated to addressing such a challenge so as to accommodate the changes in the network, they are often costly or require complicated strategies to handle the service interruptions caused by node departures or link failures, which hinders the service practicability. In this paper, we present two schemes to address the challenges. The first one is a location-aware replication scheme called NS, which makes replicas in a dispersed fashion that enables the services on nodes to gain immunity of failures with different patterns (e.g., network partition and single point failure) while keeping replication overhead low. The second one is a novel failure recovery scheme that exploits the independence between service recovery and structure recovery in time domain to achieve quick failure recovery. Our simulation results indicate that the two proposed schemes outperform the existing schemes and simple alternative schemes in service success rate, recovery latency, and communication cost.
    Diagnosing Traffic Anomalies Using a Two-Phase Model
    Bin Zhang (张宾), Jia-Hai Yang (杨家海), Member, CCF, ACM, IEEE Jian-Ping Wu (吴建平), Fellow, IEEE, Member, CCF, ACM, and Ying-Wu Zhu (朱应武)
    Journal of Data Acquisition and Processing, 2012, (2): 313-327. 
    Abstract   PDF(721KB) ( 1704 )  
    Network traffic anomalies are unusual changes in a network, so diagnosing anomalies is important for network management. Feature-based anomaly detection models (ab)normal network traffic behavior by analyzing packet header features. PCA-subspace method (Principal Component Analysis) has been verified as an efficient feature-based way in network-wide anomaly detection. Despite the powerful ability of PCA-subspace method for network-wide traffic detection, it cannot be effectively used for detection on a single link. In this paper, different from most works focusing on detection on flow-level traffic, based on observations of six traffic features for packet-level traffic, we propose a new approach B6-SVM to detect anomalies for packet-level traffic on a single link. The basic idea of B6-SVM is to diagnose anomalies in a multi-dimensional view of traffic features using Support Vector Machine (SVM). Through two-phase classification, B6-SVM can detect anomalies with high detection rate and low false alarm rate. The test results demonstrate the effectiveness and potential of our technique in diagnosing anomalies. Further, compared to previous feature-based anomaly detection approaches, B6-SVM provides a framework to automatically identify possible anomalous types. The framework of B6-SVM is generic and therefore, we expect the derived insights will be helpful for similar future research efforts.
    Delay and Capacity Trade-offs in Mobile Wireless Networks with Infrastructure Support
    Zhuo Li (李卓), Wen-Zhong Li (李文中), Member, CCF, ACM, IEEE, Song Guo (郭嵩), Senior Member, IEEE, Member, ACM, Sang-Lu Lu, (陆桑璐), Senior Member, CCF, Member, ACM, IEEE, and Dao-Xu Chen (陈道蓄), Senior Member, CCF, Member, ACM, IEEE
    Journal of Data Acquisition and Processing, 2012, (2): 328-340. 
    Abstract   PDF(664KB) ( 9250 )  
    In this paper, we investigate the trade-offs between delay and capacity in mobile wireless networks with infrastructure support. We consider three different mobility models, independent and identically distributed (i.i.d) mobility model, random walk mobility model with constant speed and Lévy flight mobility model. For i.i.d mobility model and random walk mobility model with the speed Θ((1/√n)), we get the theoretical results of the average packet delay when capacity is Θ(1), Θ((1/√n)) individually, where n is the number of nodes. We find that the optimal average packet delay is achieved when capacity  where K is the number of gateways. It is proved that average packet delay D(n) divided by capacity λ(n) is bounded below by (n/K·W) . When ω(√n) ≤ K < n, the critical average delay for capacity compared with static hybrid wireless networks is Θ((K2/n) ). Lévy flight mobility model is based on human mobility and is more sophisticated. For the model with parameter α, it is found that (D(n)/λ(n)) > O(n((1-η)·(α+1)/2) ln n) when K = O(nη) (0 ≤ η < 1). We also prove that when ω(√n) ≤ K < n, the critical average delay is Θ(n(α-1/2)·K).
    Machine Learning and Data Mining
    A Dimensionality Reduction Framework for Detection of Multiscale Structure in Heterogeneous Networks
    Hua-Wei Shen (沈华伟), Member, CCF, Xue-Qi Cheng (程学旗), Senior Member, CCF, Member, IEEE, Yuan-Zhuo Wang (王元卓), Senior Member, CCF, Member, ACM, IEEE, and Yixin Chen (陈一昕), Senior Member, IEEE
    Journal of Data Acquisition and Processing, 2012, (2): 341-357. 
    Abstract   PDF(918KB) ( 1660 )  
    Graph clustering has been widely applied in exploring regularities emerging in relational data. Recently, the rapid development of network theory correlates graph clustering with the detection of community structure, a common and important topological characteristic of networks. Most existing methods investigate the community structure at a single topological scale. However, as shown by empirical studies, the community structure of real world networks often exhibits multiple topological descriptions, corresponding to the clustering at different resolutions. Furthermore, the detection of multiscale community structure is heavily affected by the heterogeneous distribution of node degree. It is very challenging to detect multiscale community structure in heterogeneous networks. In this paper, we propose a novel, unified framework for detecting community structure from the perspective of dimensionality reduction. Based on the framework, we first prove that the well-known Laplacian matrix for network partition and the widely-used modularity matrix for community detection are two kinds of covariance matrices used in dimensionality reduction. We then propose a novel method to detect communities at multiple topological scales within our framework. We further show that existing algorithms fail to deal with heterogeneous node degrees. We develop a novel method to handle heterogeneity of networks by introducing a rescaling transformation into the covariance matrices in our framework. Extensive tests on real world and artificial networks demonstrate that the proposed correlation matrices significantly outperform Laplacian and modularity matrices in terms of their ability to identify multiscale community structure in heterogeneous networks.
    Term-Dependent Confidence Normalisation for Out-of-Vocabulary Spoken Term Detection
    Dong Wang, Member, IEEE, Javier Tejedor, Simon King, Senior Member, IEEE and Joe Frankel, Member, IEEE
    Journal of Data Acquisition and Processing, 2012, (2): 358-375. 
    Abstract   PDF(2350KB) ( 2375 )  
    An important component of a spoken term detection (STD) system involves estimating confidence measures of hypothesised detections. A potential problem of the widely used lattice-based confidence estimation, however, is that the confidence scores are treated uniformly for all search terms, regardless of how much they may differ in terms of phonetic or linguistic properties. This problem is particularly evident for out-of-vocabulary (OOV) terms which tend to exhibit high intra-term diversity. To address the impact of term diversity on confidence measures, we propose in this work a term-dependent normalisation technique which compensates for term diversity in confidence estimation. We first derive an evaluation-metric-oriented normalisation that optimises the evaluation metric by compensating for the diverse occurrence rates among terms, and then propose a linear bias compensation and a discriminative compensation to deal with the bias problem that is inherent in lattice-based confidence measurement and from which the Term Specific Threshold (TST) approach suffers. We tested the proposed technique on speech data from the multi-party meeting domain with two state-of-the-art STD systems based on phonemes and words respectively. The experimental results demonstrate that the confidence normalisation approach leads to a significant performance improvement in STD, particularly for OOV terms with phoneme-based systems.
    Fuzzy Distance-Based Range Queries over Uncertain Moving Objects
    Yi-Fei Chen (陈逸菲), Member, CCF, Xiao-Lin Qin, (秦小麟), Senior Member, CCF, Liang Liu (刘亮), Student Member, CCF, and Bo-Han Li (李博涵), Member, CCF
    Journal of Data Acquisition and Processing, 2012, (2): 376-396. 
    Abstract   PDF(1318KB) ( 1596 )  
    Data obtained from real world are imprecise or uncertain due to the accuracy of positioning devices, updating protocols or characteristics of applications. On the other hand, users sometimes prefer to qualitatively express their requests with vague conditions and different parts of search region are in-equally important in some applications. We address the problem of efficiently processing the fuzzy range queries for uncertain moving objects whose whereabouts in time are not known exactly, for which the basic syntax is find objects always/sometimes near to the query issuer with the qualifying guarantees no less than a given threshold during a given temporal interval. We model the location uncertainty of moving objects on the utilization of probability density functions and describe the indeterminate boundary of query range with fuzzy set. We present the qualifying guarantee evaluation of objects, and propose pruning techniques based on the ff-cut of fuzzy set to shrink the search space efficiently. We also design rules to reject non-qualifying objects and validate qualifying objects in order to avoid unnecessary costly numeric integrations in the refinement step. An extensive empirical study has been conducted to demonstrate the efficiency and effectiveness of algorithms under various experimental settings.
    Bug Prioritization to Facilitate Bug Report Triage
    Jaweria Kanwal and Onaiza Maqbool
    Journal of Data Acquisition and Processing, 2012, (2): 397-412. 
    Abstract   PDF(470KB) ( 1952 )  
    The large number of new bug reports received in bug repositories of software systems makes their management a challenging task. Handling these reports manually is time consuming, and often results in delaying the resolution of important bugs. To address this issue, a recommender may be developed which automatically prioritizes the new bug reports. In this paper, we propose and evaluate a classification based approach to build such a recommender. We use the Naïve Bayes and Support Vector Machine (SVM) classifiers, and present a comparison to evaluate which classifier performs better in terms of accuracy. Since a bug report contains both categorical and text features, another evaluation we perform is to determine the combination of features that better determines the priority of a bug. To evaluate the bug priority recommender, we use precision and recall measures and also propose two new measures, Nearest False Negatives (NFN) and Nearest False Positives (NFP), which provide insight into the results produced by precision and recall. Our findings are that the results of SVM are better than the Naïve Bayes algorithm for text features, whereas for categorical features, Naïve Bayes performance is better than SVM. The highest accuracy is achieved with SVM when categorical and text features are combined for training.
    HilAnchor: Location Privacy Protection in the Presence of Users' Preferences
    Wei-Wei Ni (倪巍伟), Member, CCF, Jin-Wang Zheng (郑锦旺), and Zhi-Hong Chong (崇志宏)
    Journal of Data Acquisition and Processing, 2012, (2): 413-427. 
    Abstract   PDF(788KB) ( 1389 )  
    Location privacy receives considerable attentions in emerging location based services. Most current practices however either ignore users' preferences or incompletely fulfill privacy preferences. In this paper, we propose a privacy protection solution to allow users' preferences in the fundamental query of k nearest neighbors (kNN). Particularly, users are permitted to choose privacy preferences by specifying minimum inferred region. Via Hilbert curve based transformation, the additional workload from users' preferences is alleviated. Furthermore, this transformation reduces time-expensive region queries in 2-D space to range the ones in 1-D space. Therefore, the time efficiency, as well as communication efficiency, is greatly improved due to clustering properties of Hilbert curve. Further, details of choosing anchor points are theoretically elaborated. The empirical studies demonstrate that our implementation delivers both flexibility for users' preferences and scalability for time and communication costs.
    Convex Decomposition Based Cluster Labeling Method for Support Vector Clustering
    Yuan Ping, Ying-Jie Tian, Ya-Jian Zhou, Yi-Xian Yang
    Journal of Data Acquisition and Processing, 2012, (2): 428-442. 
    Abstract   PDF(748KB) ( 3437 )  
    Support vector clustering (SVC) is an important boundary-based clustering algorithm in multiple applications for its capability of handling arbitrary cluster shapes. However, SVC's popularity is degraded by its highly intensive time complexity and poor label performance. To overcome such problems, we present a novel efficient and robust convex decomposition based cluster labeling (CDCL) method based on the topological property of dataset. The CDCL decomposes the implicit cluster into convex hulls and each one is comprised by a subset of support vectors (SVs). According to a robust algorithm applied in the nearest neighboring convex hulls, the adjacency matrix of convex hulls is built up for finding the connected components; and the remaining data points would be assigned the label of the nearest convex hull appropriately. The approach's validation is guaranteed by geometric proofs. Time complexity analysis and comparative experiments suggest that CDCL improves both the efficiency and clustering quality significantly.
    A Novel Similarity Measure to Induce Semantic Classes and Its Application for Language Model Adaptation in a Dialogue System
    Ya-Li Li (李亚丽), Wei-Qun Xu (徐为群), and Yong-Hong Yan (颜永红)
    Journal of Data Acquisition and Processing, 2012, (2): 443-450. 
    Abstract   PDF(683KB) ( 1781 )  
    In this paper, we propose a novel co-occurrence probabilities based similarity measure for inducing semantic classes. Clustering with the new similarity measure outperforms the widely used distance based on Kullback-Leibler diver-gence in precision, recall and F1 evaluation. In our experiments, we induced semantic classes from unannotated in-domain corpus and then used the induced classes and structures to generate large in-domain corpus which was then used for language model adaptation. Character recognition rate was improved from 85.2% to 91%. We imply a new measure to solve the lack of domain data problem by first induction then generation for a dialogue system.
SCImago Journal & Country Rank
 

ISSN 1004-9037

         

Home
Editorial Board
Author Guidelines
Subscription
Journal of Data Acquisition and Processing
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China

E-mail: info@sjcjycl.cn
 
  Copyright ©2015 JCST, All Rights Reserved