Loading...
Bimonthly    Since 1986
ISSN 1004-9037
/
Indexed in:
SCIE, Ei, INSPEC, JST, AJ, MR, CA, DBLP, etc.
Publication Details
Edited by: Editorial Board of Journal of Data Acquisition and Processing
P.O. Box 2704, Beijing 100190, P.R. China
Sponsored by: Institute of Computing Technology, CAS & China Computer Federation
Undertaken by: Institute of Computing Technology, CAS
Published by: SCIENCE PRESS, BEIJING, CHINA
Distributed by:
China: All Local Post Offices
 
  • Table of Content
      05 January 2019, Volume 34 Issue 1   
    For Selected: View Abstracts Toggle Thumbnails
    Special Section of Advances in Computer Science and Technology—Current Advances in the NSFC Joint Research Fund for Overseas Chinese Scholars and Scholars in Hong Kong and Macao 2014-2017 (Part 1)
    Preface
    Su Song, Ke Liu, Zhi-Yong Liu
    Journal of Data Acquisition and Processing, 2019, 34 (1): 1-2. 
    Abstract   PDF(109KB) ( 181 )  
    Decoding the Structural Keywords in Protein Structure Universe
    Wessam Elhefnawy, Min Li, Jian-Xin Wang, Yaohang Li
    Journal of Data Acquisition and Processing, 2019, 34 (1): 3-15. 
    Abstract   PDF(2131KB) ( 461 )  
    Although the protein sequence-structure gap continues to enlarge due to the development of high-throughput sequencing tools, the protein structure universe tends to be complete without proteins with novel structural folds deposited in the protein data bank (PDB) recently. In this work, we identify a protein structural dictionary (Frag-K) composed of a set of backbone fragments ranging from 4 to 20 residues as the structural "keywords" that can effectively distinguish between major protein folds. We firstly apply randomized spectral clustering and random forest algorithms to construct representative and sensitive protein fragment libraries from a large scale of high-quality, non-homologous protein structures available in PDB. We analyze the impacts of clustering cut-offs on the performance of the fragment libraries. Then, the Frag-K fragments are employed as structural features to classify protein structures in major protein folds defined by SCOP (Structural Classification of Proteins). Our results show that a structural dictionary with ~400 4- to 20-residue Frag-K fragments is capable of classifying major SCOP folds with high accuracy.
    Controllability and Its Applications to Biological Networks
    Lin Wu, Min Li, Jian-Xin Wang, Fang-Xiang Wu
    Journal of Data Acquisition and Processing, 2019, 34 (1): 16-34. 
    Abstract   PDF(528KB) ( 601 )  
    Biological elements usually exert their functions through interactions with others to form various types of biological networks. The ability of controlling the dynamics of biological networks is of enormous benefits to pharmaceutical and medical industry as well as scientific research. Though there are many mathematical methods for steering dynamic systems towards desired states, the methods are usually not feasible for applying to complex biological networks. The difficulties come from the lack of accurate model that can capture the dynamics of interactions between biological elements and the fact that many mathematical methods are computationally intractable for large-scale networks. Recently, a concept in control theory—controllability, has been applied to investigate the dynamics of complex networks. In this article, recent advances on the controllability of complex networks and applications to biological networks are reviewed. Developing dynamic models is the prior concern for analyzing dynamics of biological networks. First, we introduce a widely used dynamic model for investigating controllability of complex networks. Then recent studies of theorems and algorithms for having complex biological networks controllable in general or specific application scenarios are reviewed. Finally, applications to real biological networks manifest that investigating the controllability of biological networks can shed lights on many critical physiological or medical problems, such as revealing biological mechanisms and identifying drug targets, from a systematic perspective.
    Texture Feature Extraction from Thyroid MR Imaging Using High-Order Derived Mean CLBP
    Zhe Liu, Cheng-Jian Qiu, Yu-Qing Song, Xiao-Hong Liu, Juan Wang, Victor S. Sheng
    Journal of Data Acquisition and Processing, 2019, 34 (1): 35-46. 
    Abstract   PDF(1631KB) ( 443 )  
    In the field of medical imaging, the traditional local binary pattern (LBP) and its improved algorithms are often sensitive to noise. Traditional LBPs are solely based on the signal information from local differences, and the binary quantization method oversimplifies the local texture features while disregarding the imaging information from the concaveconvex regions between the high-order pixels and the neighboring sampling points. Therefore, we propose an improved Derived Mean Complete Local Binary Pattern (DM_CLBP) algorithm based on high-order derivatives. In the DM_CLBP method, the grey value of a single pixel is replaced by the mean grey value of the rectangular area block, and the difference between pixel values in the area is obtained using the second-order differentiation method. Based on the calculation concept of the complete local binary pattern (CLBP) algorithm, the cascade signs and magnitudes of the two components are encoded and recombined in DM_CLBP using a uniform pattern. The results from the experiments showed that the proposed DM_CLBP descriptors achieved a classification accuracy of 94.4%. Compared with LBP and other improved algorithms, the DM_CLBP algorithm presented in this study can effectively differentiate between lesion areas and normal areas in thyroid MR (magnetic resonance) images and shows the improved accuracy of area classification.
    Privacy-Protective-GAN for Privacy Preserving Face De-Identification
    Yifan Wu, Fan Yang, Yong Xu, Haibin Ling
    Journal of Data Acquisition and Processing, 2019, 34 (1): 47-60. 
    Abstract   PDF(4291KB) ( 901 )  
    Face de-identification has become increasingly important as the image sources are explosively growing and easily accessible. The advance of new face recognition techniques also arises people's concern regarding the privacy leakage. The mainstream pipelines of face de-identification are mostly based on the k-same framework, which bears critiques of low effectiveness and poor visual quality. In this paper, we propose a new framework called Privacy-Protective-GAN (PP-GAN) that adapts GAN (generative adversarial network) with novel verificator and regulator modules specially designed for the face de-identification problem to ensure generating de-identified output with retained structure similarity according to a single input. We evaluate the proposed approach in terms of privacy protection, utility preservation, and structure similarity. Our approach not only outperforms existing face de-identification techniques but also provides a practical framework of adapting GAN with priors of domain knowledge.
    Computer Architecture and Systems
    ROCO: Using a Solid State Drive Cache to Improve the Performance of a Host-Aware Shingled Magnetic Recording Drive
    Wen-Guo Liu, Ling-Fang Zeng, Dan Feng, Kenneth B. Kent
    Journal of Data Acquisition and Processing, 2019, 34 (1): 61-76. 
    Abstract   PDF(2663KB) ( 416 )  
    Shingled magnetic recording (SMR) can effectively increase the capacity of hard disk drives (HDDs). Hostaware SMR (HA-SMR) is expected to be more popular than other SMR models because of its backward compatibility and new SMR-specific APIs. However, an HA-SMR drive often suffers performance degradation under write-intensive workloads because of frequent non-sequential writes buffered in the disk cache. The non-sequential writes mainly come from update writes, small random writes and out-of-order writes. In this paper, we propose a hybrid storage system called ROCO which aims to use a solid state drive (SSD) cache to improve the performance of an HA-SMR drive. ROCO reorders out-of-order writes belonging to the same zone and uses the SSD cache to absorb update writes and small random writes. We also design a data replacement algorithm called CREA for the SSD cache. CREA first conducts zone-oriented hot/cold data identification to identify cold-cached zones and hot-cached zones, and then evicts data blocks belonging to colder zones with higher priorities that can be sequentially written or written through host-side read-modify-write operations. It gives the lowest priority to data blocks belonging to the hottest-cached zone that have to be non-sequentially written. Experimental results show that ROCO can effectively reduce non-sequential writes to the HA-SMR drive and improve the performance of the HA-SMR drive.
    Enabling Highly Efficient k-Means Computations on the SW26010 Many-Core Processor of Sunway TaihuLight
    Min Li, Chao Yang, Qiao Sun Wen-Jing Ma, Wen-Long Cao, Yu-Long Ao
    Journal of Data Acquisition and Processing, 2019, 34 (1): 77-93. 
    Abstract   PDF(3105KB) ( 618 )  
    With the advent of the big data era, the amounts of sampling data and the dimensions of data features are rapidly growing. It is highly desired to enable fast and efficient clustering of unlabeled samples based on feature similarities. As a fundamental primitive for data clustering, the k-means operation is receiving increasingly more attentions today. To achieve high performance k-means computations on modern multi-core/many-core systems, we propose a matrix-based fused framework that can achieve high performance by conducting computations on a distance matrix and at the same time can improve the memory reuse through the fusion of the distance-matrix computation and the nearest centroids reduction. We implement and optimize the parallel k-means algorithm on the SW26010 many-core processor, which is the major horsepower of Sunway TaihuLight. In particular, we design a task mapping strategy for load-balanced task distribution, a data sharing scheme to reduce the memory footprint and a register blocking strategy to increase the data locality. Optimization techniques such as instruction reordering and double buffering are further applied to improve the sustained performance. Discussions on block-size tuning and performance modeling are also presented. We show by experiments on both randomly generated and real-world datasets that our parallel implementation of k-means on SW26010 can sustain a double-precision performance of over 348.1 Gflops, which is 46.9% of the peak performance and 84% of the theoretical performance upper bound on a single core group, and can achieve a nearly ideal scalability to the whole SW26010 processor of four core groups. Performance comparisons with the previous state-of-the-art on both CPU and GPU are also provided to show the superiority of our optimized k-means kernel.
    Scaling out NUMA-Aware Applications with RDMA-Based Distributed Shared Memory
    Yang Hong, Yang Zheng, Fan Yang, Bin-Yu Zang, Hai-Bing Guan, Hai-Bo Chen
    Journal of Data Acquisition and Processing, 2019, 34 (1): 94-112. 
    Abstract   PDF(1312KB) ( 774 )  
    The multicore evolution has stimulated renewed interests in scaling up applications on shared-memory multiprocessors, significantly improving the scalability of many applications. But the scalability is limited within a single node; therefore programmers still have to redesign applications to scale out over multiple nodes. This paper revisits the design and implementation of distributed shared memory (DSM) as a way to scale out applications optimized for non-uniform memory access (NUMA) architecture over a well-connected cluster. This paper presents MAGI, an efficient DSM system that provides a transparent shared address space with scalable performance on a cluster with fast network interfaces. MAGI is unique in that it presents a NUMA abstraction to fully harness the multicore resources in each node through hierarchical synchronization and memory management. MAGI also exploits the memory access patterns of big-data applications and leverages a set of optimizations for remote direct memory access (RDMA) to reduce the number of page faults and the cost of the coherence protocol. MAGI has been implemented as a user-space library with pthread-compatible interfaces and can run existing multithreaded applications with minimized modifications. We deployed MAGI over an 8-node RDMAenabled cluster. Experimental evaluation shows that MAGI achieves up to 9.25x speedup compared with an unoptimized implementation, leading to a scalable performance for large-scale data-intensive applications.
    Extending SSD Lifespan with Comprehensive Non-Volatile Memory-Based Write Buffers
    Ziqi Fan, Dongchul Park
    Journal of Data Acquisition and Processing, 2019, 34 (1): 113-132. 
    Abstract   PDF(793KB) ( 436 )  
    New non-volatile memory (NVM) technologies are expected to replace main memory DRAM (dynamic random access memory) in the near future. NAND flash technological breakthroughs have enabled wide adoption of solid state drives (SSDs) in storage systems. However, flash-based SSDs, by nature, cannot avoid low endurance problems because each cell only allows a limited number of erasures. This can give rise to critical SSD reliability issues. Since many SSD write operations eventually cause many SSD erase operations, reducing SSD write traffic plays a crucial role in SSD reliability. This paper proposes two NVM-based buffer cache policies which can work together in different layers to maximally reduce SSD write traffic: a main memory buffer cache design named Hierarchical Adaptive Replacement Cache (H-ARC) and an internal SSD write buffer design named Write Traffic Reduction Buffer (WRB). H-ARC considers four factors (dirty, clean, recency, and frequency) to reduce write traffic and improve cache hit ratios in the host. WRB reduces block erasures and write traffic further inside an SSD by effectively exploiting temporal and spatial localities. These two comprehensive schemes significantly reduce total SSD write traffic at each different layer (i.e., host and SSD) by up to 3x. Consequently, they help extend SSD lifespan without system performance degradation.
    Data Management and Data Mining
    Secure Inverted Index Based Search over Encrypted Cloud Data with User Access Rights Management
    Fateh Boucenna, Omar Nouali, Samir Kechid, M. Tahar Kechadi
    Journal of Data Acquisition and Processing, 2019, 34 (1): 133-154. 
    Abstract   PDF(1465KB) ( 431 )  
    Cloud computing is a technology that provides users with a large storage space and an enormous computing power. However, the outsourced data are often sensitive and confidential, and hence must be encrypted before being outsourced. Consequently, classical search approaches have become obsolete and new approaches that are compatible with encrypted data have become a necessity. For privacy reasons, most of these approaches are based on the vector model which is a time consuming process since the entire index must be loaded and exploited during the search process given that the query vector must be compared with each document vector. To solve this problem, we propose a new method for constructing a secure inverted index using two key techniques, homomorphic encryption and the dummy documents technique. However, 1) homomorphic encryption generates very large ciphertexts which are thousands of times larger than their corresponding plaintexts, and 2) the dummy documents technique that enhances the index security produces lots of false positives in the search results. The proposed approach exploits the advantages of these two techniques by proposing two methods called the compressed table of encrypted scores and the double score formula. Moreover, we exploit a second secure inverted index in order to manage the users' access rights to the data. Finally, in order to validate our approach, we performed an experimental study using a data collection of one million documents. The experiments show that our approach is many times faster than any other approach based on the vector model.
    Learning to Generate Posters of Scientific Papers by Probabilistic Graphical Models
    Yu-Ting Qiang, Yan-Wei Fu, Xiao Yu, Yan-Wen Guo, Zhi-Hua Zhou, Leonid Sigal
    Journal of Data Acquisition and Processing, 2019, 34 (1): 155-169. 
    Abstract   PDF(9955KB) ( 334 )  
    Researchers often summarize their work in the form of scientific posters. Posters provide a coherent and efficient way to convey core ideas expressed in scientific papers. Generating a good scientific poster, however, is a complex and time-consuming cognitive task, since such posters need to be readable, informative, and visually aesthetic. In this paper, for the first time, we study the challenging problem of learning to generate posters from scientific papers. To this end, a data-driven framework, which utilizes graphical models, is proposed. Specifically, given content to display, the key elements of a good poster, including attributes of each panel and arrangements of graphical elements, are learned and inferred from data. During the inference stage, the maximum a posterior (MAP) estimation framework is employed to incorporate some design principles. In order to bridge the gap between panel attributes and the composition within each panel, we also propose a recursive page splitting algorithm to generate the panel layout for a poster. To learn and validate our model, we collect and release a new benchmark dataset, called NJU-Fudan Paper-Poster dataset, which consists of scientific papers and corresponding posters with exhaustively labelled panels and attributes. Qualitative and quantitative results indicate the effectiveness of our approach.
    Who Should Be Invited to My Party: A Size-Constrained k-Core Problem in Social Networks
    Yu-Liang Ma, Ye Yuan, Fei-Da Zhu, Guo-Ren Wang, Jing Xiao, Jian-Zong Wang
    Journal of Data Acquisition and Processing, 2019, 34 (1): 170-184. 
    Abstract   PDF(553KB) ( 579 )  
    In this paper, we investigate the problem of a size-constrained k-core group query (SCCGQ) in social networks, taking both user closeness and network topology into consideration. More specifically, SCCGQ intends to find a group of h users that has the highest social closeness while being a k-core. SCCGQ can be widely applied to event planning, task assignment, social analysis, and many other fields. In contrast to existing work on the k-core detection problem, which aims to find a k-core in a social network, SCCGQ not only focuses on k-core detection but also takes size constraints into consideration. Although the conventional k-core detection problem can be solved in linear time, SCCGQ has a higher complexity. To solve the problem of SCCGQ, we propose a Blast Scatter (BS) algorithm, which appoints the query node as the center to begin outward expansions via breadth search. In each outward expansion, BS finds a new center through a greedy strategy and then selects multiple neighbors of the center. To speed up the BS algorithm, we propose an advanced search algorithm, called Bounded Extension (BE). Specifically, BE combines an effective social distance pruning strategy and a tight upper bound of social closeness to prune the search space considerably. In addition, we propose an offline social-aware index to accelerate the query processing. Finally, our experimental results demonstrate the efficiency and effectiveness of our proposed algorithms on large real-world social networks.
    Regular Paper
    On Maximum Elastic Scheduling in Cloud-Based Data Center Networks for Virtual Machines with the Hose Model
    Shuai-Bing Lu, Jie Wu, Huan-Yang Zheng, Zhi-Yi Fang
    Journal of Data Acquisition and Processing, 2019, 34 (1): 185-206. 
    Abstract   PDF(1495KB) ( 397 )  
    With the growing popularity of cloud-based data center networks (DCNs), task resource allocation has become more and more important to the efficient use of resource in DCNs. This paper considers provisioning the maximum admissible load (MAL) of virtual machines (VMs) in physical machines (PMs) with underlying tree-structured DCNs using the hose model for communication. The limitation of static load distribution is that it assigns tasks to nodes in a once-and-for-all manner, and thus requires a priori knowledge of program behavior. To avoid load redistribution during runtime when the load grows, we introduce maximum elasticity scheduling, which has the maximum growth potential subject to the node and link capacities. This paper aims to find the schedule with the maximum elasticity across nodes and links. We first propose a distributed linear solution based on message passing, and we discuss several properties and extensions of the model. Based on the assumptions and conclusions, we extend it to the multiple paths case with a fat tree DCN, and discuss the optimal solution for computing the MAL with both computation and communication constraints. After that, we present the provision scheme with the maximum elasticity for the VMs, which comes with provable optimality guarantee for a fixed flow scheduling strategy in a fat tree DCN. We conduct the evaluations on our testbed and present various simulation results by comparing the proposed maximum elastic scheduling schemes with other methods. Extensive simulations validate the effectiveness of the proposed policies, and the results are shown from different perspectives to provide solutions based on our research.
    A Survey on the Moving Target Defense Strategies: An Architectural Perspective
    Jianjun Zheng, Akbar Siami Namin
    Journal of Data Acquisition and Processing, 2019, 34 (1): 207-233. 
    Abstract   PDF(1266KB) ( 772 )  
    As the complexity and the scale of networks continue to grow, the management of the network operations and security defense has become a challenging task for network administrators, and many network devices may not be updated timely, leaving the network vulnerable to potential attacks. Moreover, the static nature of our existing network infrastructure allows attackers to have enough time to study the static configurations of the network and to launch well-crafted attacks at their convenience while defenders have to work around the clock to defend the network. This asymmetry, in terms of time and money invested, has given attackers greater advantage than defenders and has made the security defense even more challenging. It calls for new and innovative ideas to fix the problem. Moving Target Defense (MTD) is one of the innovative ideas which implements diverse and dynamic configurations of network systems with the goal of puzzling the exact attack surfaces available to attackers. As a result, the system status with the MTD strategy is unpredictable to attackers, hard to exploit, and is more resilient to various forms of attacks. There are existing survey papers on various MTD techniques, but to the best of our knowledge, insufficient focus was given on the architectural perspective of MTD strategies or some new technologies such as Internet of Things (IoT). This paper presents a comprehensive survey on MTD and implementation strategies from the perspective of the architecture of the complete network system, covering the motivation for MTD, the explanation of main MTD concepts, ongoing research efforts of MTD and its implementation at each level of the network system, and the future research opportunities offered by new technologies such as Software-Defined Networking (SDN) and Internet of Things (IoT).
    A Cost-Efficient Approach to Storing Users' Data for Online Social Networks
    Jing-Ya Zhou, Jian-Xi Fan, Cheng-Kuan Lin, Bao-Lei Cheng
    Journal of Data Acquisition and Processing, 2019, 34 (1): 234-252. 
    Abstract   PDF(4164KB) ( 385 )  
    As users increasingly befriend others and interact online via their social media accounts, online social networks (OSNs) are expanding rapidly. Confronted with the big data generated by users, it is imperative that data storage be distributed, scalable, and cost-efficient. Yet one of the most significant challenges about this topic is determining how to minimize the cost without deteriorating system performance. Although many storage systems use the distributed key value store, it cannot be directly applied to OSN storage systems. And because users' data are highly correlated, hash storage leads to frequent inter-server communications, and the high inter-server traffic costs decrease the OSN storage system's scalability. Previous studies proposed conducting network partitioning and data replication based on social graphs. However, data replication increases storage costs and impacts traffic costs. Here, we consider how to minimize costs from the perspective of data storage, by combining partitioning and replication. Our cost-efficient data storage approach supports scalable OSN storage systems. The proposed approach co-locates frequently interactive users together by conducting partitioning and replication simultaneously while meeting load-balancing constraints. Extensive experiments are undertaken on two realworld traces, and the results show that our approach achieves lower cost compared with state-of-the-art approaches. Thus we conclude that our approach enables economic and scalable OSN data storage.
SCImago Journal & Country Rank
 

ISSN 1004-9037

         

Home
Editorial Board
Author Guidelines
Subscription
Journal of Data Acquisition and Processing
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China

E-mail: info@sjcjycl.cn
 
  Copyright ©2015 JCST, All Rights Reserved