Loading...
Bimonthly    Since 1986
ISSN 1004-9037
/
Indexed in:
SCIE, Ei, INSPEC, JST, AJ, MR, CA, DBLP, etc.
Publication Details
Edited by: Editorial Board of Journal of Data Acquisition and Processing
P.O. Box 2704, Beijing 100190, P.R. China
Sponsored by: Institute of Computing Technology, CAS & China Computer Federation
Undertaken by: Institute of Computing Technology, CAS
Published by: SCIENCE PRESS, BEIJING, CHINA
Distributed by:
China: All Local Post Offices
 
  • Table of Content
      05 January 2015, Volume 30 Issue 1   
    For Selected: View Abstracts Toggle Thumbnails
    Special Section on Computer Architecture and Systems for Big Data
    Preface
    Wen-Guang Chen
    Journal of Data Acquisition and Processing, 2015, 30 (1): 1-2. 
    Abstract   PDF(316KB) ( 1264 )  
    CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications
    Yi Yang, Chao Li, Huiyang Zhou
    Journal of Data Acquisition and Processing, 2015, 30 (1): 3-19. 
    Abstract   PDF(1358KB) ( 1557 )  
    Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both sequential code and parallel loops. In order to leverage such parallel loops, the latest NVIDIA Kepler architecture introduces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs. In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these benchmarks do not have a very high loop count or high degree of TLP. Consequently, the benefits of leveraging such parallel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implement our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add Open MP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically generates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and efficiently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our proposed CUDA-NP framework further improves the performance by up to 6.69 times and 2.01 times on average.
    Bipartite-Oriented Distributed Graph Partitioning for Big Learning
    Rong Chen, Jia-Xin Shi, Hai-Bo Chen, Bin-Yu Zang
    Journal of Data Acquisition and Processing, 2015, 30 (1): 20-29. 
    Abstract   PDF(2886KB) ( 1301 )  
    Many machine learning and data mining (MLDM) problems like recommendation, topic modeling, and medical diagnosis can be modeled as computing on bipartite graphs. However, most distributed graph-parallel systems are oblivious to the unique characteristics in such graphs and existing online graph partitioning algorithms usually cause excessive replication of vertices as well as significant pressure on network communication. This article identifies the challenges and opportunities of partitioning bipartite graphs for distributed MLDM processing and proposes BiGraph, a set of bipartite-oriented graph partitioning algorithms. BiGraph leverages observations such as the skewed distribution of vertices, discriminated computation load and imbalanced data sizes between the two subsets of vertices to derive a set of optimal graph partitioning algorithms that result in minimal vertex replication and network communication. BiGraph has been implemented on PowerGraph and is shown to have a performance boost up to 17.75X (from 1.16X) for four typical MLDM algorithms, due to reducing up to 80% vertex replication, and up to 96% network traffic.
    Using Memory in the Right Way to Accelerate Big Data Processing
    Dong Yan, Xu-Sen Yin, Cheng Lian, Xiang Zhong, Xin Zhou, Gan-Sha Wu
    Journal of Data Acquisition and Processing, 2015, 30 (1): 30-41. 
    Abstract   PDF(3877KB) ( 2413 )  
    Big data processing is becoming a standout part of data center computation. However, latest research has indicated that big data workloads cannot make full use of modern memory systems. We find that the dramatic inefficiency of the big data processing is from the enormous amount of cache misses and stalls of the depended memory accesses. In this paper, we introduce two optimizations to tackle these problems. The first one is the slice-and-merge strategy, which reduces the cache miss rate of the sort procedure. The second optimization is direct-memory-access, which reforms the data structure used in key/value storage. These optimizations are evaluated with both micro-benchmarks and the real-world benchmark HiBench. The results of our micro-benchmarks clearly demonstrate the effectiveness of our optimizations in terms of hardware event counts; and the additional results of HiBench show the 1.21X average speedup on the application-level. Both results illustrate that careful hardware/software co-design will improve the memory efficiency of big data processing. Our work has already been integrated into Intel distribution for Apache Hadoop.
    An Efficient and Flexible Deterministic Framework for Multithreaded Programs
    Kai Lu, Xu Zhou, Xiao-Ping Wang, Tom Bergan, Chen Chen
    Journal of Data Acquisition and Processing, 2015, 30 (1): 42-56. 
    Abstract   PDF(750KB) ( 2031 )  
    Determinism is very useful to multithreaded programs in debugging, testing, etc. Many deterministic approaches have been proposed, such as deterministic multithreading (DMT) and deterministic replay. However, these systems either are inefficient or target a single purpose, which is not flexible. In this paper, we propose an efficient and flexible deterministic framework for multithreaded programs. Our framework implements determinism in two steps: relaxed determinism and strong determinism. Relaxed determinism solves data races efficiently by using a proper weak memory consistency model. After that, we implement strong determinism by solving lock contentions deterministically. Since we can apply different approaches for these two steps independently, our framework provides a spectrum of deterministic choices, including nondeterministic system (fast), weak deterministic system (fast and conditionally deterministic), DMT system, and deterministic replay system. Our evaluation shows that the DMT configuration of this framework could even outperform a state-of-the-art DMT system.
    System-Enforced Deterministic Streaming for Efficient Pipeline Parallelism
    Yu Zhang, Zhao-Peng Li, Hui-Fang Cao
    Journal of Data Acquisition and Processing, 2015, 30 (1): 57-73. 
    Abstract   PDF(720KB) ( 1400 )  
    Pipeline parallelism is a popular parallel programming pattern for emerging applications. However, programming pipelines directly on conventional multithreaded shared memory is difficult and error-prone. We present DStream, a C library that provides high-level abstractions of deterministic threads and streams for simply representing pipeline stage workers and their communications. The deterministic stream is established atop our proposed single-producer/multi-consumer (SPMC) virtual memory, which integrates synchronization with the virtual memory model to enforce determinism on shared memory accesses. We investigate various strategies on how to efficiently implement DStream atop the SPMC memory, so that an infinite sequence of data items can be asynchronously published (fixed) and asynchronously consumed in order among adjacent stage workers. We have successfully transformed two representative pipeline applications - ferret and dedup using DStream, and conclude conversion rules. An empirical evaluation shows that the converted ferret performed on par with its Pthreads and TBB counterparts in term of running time, while the converted dedup is close to 2.56X, 7.05X faster than the Pthreads counterpart and 1.06X, 3.9X faster than the TBB counterpart on 16 and 32 CPUs, respectively.
    Exploring Heterogeneous NoC Design Space in Heterogeneous GPU-CPU Architectures
    Juan Fang, Zhen-Yu Leng, Si-Tong Liu, Zhi-Cheng Yao, Xiu-Feng Sui
    Journal of Data Acquisition and Processing, 2015, 30 (1): 74-83. 
    Abstract   PDF(3097KB) ( 2271 )  
    Computer architecture is transiting from the multicore era into the heterogeneous era in which heterogeneous architectures use on-chip networks to access shared resources and how a network is configured will likely have a significant impact on overall performance and power consumption. Recently, heterogeneous network on chip (NoC) has been proposed not only to achieve performance comparable to that of the NoCs with buffered routers but also to reduce buffer cost and energy consumption. However, heterogeneous NoC design for heterogeneous GPU-CPU architectures has not been studied in depth. This paper first evaluates the performance and power consumption of a variety of static hot-potato based heterogeneous NoCs with different buffered and bufferless router placements, which is helpful to explore the design space for heterogeneous GPU-CPU interconnection. Then it proposes Unidirectional Flow Control (UFC), a simple credit-based flow control mechanism for heterogeneous NoC in GPU-CPU architectures to control network congestion. UFC can guarantee that there are always unoccupied entries in buffered routers to receive flits coming from adjacent bufferless routers. Our evaluations show that when compared to hot-potato routing, UFC improves performance by an average of 14.1% with energy increased by an average of 5.3% only.
    CRAIS: A Crossbar-Based Interconnection Scheme on FPGA for Big Data
    Chao Wang, Xi Li, Xue-Hai Zhou
    Journal of Data Acquisition and Processing, 2015, 30 (1): 84-96. 
    Abstract   PDF(2198KB) ( 1465 )  
    On-chip interconnection has posed significant challenges in multiprocessor system on chip (MPSoC) design paradigm, especially in big data era. With respect to the state-of-the-art, crossbar-based interconnection methodologies are still efficient for FPGA-based small-scale heterogeneous MPSoCs. This paper proposes a crossbar-based on-chip interconnection scheme, named CRAIS. CRAIS utilizes reconfigurable crossbar interconnections between microprocessors and intellectual property (IP) cores in MPSoC. The hardware interconnection can be dynamically reconfigured during execution. Empirical results on FPGA prototype demonstrate that CRAIS can achieve more than 7X speedup compared with the state-of-the-art StarNet approach, while it only utilizes 21%~35% hardware resources of StarNet.
    Adapting Memory Hierarchies for Emerging Datacenter Interconnects
    Tao Jiang, Rui Hou, Jian-Bo Dong, Lin Chai, Sally A. McKee, Bin Tian, Li-Xin Zhang, Ning-Hui Sun
    Journal of Data Acquisition and Processing, 2015, 30 (1): 97-109. 
    Abstract   PDF(2953KB) ( 2106 )  
    Efficient resource utilization requires that emerging datacenter interconnects support both high performance communication and efficient remote resource sharing. These goals require that the network be more tightly coupled with the CPU chips. Designing a new interconnection technology thus requires considering not only the interconnection itself, but also the design of the processors that will rely on it. In this paper, we study memory hierarchy implications for the design of high-speed datacenter interconnects — particularly as they affect remote memory access — and we use PCIe as the vehicle for our investigations. To that end, we build three complementary platforms: a PCIe-interconnected prototype server with which we measure and analyze current bottlenecks; a software simulator that lets us model microarchitectural and cache hierarchy changes; and an FPGA prototype system with a streamlined switchless customized protocol Thunder with which we study hardware optimizations outside the processor. We highlight several architectural modifications to better support remote memory access and communication, and quantify their impact and limitations.
    Improving the Performance and Energy Efficiency of Phase Change Memory Systems
    Qi Wang, Jia-Rui Li, Dong-Hui Wang
    Journal of Data Acquisition and Processing, 2015, 30 (1): 110-120. 
    Abstract   PDF(3271KB) ( 1260 )  
    Phase change memory (PCM) is a promising technology for future memory thanks to its better scalability and lower leakage power than DRAM (dynamic random-access memory). However, adopting PCM as main memory needs to overcome its write issues, such as long write latency and high write power. In this paper, we propose two techniques to improve the performance and energy-efficiency of PCM memory systems. First, we propose a victim cache technique utilizing the existing buffer in the memory controller to reduce PCM memory accesses. The key idea is reorganizing the buffer into a victim cache structure (RBC) to provide additional hits for the LLC (last level cache). Second, we propose a chip parallelism-aware replacement policy (CPAR) for the victim cache to further improve performance. Instead of evicting one cache line once, CPAR evicts multiple cache lines that access different PCM chips. CPAR can reduce the frequent victim cache eviction and improve the write parallelism of PCM chips. The evaluation results show that, compared with the baseline, RBC can improve PCM memory system performance by up to 9.4% and 5.4% on average. Combing CPAR with RBC (RBC+CPAR) can improve performance by up to 19.0% and 12.1% on average. Moreover, RBC and RBC+CPAR can reduce memory energy consumption by 8.3% and 6.6% on average, respectively.
    Computer Architecture and Systems
    A Survey of Phase Change Memory Systems
    Fei Xia, De-Jun Jiang, Jin Xiong, Ning-Hui Sun
    Journal of Data Acquisition and Processing, 2015, 30 (1): 121-144. 
    Abstract   PDF(2230KB) ( 3645 )  
    As the scaling of applications increases, the demand of main memory capacity increases in order to serve large working set. It is difficult for DRAM (dynamic random access memory) based memory system to satisfy the memory capacity requirement due to its limited scalability and high energy consumption. Compared to DRAM, PCM (phase change memory) has better scalability, lower energy leakage, and non-volatility. PCM memory systems have become a hot topic of academic and industrial research. However, PCM technology has the following three drawbacks: long write latency, limited write endurance, and high write energy, which raises challenges to its adoption in practice. This paper surveys architectural research work to optimize PCM memory systems. First, this paper introduces the background of PCM. Then, it surveys research efforts on PCM memory systems in performance optimization, lifetime improving, and energy saving in detail, respectively. This paper also compares and summarizes these techniques from multiple dimensions. Finally, it concludes these optimization techniques and discusses possible research directions of PCM memory systems in future.
    Cooperative Computing Techniques for a Deeply Fused and Heterogeneous Many-Core Processor Architecture
    Fang Zheng, Hong-Liang Li, Hui Lv, Feng Guo, Xiao-Hong Xu, Xiang-Hui Xie
    Journal of Data Acquisition and Processing, 2015, 30 (1): 145-162. 
    Abstract   PDF(2342KB) ( 1784 )  
    Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which has become a bottleneck in many-core processors. In this paper, we present a novel heterogeneous many-core processor architecture named deeply fused many-core (DFMC) for high performance computing systems. DFMC integrates management processing elements (MPEs) and computing processing elements (CPEs), which are heterogeneous processor cores for different application features with a unified ISA (instruction set architecture), a unified execution model, and share-memory that supports cache coherence. The DFMC processor can alleviate the memory wall problem by combining a series of cooperative computing techniques of CPEs, such as multi-pattern data stream transfer, efficient register-level communication mechanism, and fast hardware synchronization technique. These techniques are able to improve on-chip data reuse and optimize memory access performance. This paper illustrates an implementation of a full system prototype based on FPGA with four MPEs and 256 CPEs. Our experimental results show that the effect of the cooperative computing techniques of CPEs is significant, with DGEMM (double-precision matrix multiplication) achieving an efficiency of 94%, FFT (fast Fourier transform) obtaining a performance of 207 GFLOPS and FDTD (finite-difference time-domain) obtaining a performance of 27 GFLOPS.
    Data Management and Data Mining
    Survey of Large-Scale Data Management Systems for Big Data Applications
    Lengdong Wu, Liyan Yuan, Jiahuai You
    Journal of Data Acquisition and Processing, 2015, 30 (1): 163-183. 
    Abstract   PDF(2181KB) ( 2084 )  
    Today, data is flowing into various organizations at an unprecedented scale. The ability to scale out for processing an enhanced workload has become an important factor for the proliferation and popularization of database systems. Big data applications demand and consequently lead to the developments of diverse large-scale data management systems in different organizations, ranging from traditional database vendors to new emerging Internet-based enterprises. In this survey, we investigate, characterize, and analyze the large-scale data management systems in depth and develop comprehensive taxonomies for various critical aspects covering the data model, the system architecture, and the consistency model. We map the prevailing highly scalable data management systems to the proposed taxonomies, not only to classify the common techniques but also to provide a basis for analyzing current system scalability limitations. To overcome these limitations, we predicate and highlight the possible principles that future efforts need to be undertaken for the next generation large-scale data management systems.
    Social Influence Study in Online Networks: A Three-Level Review
    Hui Li, Jiang-Tao Cui, Jian-Feng Ma
    Journal of Data Acquisition and Processing, 2015, 30 (1): 184-199. 
    Abstract   PDF(880KB) ( 1597 )  
    Social network analysis (SNA) views social relationships in terms of network theory consisting of nodes and ties. Nodes are the individual actors within the networks; ties are the relationships between the actors. In the sequel, we will use the term node and individual interchangeably. The relationship could be friendship, communication, trust, etc. These reason is that these relationships and ties are driven by social in uence, which is the most important phenomenon that distinguishes social network from other networks. In this paper, we present an overview of the representative research work in social in uence study. Those studies can be classi ed into three levels, namely individual, community, and network levels. Throughout the study, we are able to unveil a series of research directions in future and possible applications based on the state-of-the-art study.
    Review Authorship Attribution in a Similarity Space
    Tie-Yun Qian, Bing Liu, Qing Li, Jianfeng Si
    Journal of Data Acquisition and Processing, 2015, 30 (1): 200-213. 
    Abstract   PDF(976KB) ( 2439 )  
    Authorship attribution, also known as authorship classification, is the problem of identifying the authors (reviewers) of a set of documents (reviews). The common approach is to build a classifier using supervised learning. This approach has several issues which hurts its applicability. First, supervised learning needs a large set of documents from each author to serve as the training data. This can be difficult in practice. For example, in the online review domain, most reviewers (authors) only write a few reviews, which are not enough to serve as the training data. Second, the learned classifier cannot be applied to authors whose documents have not been used in training. In this article, we propose a novel solution to deal with the two problems. The core idea is that instead of learning in the original document space, we transform it to a similarity space. In the similarity space, the learning is able to naturally tackle the issues. Our experiment results based on online reviews and reviewers show that the proposed method outperforms the state-of-the-art supervised and unsupervised baseline methods significantly.
    A Clustering Algorithm for Planning the Integration Process of a Large Number of Conceptual Schemas
    Carlo Batini, Paola Bonizzoni, Marco Comerio, Riccardo Dondi, Yuri Pirola, Francesco Salandra
    Journal of Data Acquisition and Processing, 2015, 30 (1): 214-224. 
    Abstract   PDF(1070KB) ( 1499 )  
    When tens and even hundreds of schemas are involved in the integration process, criteria are needed for choosing clusters of schemas to be integrated, so as to deal with the integration problem through an efficient iterative process. Schemas in clusters should be chosen according to cohesion and coupling criteria that are based on similarities and dissimilarities among schemas. In this paper, we propose an algorithm for a novel variant of the correlation clustering approach that addresses the problem of assisting a designer in integrating a large number of conceptual schemas. The novel variant introduces upper and lower bounds to the number of schemas in each cluster, in order to avoid too complex and too simple integration contexts respectively. We give a heuristic for solving the problem, being an NP hard combinatorial problem. An experimental activity demonstrates an appreciable increment in the effectiveness of the schema integration process when clusters are computed by means of the proposed algorithm w.r.t. the ones manually defined by an expert.
SCImago Journal & Country Rank
 

ISSN 1004-9037

         

Home
Editorial Board
Author Guidelines
Subscription
Journal of Data Acquisition and Processing
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China

E-mail: info@sjcjycl.cn
 
  Copyright ©2015 JCST, All Rights Reserved