Loading...
Bimonthly    Since 1986
ISSN 1004-9037
/
Indexed in:
SCIE, Ei, INSPEC, JST, AJ, MR, CA, DBLP, etc.
Publication Details
Edited by: Editorial Board of Journal of Data Acquisition and Processing
P.O. Box 2704, Beijing 100190, P.R. China
Sponsored by: Institute of Computing Technology, CAS & China Computer Federation
Undertaken by: Institute of Computing Technology, CAS
Published by: SCIENCE PRESS, BEIJING, CHINA
Distributed by:
China: All Local Post Offices
 
  • Table of Content
      05 July 2015, Volume 30 Issue 4   
    For Selected: View Abstracts Toggle Thumbnails
    Special Section on Data Management and Data Mining
    Preface
    Jian Pei
    Journal of Data Acquisition and Processing, 2015, 30 (4): 655-656. 
    Abstract   PDF(69KB) ( 769 )  
    We are in the era of Big Data, where data management and data mining is at the core of the stage. We cannot overemphasize the importance and challenges in research and development of data management and mining. To echo the emerging trend, the Journal of Data Acquisition and Processing made a strategic decision to organize a special section on the topic. I am very honored to coordinate the selection process.
    ......
    Hetero-DB: Next Generation High-Performance Database Systems by Best Utilizing Heterogeneous Computing and Storage Resources
    Kai Zhang, Feng Chen, Xiaoning Ding, Yin Huai, Rubao Lee, Tian Luo, Kaibo Wang, Yuan Yuan, Xiaodong Zhang
    Journal of Data Acquisition and Processing, 2015, 30 (4): 657-678. 
    Abstract   PDF(1469KB) ( 1545 )  
    With recent advancement on hardware technologies, new general-purpose high-performance devices have been widely adopted, such as the graphics processing unit (GPU) and solid state drive (SSD). GPU may offer an order of higher throughput for applications with massive data parallelism, compared with the multicore CPU. Moreover, new storage device SSD is also capable of offering a much higher I/O throughput and lower latency than a traditional hard disk device (HDD). These new hardware devices can significantly boost the performance of many applications; thus the database community has been actively engaging in adopting them into database systems. However, the performance benefit cannot be easily reaped if the new hardwares are improperly used. In this paper, we propose Hetero-DB, a high-performance database system by exploiting both the characteristics of the database system and the special properties of the new hardware devices in system's design and implementation. Hetero-DB develops a GPU-aware query execution engine with GPU device memory management and query scheduling mechanism to support concurrent query execution. Furthermore, with the SSD-HDD hybrid storage system, we redesign the storage engine by organizing HDD and SSD into a two-level caching hierarchy in Hetero-DB. To best utilize the hybrid hardware devices, the semantic information that is critical for storage I/O is identified and passed to the storage manager, which has a great potential to improve the efficiency and performance. Hetero-DB aims to maximize the performance benefits of GPU and SSD, and demonstrates the effectiveness for designing next generation database systems.
    HAG: An Energy-Proportional Data Storage Scheme for Disk Array Systems
    Pei-Quan Jin, Xike Xie, Christian S. Jensen, Yong Jin, Li-Hua Yue
    Journal of Data Acquisition and Processing, 2015, 30 (4): 679-695. 
    Abstract   PDF(3873KB) ( 1206 )  
    Energy consumption has been a critical issue for data storage systems, especially for modern data centers. A recent survey shows that power costs amount to about 50% of the total cost of ownership in a typical data center, with about 27% of the system power being consumed by storage systems. This paper aims at providing an effective solution to reducing the energy consumed by disk storage systems, by proposing a new approach to reduce the energy consumption. Differing from previous approaches, we adopt two new designs: (1) we introduce a hotness-aware and group-based system model (HAG) to organize the disks, in which all disks are partitioned into a hot group and a cold group. We only make file migration between the two groups and avoid the migration within a single group, so that we are able to reduce the total cost of file migration. (2) We use an on-demand approach to reorganize files among the disks that is based on workload change as well as the change of data hotness. We conduct trace-driven experiments involving two real and nine synthetic traces and we make detailed comparisons between our method and competitor methods according to different metrics. The results show that our method can dynamically select hot files and disks when the workload changes and that it is able to reduce energy consumption for all the traces. Furthermore, its time performance is comparable to that of the compared algorithms. In general, our method exhibits the best energy efficiency in all experiments, and it is capable of maintaining an improved trade-off between performance and energy consumption.
    Mining Frequent Itemsets in Correlated Uncertain Databases
    Yong-Xin Tong, Lei Chen, Jieying She
    Journal of Data Acquisition and Processing, 2015, 30 (4): 696-712. 
    Abstract   PDF(930KB) ( 1442 )  
    Recently, with the growing popularity of Internet of Things (IoT) and pervasive computing, a large amount of uncertain data, e.g. RFID data, sensor data, real-time video data, etc., has been collected. As one of the most fundamental issues of uncertain data mining, uncertain frequent pattern mining has attracted much attention in the database and the data mining communities. Although there have been some solutions for uncertain frequent pattern mining, most of them assume that the data is independent, which is not true in most real-world scenarios. Therefore, current methods that are based on the independent assumption may generate inaccurate results for correlated uncertain data. In this paper, we focus on the problem of mining frequent itemsets over correlated uncertain data, where correlation can exist in any pair of uncertain data objects (transactions). We propose a novel probabilistic model, called correlated frequent probability model (CFP model) to represent the probability distribution of support in a given correlated uncertain data set. Based on the distribution of support derived from the CFP model, we observe that some probabilistic frequent itemsets are only frequent in several transactions with high positive correlation. In particular, the itemsets, which are global probabilistic frequent, have more significance in eliminating the influence of the existing noise and correlation in data. In order to reduce redundant frequent itemsets, we further propose a new type of patterns, called global probabilistic frequent itemsets, to identify itemsets that are always frequent in each group of transactions if the whole correlated uncertain database is divided into disjoint groups based on their correlation. To speed up the mining process, we also design a dynamic programming solution, as well as two pruning and bounding techniques. Extensive experiments on both real and synthetic datasets verify the effectiveness and efficiency of the proposed model and algorithms.
    A Structure Learning Algorithm for Bayesian Network Using Prior Knowledge
    Jun-Gang Xu, Yue Zhao, Jian Chen, Chao Han
    Journal of Data Acquisition and Processing, 2015, 30 (4): 713-724. 
    Abstract   PDF(1303KB) ( 1787 )  
    Bayesian Network is an important probabilistic graphical model that represents a set of random nodes and their conditional dependencies. Bayesian Networks have been increasingly used in a wide range of applications, such as natural language processing, hardware diagnostics, bioinformatics, statistical physics, econometrics, etc. Learning structure from data is one of the most important fundamental tasks of Bayesian network research. Particularly, learning optional structure of Bayesian network is a Non-deterministic Polynomial-time (NP) hard problem. To solve this problem, many heuristic algorithms have been proposed, some of them learn Bayesian network structure with the help of different types of prior knowledge. However, the existing algorithms have some restrictions on the prior knowledge, such as quality restriction of prior knowledge, use restriction of prior knowledge, etc. This makes it difficult to use the prior knowledge well in these algorithms. In this paper, we introduce the prior knowledge into Markov Chain Monte Carlo (MCMC) algorithm and proposed an algorithm called Constrained MCMC (C-MCMC) algorithm to learn the structure of the Bayesian network. Three types of prior knowledge are defined: existence of parent node, absence of parent node and distribution knowledge including the Conditional Probability Distribution (CPD) of edges and the Probability Distribution (PD) of nodes. All of these types of prior knowledge are easily used in this algorithm. We conduct extensive experiments to demonstrate the feasibility and effectiveness of the proposed method C-MCMC. We conduct extensive experiments to demonstrate the feasibility and effectiveness of the proposed method C-MCMC.
    VID Join: Mapping Trajectories to Points of Interest to Support Location-Based Services
    Shuo Shang, Kexin Xie, Kai Zheng, Jiajun Liu, Ji-Rong Wen
    Journal of Data Acquisition and Processing, 2015, 30 (4): 725-744. 
    Abstract   PDF(1162KB) ( 1239 )  
    Variable Influence Duration (VID) join is a novel spatio-temporal join operation between a set T of trajectories and a set P of spatial points. Here, trajectories are traveling history of moving objects (e.g., travelers), and spatial points are POIs (e.g., restaurants). VID join returns all pairs of (τs, p) if τs is spatially close to p for a long period of time, where τs is a segment of trajectory τT and p∈P. Each returned (τs, p) implies that the moving object associated with τs stayed at p (e.g., having dinner at a restaurant). Such information is useful in many aspects, such as targeted advertising, social security, and social activity analysis. The concepts of influence and influence duration are introduced to measure the spatial closeness between τs and p, and the time spanned, respectively. Compared to the conventional spatio-temporal join, the VID join is more challenging since the join condition varies for different POIs, and the additional temporal requirement cannot be indexed effectively. To process the VID join efficiently, three algorithms are developed and several optimization techniques are applied, including spatial duplication reuse and time duration based pruning. The performance of the developed algorithms is verified by extensive experiments on real spatial data.
    Trip Oriented Search on Activity Trajectory
    Wei Chen, Lei Zhao, Jia-Jie Xu, Guan-Feng Liu, Kai Zheng, Xiaofang Zhou
    Journal of Data Acquisition and Processing, 2015, 30 (4): 745-761. 
    Abstract   PDF(714KB) ( 1326 )  
    Driven by the flourish of location-based services, trajectory search has received significant attentions in recent years. Different from existing studies that focus on searching trajectories with spatio-temporal information and text descriptions, we study a novel problem of searching trajectories with spatial distance, activities and rating scores. Given a query q with a threshold of distance, a set of activities, a start point S and a destination E, trip oriented search on activity trajectory (TOSAT) returns k trajectories that can cover the activities with the highest rating scores within the threshold of distance. In addition, we extend the query with an order, i.e., order-sensitive trip oriented search on activity trajectory (OTOSAT), which takes both the order of activities in a query q and the order of trajectories into consideration. It is very challenging to answer TOSAT and OTOSAT efficiently due to the structural complexity of trajectory data with rating information. In order to tackle the problem efficiently, we develop a hybrid index AC-tree to organize trajectories. Moreover, the optimized variant RAC+-tree and novel algorithms are introduced with the goal of achieving higher performance. Extensive experiments based on real trajectory datasets demonstrate that the proposed index structures and algorithms are capable of achieving high efficiency and scalability.
    Threshold-Based Shortest Path Query over Large Correlated Uncertain Graphs
    Yu-Rong Cheng, Ye Yuan, Lei Chen, Guo-Ren Wang
    Journal of Data Acquisition and Processing, 2015, 30 (4): 762-780. 
    Abstract   PDF(1405KB) ( 1332 )  
    With the popularity of uncertain data, queries over uncertain graphs have become a hot topic in the database community. As one of the important queries, the shortest path query over an uncertain graph has attracted much attention of researchers due to its wide applications. Although there are some efficient solutions addressing this problem, all existing models ignore an important property existing in uncertain graphs: the correlation among the edges sharing the same vertex. In this paper, we apply Markov network to model the hidden correlation in uncertain graphs and compute the shortest path. Unfortunately, calculating the shortest path and corresponding probability over uncertain graphs modeled by Markov networks is a #P-hard problem. Thus, we propose a filtering-and-verification framework to accelerate the queries. In the filtering phase, we design a probabilistic shortest path index based on vertex cuts and blocks of a graph. We find a series of upper bounds and prune the vertices and edges whose upper bounds of the shortest path probability are lower than the threshold. By carefully picking up the blocks and vertex cuts, the index is optimized to have the maximum pruning capability, so that we can filter a large number of vertices, which make no contribution to the final shortest path query results. In the verification phase, we develop an efficient sampling algorithm to determine the final query answers. Finally, we verify the efficiency and effectiveness of our solutions with extensive experiments.
    On Efficient Aggregate Nearest Neighbor Query Processing in Road Networks
    Wei-Wei Sun, Chu-Nan Chen, Liang Zhu, Yun-Jun Gao, Yi-Nan Jing, Qing Li
    Journal of Data Acquisition and Processing, 2015, 30 (4): 781-798. 
    Abstract   PDF(662KB) ( 1649 )  
    An aggregate nearest neighbor (ANN) query returns a point of interest (POI) that minimizes an aggregate function for multiple query points. In this paper, we propose an efficient approach to tackle ANN queries in road networks. Our approach consists of two phases: searching phase and pruning phase. In particular, we first continuously compute the nearest neighbors (NNs) for each query point in some specific order to obtain the candidate POIs until all query points find a common POI. Next, we filter out the unqualified POIs based on the pruning strategy for a given aggregate function. The two-phase process is repeated until there remains only one candidate POI, and the remained one is returned as the final result. In addition, we discuss the partition strategies for query points and the approximate ANN query for the case where the number of query points is huge. Extensive experiments using real datasets demonstrate that our proposed approach outperforms its competitors significantly in most cases.
    Zip: An Algorithm Based on Loser Tree for Common Contacts Searching in Large Graphs
    Hong Tang, Shuai Mu, Jin Huang, Jia Zhu, Jian Chen, Rui Ding
    Journal of Data Acquisition and Processing, 2015, 30 (4): 799-809. 
    Abstract   PDF(1617KB) ( 1258 )  
    The problem of k-hop reachability between two vertices in a graph has received considerable attention in recent years. A substantial number of algorithms have been proposed with the goal of improving the searching efficiency of the k-hop reachability between two vertices in a graph. However, searching and traversing are challenging tasks, especially in large-scale graphs. Furthermore, the existing algorithms propounded by different scholars are not satisfactory in terms of feasibility and scalability when applied to different kinds of graphs. In this work we propose a new algorithm, called Zip, in an attempt to efficiently determine the common contacts between any two random vertices in a large-scale graph. First, we describe a novel algorithm for constructing the graph index via binary searching which maintains the adjacent list of each vertex in order. Second, we present the ways to achieve a sequential k-hop contact set by using the loser tree, a merge sorting algorithm. Finally, we develop an efficient algorithm for querying common contacts and an optimized strategy for k-hop contact set serialization. Experimental results on synthetic and real datasets show that the proposed Zip algorithm outperforms existing state-of-the-art algorithms (e.g., breadth-first Searching, GRAIL, Graph Stratification algorithm).
    The Best Answers? Think Twice: Identifying Commercial Campagins in the CQA Forums
    Cheng Chen, Kui Wu, Venkatesh Srinivasan, Kesav Bharadwaj R
    Journal of Data Acquisition and Processing, 2015, 30 (4): 810-828. 
    Abstract   PDF(1993KB) ( 1754 )  
    In an emerging trend, more and more Internet users search for information from Community Question and Answer (CQA) websites, as interactive communication in such websites provides users with a rare feeling of trust. More often than not, end users look for instant help when they browse the CQA websites for the best answers. Hence, it is imperative that they should be warned of any potential commercial campaigns hidden behind the answers. Existing research focuses more on the quality of answers and does not meet the above need. Textual similarities between questions and answers are widely used in previous research. However, this feature will no longer be effective when facing commercial paid posters. More context information, such as writing templates and a user's reputation track need to be combined together to form a new model to detect the potential campaign answers. In this paper, we develop a system that automatically analyzes the hidden patterns of commercial spam and raises alarms instantaneously to end users whenever a potential commercial campaign is detected. Our detection method integrates semantic analysis and posters' track records and utilizes the special features of CQA websites largely different from those in other types of forums such as microblogs or news reports. Our system is adaptive and accommodates new evidence uncovered by the detection algorithms over time. Validated with real-world trace data from a popular Chinese CQA website over a period of three months, our system shows great potential towards adaptive detection of CQA spams.
    Learning to Predict Links by Integrating Structure and Interaction Information in Microblogs
    Yan-Tao Jia, Yuan-Zhuo Wang, Xue-Qi Cheng
    Journal of Data Acquisition and Processing, 2015, 30 (4): 829-842. 
    Abstract   PDF(390KB) ( 1181 )  
    Link prediction in Microblogs by using unsupervised methods has been studied extensively in recent years, which aims to find an appropriate similarity measure between users in the network. However, the measures used by existing work lack a simple way to incorporate the structure of the network and the interactions between users. This leads to the gap between the predictive result and the ground truth value. For example, the F1-measure created by the best method is around 0.2. In this work, we firstly discover the gap and prove its existence. To narrow this gap, we define the retweet similarity to measure the interactions between users in Twitter, and propose a structural-interaction based matrix factorization model for following-link prediction. Experiments based on the real world Twitter data show that our model outperforms state-of-the-art methods.
    Research on Trust Prediction from a Sociological Perspective
    Ying Wang, Xin Wang, Wan-Li Zuo
    Journal of Data Acquisition and Processing, 2015, 30 (4): 843-858. 
    Abstract   PDF(631KB) ( 1177 )  
    Trust, as a major part of human interactions, plays an important role in helping users collect reliable information and make decisions. However, in reality, user-specified trust relations are often very sparse and follow a power law distribution; hence inferring unknown trust relations attracts increasing attention in recent years. Social theories are frameworks of empirical evidence used to study and interpret social phenomena from a sociological perspective, while social networks reflect the correlations of users in real world; hence, making the principle, rules, ideas and methods of social theories into the analysis of social networks brings new opportunities for trust prediction. In this paper, we investigate how to exploit homophily and social status in trust prediction by modeling social theories. We first give several methods to compute homophily coefficient and status coefficient, then provide a principled way to model trust prediction mathematically, and propose a novel framework, hsTrust, which incorporates homophily theory and status theory. Experimental results on real-world datasets demonstrate the effectiveness of the proposed framework. Further experiments are conducted to understand the importance of homophily theory and status theory in trust prediction.
    Enhancing Time Series Clustering by Incorporating Multiple Distance Measures with Semi-Supervised Learning
    Jing Zhou, Shan-Feng Zhu, Xiaodi Huang, Yanchun Zhang
    Journal of Data Acquisition and Processing, 2015, 30 (4): 859-873. 
    Abstract   PDF(1447KB) ( 2227 )  
    Time series clustering is widely applied in various areas. Existing research work focuses mainly on distance measures between two time series, such as DTW-based (dynamic time warping) methods, edit distance-based methods, and shapelets-based methods. In this work, we experimentally demonstrate, for the first time, that no single distance measure performs significantly better than others on clustering data sets of time series where spectral clustering is used. As such, a question arises as to how to choose an appropriate measure for a given data set of time series. To answer this question, we propose an integration scheme that incorporates multiple distance measures using semi-supervised clustering. Our approach is able to integrate all the measures by extracting valuable underlying information for the clustering. To our best knowledge, this work demonstrates for the first time that semi-supervised clustering method based on constraints is able to enhance time series clustering by combining multiple distance measures. Having tested on clustering various time series data sets, we show that our method outperforms individual measures, as well as typical integration approaches.
    Classifying Uncertain and Evolving Data Streams with Distributed Extreme Learning Machine
    Dong-Hong Han, Xin Zhang, Guo-Ren Wang
    Journal of Data Acquisition and Processing, 2015, 30 (4): 874-887. 
    Abstract   PDF(432KB) ( 1304 )  
    Conventional classification algorithms are not well suited for the inherent uncertainty, potential concept drift, volume, and velocity of streaming data. Specialized algorithms are needed to obtain efficient and accurate classifiers for uncertain data streams. In this paper, we first introduce Distributed Extreme Learning Machine (DELM), an optimization of ELM for large matrix operations over large datasets. We then present Weighted Ensemble Classifier Based on Distributed ELM (WE-DELM), an online and one-pass algorithm for efficiently classifying uncertain streaming data with concept drift. A probability world model is built to transform uncertain streaming data into certain streaming data. Base classifiers are learned using DELM. The weights of the base classifiers are updated dynamically according to classification results. WE-DELM improves both the efficiency in learning the model and the accuracy in performing classification. Experimental results show that WE-DELM achieves better performance on different evaluation criteria, including efficiency, accuracy, and speedup.
    Search Result Diversification Based on Query Facets
    Sha Hu, Zhi-Cheng Dou, Xiao-Jie Wang, Ji-Rong Wen
    Journal of Data Acquisition and Processing, 2015, 30 (4): 888-901. 
    Abstract   PDF(1370KB) ( 2030 )  
    In search engines, by issuing the same query, different users may search for different information. To satisfy more users with limited search results, search result diversification re-ranks the results to cover as many user intents as possible. Most existing intent-aware diversification algorithms recognize user intents as subtopics, each of which is usually a word, a phrase, or a piece of description. In this paper, we leverage query facets to understand user intents in diversification, where each facet contains a group of words or phrases that explain an underlying intent of a query. We generate subtopics based on query facets and propose faceted diversification approaches. Experimental results on the public TREC 2009 dataset show that our faceted approaches outperform state-of-the-art diversification models.
    Special Section on Social Media Processing — Part 1
    Preface
    Jie Tang, Xiao-Yan Zhu
    Journal of Data Acquisition and Processing, 2015, 30 (4): 902-902. 
    Abstract   PDF(87KB) ( 775 )  
    Leveraging Large Data with Weak Supervision for Joint Feature and Opinion Word Extraction
    Lei Fang, Biao Liu, Min-Lie Huang
    Journal of Data Acquisition and Processing, 2015, 30 (4): 903-916. 
    Abstract   PDF(3555KB) ( 938 )  
    Product feature and opinion word extraction is very important for fine granular sentiment analysis. In this paper, we leverage large scale unlabeled data for joint extraction of feature and opinion words under a knowledge poor setting, in which only a few feature-opinion pairs are utilized as weak supervision. Our major contributions are two-fold: first, we propose a data-driven approach to represent product features and opinion words as a list of corpus-level syntactic relations, which captures rich language structures; second, we build a simple yet robust unsupervised model with prior knowledge incorporated to extract new feature and opinion words, which obtains high performance robustly. The extraction process is based upon a bootstrapping framework which, to some extent, reduces error propagation under large data. Experimental results under various settings compared with state-of-the-art baselines demonstrate that our method is effective and promising.
    When Factorization Meets Heterogeneous Latent Topics: An Interpretable Cross-Site Recommendation Framework
    Xin Xin, Chin-Yew Lin, Xiao-Chi Wei, He-Yan Huang
    Journal of Data Acquisition and Processing, 2015, 30 (4): 917-932. 
    Abstract   PDF(1491KB) ( 1382 )  
    Data sparsity is a well-known challenge in applications of recommender systems. Previous work alleviate this problem by incorporating the information within the corresponding social media site. In this paper, we are going to solve this challenge by exploring the cross-site information. Specifically, we target at: 1) How to effectively and efficiently utilize cross-site ratings and content features to improve the recommendation performance? and 2) How to make the recommendation interpretable by utilizing the content features? We propose a joint model of matrix factorization and latent topic analysis as the recommendation framework. In this model, heterogeneous content features can be modeled by multiple kinds of latent topics, by which the feature dimensionality reduction is accurately conducted for improving recommendation performance. In addition, the combination of matrix factorization and latent topics makes the recommendation result interpretable from many aspects. Therefore, the above two issues are simultaneously solved. Through a real world dataset, where user behaviors in three social media sites are collected, we demonstrate that the proposed model is effective in improving the recommendation performance and interpreting the rationale of ratings.
SCImago Journal & Country Rank
 

ISSN 1004-9037

         

Home
Editorial Board
Author Guidelines
Subscription
Journal of Data Acquisition and Processing
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China

E-mail: info@sjcjycl.cn
 
  Copyright ©2015 JCST, All Rights Reserved