Loading...
Bimonthly    Since 1986
ISSN 1004-9037
/
Indexed in:
SCIE, Ei, INSPEC, JST, AJ, MR, CA, DBLP, etc.
Publication Details
Edited by: Editorial Board of Journal of Data Acquisition and Processing
P.O. Box 2704, Beijing 100190, P.R. China
Sponsored by: Institute of Computing Technology, CAS & China Computer Federation
Undertaken by: Institute of Computing Technology, CAS
Published by: SCIENCE PRESS, BEIJING, CHINA
Distributed by:
China: All Local Post Offices
 
  • Table of Content
      05 July 2016, Volume 31 Issue 4   
    For Selected: View Abstracts Toggle Thumbnails
    Special Section on Data Management and Data Mining 2016
    Preface
    Jian Pei
    Journal of Data Acquisition and Processing, 2016, 31 (4): 635-636. 
    Abstract   PDF(64KB) ( 652 )  
    Dynamic Shortest Path Monitoring in Spatial Networks
    Shuo Shang, Lisi Chen, Zhe-Wei Wei, Dan-Huai Guo, Ji-Rong Wen
    Journal of Data Acquisition and Processing, 2016, 31 (4): 637-648. 
    Abstract   PDF(420KB) ( 1195 )  
    With the increasing availability of real-time traffic information, dynamic spatial networks are pervasive nowadays and path planning in dynamic spatial networks becomes an important issue. In this light, we propose and investigate a novel problem of dynamically monitoring shortest paths in spatial networks (DSPM query). When a traveler aims to a destination, his/her shortest path to the destination may change due to two reasons:1) the travel costs of some edges have been updated and 2) the traveler deviates from the pre-planned path. Our target is to accelerate the shortest path computing in dynamic spatial networks, and we believe that this study may be useful in many mobile applications, such as route planning and recommendation, car navigation and tracking, and location-based services in general. This problem is challenging due to two reasons:1) how to maintain and reuse the existing computation results to accelerate the following computations, and 2) how to prune the search space effectively. To overcome these challenges, filter-and-refinement paradigm is adopted. We maintain an expansion tree and define a pair of upper and lower bounds to prune the search space. A series of optimization techniques are developed to accelerate the shortest path computing. The performance of the developed methods is studied in extensive experiments based on real spatial data.
    Mining Object Similarity for Predicting Next Locations
    Meng Chen, Xiaohui Yu, Yang Liu
    Journal of Data Acquisition and Processing, 2016, 31 (4): 649-660. 
    Abstract   PDF(458KB) ( 1241 )  
    Next location prediction is of great importance for many location-based applications. With the virtue of solid theoretical foundations, Markov-based approaches have gained success along this direction. In this paper, we seek to enhance the prediction performance by understanding the similarity between objects. In particular, we propose a novel method, called weighted Markov model (weighted-MM), which exploits both the sequence of just-passed locations and the object similarity in mining the mobility patterns. To this end, we first train a Markov model for each object with its own trajectory records, and then quantify the similarities between different objects from two aspects:spatial locality similarity and trajectory similarity. Finally, we incorporate the object similarity into the Markov model by considering the similarity as the weight of the probability of reaching each possible next location, and return the top-rankings as results. We have conducted extensive experiments on a real dataset, and the results demonstrate significant improvements in prediction accuracy over existing solutions.
    Web News Extraction via Tag Path Feature Fusion Using DS Theory
    Gong-Qing Wu, Lei Li, Li Li, Xindong Wu
    Journal of Data Acquisition and Processing, 2016, 31 (4): 661-672. 
    Abstract   PDF(1909KB) ( 1348 )  
    Contents, layout styles, and parse structures of web news pages differ greatly from one page to another. In addition, the layout style and the parse structure of a web news page may change from time to time. For these reasons, how to design features with excellent extraction performances for massive and heterogeneous web news pages is a challenging issue. Our extensive case studies indicate that there is potential relevancy between web content layouts and their tag paths. Inspired by the observation, we design a series of tag path extraction features to extract web news. Because each feature has its own strength, we fuse all those features with the DS (Dempster-Shafer) evidence theory, and then design a content extraction method CEDS. Experimental results on both CleanEval datasets and web news pages selected randomly from well-known websites show that the F1-score with CEDS is 8.08% and 3.08% higher than existing popular content extraction methods CETR and CEPR-TPR respectively.
    Online Feature Selection of Class Imbalance via PA Algorithm
    Chao Han, Yun-Kun Tan, Jin-Hui Zhu, Yong Guo, Jian Chen, Qing-Yao Wu
    Journal of Data Acquisition and Processing, 2016, 31 (4): 673-682. 
    Abstract   PDF(395KB) ( 1320 )  
    Imbalance classification techniques have been frequently applied in many machine learning application domains where the number of the majority (or positive) class of a dataset is much larger than that of the minority (or negative) class. Meanwhile, feature selection (FS) is one of the key techniques for the high-dimensional classification task in a manner which greatly improves the classification performance and the computational efficiency. However, most studies of feature selection and imbalance classification are restricted to off-line batch learning, which is not well adapted to some practical scenarios. In this paper, we aim to solve high-dimensional imbalanced classification problem accurately and efficiently with only a small number of active features in an online fashion, and we propose two novel online learning algorithms for this purpose. In our approach, a classifier which involves only a small and fixed number of features is constructed to classify a sequence of imbalanced data received in an online manner. We formulate the construction of such online learner into an optimization problem and use an iterative approach to solve the problem based on the passive-aggressive (PA) algorithm as well as a truncated gradient (TG) method. We evaluate the performance of the proposed algorithms based on several real-world datasets, and our experimental results have demonstrated the effectiveness of the proposed algorithms in comparison with the baselines.
    Efficient Set-Correlation Operator Inside Databases
    Fei Gao, Shao-Xu Song, Lei Chen, Jian-Min Wang
    Journal of Data Acquisition and Processing, 2016, 31 (4): 683-701. 
    Abstract   PDF(1251KB) ( 819 )  
    Large scale of short text records are now prevalent, such as news highlights, scientific paper citations, and posted messages in a discussion forum, and are often stored as set records in hidden-Web databases. Many interesting information retrieval tasks are correspondingly raised on the correlation query over these short text records, such as finding hot topics over news highlights and searching related scientific papers on a certain topic. However, current relational database management systems (RDBMS) do not directly provide support on set correlation query. Thus, in this paper, we address both the effectiveness and the efficiency issues of set correlation query over set records in databases. First, we present a framework of set correlation query inside databases. To the best of our knowledge, only the Pearson's correlation can be implemented to construct token correlations by using RDBMS facilities. Thereby, we propose a novel correlation coefficient to extend Pearson's correlation, and provide a pure-SQL implementation inside databases. We further propose optimal strategies to set up correlation filtering threshold, which can greatly reduce the query time. Our theoretical analysis proves that with a proper setting of filtering threshold, we can improve the query efficiency with a little effectiveness loss. Finally, we conduct extensive experiments to show the effectiveness and the efficiency of proposed correlation query and optimization strategies.
    Keyword Query over Error-Tolerant Knowledge Bases
    Yu-Rong Cheng, Ye Yuan, Jia-Yu Li, Lei Chen, Guo-Ren Wang
    Journal of Data Acquisition and Processing, 2016, 31 (4): 702-719. 
    Abstract   PDF(604KB) ( 886 )  
    With more and more knowledge provided by WWW, querying and mining the knowledge bases have attracted much research attention. Among all the queries over knowledge bases, which are usually modelled as graphs, a keyword query is the most widely used one. Although the problem of keyword query over graphs has been deeply studied for years, knowledge bases, as special error-tolerant graphs, lead to the results of the traditional defined keyword queries out of users' satisfaction. Thus, in this paper, we define a new keyword query, called confident r-clique, specific for knowledge bases based on the r-clique definition for keyword query on general graphs, which has been proved to be the best one. However, as we prove in the paper, finding the confident r-cliques is #P-hard. We propose a filtering-and-verification framework to improve the search efficiency. In the filtering phase, we develop the tightest upper bound of the confident r-clique, and design an index together with its search algorithm, which suits the large scale of knowledge bases well. In the verification phase, we develop an efficient sampling method to verify the final answers from the candidates remaining in the filtering phase. Extensive experiments demonstrate that the results derived from our new definition satisfy the users' requirement better compared with the traditional r-clique definition, and our algorithms are efficient.
    Determining the Real Data Completeness of a Relational Dataset
    Yong-Nan Liu, Jian-Zhong Li, Zhao-Nian Zou
    Journal of Data Acquisition and Processing, 2016, 31 (4): 720-740. 
    Abstract   PDF(556KB) ( 987 )  
    Low quality of data is a serious problem in the new era of big data, which can severely reduce the usability of data, mislead or bias the querying, analyzing and mining, and leads to huge loss. Incomplete data is common in low quality data, and it is necessary to determine the data completeness of a dataset to provide hints for follow-up operations on it. Little existing work focuses on the completeness of a dataset, and such work views all missing values as unknown values. In this paper, we study how to determine real data completeness of a relational dataset. By taking advantage of given functional dependencies, we aim to determine some missing attribute values by other tuples and capture the really missing attribute cells. We propose a data completeness model, formalize the problem of determining the real data completeness of a relational dataset, and give a lower bound of the time complexity of this problem. Two optimal algorithms to determine the data completeness of a dataset for different cases are proposed. We empirically show the effectiveness and the scalability of our algorithms on both real-world data and synthetic data.
    Content-Related Repairing of Inconsistencies in Distributed Data
    Yue-Feng Du, De-Rong Shen, Tie-Zheng Nie, Yue Kou, Ge Yu
    Journal of Data Acquisition and Processing, 2016, 31 (4): 741-758. 
    Abstract   PDF(1030KB) ( 834 )  
    Conditional functional dependencies (CFDs) are a critical technique for detecting inconsistencies while they may ignore some potential inconsistencies without considering the content relationship of data. Content-related conditional functional dependencies (CCFDs) are a type of special CFDs, which combine content-related CFDs and detect potential inconsistencies by putting content-related data together. In the process of cleaning inconsistencies, detection and repairing are interactive:1) detection catches inconsistencies, 2) repairing corrects caught inconsistencies while may bring new inconsistencies. Besides, data are often fragmented and distributed into multiple sites. It consequently costs expensive shipment for inconsistencies cleaning. In this paper, our aim is to repair inconsistencies in distributed content-related data. We propose a framework consisting of an inconsistencies detection method and an inconsistencies repairing method, which work iteratively. The detection method marks the violated CCFDs for computing the inconsistencies which should be repaired preferentially. Based on the repairing-cost model presented in this paper, we prove that the minimum-cost repairing using CCFDs is NP-complete. Therefore, the repairing method heuristically repairs the inconsistencies with minimum cost. To improve the efficiency and accuracy of repairing, we propose distinct values and rules sequences. Distinct values make less data shipments than real data for communication. Rules sequences determine appropriate repairing sequences to avoid some incorrect repairs. Our solution is proved to be more effective than CFDs by empirical evaluation on two real-life datasets.
    Budget Allocation for Maximizing Viral Advertising in Social Networks
    Bo-Lei Zhang, Zhu-Zhong Qian, Wen-Zhong Li, Bin Tang, Sang-Lu Lu, Xiaoming Fu
    Journal of Data Acquisition and Processing, 2016, 31 (4): 759-775. 
    Abstract   PDF(890KB) ( 1154 )  
    Viral advertising in social networks has arisen as one of the most promising ways to increase brand awareness and product sales. By distributing a limited budget, we can incentivize a set of users as initial adopters so that the advertising can start from the initial adopters and spread via social links to become viral. Despite extensive researches in how to target the most influential users, a key issue is often neglected:how to incentivize the initial adopters. In the problem of influence maximization, the assumption is that each user has a fixed cost for being initial adopters, while in practice, user decisions for accepting the budget to be initial adopters are often probabilistic rather than deterministic. In this paper, we study optimal budget allocation in social networks to maximize the spread of viral advertising. In particular, a concave probability model is introduced to characterize each user's utility for being an initial adopter. Under this model, we show that it is NP-hard to find an optimal budget allocation for maximizing the spread of viral advertising. We then present a novel discrete greedy algorithm with near optimal performance, and further propose scaling-up techniques to improve the time-efficiency of our algorithm. Extensive experiments on real-world social graphs are implemented to validate the effectiveness of our algorithm in practice. The results show that our algorithm can outperform other intuitive heuristics significantly in almost all cases.
    HUITWU: An Efficient Algorithm for High-Utility Itemset Mining in Transaction Databases
    Shi-Ming Guo, Hong Gao
    Journal of Data Acquisition and Processing, 2016, 31 (4): 776-786. 
    Abstract   PDF(569KB) ( 1027 )  
    Mining high-utility itemsets (HUIs) from a transaction database refers to the discovery of itemsets with high utilities like profits. Most of existing studies discover HUIs from a transaction database in two phases. In phase 1, different overestimation methods are applied to calculate the upper bounds of the utilities of itemsets. Since the overestimated utilities of itemsets are adopted, the itemsets whose overestimated utilities are no less than a user-specified threshold are selected as candidate HUIs, and they are verified by scanning the database one more time in phase 2. However, a large number of candidate HUIs incur two problems:1) it requires excessive memory to store these candidates; 2) it needs a large amount of running time to calculate their exact utilities. Vertical data format has been applied to mine HUIs recently. However this kind of method cannot deal with transactions with the same items effectively so that the size of database cannot be reduced sufficiently. The overall performance of algorithms is degraded consequently. Thus an algorithm HUITWU is proposed in this paper for mining HUIs. A novel data structure HUITWU-Tree is adopted to efficiently calculate the utilities of itemsets in a database. Extensive studies with both sparse and dense datasets have demonstrated that our proposed algorithm is more than an order of magnitude faster and consumes less memory than the state-of-the-art algorithms.
    Computer Graphics and Multimedia
    A Survey of Visual Analytic Pipelines
    Xu-Meng Wang, Tian-Ye Zhang, Yu-Xin Ma, Jing Xia, Wei Chen
    Journal of Data Acquisition and Processing, 2016, 31 (4): 787-804. 
    Abstract   PDF(5040KB) ( 2069 )  
    Visual analytics has been widely studied in the past decade. One key to make visual analytics practical for both research and industrial applications is the appropriate definition and implementation of the visual analytics pipeline which provides effective abstractions for designing and implementing visual analytics systems. In this paper we review the previous work on visual analytics pipelines and individual modules from multiple perspectives:data, visualization, model and knowledge. In each module we discuss various representations and descriptions of pipelines inside the module, and compare the commonalities and the differences among them.
    Computer Architecture and Systems
    Improving Metadata Caching Efficiency for Data Deduplication via In-RAM Metadata Utilization
    Bing Zhou, Jiang-Tao Wen
    Journal of Data Acquisition and Processing, 2016, 31 (4): 805-819. 
    Abstract   PDF(528KB) ( 833 )  
    We describe a data deduplication system for backup storage of PC disk images, named in-RAM metadata utilizing deduplication (IR-MUD). In-RAM hash granularity adaptation and miniLZO based data compression are firstly proposed to reduce the in-RAM metadata size and thereby reduce the space overheads required by the in-RAM metadata caches. Secondly, an in-RAM metadata write cache, as opposed to the traditional metadata read cache, is proposed for further reducing metadata-related disk I/O operations and improving deduplication throughput. During deduplication, the metadata write cache is managed following the LRU caching policy. For each manifest that is hit in the metadata write cache, an expensive manifest reloading operation from the disk is avoided. After deduplication, all the manifests in the metadata write cache are cleared and stored on the disk. Our experimental results using 1.5 TB real-world disk image dataset show that 1) IR-MUD achieved about 95% size reduction for the deduplication metadata, with a small time overhead introduced, 2) when the metadata write cache was not utilized, with the same RAM space size for the metadata read cache, IR-MUD achieved a 400% higher RAM hit ratio and a 50% higher deduplication throughput, as compared with the classic Sparse Indexing deduplication system where no metadata utilization approaches are utilized, and 3) when the metadata write cache was utilized and enough RAM space was available, IR-MUD achieved a 500% higher RAM hit ratio compared with Sparse Indexing and a 70% higher deduplication throughput compared with IR-MUD with only a single metadata read cache. The in-RAM metadata harnessing and metadata write caching approaches of IR-MUD can be applied in most parallel deduplication systems for improving metadata caching efficiency.
    A Data Deduplication Framework of Disk Images with Adaptive Block Skipping
    Bing Zhou, Jiang-Tao Wen
    Journal of Data Acquisition and Processing, 2016, 31 (4): 820-835. 
    Abstract   PDF(534KB) ( 710 )  
    We describe an efficient and easily applicable data deduplication framework with heuristic prediction based adaptive block skipping for the real-world dataset such as disk images to save deduplication related overheads and improve deduplication throughput with good deduplication efficiency maintained. Under the framework, deduplication operations are skipped for data chunks determined as likely non-duplicates via heuristic prediction, in conjunction with a hit and matching extension process for duplication identification within skipped blocks and a hysteresis mechanism based hash indexing process to update the hash indices for the re-encountered skipped chunks. For performance evaluation, the proposed framework was integrated and implemented in the existing data domain and sparse indexing deduplication algorithms. The experimental results based on a real-world dataset of 1.0 TB disk images showed that the deduplication related overheads were significantly reduced with adaptive block skipping, leading to a 30% 80% improvement in deduplication throughput when deduplication metadata were stored on the disk for data domain, and 25% 40% RAM space saving with a 15% 20% improvement in deduplication throughput when an in-RAM sparse index was used in sparse indexing. In both cases, the corresponding deduplication ratios reduced were below 5%.
    Reducing Synchronization Cost for Single-Level Store in Mobile Systems
    Yuan-Chao Xu, Hu Wan, Ke-Ni Qiu, Tao Li, Wei-Gong Zhang
    Journal of Data Acquisition and Processing, 2016, 31 (4): 836-848. 
    Abstract   PDF(2794KB) ( 961 )  
    Emerging byte-addressable non-volatile memory technologies, such as phase change memory (PCM) and spintransfer torque RAM (STT-RAM), offer both the byte-addressability of memory and the durability of storage, thus making it feasible to build single-level store systems. To ensure the consistency of persistent data structures in the presence of power failures or system crashes, it requires flushing cache lines to persistent memory frequently, thus incurring non-trivial synchronization overhead. To mitigate this issue, we propose two techniques. First, we use non-volatile STT-RAM as scratchpad memory on chip to store recovery information, thereby eliminating synchronization cost in the logging phase due to the avoidance of off-chip logging operations. Second, we present an adaptive synchronization policy based on caching modes in terms of data access patterns, thereby eliminating unnecessary synchronization cost in the checkpoint phase. Evaluation results indicate that the two techniques improve the overall performance from 2.15x to 2.39x compared with conventional transactional persistent memory.
SCImago Journal & Country Rank
 

ISSN 1004-9037

         

Home
Editorial Board
Author Guidelines
Subscription
Journal of Data Acquisition and Processing
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China

E-mail: info@sjcjycl.cn
 
  Copyright ©2015 JCST, All Rights Reserved