Bimonthly    Since 1986
ISSN 1004-9037
Indexed in:
Publication Details
Edited by: Editorial Board of Journal of Data Acquisition and Processing
P.O. Box 2704, Beijing 100190, P.R. China
Sponsored by: Institute of Computing Technology, CAS & China Computer Federation
Undertaken by: Institute of Computing Technology, CAS
Distributed by:
China: All Local Post Offices
  • Table of Content
      05 July 2021, Volume 36 Issue 4   
    For Selected: View Abstracts Toggle Thumbnails
    Special Section on AI4DB and DB4AI
    Guo-Liang Li, Dong Deng, Ju Fan
    Journal of Data Acquisition and Processing, 2021, 36 (4): 719-720. 
    COLIN: A Cache-Conscious Dynamic Learned Index with High Read/Write Performance
    Zhou Zhang, Pei-Quan Jin, Xiao-Liang Wang, Yan-Qi Lv, Shou-Hong Wan, Xi-Ke Xie
    Journal of Data Acquisition and Processing, 2021, 36 (4): 721-740. 
    The recently proposed learned index has higher query performance and space efficiency than the conventional B+-tree. However, the original learned index has the problems of insertion failure and unbounded query complexity, meaning that it supports neither insertions nor bounded query complexity. Some variants of the learned index use an out-of-place strategy and a bottom-up build strategy to accelerate insertions and support bounded query complexity, but introduce additional query costs and frequent node splitting operations. Moreover, none of the existing learned indices are cachefriendly. In this paper, aiming to not only support efficient queries and insertions but also offer bounded query complexity, we propose a new learned index called COLIN (Cache-cOnscious Learned INdex). Unlike previous solutions using an out-ofplace strategy, COLIN adopts an in-place approach to support insertions and reserves some empty slots in a node to optimize the node’s data placement. In particular, through model-based data placement and cache-conscious data layout, COLIN decouples the local-search boundary from the maximum error of the model. The experimental results on five workloads and three datasets show that COLIN achieves the best read/write performance among all compared indices and outperforms the second best index by 18.4%, 6.2%, and 32.9% on the three datasets, respectively.
    WATuning: A Workload-Aware Tuning System with Attention-Based Deep Reinforcement Learning
    Jia-Ke Ge, Yan-Feng Chai, Yun-Peng Chai
    Journal of Data Acquisition and Processing, 2021, 36 (4): 741-761. 
    Configuration tuning is essential to optimize the performance of systems (e.g., databases, key-value stores). High performance usually indicates high throughput and low latency. At present, most of the tuning tasks of systems are performed artificially (e.g., by database administrators), but it is hard for them to achieve high performance through tuning in various types of systems and in various environments. In recent years, there have been some studies on tuning traditional database systems, but all these methods have some limitations. In this article, we put forward a tuning system based on attention-based deep reinforcement learning named WATuning, which can adapt to the changes of workload characteristics and optimize the system performance efficiently and effectively. Firstly, we design the core algorithm named ATT-Tune for WATuning to achieve the tuning task of systems. The algorithm uses workload characteristics to generate a weight matrix and acts on the internal metrics of systems, and then ATT-Tune uses the internal metrics with weight values assigned to select the appropriate configuration. Secondly, WATuning can generate multiple instance models according to the change of the workload so that it can complete targeted recommendation services for different types of workloads. Finally, WATuning can also dynamically fine-tune itself according to the constantly changing workload in practical applications so that it can better fit to the actual environment to make recommendations. The experimental results show that the throughput and the latency of WATuning are improved by 52.6% and decreased by 31%, respectively, compared with the throughput and the latency of CDBTune which is an existing optimal tuning method.
    Cardinality Estimator: Processing SQL with a Vertical Scanning Convolutional Neural Network
    Shao-Jie Qiao, Guo-Ping Yang, Nan Han, Hao Chen, Fa-Liang Huang, Kun Yue, Yu-Gen Yi, Chang-An Yuan
    Journal of Data Acquisition and Processing, 2021, 36 (4): 762-777. 
    Although the popular database systems perform well on query optimization, they still face poor query execution plans when the join operations across multiple tables are complex. Bad execution planning usually results in bad cardinality estimations. The cardinality estimation models in traditional databases cannot provide high-quality estimation, because they are not capable of capturing the correlation between multiple tables in an effective fashion. Recently, the state-ofthe-art learning-based cardinality estimation is estimated to work better than the traditional empirical methods. Basically, they used deep neural networks to compute the relationships and correlations of tables. In this paper, we propose a vertical scanning convolutional neural network (abbreviated as VSCNN) to capture the relationships between words in the word vector in order to generate a feature map. The proposed learning-based cardinality estimator converts Structured Query Language (SQL) queries from a sentence to a word vector and we encode table names in the one-hot encoding method and the samples into bitmaps, separately, and then merge them to obtain enough semantic information from data samples. In particular, the feature map obtained by VSCNN contains semantic information including tables, joins, and predicates about SQL queries. Importantly, in order to improve the accuracy of cardinality estimation, we propose the negative sampling method for training the word vector by gradient descent from the base table and compress it into a bitmap. Extensive experiments are conducted and the results show that the estimation quality of q-error of the proposed vertical scanning convolutional neural network based model is reduced by at least 14.6% when compared with the estimators in traditional databases.
    TransGPerf: Exploiting Transfer Learning for Modeling Distributed Graph Computation Performance
    Songjie Niu, Shimin Chen
    Journal of Data Acquisition and Processing, 2021, 36 (4): 778-791. 
    It is challenging to model the performance of distributed graph computation. Explicit formulation cannot easily capture the diversified factors and complex interactions in the system. Statistical learning methods require a large number of training samples to generate an accurate prediction model. However, it is time-consuming to run the required graph computation tests to obtain the training samples. In this paper, we propose TransGPerf, a transfer learning based solution that can exploit prior knowledge from a source scenario and utilize a manageable amount of training data for modeling the performance of a target graph computation scenario. Experimental results show that our proposed method is capable of generating accurate models for a wide range of graph computation tasks on PowerGraph and GraphX. It outperforms transfer learning methods proposed for other applications in the literature.
    Efficient Model Store and Reuse in an OLML Database System
    Jian-Wei Cui, Wei Lu, Xin Zhao, Xiao-Yong Du
    Journal of Data Acquisition and Processing, 2021, 36 (4): 792-805. 
    Deep learning has shown significant improvements on various machine learning tasks by introducing a wide spectrum of neural network models. Yet, for these neural network models, it is necessary to label a tremendous amount of training data, which is prohibitively expensive in reality. In this paper, we propose OnLine Machine Learning (OLML) database which stores trained models and reuses these models in a new training task to achieve a better training effect with a small amount of training data. An efficient model reuse algorithm AdaReuse is developed in the OLML database. Specifically, AdaReuse firstly estimates the reuse potential of trained models from domain relatedness and model quality, through which a group of trained models with high reuse potential for the training task could be selected efficiently. Then, multi selected models will be trained iteratively to encourage diverse models, with which a better training effect could be achieved by ensemble. We evaluate AdaReuse on two types of natural language processing (NLP) tasks, and the results show AdaReuse could improve the training effect significantly compared with models training from scratch when the training data is limited. Based on AdaReuse, we implement an OLML database prototype system which could accept a training task as an SQL-like query and automatically generate a training plan by selecting and reusing trained models. Usability studies are conducted to illustrate the OLML database could properly store the trained models, and reuse the trained models efficiently in new training tasks.
    Impacts of Dirty Data on Classification and Clustering Models: An Experimental Evaluation
    Zhi-Xin Qi, Hong-Zhi Wang, An-Jie Wang
    Journal of Data Acquisition and Processing, 2021, 36 (4): 806-821. 
    Data quality issues have attracted widespread attentions due to the negative impacts of dirty data on data mining and machine learning results. The relationship between data quality and the accuracy of results could be applied on the selection of the appropriate model with the consideration of data quality and the determination of the data share to clean. However, rare research has focused on exploring such relationship. Motivated by this, this paper conducts an experimental comparison for the effects of missing, inconsistent, and conflicting data on classification and clustering models. From the experimental results, we observe that dirty-data impacts are related to the error type, the error rate, and the data size. Based on the findings, we suggest users leverage our proposed metrics, sensibility and data quality inflection point, for model selection and data cleaning.
    Mixed Hierarchical Networks for Deep Entity Matching
    Chen-Chen Sun, De-Rong Shen
    Journal of Data Acquisition and Processing, 2021, 36 (4): 822-838. 
    Entity matching is a fundamental problem of data integration. It groups records according to underlying real-world entities. There is a growing trend of entity matching via deep learning techniques. We design mixed hierarchical deep neural networks (MHN) for entity matching, exploiting semantics from different abstract levels in the record internal hierarchy. A family of attention mechanisms is utilized in different periods of entity matching. Self-attention focuses on internal dependency, inter-attention targets at alignments, and multi-perspective weight attention is devoted to importance discrimination. Especially, hybrid soft token alignment is proposed to address corrupted data. Attribute order is for the first time considered in deep entity matching. Then, to reduce utilization of labeled training data, we propose an adversarial domain adaption approach (DA-MHN) to transfer matching knowledge between different entity matching tasks by maximizing classifier discrepancy. Finally, we conduct comprehensive experimental evaluations on 10 datasets (seven for MHN and three for DA-MHN), which illustrate our two proposed approaches’ superiorities. MHN apparently outperforms previous studies in accuracy, and also each component of MHN is tested. DA-MHN greatly surpasses existing studies in transferability.
    Regular Paper
    Using Markov Chain Based Estimation of Distribution Algorithm for Model-Based Safety Analysis of Graph Transformation
    Einollah Pira
    Journal of Data Acquisition and Processing, 2021, 36 (4): 839-855. 
    The ability to assess the reliability of safety-critical systems is one of the most crucial requirements in the design of modern safety-critical systems where even a minor failure can result in loss of life or irreparable damage to the environment. Model checking is an automatic technique that verifies or refutes system properties by exploring all reachable states (state space) of a model. In large and complex systems, it is probable that the state space explosion problem occurs. In exploring the state space of systems modeled by graph transformations, the rule applied on the current state specifies the rule that can perform on the next state. In other words, the allowed rule on the current state depends only on the applied rule on the previous state, not the ones on earlier states. This fact motivates us to use a Markov chain (MC) to capture this type of dependencies and applies the Estimation of Distribution Algorithm (EDA) to improve the quality of the MC. EDA is an evolutionary algorithm directing the search for the optimal solution by learning and sampling probabilistic models through the best individuals of a population at each generation. To show the effectiveness of the proposed approach, we implement it in GROOVE, an open source toolset for designing and model checking graph transformation systems. Experimental results confirm that the proposed approach has a high speed and accuracy in comparison with the existing meta-heuristic and evolutionary techniques in safety analysis of systems specified formally through graph transformations.
    An Empirical Comparison Between Tutorials and Crowd Documentation of Application Programming Interface
    Yi-Xuan Tang, Zhi-Lei Ren, He Jiang, Xiao-Chen Li, Wei-Qiang Kong
    Journal of Data Acquisition and Processing, 2021, 36 (4): 856-876. 
    API (application programming interface) documentation is critical for developers to learn APIs. However, it is unclear whether API documentation indeed improves the API learnability for developers. In this paper, we focus on two types of API documentation, i.e., official API tutorials and API crowd documentation. First, we analyze API coverage and check API consistencies in API documentation based on the API traceability. Then, we conduct a survey and extract several characteristics to analyze which API documentation can help developers learn APIs. Our findings show that: 1) API crowd documentation can be regarded as a supplement to the official API tutorials to some extent; 2) the concerns for frequentlyused APIs between different types of API documentation show a huge mismatch, which may prevent developers from deeply understanding the usages of APIs through only one type of API documentation; 3) official API tutorials can help developers seek API information on a long page and API crowd documentation could provide long codes for a particular programming task. These findings may help developers select the suitable API documentation and find the useful information they need.
    Order-Revealing Encryption: File-Injection Attack and Forward Security
    Yuan Li, Xing-Chen Wang, Lin Huang, Yun-Lei Zhao
    Journal of Data Acquisition and Processing, 2021, 36 (4): 877-895. 
    Order-preserving encryption (OPE) and order-revealing encryption (ORE) are among the core ingredients for encrypted databases (EDBs). In this work, we study the leakage of OPE and ORE and their forward security. We propose generic yet powerful file-injection attacks (FIAs) on OPE/ORE, aimed at the situations of possessing order by and range queries. Our FIAs only exploit the ideal leakage of OPE/ORE (in particular, no need of data denseness or frequency). We also improve their efficiency with the frequency statistics using a hierarchical idea such that the high-frequency values will be recovered more quickly. We conduct some experiments on real datasets to test the performance, and the results show that our FIAs can cause an extreme hazard on most of the existing OPEs and OREs with high efficiency and 100% recovery rate. We then formulate forward security of ORE, and propose a practical compilation framework for achieving forward secure ORE to resist the perniciousness of FIA. The compilation framework can transform most of the existing OPEs/OREs into forward secure OREs, with the goal of minimizing the extra burden incurred on computation and storage. We also present its security proof, and execute some experiments to analyze its performance. The proposed compilation is highly efficient and forward secure.
    A Heuristic Sampling Method for Maintaining the Probability Distribution
    Jiao-Yun Yang, Jun-Da Wang, Yi-Fang Zhang, Wen-Juan Cheng, Lian Li
    Journal of Data Acquisition and Processing, 2021, 36 (4): 896-909. 
    Sampling is a fundamental method for generating data subsets. As many data analysis methods are developed based on probability distributions, maintaining distributions when sampling can help to ensure good data analysis performance. However, sampling a minimum subset while maintaining probability distributions is still a problem. In this paper, we decompose a joint probability distribution into a product of conditional probabilities based on Bayesian networks and use the chi-square test to formulate a sampling problem that requires that the sampled subset pass the distribution test to ensure the distribution. Furthermore, a heuristic sampling algorithm is proposed to generate the required subset by designing two scoring functions: one based on the chi-square test and the other based on likelihood functions. Experiments on four types of datasets with a size of 60 000 show that when the significant difference level, α, is set to 0.05, the algorithm can exclude 99.9%, 99.0%, 93.1% and 96.7% of the samples based on their Bayesian networks—ASIA, ALARM, HEPAR2, and ANDES, respectively. When subsets of the same size are sampled, the subset generated by our algorithm passes all the distribution tests and the average distribution difference is approximately 0.03; by contrast, the subsets generated by random sampling pass only 83.8% of the tests, and the average distribution difference is approximately 0.24.
    Method for Processing Graph Degeneracy in Dynamic Geometry Based on Domain Design
    Hao Guan, Yong-Sheng Rao, Jing-Zhong Zhang, Sheng Cao, Xiao-Lin Qin
    Journal of Data Acquisition and Processing, 2021, 36 (4): 910-921. 
    A dynamic geometry system, as an important application in the field of geometric constraint solving, is widely used in elementary mathematics education; moreover, the dynamic geometry system is also a fundamental environment for automated theorem proving in geometry. In a geometric constraint solving process, a situation involving a critical point is often encountered, and geometric element degeneracy may occur at this point. Usually, the degeneracy situation must be substantively focused on during the learning and exploration process. However, many degeneracy situations cannot be completely presented even by the well-known dynamic geometry software. In this paper, the mechanisms causing the degeneracy of a geometric element are analyzed, and relevant definitions and formalized descriptions for the problem are provided according to the relevant modern Euclidean geometry theories. To solve the problem, the data structure is optimized, and a domain model design for the geometric element and the constraint relationships thereof in the dynamic geometry system are formed; furthermore, an update algorithm for the element is proposed based on the novel domain model. In addition, instances show that the proposed domain model and the update algorithm can effectively cope with the geometric element degeneracy situations in the geometric constraint solving process, thereby achieving unification of the dynamic geometry drawing and the geometric intuition of the user.
    Discovering API Directives from API Specifications with Text Classification
    Jing-Xuan Zhang, Chuan-Qi Tao, Zhi-Qiu Huang, Xin Chen
    Journal of Data Acquisition and Processing, 2021, 36 (4): 922-943. 
    Application programming interface (API) libraries are extensively used by developers. To correctly program with APIs and avoid bugs, developers shall pay attention to API directives, which illustrate the constraints of APIs. Unfortunately, API directives usually have diverse morphologies, making it time-consuming and error-prone for developers to discover all the relevant API directives. In this paper, we propose an approach leveraging text classification to discover API directives from API specifications. Specifically, given a set of training sentences in API specifications, our approach first characterizes each sentence by three groups of features. Then, to deal with the unequal distribution between API directives and non-directives, our approach employs an under-sampling strategy to split the imbalanced training set into several subsets and trains several classifiers. Given a new sentence in an API specification, our approach synthesizes the trained classifiers to predict whether it is an API directive. We have evaluated our approach over a publicly available annotated API directive corpus. The experimental results reveal that our approach achieves an F-measure value of up to 82.08%. In addition, our approach statistically outperforms the state-of-the-art approach by up to 29.67% in terms of F-measure.
    Multi-Attribute Preferences Mining Method for Group Users with the Process of Noise Reduction
    Qing-Mei Tan, Xu-Na Wang
    Journal of Data Acquisition and Processing, 2021, 36 (4): 944-960. 
    Traditional researches on user preferences mining mainly explore the user’s overall preferences on the project, but ignore that the fundamental motivation of user preferences comes from their attitudes on some attributes of the project. In addition, traditional researches seldom consider the typical preferences combination of group users, which may have influence on the personalized service for group users. To solve this problem, a method with noise reduction for group user preferences mining is proposed, which focuses on mining the multi-attribute preference tendency of group users. Firstly, both the availability of data and the noise interference on preferences mining are considered in the algorithm design. In the process of generating group user preferences, a new path is used to generate preference keywords so as to reduce the noise interference. Secondly, the Gibbs sampling algorithm is used to estimate the parameters of the model. Finally, using the user comment data of several online shopping websites as experimental objects, the method is used to mine the multi-attribute preferences of different groups. The proposed method is compared with other methods from three aspects of predictive ability, preference mining ability and preference topic similarity. Experimental results show that the method is significantly better than other existing methods.
SCImago Journal & Country Rank

ISSN 1004-9037


Editorial Board
Author Guidelines
Journal of Data Acquisition and Processing
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China

E-mail: info@sjcjycl.cn
  Copyright ©2015 JCST, All Rights Reserved