Loading...
Bimonthly    Since 1986
ISSN 1004-9037
/
Indexed in:
SCIE, Ei, INSPEC, JST, AJ, MR, CA, DBLP, etc.
Publication Details
Edited by: Editorial Board of Journal of Data Acquisition and Processing
P.O. Box 2704, Beijing 100190, P.R. China
Sponsored by: Institute of Computing Technology, CAS & China Computer Federation
Undertaken by: Institute of Computing Technology, CAS
Published by: SCIENCE PRESS, BEIJING, CHINA
Distributed by:
China: All Local Post Offices
 
  • Table of Content
      05 May 2014, Volume 29 Issue 3   
    For Selected: View Abstracts Toggle Thumbnails
    Data Management and Data Mining
    Online Induction of Probabilistic Real-Time Automata
    Jana Schmidt and Stefan Kramer
    Journal of Data Acquisition and Processing, 2014, 29 (3): 345-360. 
    Abstract   PDF(3333KB) ( 1439 )  
    The probabilistic real-time automaton (PRTA) is a representation of dynamic processes arising in the sciences and industry. Currently, the induction of automata is divided into two steps: the creation of the prefix tree acceptor (PTA) and the merge procedure based on clustering of the states. These two steps can be very time intensive when a PRTA is to be induced for massive or even unbounded datasets. The latter one can be effciently processed, as there exist scalable online clustering algorithms. However, the creation of the PTA still can be very time consuming. To overcome this problem, we propose a genuine online PRTA induction approach that incorporates new instances by first collapsing them and then using a maximum frequent pattern based clustering. The approach is tested against a predefined synthetic automaton and real world datasets, for which the approach is scalable and stable. Moreover, we present a broad evaluation on a real world disease group dataset that shows the applicability of such a model to the analysis of medical processes.
    Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network
    Rafael Geraldeli Rossi, Alneu de Andrade Lopes, Thiago de Paulo Faleiros, and Solange Oliveira Rezende
    Journal of Data Acquisition and Processing, 2014, 29 (3): 361-375. 
    Abstract   PDF(1440KB) ( 2746 )  
    Algorithms for numeric data classification have been applied for text classification. Usually the vector space model is used to represent text collections. The characteristics of this representation such as sparsity and high dimensionality sometimes impair the quality of general-purpose classifiers. Networks can be used to represent text collections, avoiding the high sparsity and allowing to model relationships among different objects that compose a text collection. Such networkbased representations can improve the quality of the classification results. One of the simplest ways to represent textual collections by a network is through a bipartite heterogeneous network, which is composed of objects that represent the documents connected to objects that represent the terms. Heterogeneous bipartite networks do not require computation of similarities or relations among the objects and can be used to model any type of text collection. Due to the advantages of representing text collections through bipartite heterogeneous networks, in this article we present a text classifier which builds a classification model using the structure of a bipartite heterogeneous network. Such an algorithm, referred to as IMBHN (Inductive Model Based on Bipartite Heterogeneous Network), induces a classification model assigning weights to objects that represent the terms for each class of the text collection. An empirical evaluation using a large amount of text collections from different domains shows that the proposed IMBHN algorithm produces significantly better results than k-NN, C4.5, SVM, and Naive Bayes algorithms.
    Higher-Order Smoothing:A Novel Semantic Smoothing Method for Text Classification
    Mitat Poyraz, Zeynep Hilal Kilimci, and Murat Can Ganiz
    Journal of Data Acquisition and Processing, 2014, 29 (3): 376-391. 
    Abstract   PDF(500KB) ( 2326 )  
    It is known that latent semantic indexing (LSI) takes advantage of implicit higher-order (or latent) structure in the association of terms and documents. Higher-order relations in LSI capture "latent semantics". These finding have inspired a novel Bayesian framework for classification named Higher-Order Naive Bayes (HONB), which was introduced previously, that can explicitly make use of these higher-order relations. In this paper, we present a novel semantic smoothing method named Higher-Order Smoothing (HOS) for the Naive Bayes algorithm. HOS is built on a similar graph based data representation of the HONB which allows semantics in higher-order paths to be exploited. We take the concept one step further in HOS and exploit the relationships between instances of different classes. As a result, we move not only beyond instance boundaries, but also class boundaries to exploit the latent information in higher-order paths. This approach improves the parameter estimation when dealing with insufficient labeled data. Results of our extensive experiments demonstrate the value of HOS on several benchmark datasets.
    ConfDTree:Statistical Methods for Improving Decision Trees
    Gilad Katz, Asaf Shabtai, Lior Rokach, and Nir Ofek
    Journal of Data Acquisition and Processing, 2014, 29 (3): 392-407. 
    Abstract   PDF(1139KB) ( 2473 )  
    Decision trees have three main disadvantages: reduced performance when the training set is small; rigid decision criteria; and the fact that a single "uncharacteristic" attribute might "derail" the classification process. In this paper we present ConfDTree (Confidence-Based Decision Tree)——a post-processing method that enables decision trees to better classify outlier instances. This method, which can be applied to any decision tree algorithm, uses easy-to-implement statistical methods (confidence intervals and two-proportion tests) in order to identify hard-to-classify instances and to propose alternative routes. The experimental study indicates that the proposed post-processing method consistently and significantly improves the predictive performance of decision trees, particularly for small, imbalanced or multi-class datasets in which an average improvement of 5%~9% in the AUC performance is reported.
    Harnessing the Power of GPUs to Speed Up Feature Selection for Outlier Detection
    Fatemeh Azmandian, Ayse Yilmazer, Jennifer G. Dy Javed A. Aslam, and David R. Kaeli
    Journal of Data Acquisition and Processing, 2014, 29 (3): 408-422. 
    Abstract   PDF(7259KB) ( 1478 )  
    Acquiring a set of features that emphasize the differences between normal data points and outliers can drastically facilitate the task of identifying outliers. In our work, we present a novel non-parametric evaluation criterion for filter-based feature selection which has an eye towards the final goal of outlier detection. The proposed method seeks the subset of features that represent the inherent characteristics of the normal dataset while forcing outliers to stand out, making them more easily distinguished by outlier detection algorithms. Experimental results on real datasets show the advantage of our feature selection algorithm compared with popular and state-of-the-art methods. We also show that the proposed algorithm is able to overcome the small sample space problem and perform well on highly imbalanced datasets. Furthermore, due to the highly parallelizable nature of the feature selection, we implement the algorithm on a graphics processing unit (GPU) to gain significant speedup over the serial version. The benefits of the GPU implementation are two-fold, as its performance scales very well in terms of the number of features, as well as the number of data points.
    Optimal Set Cover Formulation for Exclusive Row Biclustering of Gene Expression
    Amichai Painsky and Saharon Rosset
    Journal of Data Acquisition and Processing, 2014, 29 (3): 423-435. 
    Abstract   PDF(1843KB) ( 1248 )  
    The availability of large microarray data has led to a growing interest in biclustering methods in the past decade. Several algorithms have been proposed to identify subsets of genes and conditions according to different similarity measures and under varying constraints. In this paper we focus on the exclusive row biclustering problem (also known as projected clustering) for gene expression, in which each row can only be a member of a single bicluster while columns can participate in multiple clusters. This type of biclustering may be adequate, for example, for clustering groups of cancer patients where each patient (row) is expected to be carrying only a single type of cancer, while each cancer type is associated with multiple (and possibly overlapping) genes (columns). We present a novel method to identify these exclusive row biclusters in the spirit of the optimal set cover problem. We present our algorithmic solution as a combination of existing biclustering algorithms and combinatorial auction techniques. Furthermore, we devise an approach for tuning the threshold of our algorithm based on comparison with a null model, inspired by the Gap statistic approach. We demonstrate our approach on both synthetic and real world gene expression data and show its power in identifying large span non-overlapping rows submatrices, while considering their unique nature.
    Continuous Outlier Monitoring on Uncertain Data Streams
    Ke-Yan Cao, Guo-Ren Wang, Dong-Hong Han, Guo-Hui Ding, Ai-Xia Wang, and Ling-Xu Shi
    Journal of Data Acquisition and Processing, 2014, 29 (3): 436-448. 
    Abstract   PDF(1063KB) ( 2199 )  
    Outlier detection on data streams is an important task in data mining. The challenges become even larger when considering uncertain data. This paper studies the problem of outlier detection on uncertain data streams. We propose Continuous Uncertain Outlier Detection (CUOD), which can quickly determine the nature of the uncertain elements by pruning to improve the effciency. Furthermore, we propose a pruning approach——Probability Pruning for Continuous Uncertain Outlier Detection (PCUOD) to reduce the detection cost. It is an estimated outlier probability method which can effectively reduce the amount of calculations. The cost of PCUOD incremental algorithm can satisfy the demand of uncertain data streams. Finally, a new method for parameter variable queries to CUOD is proposed, enabling the concurrent execution of different queries. To the best of our knowledge, this paper is the first work to perform outlier detection on uncertain data streams which can handle parameter variable queries simultaneously. Our methods are verified using both real data and synthetic data. The results show that they are able to reduce the required storage and running time.
    Computer Networks and Distributed Computing
    Efficient Data Access for Location-Dependent Spatial Queries
    Kwangjin Park
    Journal of Data Acquisition and Processing, 2014, 29 (3): 449-469. 
    Abstract   PDF(2712KB) ( 1549 )  
    When the mobile environment consists of light-weight devices, the loss of network connectivity and scarce resources, e.g., low battery power and limited memory, become primary issues of concern in order to effciently support portable wireless devices. In this paper, we propose an index-based peer-to-peer data access method that uses a new Hierarchical Location-Based Sequential (HLBS) index. We then propose a novel distributed Nearest First Broadcast (NFB) algorithm. Both HLBS and NFB are specifically designed for mobile peer-to-peer service in wireless broadcast environments. The system has a lower response time, because the client only contacts a qualified service provider by accessing the HLBS and quickly retrieves the data to answer the query by using NFB. HLBS and NFB design the index for spatial objects according to the positions of individual clients and transfer the index in the order arranged so that the spatial query can be processed even after the user tunes the partial index. Hence, this design can support rapid and energy-effcient service. A performance evaluation is conducted to compare the proposed algorithms with algorithms based on R-tree and Hilbert-curve air indexes. The results show that the proposed data dissemination algorithm with the HLBS index is scalable and energy effcient in both range queries and nearest neighbor queries.
    A Survey on Data Dissemination in Wireless Sensor Networks
    Xiao-Long Zheng and Meng Wan
    Journal of Data Acquisition and Processing, 2014, 29 (3): 470-486. 
    Abstract   PDF(2068KB) ( 2703 )  
    Wireless Sensor Networks (WSNs) have been applied in a variety of application areas. Most WSN systems, once deployed, are intended to operate unattended for a long period. During the lifetime, it is necessary to fix bugs, reconfigure system parameters, and upgrade the software in order to achieve reliable system performance. However, manually collecting all nodes back and reconfiguring through serial connections with computer is infeasible since it is labor-intensive and inconvenient due to the harsh deploying environments. Hence, data dissemination over multi-hop is desired to facilitate such tasks.
    This survey discusses the challenges and requirements of data dissemination in WSNs, reviews existing works, introduces some relevant techniques, presents the metrics of the performance and comparisons of the state-of-the-art works, and finally suggests the possible future directions in data dissemination studies. This survey elaborates and compares existing approaches by two categories: structure-less schemes and structure-based schemes, classified by whether or not the network structure information is used during the disseminating process. In existing literatures, different categories have definite boundary and limited analysis on the tradeoff between different categories. Besides, there is no survey has discussed the emerging techniques such as Constructive Interference (CI). But these emerging techniques have the chance to change the framework of data dissemination. In a word, even though many efforts have been made, data dissemination in WSNs still needs some works to embrace the new techniques and improve the efficiency and practicability further.
    Channel Aware Opportunistic Routing in Multi-radio Multi-channel Wireless Mesh Networks
    Shi-Ming He, Da-Fang Zhang, Kun Xie, Hong Qiao, and Ji Zhang
    Journal of Data Acquisition and Processing, 2014, 29 (3): 487-501. 
    Abstract   PDF(2445KB) ( 1851 )  
    Opportunistic routing (OR) involves multiple candidate forwarders to relay packets by taking advantage of the broadcast nature and multi-user diversity of the wireless medium. Compared with traditional routing (TR), OR is more suitable for the unreliable wireless link, and can evidently improve the end to end throughput. At present, there are many achievements concerning OR in the single radio wireless network. However, the study of OR in multi-radio wireless network stays the beginning stage. To demonstrate the benefit of OR in multi-radio multi-channel network, we propose a new route metric——multi-channel expected anypath transmission time (MEATT), which exploits the channel diversity and resource of multiple candidate forwarders for OR. Based on the new metric, a distributed algorithm named Channel Aware Opportunistic Routing (CAOR) is proposed. The simulation results demonstrate that MEATT improves 1.14 and 1.53 times of the average throughput than existing expected anypath transmission time (EATT)and metric of interference and channel switching cost (MIC) respectively. The average delay of MEATT is 17% and 40% lower than those of EATT, MIC, respectively.
    Artificial Intelligence and Pattern Recognition
    Exploring the Interactions of Storylines from Informative News Events
    Po Hu, Min-Lie Huang, and Xiao-Yan Zhu
    Journal of Data Acquisition and Processing, 2014, 29 (3): 502-518. 
    Abstract   PDF(5272KB) ( 1865 )  
    Today's news readers can be easily overwhelmed by the numerous news articles online. To cope with information overload, online news media publishes timelines for continuously developing news topics. However, the timeline summary does not show the relationship of storylines, and is not intuitive for readers to comprehend the development of a complex news topic. In this paper, we study a novel problem of exploring the interactions of storylines in a news topic. An interaction of two storylines is signified by informative news events that play a key role in both storylines. Storyline interactions can indicate key phases of a news topic, and reveal the latent connections among various aspects of the story. We address the coherence between news articles which is not considered in traditional similarity-based methods, and discover salient storyline interactions to form a clear, global picture of the news topic. User preference can be naturally integrated into our method to generate query-specific results. Comprehensive experiments on ten news topics show the effectiveness of our method over alternative approaches.
    Discovering High-Quality Threaded Discussions in Online Forums
    Jung-Tae Lee, Min-Chul Yang, and Hae-Chang Rim
    Journal of Data Acquisition and Processing, 2014, 29 (3): 519-531. 
    Abstract   PDF(1038KB) ( 1523 )  
    Archives of threaded discussions generated by users in online forums and discussion boards contain valuable knowledge on various topics. However, not all threads are useful because of deliberate abuses, such as trolling and flaming, that are commonly observed in online conversations. The existence of various users with different levels of expertise also makes it diffcult to assume that every discussion thread stored online contains high-quality contents. Although finding high-quality threads automatically can help both users and search engines sift through a huge amount of thread archives and make use of these potentially useful resources effectively, no previous work to our knowledge has performed a study on such task. In this paper, we propose an automatic method for distinguishing high-quality threads from low-quality ones in online discussion sites. We first suggest four different artificial measures for inducing overall quality of a thread based on ratings of its posts. We then propose two tasks involving prediction of thread quality without using post rating information. We adopt a popular machine learning framework to solve the two prediction tasks. Experimental results on a real world forum archive demonstrate that our method can significantly improve the prediction performance across all four measures of thread quality on both tasks. We also compare how different types of features derived from various aspects of threads contribute to the overall performance and investigate key features that play a crucial role in discovering high-quality threads in online discussion sites.
    Computer Architecture and Systems
    OpenMC:Towards Simplifying Programming for TianHe Supercomputers
    Xiang-Ke Liao, Can-Qun Yang, Tao Tang Hui-Zhan Yi, Feng Wang, Qiang Wu, and Jingling Xue
    Journal of Data Acquisition and Processing, 2014, 29 (3): 532-546. 
    Abstract   PDF(8219KB) ( 2450 )  
    Modern petascale and future exascale systems are massively heterogeneous architectures. Developing productive intra-node programming models is crucial toward addressing their programming challenge. We introduce a directive-based intra-node programming model, OpenMC, and show that this new model can achieve ease of programming, high performance, and the degree of portability desired for heterogeneous nodes, especially those in TianHe supercomputers. While existing models are geared towards offloading computations to accelerators (typically one), OpenMC aims to more uniformly and adequately exploit the potential offered by multiple CPUs and accelerators in a compute node. OpenMC achieves this by providing a unified abstraction of hardware resources as workers and facilitating the exploitation of asynchronous task parallelism on the workers. We present an overview of OpenMC, a prototyping implementation, and results from some initial comparisons with OpenMP and hand-written code in developing six applications on two types of nodes from TianHe supercomputers.
SCImago Journal & Country Rank
 

ISSN 1004-9037

         

Home
Editorial Board
Author Guidelines
Subscription
Journal of Data Acquisition and Processing
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China

E-mail: info@sjcjycl.cn
 
  Copyright ©2015 JCST, All Rights Reserved