Loading...
Bimonthly    Since 1986
ISSN 1004-9037
/
Indexed in:
SCIE, Ei, INSPEC, JST, AJ, MR, CA, DBLP, etc.
Publication Details
Edited by: Editorial Board of Journal of Data Acquisition and Processing
P.O. Box 2704, Beijing 100190, P.R. China
Sponsored by: Institute of Computing Technology, CAS & China Computer Federation
Undertaken by: Institute of Computing Technology, CAS
Published by: SCIENCE PRESS, BEIJING, CHINA
Distributed by:
China: All Local Post Offices
 
  • Table of Content
      05 September 2015, Volume 30 Issue 5   
    For Selected: View Abstracts Toggle Thumbnails
    Special Section on Software Systems
    Preface
    Tao Xie
    Journal of Data Acquisition and Processing, 2015, 30 (5): 933-934. 
    Abstract   PDF(74KB) ( 824 )  
    Roundtable: Research Opportunities and Challenges for Emerging Software Systems
    Xiangyu Zhang, Dongmei Zhang, Yves Le Traon, Qing Wang, Lu Zhang
    Journal of Data Acquisition and Processing, 2015, 30 (5): 935-941. 
    Abstract   PDF(244KB) ( 1144 )  
    For this special section on software systems, several research leaders in software systems, as guest editors for this special section, discuss important issues that will shape this field's future directions. The essays included in this roundtable article cover research opportunities and challenges for emerging software systems such as data processing programs (Xiangyu Zhang) and online services (Dongmei Zhang), with new directions of technologies such as unifications in software testing (Yves Le Traon), data-driven and evidence-based software engineering (Qing Wang), and dynamic analysis of multiple traces (Lu Zhang).——Tao Xie, Leading Editor of Special Section on Software System.
    Detecting Android Malware Using Clone Detection
    Jian Chen, Manar H. Alalfi, Thomas R. Dean, Ying Zou
    Journal of Data Acquisition and Processing, 2015, 30 (5): 942-956. 
    Abstract   PDF(3197KB) ( 2572 )  
    Android is currently one of the most popular smartphone operating systems. However, Android has the largest share of global mobile malware and significant public attention has been brought to the security issues of Android. In this paper, we investigate the use of a clone detector to identify known Android malware. We collect a set of Android applications known to contain malware and a set of benign applications. We extract the Java source code from the binary code of the applications and use NiCad, a near-miss clone detector, to find the classes of clones in a small subset of the malicious applications. We then use these clone classes as a signature to find similar source files in the rest of the malicious applications. The benign collection is used as a control group. In our evaluation, we successfully decompile more than 1,000 malicious apps in 19 malware families. Our results show that using a small portion of malicious applications as a training set can detect 95% of previously known malware with very low false positives and high accuracy at 96.88%. Our method can effectively and reliably pinpoint malicious applications that belong to certain malware families.
    Balancing Frequencies and Fault Detection in the In-Parameter-Order Algorithm
    Shi-Wei Gao, Jiang-Hua Lv, Bing-Lei Du, Charles J. Colbourn, Shi-Long Ma
    Journal of Data Acquisition and Processing, 2015, 30 (5): 957-968. 
    Abstract   PDF(844KB) ( 1288 )  
    The In-Parameter-Order (IPO) algorithm is a widely used strategy for the construction of software test suites for Combinatorial Testing (CT) whose goal is to reveal faults triggered by interactions among parameters. Variants of IPO have been shown to produce test suites within reasonable amounts of time that are often not much larger than the smallest test suites known. When an entire test suite is executed, all faults that arise from t-way interactions for some fixed t are surely found. However, when tests are executed one at a time, it is desirable to detect a fault as early as possible so that it can be repaired. The basic IPO strategies of horizontal and vertical growth address test suite size, but not the early detection of faults. In this paper, the growth strategies in IPO are modified to attempt to evenly distribute the values of each parameter across the tests. Together with a reordering strategy that we add, this modification to IPO improves the rate of fault detection dramatically (improved by 31% on average). Moreover, our modifications always reduce generation time (2 times faster on average) and in some cases also reduce test suite size.
    A Hybrid Instance Selection Using Nearest-Neighbor for Cross-Project Defect Prediction
    Duksan Ryu, Jong-In Jang, Jongmoon Baik
    Journal of Data Acquisition and Processing, 2015, 30 (5): 969-980. 
    Abstract   PDF(385KB) ( 1102 )  
    Software defect prediction (SDP) is an active research field in software engineering to identify defect-prone modules. Thanks to SDP, limited testing resources can be effectively allocated to defect-prone modules. Although SDP requires sufficient local data within a company, there are cases where local data are not available, e.g., pilot projects. Companies without local data can employ cross-project defect prediction (CPDP) using external data to build classifiers. The major challenge of CPDP is different distributions between training and test data. To tackle this, instances of source data similar to target data are selected to build classifiers. Software datasets have a class imbalance problem meaning the ratio of defective class to clean class is far low. It usually lowers the performance of classifiers. We propose a Hybrid Instance Selection Using Nearest-Neighbor (HISNN) method that performs a hybrid classification selectively learning local knowledge (via k-nearest neighbor) and global knowledge (via naïve Bayes). Instances having strong local knowledge are identified via nearest-neighbors with the same class label. Previous studies showed low PD (probability of detection) or high PF (probability of false alarm) which is impractical to use. The experimental results show that HISNN produces high overall performance as well as high PD and low PF.
    Multi-Factor Duplicate Question Detection in Stack Overflow
    Yun Zhang, David Lo, Xin Xia, Jian-Ling Sun
    Journal of Data Acquisition and Processing, 2015, 30 (5): 981-997. 
    Abstract   PDF(1642KB) ( 1541 )  
    Stack Overflow is a popular on-line question and answer site for software developers to share their experience and expertise. Among the numerous questions posted in Stack Overflow, two or more of them may express the same point and thus are duplicates of one another. Duplicate questions make Stack Overflow site maintenance harder, waste resources that could have been used to answer other questions, and cause developers unnecessary to wait for answers that are already available. To reduce the problem of duplicate questions, Stack Overflow allows questions to be manually marked as duplicates of others. Since there are thousands of questions submitted to Stack Overflow every day, manually identifying duplicate questions is a difficult work. Thus, there is a need for an automated approach that can help in detecting these duplicate questions. To address the above-mentioned need, in this paper, we propose an automated approach named DupPredictor that takes as input a new question and detects potential duplicates of this question by considering multiple factors. DupPredictor extracts the title and description of a question and also tags that are attached to the question. These pieces of information (title, description, and a few tags) are mandatory information that a user needs to input when posting a question. DupPredictor then computes the latent topics of each question by using a topic model. Next, for each pair of questions, it computes four similarity scores by comparing their titles, descriptions, latent topics, and tags. These four similarity scores are finally combined together to result in a new similarity score that comprehensively considers the multiple factors. To examine the benefit of DupPredictor, we perform an experiment on a Stack Overflow dataset which contains a total of more than 2 million questions. The result shows that DupPredictor can achieve a recall-rate@20 score of 63.8%. We compare our approach with the standard search engine of Stack Overflow, and DupPredictor improves its recall-rate@10 score by 40.63%. We also compare our approach with approaches that only use title, description, topic, and tag similarity and Runeson et al.'s approach that has been used to detect duplicate bug reports, and DupPredictor improves their recall-rate@10 scores by 27.2%, 97.4%, 746.0%, 231.1%, and 16.4% respectively.
    CoreDevRec: Automatic Core Member Recommendation for Contribution Evaluation
    Jing Jiang, Jia-Huan He, Xue-Yuan Chen
    Journal of Data Acquisition and Processing, 2015, 30 (5): 998-1016. 
    Abstract   PDF(1211KB) ( 1120 )  
    The pull-based software development helps developers make contributions flexibly and efficiently. Core members evaluate code changes submitted by contributors, and decide whether to merge these code changes into repositories or not. Ideally, code changes are assigned to core members and evaluated within a short time after their submission. However, in reality, some popular projects receive many pull requests, and core members have difficulties in choosing pull requests which are to be evaluated. Therefore, there is a growing need for automatic core member recommendation, which improves the evaluation process. In this paper, we investigate pull requests with manual assignment. Results show that 3.2%~40.6% of pull requests are manually assigned to specific core members. To assist with the manual assignment, we propose CoreDevRec to recommend core members for contribution evaluation in GitHub. CoreDevRec uses support vector machines to analyze different kinds of features, including file paths of modified codes, relationships between contributors and core members, and activeness of core members. We evaluate CoreDevRec on 18651 pull requests of five popular projects in GitHub. Results show that CoreDevRec achieves accuracy from 72.9% to 93.5% for top 3 recommendation. In comparison with a baseline approach, CoreDevRec improves the accuracy from 18.7% to 81.3% for top 3 recommendation. Moreover, CoreDevRec even has higher accuracy than manual assignment in the project TrinityCore. We believe that CoreDevRec can improve the assignment of pull requests.
    TagCombine: Recommending Tags to Contents in Software Information Sites
    Xin-Yu Wang, Xin Xia, David Lo
    Journal of Data Acquisition and Processing, 2015, 30 (5): 1017-1035. 
    Abstract   PDF(1316KB) ( 1825 )  
    Nowadays, software engineers use a variety of online media to search and become informed of new and interesting technologies, and to learn from and help one another. We refer to these kinds of online media which help software engineers improve their performance in software development, maintenance, and test processes as software information sites. In this paper, we propose TagCombine, an automatic tag recommendation method which analyzes objects in software information sites. TagCombine has three different components: 1) multi-label ranking component which considers tag recommendation as a multi-label learning problem; 2) similarity-based ranking component which recommends tags from similar objects; 3) tag-term based ranking component which considers the relationship between different terms and tags, and recommends tags after analyzing the terms in the objects. We evaluate TagCombine on four software information sites, Ask Different, Ask Ubuntu, Freecode, and Stack Overflow. On averaging across the four projects, TagCombine achieves recall@5 and recall@10 to 0.619 8 and 0.762 5 respectively, which improves TagRec proposed by Al-Kofahi et al. by 14.56% and 10.55% respectively, and the tag recommendation method proposed by Zangerle et al. by 12.08% and 8.16% respectively.
    Special Section on Social Media Processing
    Preface
    Jie Tang, Xiao-Yan Zhu
    Journal of Data Acquisition and Processing, 2015, 30 (5): 1036-1038. 
    Abstract   PDF(109KB) ( 616 )  
    Social Trust Aware Item Recommendation for Implicit Feedback
    Lei Guo, Jun Ma Hao-Ran Jiang, Zhu-Min Chen, Chang-Ming Xing
    Journal of Data Acquisition and Processing, 2015, 30 (5): 1039-1053. 
    Abstract   PDF(1658KB) ( 2617 )  
    Social trust aware recommender systems have been well studied in recent years. However, most of existing methods focus on the recommendation scenarios where users can provide explicit feedback to items. But in most cases, the feedback is not explicit but implicit. Moreover, most of trust aware methods assume the trust relationships among users are single and homogeneous, whereas trust as a social concept is intrinsically multi-faceted and heterogeneous. Simply exploiting the raw values of trust relations cannot get satisfactory results. Based on the above observations, we propose to learn a trust aware personalized ranking method with multi-faceted trust relations for implicit feedback. Specifically, we first introduce the social trust assumption——a user's taste is close to the neighbors he/she trusts——into the Bayesian Personalized Ranking model. To explore the impact of users' multi-faceted trust relations, we further propose a categorysensitive random walk method CRWR to infer the true trust value on each trust link. Finally, we arrive at our trust strength aware item recommendation method SocialBPRCRWR by replacing the raw binary trust matrix with the derived real-valued trust strength. Data analysis and experimental results on two real-world datasets demonstrate the existence of social trust influence and the effectiveness of our social based ranking method SocialBPRCRWR in terms of AUC (area under the receiver operating characteristic curve).
    Mining Intention-Related Products on Online Q&A Community
    Jun-Wen Duan, Yi-Heng Chen Ting Liu, Xiao Ding
    Journal of Data Acquisition and Processing, 2015, 30 (5): 1054-1062. 
    Abstract   PDF(605KB) ( 2924 )  
    User generated content on social media has attracted much attention from service/product providers, as it contains plenty of potential commercial opportunities. However, previous work mainly focuses on user consumption intention (CI) identification, and little effort has been spent to mine intention-related products. In this paper, focusing on the Baby & Child Care domain, we propose a novel approach to mine intention-related products on online question and answer (Q&A) community. Making use of the question-answering pairs as data source, we first automatically extract candidate products based on dependency parser. And then by means of the collocation extraction model, we identify the real intention-related products from the candidate set. The experimental results on our carefully constructed evaluation dataset show that our approach achieves better performance than two natural baseline methods.
    Tag Correspondence Model for User Tag Suggestion
    Cun-Chao Tu, Zhi-Yuan Liu, Mao-Song Sun
    Journal of Data Acquisition and Processing, 2015, 30 (5): 1063-1072. 
    Abstract   PDF(321KB) ( 1288 )  
    Some microblog services encourage users to annotate themselves with multiple tags, indicating their attributes and interests. User tags play an important role for personalized recommendation and information retrieval. In order to better understand the semantics of user tags, we propose Tag Correspondence Model (TCM) to identify complex correspondences of tags from the rich context of microblog users. The correspondence of a tag is referred to as a unique element in the context which is semantically correlated with this tag. In TCM, we divide the context of a microblog user into various sources (such as short messages, user profile, and neighbors). With a collection of users with annotated tags, TCM can automatically learn the correspondences of user tags from multiple sources. With the learned correspondences, we are able to interpret implicit semantics of tags. Moreover, for the users who have not annotated any tags, TCM can suggest tags according to users' context information. Extensive experiments on a real-world dataset demonstrate that our method can efficiently identify correspondences of tags, which may eventually represent semantic meanings of tags.
    iBole: A Hybrid Multi-Layer Architecture for Doctor Recommendation in Medical Social Networks
    Ji-Bing Gong, Li-Li WangSheng-Tao Sun, Si-Wei Peng
    Journal of Data Acquisition and Processing, 2015, 30 (5): 1073-1081. 
    Abstract   PDF(323KB) ( 1118 )  
    In this paper, we try to systematically study how to perform doctor recommendation in medical social networks (MSNs). Specifically, employing a real-world medical dataset as the source in our work, we propose iBole, a novel hybrid multi-layer architecture, to solve this problem. First, we mine doctor-patient relationships/ties via a time-constraint probability factor graph model (TPFG). Second, we extract network features for ranking nodes. Finally, we propose RWRModel, a doctor recommendation model via the random walk with restart method. Our real-world experiments validate the effectiveness of the proposed methods. Experimental results show that we obtain good accuracy in mining doctor-patient relationships from the network, and the doctor recommendation performance is better than that of the baseline algorithms:traditional Ranking SVM (RSVM) and the individual doctor recommendation model (IDR-Model). The results of our RWR-Model are more reasonable and satisfactory than those of the baseline approaches.
    Detecting Marionette Microblog Users for Improved Information Credibility
    Xian Wu, Wei Fan, Jing Gao Zi-Ming Feng, Yong Yu
    Journal of Data Acquisition and Processing, 2015, 30 (5): 1082-1096. 
    Abstract   PDF(604KB) ( 1105 )  
    In this paper, we propose to detect a special group of microblog users:the "marionette" users, who are created or employed by backstage "puppeteers", either through programs or manually. Unlike normal users that access microblog for information sharing or social communication, the marionette users perform specific tasks to earn financial profits. For example, they follow certain users to increase their "statistical popularity", or retweet some tweets to amplify their "statistical impact". The fabricated follower or retweet counts not only mislead normal users to wrong information, but also seriously impair microblog-based applications, such as hot tweets selection and expert finding. In this paper, we study the important problem of detecting marionette users on microblog platforms. This problem is challenging because puppeteers are employing complicated strategies to generate marionette users that present similar behaviors as normal users. To tackle this challenge, we propose to take into account two types of discriminative information:1) individual user tweeting behavior and 2) the social interactions among users. By integrating both information into a semi-supervised probabilistic model, we can effectively distinguish marionette users from normal ones. By applying the proposed model to one of the most popular microblog platforms (Sina Weibo) in China, we find that the model can detect marionette users with F-measure close to 0.9. In addition, we apply the proposed model to calculate the marionette ratio of the top 200 most followed microbloggers and the top 50 most retweeted posts in Sina Weibo. To accelerate the detecting speed and reduce feature generation cost, we further propose a light-weight model which utilizes fewer features to identify marionettes from retweeters.
    Anomaly Detection in Microblogging via Co-Clustering
    Wu Yang Guo-Wei Shen, Wei Wang, Liang-Yi Gong, Miao Yu, Guo-Zhong Dong
    Journal of Data Acquisition and Processing, 2015, 30 (5): 1097-1108. 
    Abstract   PDF(1988KB) ( 1457 )  
    Traditional anomaly detection on microblogging mostly focuses on individual anomalous users or messages. Since anomalous users employ advanced intelligent means, the anomaly detection is greatly poor in performance. In this paper, we propose an innovative framework of anomaly detection based on bipartite graph and co-clustering. A bipartite graph between users and messages is built to model the homogeneous and heterogeneous interactions. The proposed coclustering algorithm based on nonnegative matrix tri-factorization can detect anomalous users and messages simultaneously. The homogeneous relations modeled by the bipartite graph are used as constraints to improve the accuracy of the coclustering algorithm. Experimental results show that the proposed scheme can detect individual and group anomalies with high accuracy on a Sina Weibo dataset.
    Clustering Context-Dependent Opinion Target Words in Chinese Product Reviews
    Yu Zhang, Miao Liu, Hai-Xia Xia
    Journal of Data Acquisition and Processing, 2015, 30 (5): 1109-1119. 
    Abstract   PDF(649KB) ( 942 )  
    In opinion mining of product reviews, an important task is to provide a summary of customers' opinions based on different opinion targets. Due to various knowledge backgrounds or linguistic habits, customers use a variety of terms to describe the same opinion target. These terms are called as context-dependent synonyms. In order to provide a comprehensive summary, the first step is to classify these opinion target words into groups. In this article, we mainly focus on clustering context-dependent opinion target words in Chinese product reviews. We utilize three clustering methods based on distributional similarity and use four different co-occurrence matrices for experiments. According to the experimental results on a large number of reviews, we find that our proposed heuristic k-means clustering method using opinion target words co-occurrence matrix achieves the best clustering result with lower time complexity and less memory space. In addition, the accuracy is more stable when choosing different combinations of centroids. For some kinds of co-occurrence matrices, we also find that using small-size (low-dimensional) matrices achieves higher average clustering accuracy than using large-size (high-dimensional) matrices. Our findings provide a time-efficient and space-efficient way to cluster opinion targets with high accuracy.
    Microblog Sentiment Analysis with Emoticon Space Model
    Fei Jiang, Yi-Qun Liu, Huan-Bo LuanJia-Shen Sun, Xuan Zhu, Min Zhang, Shao-Ping Ma
    Journal of Data Acquisition and Processing, 2015, 30 (5): 1120-1129. 
    Abstract   PDF(734KB) ( 1600 )  
    Emoticons have been widely employed to express different types of moods, emotions, and feelings in microblog environments. They are therefore regarded as one of the most important signals for microblog sentiment analysis. Most existing studies use several emoticons that convey clear emotional meanings as noisy sentiment labels or similar sentiment indicators. However, in practical microblog environments, tens or even hundreds of emoticons are frequently adopted and all emoticons have their own unique emotional meanings. Besides, a considerable number of emoticons do not have clear emotional meanings. An improved sentiment analysis model should not overlook these phenomena. Instead of manually assigning sentiment labels to several emoticons that convey relatively clear meanings, we propose the emoticon space model (ESM) that leverages more emoticons to construct word representations from a massive amount of unlabeled data. By projecting words and microblog posts into an emoticon space, the proposed model helps identify subjectivity, polarity, and emotion in microblog environments. The experimental results for a public microblog benchmark corpus (NLP&CC 2013) indicate that ESM effectively leverages emoticon signals and outperforms previous state-of-the-art strategies and benchmark best runs.
    Towards Better Understanding of App Functions
    Yong-Xin Tong, Jieying She, Lei Chen
    Journal of Data Acquisition and Processing, 2015, 30 (5): 1130-1140. 
    Abstract   PDF(328KB) ( 1208 )  
    Apps are attracting more and more attention from both mobile and web platforms. Due to the self-organized nature of the current app marketplaces, the descriptions of apps are not formally written and contain a lot of noisy words and sentences. Thus, for most of the apps, the functions of them are not well documented and thus cannot be captured by app search engines easily. In this paper, we study the problem of inferring the real functions of an app by identifying the most informative words in its description. In order to utilize and integrate the diverse information of the app corpus in a proper way, we propose a probabilistic topic model to discover the latent data structure of the app corpus. The outputs of the topic model are further used to identify the function of an app and its most informative words. We verify the effectiveness of the proposed methods through extensive experiments on two real app datasets crawled from Google Play and Windows Phone Store, respectively.
    Discovering Family Groups in Passenger Social Networks
    Huai-Yu Wan, Zhi-Wei Wang You-Fang Lin, Xu-Guang Jia, Yuan-Wei Zhou
    Journal of Data Acquisition and Processing, 2015, 30 (5): 1141-1153. 
    Abstract   PDF(745KB) ( 1172 )  
    People usually travel together with others in groups for different purposes, such as family members for visiting relatives, colleagues for business, friends for sightseeing and so on. Especially, the family groups, as a kind of the most common consumer units, have a considerable scale in the field of passenger transportation market. Accurately identifying family groups can help the carriers to provide passengers with personalized travel services and precise product recommendation. This paper studies the problem of finding family groups in the field of civil aviation and proposes a family group detection method based on passenger social networks. First of all, we construct passenger social networks based on their co-travel behaviors extracted from the historical travel records; secondly, we use a collective classification algorithm to classify the social relationships between passengers into family or non-family relationship groups; finally, we employ a weighted community detection algorithm to find family groups, which takes the relationship classification results as the weights of edges. Experimental results on a real dataset of passenger travel records in the field of civil aviation demonstrate that our method can effectively find family groups from historical travel records.
    Regular Paper
    Constructing Edge-Colored Graph for Heterogeneous Networks
    Rui Hou, Ji-Gang WuYawen Chen, Haibo Zhang, Xiu-Feng Sui
    Journal of Data Acquisition and Processing, 2015, 30 (5): 1154-1160. 
    Abstract   PDF(557KB) ( 1013 )  
    In order to build a fault-tolerant network, heterogeneous facilities are arranged in the network to prevent homogeneous faults from causing serious damage. This paper uses edge-colored graph to investigate the features of a network topology which is survivable after a set of homogeneous devices malfunction. We propose an approach to designing such networks under arbitrary parameters. We also show that the proposed approach can be used to optimize inter-router connections in network-on-chip to reduce the additional consumption of energy and time delay.
SCImago Journal & Country Rank
 

ISSN 1004-9037

         

Home
Editorial Board
Author Guidelines
Subscription
Journal of Data Acquisition and Processing
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China

E-mail: info@sjcjycl.cn
 
  Copyright ©2015 JCST, All Rights Reserved