Journal of Data Acquisition and Processing

Table of Content

05 March 2021, Volume 36 Issue 2

For Selected:

View Abstracts

Download Citations
EndNote Reference Manager ProCite BibTeX RefWorks

Toggle Thumbnails

Special Section on AI and Big Data Analytics in Biology and Medicine

Select

Preface

Yi Pan, De-Shuang Huang, Jian-Xin Wang, Fa Zhang

Journal of Data Acquisition and Processing, 2021, 36 (2): 231-233.

Abstract

PDF

Chinese Summary

Select

DeepHBSP: A Deep Learning Framework for Predicting Human Blood-Secretory Proteins Using Transfer Learning

Wei Du, Yu Sun, Hui-Min Bao, Liang Chen, Ying Li, Yan-Chun Liang

Journal of Data Acquisition and Processing, 2021, 36 (2): 234-247.

Abstract

PDF

Chinese Summary

The identification of blood-secretory proteins and the detection of protein biomarkers in the blood have an important clinical application value. Existing methods for predicting blood-secretory proteins are mainly based on traditional machine learning algorithms, and heavily rely on annotated protein features. Unlike traditional machine learning algorithms, deep learning algorithms can automatically learn better feature representations from raw data, and are expected to be more promising to predict blood-secretory proteins. We present a novel deep learning model (DeepHBSP) combined with transfer learning by integrating a binary classification network and a ranking network to identify blood-secretory proteins from the amino acid sequence information alone. The loss function of DeepHBSP in the training step is designed to apply descriptive loss and compactness loss to the binary classification network and the ranking network, respectively. The feature extraction subnetwork of DeepHBSP is composed of a multi-lane capsule network. Additionally, transfer learning is used to train a highly accurate generalized model with small samples of blood-secretory proteins. The main contributions of this study are as follows: 1) a novel deep learning architecture by integrating a binary classification network and a ranking network is proposed, superior to existing traditional machine learning algorithms and other state-of-the-art deep learning architectures for biological sequence analysis; 2) the proposed model for blood-secretory protein prediction uses only amino acid sequences, overcoming the heavy dependence of existing methods on annotated protein features; 3) the blood-secretory proteins predicted by our model are statistically significant compared with existing blood-based biomarkers of cancer.

Select

Effective Identification and Annotation of Fungal Genomes

Jian Liu, Jia-Liang Sun, Yong-Zhuang Liu

Journal of Data Acquisition and Processing, 2021, 36 (2): 248-260.

Abstract

PDF

Chinese Summary

In the past few decades, the dangers of mycosis have caused widespread concern. With the development of the sequencing technology, the effective analysis of fungal sequencing data has become a hotspot. With the gradual increase of fungal sequencing data, there is now a lack of sufficient approaches for the identification and functional annotation of fungal chromosomal genomes. To overcome this challenge, this paper firstly deals with the approaches of the identification and annotation of fungal genomes based on short and long reads sequenced by using multiple platforms such as Illumina and Pacbio. Then this paper develops an automated bioinformatics pipeline called PFGI for the identification and annotation task. The experimental evaluation on a real-world dataset ENA (European Nucleotide Archive) shows that PFGI provides a user-friendly way to perform fungal identification and annotation based on the sequencing data analysis, and could provide accurate analyzing results, accurate to the species level (97% sequence identity).

Select

Synthetic Lethal Interactions Prediction Based on Multiple Similarity Measures Fusion

Lian-Lian Wu, Yu-Qi Wen, Xiao-Xi Yang, Bo-Wei Yan, Song He, Xiao-Chen Bo

Journal of Data Acquisition and Processing, 2021, 36 (2): 261-275.

Abstract

PDF

Chinese Summary

The synthetic lethality (SL) relationship arises when a combination of deficiencies in two genes leads to cell death, whereas a deficiency in either one of the two genes does not. The survival of the mutant tumor cells depends on the SL partners of the mutant gene, thereby the cancer cells could be selectively killed by inhibiting the SL partners of the oncogenic genes but normal cells could not. Therefore, there is an urgent need to develop more efficient computational methods of SL pairs identification for cancer targeted therapy. In this paper, we propose a new approach based on similarity fusion to predict SL pairs. Multiple types of gene similarity measures are integrated and k-nearest neighbors algorithm (k-NN) is applied to achieve the similarity-based classification task between gene pairs. As a similarity-based method, our method demonstrated excellent performance in multiple experiments. Besides the effectiveness of our method, the ease of use and expansibility can also make our method more widely used in practice.

Select

Logistic Weighted Profile-Based Bi-Random Walk for Exploring MiRNA-Disease Associations

Ling-Yun Dai, Jin-Xing Liu, Rong Zhu, Juan Wang, Sha-Sha Yuan

Journal of Data Acquisition and Processing, 2021, 36 (2): 276-287.

Abstract

PDF

Chinese Summary

MicroRNAs (miRNAs) exert an enormous influence on cell differentiation, biological development and the onset of diseases. Because predicting potential miRNA-disease associations (MDAs) by biological experiments usually requires considerable time and money, a growing number of researchers are working on developing computational methods to predict MDAs. High accuracy is critical for prediction. To date, many algorithms have been proposed to infer novel MDAs. However, they may still have some drawbacks. In this paper, a logistic weighted profile-based bi-random walk method (LWBRW) is designed to infer potential MDAs based on known MDAs. In this method, three networks (i.e., a miRNA functional similarity network, a disease semantic similarity network and a known MDA network) are constructed first. In the process of building the miRNA network and the disease network, Gaussian interaction profile (GIP) kernel is computed to increase the kernel similarities, and the logistic function is used to extract valuable information and protect known MDAs. Next, the known MDA matrix is preprocessed by the weighted K-nearest known neighbours (WKNKN) method to reduce the number of false negatives. Then, the LWBRW method is applied to infer novel MDAs by bi-randomly walking on the miRNA network and the disease network. Finally, the predictive ability of the LWBRW method is confirmed by the average AUC of 0.939 3 (0.006 1) in 5-fold cross-validation (CV) and the AUC value of 0.976 3 in leave-one-out cross-validation (LOOCV). In addition, case studies also show the outstanding ability of the LWBRW method to explore potential MDAs.

Select

Predicting CircRNA-Disease Associations Based on Improved Weighted Biased Meta-Structure

Xiu-Juan Lei, Chen Bian, Yi Pan

Journal of Data Acquisition and Processing, 2021, 36 (2): 288-298.

Abstract

PDF

Chinese Summary

Circular RNAs (circRNAs) are RNAs with a special closed loop structure, which play important roles in tumors and other diseases. Due to the time consumption of biological experiments, computational methods for predicting associations between circRNAs and diseases become a better choice. Taking the limited number of verified circRNA-disease associations into account, we propose a method named CDWBMS, which integrates a small number of verified circRNA-disease associations with a plenty of circRNA information to discover the novel circRNA-disease associations. CDWBMS adopts an improved weighted biased meta-structure search algorithm on a heterogeneous network to predict associations between circRNAs and diseases. In terms of leave-one-out-cross-validation (LOOCV), 10-fold cross-validation and 5-fold cross-validation, CDWBMS yields the area under the receiver operating characteristic curve (AUC) values of 0.921 6, 0.917 2 and 0.900 5, respectively. Furthermore, case studies show that CDWBMS can predict unknow circRNA-disease associations. In conclusion, CDWBMS is an effective method for exploring disease-related circRNAs.

Select

GAEBic: A Novel Biclustering Analysis Method for miRNA-Targeted Gene Data Based on Graph Autoencoder

Li Wang, Hao Zhang, Hao-Wu Chang, Qing-Ming Qin, Bo-Rui Zhang, Xue-Qing Li, Tian-Heng Zhao, Tian-Yue Zhang

Journal of Data Acquisition and Processing, 2021, 36 (2): 299-309.

Abstract

PDF

Chinese Summary

Unlike traditional clustering analysis, the biclustering algorithm works simultaneously on two dimensions of samples (row) and variables (column). In recent years, biclustering methods have been developed rapidly and widely applied in biological data analysis, text clustering, recommendation system and other fields. The traditional clustering algorithms cannot be well adapted to process high-dimensional data and/or large-scale data. At present, most of the biclustering algorithms are designed for the differentially expressed big biological data. However, there is little discussion on binary data clustering mining such as miRNA-targeted gene data. Here, we propose a novel biclustering method for miRNA-targeted gene data based on graph autoencoder named as GAEBic. GAEBic applies graph autoencoder to capture the similarity of sample sets or variable sets, and takes a new irregular clustering strategy to mine biclusters with excellent generalization. Based on the miRNA-targeted gene data of soybean, we benchmark several different types of the biclustering algorithm, and find that GAEBic performs better than Bimax, Bibit and the Spectral Biclustering algorithm in terms of target gene enrichment. This biclustering method achieves comparable performance on the high throughput miRNA data of soybean and it can also be used for other species.

Select

Collaborative Matrix Factorization with Soft Regularization for Drug-Target Interaction Prediction

Li-Gang Gao, Meng-Yun Yang, Jian-Xin Wang

Journal of Data Acquisition and Processing, 2021, 36 (2): 310-322.

Abstract

PDF

Chinese Summary

Identifying the potential drug-target interactions (DTI) is critical in drug discovery. The drug-target interaction prediction methods based on collaborative filtering have demonstrated attractive prediction performance. However, many corresponding models cannot accurately express the relationship between similarity features and DTI features. In order to rationally represent the correlation, we propose a novel matrix factorization method, so-called collaborative matrix factorization with soft regularization (SRCMF). SRCMF improves the prediction performance by combining the drug and the target similarity information with matrix factorization. In contrast to general collaborative matrix factorization, the fundamental idea of SRCMF is to make the similarity features and the potential features of DTI approximate, not identical. Specifically, SRCMF obtains low-rank feature representations of drug similarity and target similarity, and then uses a soft regularization term to constrain the approximation between drug (target) similarity features and drug (target) potential features of DTI. To comprehensively evaluate the prediction performance of SRCMF, we conduct cross-validation experiments under three different settings. In terms of the area under the precision-recall curve (AUPR), SRCMF achieves better prediction results than six state-of-the-art methods. Besides, under different noise levels of similarity data, the prediction performance of SRCMF is much better than that of collaborative matrix factorization. In conclusion, SRCMF is robust leading to performance improvement in drug-target interaction prediction.

Select

Seg-CapNet: A Capsule-Based Neural Network for the Segmentation of Left Ventricle from Cardiac Magnetic Resonance Imaging

Yang-Jie Cao, Shuang Wu, Chang Liu, Nan Lin, Yuan Wang, Cong Yang, Jie Li

Journal of Data Acquisition and Processing, 2021, 36 (2): 323-333.

Abstract

PDF

Chinese Summary

Deep neural networks (DNNs) have been extensively studied in medical image segmentation. However, existing DNNs often need to train shape models for each object to be segmented, which may yield results that violate cardiac anatomical structure when segmenting cardiac magnetic resonance imaging (MRI). In this paper, we propose a capsulebased neural network, named Seg-CapNet, to model multiple regions simultaneously within a single training process. The Seg-CapNet model consists of the encoder and the decoder. The encoder transforms the input image into feature vectors that represent objects to be segmented by convolutional layers, capsule layers, and fully-connected layers. And the decoder transforms the feature vectors into segmentation masks by up-sampling. Feature maps of each down-sampling layer in the encoder are connected to the corresponding up-sampling layers, which are conducive to the backpropagation of the model. The output vectors of Seg-CapNet contain low-level image features such as grayscale and texture, as well as semantic features including the position and size of the objects, which is beneficial for improving the segmentation accuracy. The proposed model is validated on the open dataset of the Automated Cardiac Diagnosis Challenge 2017 (ACDC 2017) and the Sunnybrook Cardiac Magnetic Resonance Imaging (MRI) segmentation challenge. Experimental results show that the mean Dice coefficient of Seg-CapNet is increased by 4.7% and the average Hausdorff distance is reduced by 22%. The proposed model also reduces the model parameters and improves the training speed while obtaining the accurate segmentation of multiple regions.

Select

Robust Needle Localization and Enhancement Algorithm for Ultrasound by Deep Learning and Beam Steering Methods

Jun Gao, Paul Liu, Guang-Di Liu, Le Zhang

Journal of Data Acquisition and Processing, 2021, 36 (2): 334-346.

Abstract

PDF

Chinese Summary

Ultrasound (US) imaging is clinically used to guide needle insertions because it is safe, real-time, and low cost. The localization of the needle in the ultrasound image, however, remains a challenging problem due to specular reflection off the smooth surface of the needle, speckle noise, and similar line-like anatomical features. This study presents a novel robust needle localization and enhancement algorithm based on deep learning and beam steering methods with three key innovations. First, we employ beam steering to maximize the reflection intensity of the needle, which can help us to detect and locate the needle precisely. Second, we modify the U-Net which is an end-to-end network commonly used in biomedical segmentation by using two branches instead of one in the last up-sampling layer and adding three layers after the last down-sample layer. Thus, the modified U-Net can real-time segment the needle shaft region, detect the needle tip landmark location and determine whether an image frame contains the needle by one shot. Third, we develop a needle fusion framework that employs the outputs of the multi-task deep learning (MTL) framework to precisely locate the needle tip and enhance needle shaft visualization. Thus, the proposed algorithm can not only greatly reduce the processing time, but also significantly increase the needle localization accuracy and enhance the needle visualization for real-time clinical intervention applications.

Select

CytoBrain: Cervical Cancer Screening System Based on Deep Learning Technology

Hua Chen, Juan Liu, Qing-Man Wen, Zhi-Qun Zuo, Jia-Sheng Liu, Jing Feng, Bao-Chuan Pang, Di Xiao

Journal of Data Acquisition and Processing, 2021, 36 (2): 347-360.

Abstract

PDF

Chinese Summary

Identification of abnormal cervical cells is a significant problem in computer-aided diagnosis of cervical cancer. In this study, we develop an artificial intelligence (AI) system, named CytoBrain, to automatically screen abnormal cervical cells to help facilitate the subsequent clinical diagnosis of the subjects. The system consists of three main modules: 1) the cervical cell segmentation module which is responsible for efficiently extracting cell images in a whole slide image (WSI); 2) the cell classification module based on a compact visual geometry group (VGG) network called CompactVGG which is the key part of the system and is used for building the cell classifier; 3) the visualized human-aided diagnosis module which can automatically diagnose a WSI based on the classification results of cells in it, and provide two visual display modes for users to review and modify. For model construction and validation, we have developed a dataset containing 198 952 cervical cell images (60 238 positive, 25 001 negative, and 113 713 junk) from samples of 2 312 adult women. Since CompactVGG is the key part of CytoBrain, we conduct comparison experiments to evaluate its time and classification performance on our developed dataset and two public datasets separately. The comparison results with VGG11, the most efficient one in the family of VGG networks, show that CompactVGG takes less time for either model training or sample testing. Compared with three sophisticated deep learning models, CompactVGG consistently achieves the best classification performance. The results illustrate that the system based on CompactVGG is efficient and effective and can support for large-scale cervical cancer screening.

Select

An Efficient WRF Framework for Discovering Risk Genes and Abnormal Brain Regions in Parkinson's Disease Based on Imaging Genetics Data

Xia-An Bi, Zhao-Xu Xing, Rui-Hui Xu, Xi Hu

Journal of Data Acquisition and Processing, 2021, 36 (2): 361-374.

Abstract

PDF

Chinese Summary

As an emerging research field of brain science, multimodal data fusion analysis has attracted broader attention in the study of complex brain diseases such as Parkinson's disease (PD). However, current studies primarily lie with detecting the association among different modal data and reducing data attributes. The data mining method after fusion and the overall analysis framework are neglected. In this study, we propose a weighted random forest (WRF) model as the feature screening classifier. The interactions between genes and brain regions are detected as input multimodal fusion features by the correlation analysis method. We implement sample classification and optimal feature selection based on WRF, and construct a multimodal analysis framework for exploring the pathogenic factors of PD. The experimental results in Parkinson's Progression Markers Initiative (PPMI) database show that WRF performs better compared with some advanced methods, and the brain regions and genes related to PD are detected. The fusion of multi-modal data can improve the classification of PD patients and detect the pathogenic factors more comprehensively, which provides a novel perspective for the diagnosis and research of PD. We also show the great potential of WRF to perform the multimodal data fusion analysis of other brain diseases.

Regular Paper

Select

Serendipity in Recommender Systems: A Systematic Literature Review

Reza Jafari Ziarani, Reza Ravanmehr

Journal of Data Acquisition and Processing, 2021, 36 (2): 375-396.

Abstract

PDF

Chinese Summary

A recommender system is employed to accurately recommend items, which are expected to attract the user's attention. The over-emphasis on the accuracy of the recommendations can cause information over-specialization and make recommendations boring and even predictable. Novelty and diversity are two partly useful solutions to these problems. However, novel and diverse recommendations cannot merely ensure that users are attracted since such recommendations may not be relevant to the user's interests. Hence, it is necessary to consider other criteria, such as unexpectedness and relevance. Serendipity is a criterion for making appealing and useful recommendations. The usefulness of serendipitous recommendations is the main superiority of this criterion over novelty and diversity. The bulk of studies of recommender systems have focused on serendipity in recent years. Thus, a systematic literature review is conducted in this paper on previous studies of serendipity-oriented recommender systems. Accordingly, this paper focuses on the contextual convergence of serendipity definitions, datasets, serendipitous recommendation methods, and their evaluation techniques. Finally, the trends and existing potentials of the serendipity-oriented recommender systems are discussed for future studies. The results of the systematic literature review present that the quality and the quantity of articles in the serendipity-oriented recommender systems are progressing.

Select

SymPas: Symbolic Program Slicing

Ying-Zhou Zhang

Journal of Data Acquisition and Processing, 2021, 36 (2): 397-418.

Abstract

PDF

Chinese Summary

Program slicing is a technique for simplifying programs by focusing on selected aspects of their behavior. Current mainstream static slicing methods operate on dependence graph PDG (program dependence graph) or SDG (system dependence graph), but these friendly graph representations may be a bit expensive for some users. In this paper we attempt to study a light-weight approach of static program slicing, called Symbolic Program Slicing (SymPas), which works as a dataflow analysis on LLVM (low-level virtual machine). In our SymPas approach, slices are stored in symbolic forms, not in procedures being re-analyzed (cf. procedure summaries). Instead of re-analyzing a procedure multiple times to find its slices for each callling context, we calculate a single symbolic slice which can be instantiated at call sites avoiding re-analysis; SymPas is implemented with LLVM to perform slicing on LLVM intermediate representation (IR). For comparison, we systematically adapt IFDS (interprocedural finite distributive subset) analysis and the SDG-based slicing method (SDGIFDS) to statically slice IR programs. Evaluated on open-source and benchmark programs, our backward SymPas shows a factor-of-6 reduction in time cost and a factor-of-4 reduction in space cost, compared with backward SDG-IFDS, thus being more efficient. In addition, the result shows that after studying slices from 66 programs, ranging up to 336 800 IR instructions in size, SymPas is highly size-scalable.

Select

A Secure IoT Firmware Update Scheme Against SCPA and DoS Attacks

Yan-Hong Fan, Mei-Qin Wang, Yan-Bin Li, Kai Hu, Mu-Zhou Li

Journal of Data Acquisition and Processing, 2021, 36 (2): 419-433.

Abstract

PDF

Chinese Summary

In the IEEE S&P 2017, Ronen et al. exploited side-channel power analysis (SCPA) and approximately 5 000 power traces to recover the global AES-CCM key that Philip Hue lamps use to decrypt and authenticate new firmware. Based on the recovered key, the attacker could create a malicious firmware update and load it to Philip Hue lamps to cause Internet of Things (IoT) security issues. Inspired by the work of Ronen et al., we propose an AES-CCM-based firmware update scheme against SCPA and denial of service (DoS) attacks. The proposed scheme applied in IoT terminal devices includes two aspects of design (i.e., bootloader and application layer). Firstly, in the bootloader, the number of updates per unit time is limited to prevent the attacker from acquiring a sufficient number of useful traces in a short time, which can effectively counter an SCPA attack. Secondly, in the application layer, using the proposed handshake protocol, the IoT device can access the IoT server to regain update permission, which can defend against DoS attacks. Moreover, on the STM32F405+M25P40 hardware platform, we implement Philips' and the proposed modified schemes. Experimental results show that compared with the firmware update scheme of Philips Hue smart lamps, the proposed scheme additionally requires only 2.35 KB of Flash memory and a maximum of 0.32 s update time to effectively enhance the security of the AES-CCM-based firmware update process.

Select

A Real-Time Multi-Stage Architecture for Pose Estimation of Zebrafish Head with Convolutional Neural Networks

Zhang-Jin Huang, Xiang-Xiang He, Fang-Jun Wang, Qing Shen

Journal of Data Acquisition and Processing, 2021, 36 (2): 434-444.

Abstract

PDF

Chinese Summary

In order to conduct optical neurophysiology experiments on a freely swimming zebrafish, it is essential to quantify the zebrafish head to determine exact lighting positions. To efficiently quantify a zebrafish head's behaviors with limited resources, we propose a real-time multi-stage architecture based on convolutional neural networks for pose estimation of the zebrafish head on CPUs. Each stage is implemented with a small neural network. Specifically, a light-weight object detector named Micro-YOLO is used to detect a coarse region of the zebrafish head in the first stage. In the second stage, a tiny bounding box refinement network is devised to produce a high-quality bounding box around the zebrafish head. Finally, a small pose estimation network named tiny-hourglass is designed to detect keypoints in the zebrafish head. The experimental results show that using Micro-YOLO combined with RegressNet to predict the zebrafish head region is not only more accurate but also much faster than Faster R-CNN which is the representative of two-stage detectors. Compared with DeepLabCut, a state-of-the-art method to estimate poses for user-defined body parts, our multi-stage architecture can achieve a higher accuracy, and runs 19x faster than it on CPUs.

Select

An Effective Discrete Artificial Bee Colony Based SPARQL Query Path Optimization by Reordering Triples

Zeynep Banu Ozger, Nurgul Yuzbasioglu Uslu

Journal of Data Acquisition and Processing, 2021, 36 (2): 445-462.

Abstract

PDF

Chinese Summary

Semantic Web has emerged to make web content machine-readable, and with the rapid increase in the number of web pages, its importance has increased. Resource description framework (RDF) is a special data graph format where Semantic Web data are stored and it can be queried by SPARQL query language. The challenge is to find the optimal query order that results in the shortest period of time. In this paper, the discrete Artificial Bee Colony (dABCSPARQL) algorithm is proposed, based on a novel heuristic approach, namely reordering SPARQL queries. The processing time of queries with different shapes and sizes is minimized using the dABCSPARQL algorithm. The performance of the proposed method is evaluated on chain, star, cyclic, and chain-star queries of different sizes from the Lehigh University Benchmark (LUBM) dataset. The results obtained by the proposed method are compared with those of ARQ (a SPARQL processor for Jena) query engine, the Ant System, the Elitist Ant System, and MAX-MIN Ant System algorithms. The experiments demonstrate that the proposed method significantly reduces the processing time, and in most queries, the reduction rate is higher compared with other optimization methods.