论文阅读《Deep Learning in bioinformatics: introduction, application and perspective in big data era》

2019-09-08

字数统计: 1k字 | 阅读时长≈ 5分

Introduction

Computer Vision

image recognition 图像识别
object detection 目标检测
image inpainting 图像修复
super-resolution 超分辨率

Natural Language Processing

text classification 文本分类
speech recognition 语音识别
machine translation 机器翻译

Bioinformatics

sequence analysis 序列分析
- predict the effect of noncoding sequence variants 预测非编码序列变异的影响
- model the transcription factor binding affinity landscape 对转录因子结合亲和性的建模 （PS：有一篇文章可以看）
- improve DNA sequencing and peptide sequencing
- analyze DNA sequence modification 分析DNA序列的变异
- model various post-transcription regulation events 模拟各种转录后调控事件
structure prediction and reconstruction 结构预测和重建
- protein secondary structure 蛋白质的二级结构
- model the protein structure when it interacts with other molecules
- predict protein contact maps and the structure of membrane proteins 预测蛋白质的接触图和膜蛋白的结构
- accelerates the fluorescence microscopy super-resolution
biomolecular property and function prediction 生物分子性质和功能预测
- predicts enzyme detailed function by predicting the Enzyme Commission number 通过预测酶的EC number 预测酶的详细功能
- predict the protein Gene Ontology (GO)
- predicts the protein subcellular location 预测蛋白质亚细胞位置
biomedical image processing and diagnosis 生物图像处理和诊断
- classifying skin cancer
- predict fluorescent labels from transmitted-light images of unlabeled biological samples
- analyze the cell imagining data
biomolecule interaction prediction and system biology 生物相互作用预测与统生物学
- model the hierarchical structure and the function of the whole cell 对整个细胞的层次结构和功能进行建模
- predict novel drug-target interaction 预测新的药物靶点相互作用
- model polypharmacy sides effects - multi-modal graph convolutional networks

Deep Learning Methods

通过训练一个神经网络（具有非线性函数）来表达特征和标签之间的隐含关系。

需要训练参数 W, 让模型 fit data，实现这一目的的算法是前向-反向传播（forward-backward propagation)，通过最小化前向输出和标签之间的差异（loss or error）直到模型收敛

常用的 activation function：

hidden layer — ReLU

output layer — softmax

常用的 loss function：

classification — cross-entropy

regression — mean squared error

optimizer：

stochastic gradient descent （SGD）

Momentum with learning rate decay — understanding the problem

Adam — not familiar with the problem

RMSprop

CNN Deep Learning Architecture

local connectivity 局部连接

weight sharing 权重共享

模型架构：

AlexNet、VGG、GoogleNet、ResNet、SENet、DenseNet、DPN

RNN Deep Learning Architecture

模型架构：

LSTM、Bi-RNN、GRU

Graph Nerual Networks

Primary task：

extract and encode the topological and connectivity information from the network 提取并编码网络的拓扑和连接信息

为了保证网络中节点的信息（邻居信息），构建一棵邻居树

Generative models: GAN and VAE — unsupervised learning

学习数据分布并且生成带有一些变化的新数据点

GAN

Variational Autoencoder

Autoencoder 并不能产生新的数据

Variational antoencoder

Applications of deep learning in bioinformatics

1. Identifying enzymes using multi-layer neural networks

identify enzyme sequences based on sequence information using deep learning based methods 用深度学习的方法基于序列信息预测酶的序列

encoder the protein sequences into numbers → Forward Nerual Network

2. Gene expression regression

different genes’ expression can be highly correlated

profiling around 1000 carefully selected landmark genes and predicting the expression of the other target genes based on computational methods and landmark gene expression

3. RNA-protein binding sites prediction with CNN

RNA-binding proteins (RBP) RNA 结合蛋白

将RNN序列编码为 2D tensors

4. DNA sequence function prediction with CNN and RNN

predict the functionality of non-coding DNA sequences 预测DNA序列的非编码区的功能

5. Biomedical image classification using transfer learning and ResNet

6. Graph embedding for novel protein interaction prediction using GCN

graph embedding — PPI networks

使用 GCN 学得节点（蛋白质）的嵌入表示，然后 apply the interaction operation (inner product) to each pair of nodes

7. Biology image super-resolution using GAN

8. High dimensional biological data embedding and generation with VAE

Perspectives: limitations and suggestions

1. Lack of data

transfer learning

use a well trained model from another similar task and fine tune the last one or two layers using the limited real data

data augmentation

simulated data

2. Overfitting

acts on the model parameters and the model architecture :

dropout
batch normalization
weight decay

3. Imbalanced data

use the right criteria to evaluate the prediction result and the loss 使用恰当的评价标准衡量模型的预测结果
upsample smaller classes
downsample larger classes

4. Interpretability

查看输入的每一部分的重要性分数

Perturbation-based approaches

Backpropagation-based methods

5. Uncertainty scaling

legendary Platt scaling

histogram binning

isotonic regression 保序回归

Bayssian Binning into Quantiles

temperature scaling

6. Catastrophic forgetting

regularizations：EWC
dynamic nerual network
rehearsal training methods：iCaRL

7. Reducing computational requirement and model compression 减少计算需求并进行模型压缩

parameter pruning ：reduces the redundant parameters
knowledge distillation
use compact convolutional filters to save parameters
low rank factorization

本文作者： Kelly Liu
本文链接： http://tiantianliu2018.github.io/2019/09/08/论文阅读《Deep-Learning-in-bioinformatics-introduction-application-and-perspective-in-big-data-era》/
版权声明： 本博客所有文章除特别声明外，均采用 MIT 许可协议。转载请注明出处！