本文共 3518 字,大约阅读时间需要 11 分钟。
从分子相似性评估到使用机器学习技术的定量构效关系分析各种建模方法已应用于不同大小和组成的数据集(阻断剂和非阻滞剂的数量)。本研究中使用从公共生物活性数据开发用于预测hERG阻断剂的稳健分类器。随机森林被用来开发使用不同分子描述符,活性阈值和训练集合成的预测模型。与先前提取数据集的研究报告相比,该模型在外部验证中表现出优异的性能。
代码示例
#导入依赖库
import pandas as pd
import numpy as np
import warnings; warnings.simplefilter('ignore')
from rdkit import Chem, DataStructs
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import PandasTools
from rdkit.Chem import AllChem, Draw
from sklearn.ensemble import RandomForestClassifier
#from sklearn.model_selection import StratifiedKFold
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import recall_score, roc_auc_score
from sklearn.model_selection import KFold, StratifiedKFold,StratifiedShuffleSplit
from sklearn.model_selection import train_test_split
from matplotlib import cm
import math
import pickle
import os
定义函数
class FP:
"""
A fingerprint class that inserts molecular fingerprints into pandas data frame
"""
def __init__(self, fp):
self.fp = fp
def __str__(self):
return "%d bit FP" % len(self.fp)
def __len__(self):
return len(self.fp)
def get_morgan_fp(mol):
"""
Returns the RDKit Morgan fingerprint for a molecule
"""
info = {}
arr = np.zeros((1,))
fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=1024, useFeatures=False, bitInfo=info)
DataStructs.ConvertToNumpyArray(fp, arr)
arr = np.array([len(info[x]) if x in info else 0 for x in range(1024)])
return FP(arr)
数据预处理
df = pd.read_csv("chembl_training_T3.csv", index_col=0)
PandasTools.AddMoleculeColumnToFrame(df, smilesCol='can_smiles')
df = df[~df.ROMol.isnull()]
df['fp'] = df.apply(lambda x: get_morgan_fp(x['ROMol']), axis=1)
df.head() #查看数据
定义X 、Y(指纹数据集)
X = np.array([x.fp for x in df.fp])
X.shape
y = np.array(df.ac)
y.shape
交叉验证(Cross Validation)
# Initialize performance measures
sens = np.array([])
spec = np.array([])
auc = np.array([])
# 10-fold cross-validation split
kfolds = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
kfolds.get_n_splits(X, y)
print(kfolds)
for train, test in kfolds.split(X, y):
# Split data to training and test set
X_train, X_test, y_train, y_test = X[train], X[test], y[train], y[test]
# Training a random forest classifier
rf_clf = RandomForestClassifier(n_estimators=100, criterion='gini', n_jobs=1)
rf_clf.fit(X_train, y_train)
# Predicting the test set
y_pred = rf_clf.predict(X_test)
y_pred_proba = rf_clf.predict_proba(X_test).T[1]
# Append performance measures
auc = np.append(auc, roc_auc_score(y_test, y_pred_proba))
sens = np.append(sens, recall_score(y_test, y_pred, pos_label=1))
spec = np.append(spec, recall_score(y_test, y_pred, pos_label=0))
# 10-fold cross-validation performance
print('AUC:\t\t\t%.2f +/- %.2f' % (auc.mean(), auc.std()))
print('Sensitivity:\t\t%.2f +/- %.2f' % (sens.mean(), sens.std()))
print('Specificity:\t\t%.2f +/- %.2f' % (spec.mean(), spec.std()))
AUC: 0.95 +/- 0.01 Sensitivity: 0.84 +/- 0.03 Specificity: 0.91 +/- 0.03
测试预测模型(单个分子)
mySMILES ='Fc1ccc(cc1)n3c2ccc(Cl)cc2c(c3)C5CCN(CCN4C(=O)NCC4)CC5'
from rdkit import RDConfig
mySMILESinput = pd.DataFrame(columns=['ID','my_smiles'])
mySMILESinput = mySMILESinput.append({ 'ID':123, 'my_smiles':mySMILES}, ignore_index=True)
PandasTools.AddMoleculeColumnToFrame(mySMILESinput,'my_smiles','ROMol')
mySMILESinput['fp'] = mySMILESinput.apply(lambda x: get_morgan_fp(x['ROMol']), axis=1)
mySMILESinput
resQuery = np.array([x.fp for x in mySMILESinput.fp])
y_pred = rf_clf.predict(resQuery)
y_pred
print(rf_clf.predict(resQuery))
print(rf_clf.predict_proba(resQuery))
[1] [[0.04 0.96]]
转载地址:https://blog.csdn.net/weixin_39622901/article/details/111390907 如侵犯您的版权,请留言回复原文章的地址,我们会给您删除此文章,给您带来不便请您谅解!