sklearn-数据特征（第5讲）-白红宇的个人博客

sklearn-数据特征（第5讲）

发布日期：2021-06-29 14:44:34 浏览次数：4 分类：技术文章

本文共 10676 字，大约阅读时间需要 35 分钟。

1.1.用途：数据和特征决定机器学习的上限，而模型和算法只是逼近这个上限而已；特征处理是工程的核心；好的特征能够提高预测结过的准确性；1.2.要求：降低数据的拟合度，较少的冗余数据；提高算法的精度；减少训练时间；1.3.scklearn特征处理方法：单变量特征选定；递归特征消除；主成分分析；特征的重要性；

2.1.单变量特征选定SelectKBest()用统计方法选定数据特征1—）卡方检验：检验自变量对应变量的相关性的方法；统计样本实际观测值与理论推断值之间的偏离程度；卡方值越大则偏离越大，为0表明理论值完全符合；实例1：chi-squared选择4个对结果影响最大的特征；# 通过卡方检验选定数据特征import csv,pandas as pd,numpy as npnp.set_printoptions(precision=3)# 导入数据filename = 'pima_data.csv'names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']data = pd.read_csv(filename, names=names)array = data.values# 将数据分为输入数据和输出结果X = array[:, 0:8]Y = array[:, 8]# 通过卡方检验选定数据特征from sklearn.feature_selection import SelectKBestfrom sklearn.feature_selection import chi2def getFetures(X=X,Y=Y):    # 特征选定    test = SelectKBest(score_func=chi2, k=4)    fit = test.fit(X, Y)            scores=pd.DataFrame(fit.scores_,names[0:8])    scores=scores.sort_values(by=0, ascending=True)        print(scores)    features = fit.transform(X)    print(features)getFetures()"""              0pedi     5.3927pres    17.6054skin    53.1080preg   111.5197mass   127.6693age    181.3037plas  1411.8870test  2175.5653[[148.    0.   33.6  50. ] [ 85.    0.   26.6  31. ] [183.    0.   23.3  32. ] ... [121.  112.   26.2  30. ] [126.    0.   30.1  47. ] [ 93.    0.   30.4  23. ]] """ #相关系数 #互信息法

2.2.通过递归消除来选定特征 递归消除RFE使用一个基模型来多轮训练，每轮训练后消除若干权值系数的特征，再基于新的 特征集开始下一轮训练；通过每一个基模型的精度找到对最终预测结果影响最大的特征。 实例2：# 通过递归消除来选定特征-找到3个影响最大的特征from sklearn.feature_selection import RFEfrom sklearn.linear_model import LogisticRegressiondef getFeturesRFE(X=X,Y=Y):    # 特征选定    model = LogisticRegression()    rfe = RFE(model, 3)#找出3个最重要的特征    fit = rfe.fit(X, Y)            print("特征个数：",fit.n_features_)    print("被选定的特征：",fit.support_)    print("特征排名：",fit.ranking_)    getFeturesRFE()# 特征个数： 3# 被选定的特征： [ True False False False False  True  True False]# 特征排名： [1 2 4 5 6 1 1 3]

2.3.主要成分分析PCA使用线性代数来转换压缩数据，被称为数据降维；分类：有PCA主成分分析-数据降维，让样本具有醉倒的发散性（无监督）应用在聚类等LDA线性判别分析-数据降维，让映射后的样本具有更好的分类性能（有监督）实例3：# 通过主要成分分析选定数据特征from sklearn.decomposition import PCAdef getFeturesPCA(X=X,Y=Y):    # 特征选定    pca = PCA(n_components=3)    fit = pca.fit(X)    print("解释方差：%s" % fit.explained_variance_ratio_)    print(fit.components_)    getFeturesPCA()#解释方差：[0.889 0.062 0.026]#[[-2.022e-03  9.781e-02  1.609e-02  6.076e-02  9.931e-01  1.401e-02#   5.372e-04 -3.565e-03]# [-2.265e-02 -9.722e-01 -1.419e-01  5.786e-02  9.463e-02 -4.697e-02#  -8.168e-04 -1.402e-01]# [-2.246e-02  1.434e-01 -9.225e-01 -3.070e-01  2.098e-02 -1.324e-01#  -6.400e-04 -1.255e-01]]

2.4.特征的重要性袋装决策树，随机森林，极端随机树都可计算数据特征的重要性实例4：# 通过决策树计算特征的重要性from sklearn.ensemble import ExtraTreesClassifierdef test_ExtraTreesClassifier(X=X,Y=Y):    # 特征选定    model = ExtraTreesClassifier()    fit = model.fit(X, Y)    print(fit.feature_importances_)    test_ExtraTreesClassifier()#[0.11  0.23  0.1   0.082 0.076 0.142 0.118 0.142]

3..备注：1.单变量特征选择1.1.说明：单变量特征选择通过基于单变量统计检验选择最佳特征来工作。可以将其视为估算器的预处理步骤。Scikit-learn将要素选择例程公开为实现该transform方法的对象：对每个特征使用通用的单变量统计检验：误报率SelectFpr，误发现率 SelectFdr或family-wise错误SelectFwe。GenericUnivariateSelect允许使用可配置的策略执行单变量特征选择。这允许使用超参数搜索估计器选择最佳的单变量选择策略。对于回归：f_regression，mutual_info_regression对于分类：chi2，f_classif，mutual_info_classif基于F检验的方法估计两个随机变量之间的线性相关程度。互信息方法可捕获任何类型统计依赖性，由于是非参数性的，需要更多样本才能进行准确的估算。稀疏数据特征选择如用稀疏数据 chi2，mutual_info_regression，mutual_info_classif 会处理数据而不会使其变得密集。警告 当心不要在分类问题上使用回归评分功能，您将获得无用的结果。=======================================================================================1.2.函数：SelectKBest(score_func= f_classif, *, k=10)	删除除 k 最高评分功能 参数： k：int或“all”，可选，默认=10 要选择的顶部特征数。 “all”选项绕过选择，用于参数搜索.	 score_func：callable，函数取两个数组X和y，返回一对数组（scores, pvalues）或一个分数的数组。       默认函数为f_classif，默认函数只适用于分类函数		score_func里可选公式：f_classif：                方差分析F值之间的标签/特征分类任务。	mutual_info_classif：离散目标的相互信息。	chi2：                     用于分类任务的非负特征的Chi-平方统计。	f_regression：          回归任务的标签/特征之间的F值。	mutual_info_regression：连续目标的相互信息。	SelectPercentile：   根据最高分的百分位数选择特征。	SelectFpr：             根据假阳性率测试选择特征。	SelectFdr：             根据估计的错误发现率选择特征。	SelectFwe：            根据family-wise错误率选择特征。	GenericUnivariateSelect：具有可配置模式的单变量特征选择器。	 	属性：	scores_ : array-like, shape=(n_features,)  特征的得分	pvalues_：数组的shape(n_features，)	特征评分的p-values，如果`score_func`=None只返回评分		实例：		from sklearn.datasets import load_digits	from sklearn.feature_selection import SelectKBest, chi2	X, y = load_digits(return_X_y=True)	X.shape	    # (1797, 64)	X_new = SelectKBest(chi2, k=20).fit_transform(X, y)	X_new.shape	# (1797, 20)		注释：分数相等的特征之间的关系将在未指定的情况下被打破方式。	=======================================================================================	SelectPercentile（score_func = 
   
    ，*，percentile = 10 ）# 根据最高分数的百分位数选择特征。删除除用户指定的最高得分百分比以外的所有特征参数：score_func：    函数接受两个数组X和y，并返回一对数组（分数，pvalue）或带分数的单个数组。    默认值为f_classif默认功能仅适用于分类任务。percentile int，可选，默认为10 要保留的功能百分比。属性scores_: array_like shape（n_features，）特征得分pvalues_:array_like shape（n_features，）特征分数的p值，如果score_func仅返回分数，则为None 。=======================================================================================方法：fit(X,y)                         在（X，y）上运行记分函数并得到适当的特征。fit_transform(X[, y])       拟合数据，然后转换数据。get_params([deep])      获得此估计器的参数。get_support([indices])   获取所选特征的掩码或整数索引。inverse_transform(X)     反向变换操作。set_params(**params)  设置估计器的参数。transform(X)                 将X还原为所选特征。例如，我们可以执行 χ2 测试样本以仅检索以下两个最佳功能：实例：from sklearn.datasets import load_irisfrom sklearn.feature_selection import SelectKBestfrom sklearn.feature_selection import chi2X, y = load_iris(return_X_y=True)X.shape# (150, 4)X_new = SelectKBest(chi2, k=2).fit_transform(X, y)X_new.shape# (150, 2)这些对象作为输入计分函数，返回单因素分数和p值（或仅分数SelectKBest和 SelectPercentile）：=======================================================================================sklearn.feature_selection.f_classif(X, y)[source]Compute the ANOVA F-value for the provided sample.ParametersX{array-like, sparse matrix} shape = [n_samples, n_features]The set of regressors that will be tested sequentially.yarray of shape(n_samples)The data matrix.ReturnsFarray, shape = [n_features,] The set of F values.pvalarray, shape = [n_features,] The set of p-values.=======================================================================================sklearn.feature_selection.chi2(X, y)[source]Compute chi-squared stats between each non-negative feature and class.This score can be used to select the n_features features with the highest values for the test chi-squared statistic from X, which must contain only non-negative features such as booleans or frequencies (e.g., term counts in document classification), relative to the classes.Recall that the chi-square test measures dependence between stochastic variables, so using this function “weeds out” the features that are the most likely to be independent of class and therefore irrelevant for classification.ParametersX{array-like, sparse matrix} of shape (n_samples, n_features)Sample vectors.yarray-like of shape (n_samples,)Target vector (class labels).Returnschi2array, shape = (n_features,)chi2 statistics of each feature.pvalarray, shape = (n_features,)p-values of each feature.==============================================================================================================================================================================Univariate Feature SelectionAn example showing univariate feature selection.Noisy (non informative) features are added to the iris data and univariate feature selection is applied. For each feature, we plot the p-values for the univariate feature selection and the corresponding weights of an SVM. We can see that univariate feature selection selects the informative features and that these have larger SVM weights.In the total set of features, only the 4 first ones are significant. We can see that they have the highest score with univariate feature selection. The SVM assigns a large weight to one of these features, but also Selects many of the non-informative features. Applying univariate feature selection before the SVM increases the SVM weight attributed to the significant features, and will thus improve classification.../../_images/sphx_glr_plot_feature_selection_001.pngOut:Classification accuracy without selecting features: 0.789Classification accuracy after univariate feature selection: 0.868print(__doc__)import numpy as npimport matplotlib.pyplot as pltfrom sklearn.datasets import load_irisfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import MinMaxScalerfrom sklearn.svm import LinearSVCfrom sklearn.pipeline import make_pipelinefrom sklearn.feature_selection import SelectKBest, f_classif# ############################################################################## Import some data to play with# The iris datasetX, y = load_iris(return_X_y=True)# Some noisy data not correlatedE = np.random.RandomState(42).uniform(0, 0.1, size=(X.shape[0], 20))# Add the noisy data to the informative featuresX = np.hstack((X, E))# Split dataset to select feature and evaluate the classifierX_train, X_test, y_train, y_test = train_test_split(        X, y, stratify=y, random_state=0)plt.figure(1)plt.clf()X_indices = np.arange(X.shape[-1])# ############################################################################## Univariate feature selection with F-test for feature scoring# We use the default selection function to select the four# most significant featuresselector = SelectKBest(f_classif, k=4)selector.fit(X_train, y_train)scores = -np.log10(selector.pvalues_)scores /= scores.max()plt.bar(X_indices - .45, scores, width=.2,        label=r'Univariate score ($-Log(p_{value})$)')# ############################################################################## Compare to the weights of an SVMclf = make_pipeline(MinMaxScaler(), LinearSVC())clf.fit(X_train, y_train)print('Classification accuracy without selecting features: {:.3f}'      .format(clf.score(X_test, y_test)))svm_weights = np.abs(clf[-1].coef_).sum(axis=0)svm_weights /= svm_weights.sum()plt.bar(X_indices - .25, svm_weights, width=.2, label='SVM weight')clf_selected = make_pipeline(        SelectKBest(f_classif, k=4), MinMaxScaler(), LinearSVC())clf_selected.fit(X_train, y_train)print('Classification accuracy after univariate feature selection: {:.3f}'      .format(clf_selected.score(X_test, y_test)))svm_weights_selected = np.abs(clf_selected[-1].coef_).sum(axis=0)svm_weights_selected /= svm_weights_selected.sum()plt.bar(X_indices[selector.get_support()] - .05, svm_weights_selected,        width=.2, label='SVM weights after selection')plt.title("Comparing feature selection")plt.xlabel('Feature number')plt.yticks(())plt.axis('tight')plt.legend(loc='upper right')plt.show()Total running time of the script: ( 0 minutes 0.139 seconds)

转载地址：https://chunyou.blog.csdn.net/article/details/106393609 如侵犯您的版权，请留言回复原文章的地址，我们会给您删除此文章，给您带来不便请您谅解！

上一篇：sklearn-分离数据-评估算法（第6讲）

下一篇：pandas.DataFrame按行列值/名称排序

发表评论

关于作者

喝酒易醉，品茶养心，人生如梦，品茶悟道，何以解忧？唯有杜康！

-- 愿君每日到此一游！

发表评论

最新留言

关于作者

推荐文章