机器学习的相关代码汇总
发布日期:2021-07-03 20:40:39 浏览次数:2 分类:技术文章

本文共 19674 字,大约阅读时间需要 65 分钟。

机器学习的相关代码汇总

文章目录

机器学习相关代码汇总

XGBoost

这个在sklearn当中有一个,但是有另一个功能更加强大的,只要

pip3 install xgboost

即可安装,不过这个安装过程真是一波三折的。

然后,我们需要知道xgboost使用的大体流程,以下几个示例都没离开这个流程框架:

在这里插入图片描述

示例1

在这个示例里面,涉及到一个agaricus数据,这个单词意思是:巴西蘑菇。但是巴西蘑菇有很多种类,有的有毒,有的没有毒,能否预测出一个给定的蘑菇有毒还是没毒呢?

import xgboost as xgbimport numpy as np# 1、xgBoost的基本使用# 2、自定义损失函数的梯度和二阶导train_data = 'xgboost_data/agaricus_train.txt'test_data = 'xgboost_data/agaricus_test.txt'# 定义一个损失函数def log_reg(y_hat, y):    p = 1.0 / (1.0 + np.exp(-y_hat))    g = p - y.get_label()    h = p * (1.0 - p)    return g,  h# 错误率,本例子当中,估计值<0.5代表没有毒def error_rate(y_hat, y):    return 'error', float(sum(y.get_label() != (y_hat > 0.5))) / len(y_hat)if __name__ == "__main__":    # 读取数据    data_train = xgb.DMatrix(train_data)    data_test = xgb.DMatrix(test_data)    # 设置参数    param = {
'max_depth': 3, 'eta': 1, 'silent': 1, 'objective': 'binary:logistic'} # logitraw watchlist = [(data_test, 'eval'), (data_train, 'train')] n_round = 7 bst = xgb.train(param, data_train, num_boost_round=n_round, evals=watchlist, obj=log_reg, feval=error_rate) # 计算错误率 y_hat = bst.predict(data_test) y = data_test.get_label() print('y_hat',y_hat) print('y', y) error = sum(y != (y_hat > 0.5)) error_rate = float(error) / len(y_hat) print('样本总数:\t', len(y_hat)) print('错误数目:\t%4d' % error) print('错误率:\t%.5f%%' % (100 * error_rate))

说明:

开头定义的log_reg还有error_rate最后会被用在下面的train方法中,对应当中的obj参数和feval参数,意思是:用用户定义的损失函数log_reg,来进行提升。采用用户定义的错误率error_rate来进行错误率的预测。

关于当中的train函数:

def train(params, dtrain, num_boost_round=10, evals=(), obj=None, feval=None,          maximize=None, early_stopping_rounds=None, evals_result=None,          verbose_eval=True, xgb_model=None, callbacks=None)"""dtrain:训练数据num_boost_round:数据提升时候的迭代次数evals:验证,传进去一个元组,里面指定什么是训练集,哪些是测试集"""

在train当中有一个params,这个时候就涉及到:Booster参数了

  • max_depth: 指定决策树的深度

  • eta: 学习率,默认0.1

  • silent:静默模式。该值如果是1,模型运行不输出

  • objective:给定损失函数,默认为:binary:logistic,或者reg:linear

在xgboost当中,会把相关数据存放在DMatrix数据结构当中,这个数据结构是一个二维矩阵,但是xgboost当中对其进行了优化。

在以上代码当中不停的出现get_label方法,那么具体什么是label呢?

有一句英文解释很明了:

The label is the name of some category. If you’re building a machine learning system to distinguish fruits coming down a conveyor belt, labels for training samples might be “apple”, " orange", “banana”. The features are any kind of information you can extract about each sample. In our example, you might have one feature for colour, another for weight, another for length, and another for width. Maybe you would have some measure of concavity or linearity or ball-ness.

即:落实到使用中,就是label代表了最后你这个到底是什么东西,二特征feature则代表那一个个属性

示例2

在这个示例当中,用了鸢尾花数据集。鸢尾花其实有很多种类,对于不同的鸢尾花(本数据集当中有三类,分别是:Setosa, Versicolor, Virginica)。不同种类的鸢尾花花儿宽度,叶子长度等属性,都不尽相同。我们用XGBoost训练一下,看能不能有效的对相关数据做预测。

import numpy as npimport pandas as pdimport xgboost as xgbfrom sklearn.model_selection import train_test_split   # cross_validationdef iris_type(s):    it = {
'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2} return it[s]if __name__ == "__main__": path = 'xgboost_data/iris.data' # 数据文件路径 data = pd.read_csv(path, header=None) x, y = data[range(4)], data[4] y = pd.Categorical(y).codes x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1, test_size=50) data_train = xgb.DMatrix(x_train, label=y_train) data_test = xgb.DMatrix(x_test, label=y_test) watch_list = [(data_test, 'eval'), (data_train, 'train')] #决策树深度为2,学习率是0.3, param = {
'max_depth': 2, 'eta': 0.3, 'silent': 1, 'objective': 'multi:softmax', 'num_class': 3} bst = xgb.train(param, data_train, num_boost_round=6, evals=watch_list) y_hat = bst.predict(data_test) result = y_test.reshape(1, -1) == y_hat print('正确率:\t', float(np.sum(result)) / len(y_hat)) print('END.....\n')

说明:

  • 这个代码当中有一个pd.Categorica方法。这个方法具有分类和排序的功能。
pandas.Categorical(val,category = None,ordered = None,dtype = None)"""val       :[list-like] The values of categorical. categories:[index like] Unique categorisation of the categories. ordered   :[boolean] If false, then the categorical is treated as unordered. dtype     :[CategoricalDtype] an instance. Error- ValueError: If the categories do not validate. TypeError : If an explicit ordered = True but categorical can't be sorted. Return- Categorical varibale"""
  • [reshape(1,-1)转化成1行

    [reshape(2,-1)转换成两行

    [reshape(-1,1)转换成1列

    [reshape(-1,2)转化成两列

SVM

示例1

我们还是拿经典的鸢尾花数据集,来用SVM方法来做预测

import numpy as npimport pandas as pdimport matplotlib as mplimport matplotlib.pyplot as pltfrom sklearn import svmfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_score# 'sepal length', 'sepal width', 'petal length', 'petal width'iris_feature = '花萼长度', '花萼宽度', '花瓣长度', '花瓣宽度'if __name__ == "__main__":    path = "./iris.data" # 数据文件路径    data = pd.read_csv(path, header=None)    x, y = data[range(4)], data[4]    y = pd.Categorical(y).codes # 按照花的类型进行分组    x = x[[0, 1]] # 只取第0,和第1列    x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1, train_size=0.6)    # 分类器    clf = svm.SVC(C=0.1, kernel='linear', decision_function_shape='ovr')    clf.fit(x_train, y_train.ravel())    # 准确率    print(clf.score(x_train, y_train))  # 精度    print('训练集准确率:', accuracy_score(y_train, clf.predict(x_train)))    print(clf.score(x_test, y_test))    print('测试集准确率:', accuracy_score(y_test, clf.predict(x_test)))    # decision_function    print('decision_function:\n', clf.decision_function(x_train))    print('\npredict:\n', clf.predict(x_train))    # 画图    x1_min, x2_min = x.min()    x1_max, x2_max = x.max()    x1, x2 = np.mgrid[x1_min:x1_max:500j, x2_min:x2_max:500j]  # 生成网格采样点    grid_test = np.stack((x1.flat, x2.flat), axis=1)  # 测试点    # Z = clf.decision_function(grid_test)    # 样本到决策面的距离    grid_hat = clf.predict(grid_test)       # 预测分类值    grid_hat = grid_hat.reshape(x1.shape)  # 使之与输入的形状相同    cm_light = mpl.colors.ListedColormap(['#A0FFA0', '#FFA0A0', '#A0A0FF'])    cm_dark = mpl.colors.ListedColormap(['g', 'r', 'b'])    plt.figure(facecolor='w')    plt.pcolormesh(x1, x2, grid_hat, cmap=cm_light)    plt.scatter(x[0], x[1], c=y, edgecolors='k', s=50, cmap=cm_dark)      # 样本    plt.scatter(x_test[0], x_test[1], s=120, facecolors='none', zorder=10)     # 圈中测试集样本    plt.xlabel(iris_feature[0], fontsize=13)    plt.ylabel(iris_feature[1], fontsize=13)    plt.xlim(x1_min, x1_max)    plt.ylim(x2_min, x2_max)    plt.title('Iris SVM', fontsize=16)    plt.grid(b=True, ls=':')    plt.tight_layout(pad=1.5)    plt.show()

代码说明:

  • smv.SVC当中的相关参数

    • C=1.0:

      SVC的惩罚参数。C越大,对训练集测试时准确率很高,但泛化能力弱。C值小,对误分类的惩罚减小,允许容错,泛化能力较强。

    • kernel=‘rbf’ :核函数,默认是rbf,可以是‘linear’(线性核函数), ‘poly’(多项式核函数), ‘rbf’(高斯核函数), ‘sigmoid’(sigmoid核函数)

    • degree :多项式poly函数的维度,默认是3,选择其他核函数时会被忽略。

    • gamma : ‘rbf’,‘poly’ 和‘sigmoid’的核函数参数。默认是’auto’(数值上是样本个数的倒数)

    • coef0 :核函数的常数项。对于‘poly’和 ‘sigmoid’有用。

    • probability :是否采用概率估计,默认为False

    • tol :停止训练的误差值阈值,默认为1e-3

    • max_iter :最大迭代次数。-1为无限制。

    • decision_function_shape :‘ovo’, ‘ovr’ or None, default=None3

      ovo: 模型是一对一

      ovr: 一对其它

示例二

# -*- coding:utf-8 -*-import numpy as npfrom sklearn import svmfrom scipy import statsfrom sklearn.metrics import accuracy_scoreimport matplotlib as mplimport matplotlib.pyplot as pltdef extend(a, b, r):    x = a - b    m = (a + b) / 2    return m-r*x/2, m+r*x/2if __name__ == "__main__":    #自己自创了一组样本    np.random.seed(0) #随机生成一组数据,怎么生成,看下面的代码    N = 20    x = np.empty((4*N, 2)) #生成了一个4*N行,2列的空矩阵,里面的数值都非常小,无限趋于0    means = [(-1, 1), (1, 1), (1, -1), (-1, -1)]    sigmas = [np.eye(2), 2*np.eye(2), np.diag((1,2)), np.array(((2,1),(1,2)))] #四个矩阵    for i in range(4):        mn = stats.multivariate_normal(means[i], sigmas[i]*0.3)        x[i*N:(i+1)*N, :] = mn.rvs(N)    a = np.array((0,1,2,3)).reshape((-1, 1))    y = np.tile(a, N).flatten()    clf = svm.SVC(C=1, kernel='rbf', gamma=1, decision_function_shape='ovo')    clf.fit(x, y)    y_hat = clf.predict(x)    acc = accuracy_score(y, y_hat)    np.set_printoptions(suppress=True)    print('预测正确的样本个数:%d,正确率:%.2f%%' % (round(acc*4*N), 100*acc))    # decision_function    print(clf.decision_function(x))    print(y_hat)    x1_min, x2_min = np.min(x, axis=0)    x1_max, x2_max = np.max(x, axis=0)    x1_min, x1_max = extend(x1_min, x1_max, 1.05)    x2_min, x2_max = extend(x2_min, x2_max, 1.05)    x1, x2 = np.mgrid[x1_min:x1_max:500j, x2_min:x2_max:500j]    x_test = np.stack((x1.flat, x2.flat), axis=1)    y_test = clf.predict(x_test)    y_test = y_test.reshape(x1.shape)    cm_light = mpl.colors.ListedColormap(['#FF8080', '#A0FFA0', '#6060FF', '#F080F0'])    cm_dark = mpl.colors.ListedColormap(['r', 'g', 'b', 'm'])    plt.figure(facecolor='w')    plt.pcolormesh(x1, x2, y_test, cmap=cm_light)    plt.scatter(x[:, 0], x[:, 1], s=40, c=y, cmap=cm_dark, alpha=0.7)    plt.xlim((x1_min, x1_max))    plt.ylim((x2_min, x2_max))    plt.grid(b=True)    plt.tight_layout(pad=2.5)    plt.title(u'SVM多分类方法:One/One or One/Other', fontsize=18)    plt.show()

代码说明:

  • scipy.stats.multivariate_normal

    随机生成一个多元正态分布,手动指定均值和方差

  • scipy.stats.poisson.rvs(loc=期望, scale=标准差, size=生成随机数的个数)

    从泊松分布中生成指定个数的随机数,那么在这里的代码当中,是从多元正态分布当中生成指定的随机样本

EM

示例1

import numpy as npfrom scipy.stats import multivariate_normalfrom sklearn.mixture import GaussianMixtureimport matplotlib as mplimport matplotlib.pyplot as pltfrom sklearn.metrics.pairwise import pairwise_distances_argminif __name__ == '__main__':    style = 'myself'    np.random.seed(0)    mu1_fact = (0, 0, 0) # 设定三个均值    cov1_fact = np.diag((1, 2, 3)) # 设定三个不同的方差    data1 = np.random.multivariate_normal(mu1_fact, cov1_fact, 400)    mu2_fact = (2, 2, 1)    cov2_fact = np.array(((1, 1, 3), (1, 2, 1), (0, 0, 1)))    data2 = np.random.multivariate_normal(mu2_fact, cov2_fact, 100)    data = np.vstack((data1, data2))    y = np.array([True] * 400 + [False] * 100)    if style == 'sklearn':        g = GaussianMixture(n_components=2, covariance_type='full', tol=1e-6, max_iter=1000)        g.fit(data)        print('类别概率:\t', g.weights_[0])        print('均值:\n', g.means_, '\n')        print('方差:\n', g.covariances_, '\n')        mu1, mu2 = g.means_        sigma1, sigma2 = g.covariances_    else:        num_iter = 100        n, d = data.shape        mu1 = data.min(axis=0)        mu2 = data.max(axis=0)        sigma1 = np.identity(d)        sigma2 = np.identity(d)        pi = 0.5        # EM        for i in range(num_iter):            # E Step            norm1 = multivariate_normal(mu1, sigma1)            norm2 = multivariate_normal(mu2, sigma2)            tau1 = pi * norm1.pdf(data)            tau2 = (1 - pi) * norm2.pdf(data)            gamma = tau1 / (tau1 + tau2)            # M Step            mu1 = np.dot(gamma, data) / np.sum(gamma)            mu2 = np.dot((1 - gamma), data) / np.sum((1 - gamma))            sigma1 = np.dot(gamma * (data - mu1).T, data - mu1) / np.sum(gamma)            sigma2 = np.dot((1 - gamma) * (data - mu2).T, data - mu2) / np.sum(1 - gamma)            pi = np.sum(gamma) / n            print(i, ":\t", mu1, mu2)        print('类别概率:\t', pi)        print('均值:\t', mu1, mu2)        print('方差:\n', sigma1, '\n\n', sigma2, '\n')    # 预测分类    norm1 = multivariate_normal(mu1, sigma1)    norm2 = multivariate_normal(mu2, sigma2)    tau1 = norm1.pdf(data)    tau2 = norm2.pdf(data)    fig = plt.figure(figsize=(13, 7), facecolor='w')    ax = fig.add_subplot(121, projection='3d')    ax.scatter(data[:, 0], data[:, 1], data[:, 2], c='b', s=30, marker='o', depthshade=True)    ax.set_xlabel('X')    ax.set_ylabel('Y')    ax.set_zlabel('Z')    ax.set_title(u'原始数据', fontsize=18)    ax = fig.add_subplot(122, projection='3d')    order = pairwise_distances_argmin([mu1_fact, mu2_fact], [mu1, mu2], metric='euclidean')    print(order)    if order[0] == 0:        c1 = tau1 > tau2    else:        c1 = tau1 < tau2    c2 = ~c1    acc = np.mean(y == c1)    print('准确率:%.2f%%' % (100*acc))    ax.scatter(data[c1, 0], data[c1, 1], data[c1, 2], c='r', s=30, marker='o', depthshade=True)    ax.scatter(data[c2, 0], data[c2, 1], data[c2, 2], c='g', s=30, marker='^', depthshade=True)    ax.set_xlabel('X')    ax.set_ylabel('Y')    ax.set_zlabel('Z')    ax.set_title(u'EM算法分类', fontsize=18)    plt.suptitle(u'EM算法的实现', fontsize=21)    plt.subplots_adjust(top=0.90)    plt.tight_layout()    plt.show()
  • np.vstack:沿着竖直方向将矩阵堆叠起来。

    np.hstack: 沿着水平方向将矩阵堆叠起来。

  • np.identity(m):创建一个m阶的方阵

示例二:GMM

用到案例就是男女身高分布的例子

import numpy as npfrom sklearn.mixture import GaussianMixturefrom sklearn.model_selection import train_test_splitimport matplotlib as mplimport matplotlib.colorsimport matplotlib.pyplot as pltmpl.rcParams['font.sans-serif'] = [u'SimHei']mpl.rcParams['axes.unicode_minus'] = False# from matplotlib.font_manager import FontProperties# font_set = FontProperties(fname=r"c:\windows\fonts\simsun.ttc", size=15)# fontproperties=font_setdef expand(a, b):    d = (b - a) * 0.05    return a-d, b+dif __name__ == '__main__':    data = np.loadtxt('HeightWeight.csv', dtype=np.float, delimiter=',', skiprows=1)    y, x = np.split(data, [1, ], axis=1)    x, x_test, y, y_test = train_test_split(x, y, train_size=0.6, random_state=0)    gmm = GaussianMixture(n_components=2, covariance_type='full', random_state=0)    x_min = np.min(x, axis=0)    x_max = np.max(x, axis=0)    gmm.fit(x)    print('均值 = \n', gmm.means_)    print('方差 = \n', gmm.covariances_)    y_hat = gmm.predict(x)    y_test_hat = gmm.predict(x_test)    acc = np.mean(y_hat.ravel() == y.ravel())    acc_test = np.mean(y_test_hat.ravel() == y_test.ravel())    acc_str = u'训练集准确率:%.2f%%' % (acc * 100)    acc_test_str = u'测试集准确率:%.2f%%' % (acc_test * 100)    print(acc_str)    print(acc_test_str)    cm_light = mpl.colors.ListedColormap(['#FF8080', '#77E0A0'])    cm_dark = mpl.colors.ListedColormap(['r', 'g'])    x1_min, x1_max = x[:, 0].min(), x[:, 0].max()    x2_min, x2_max = x[:, 1].min(), x[:, 1].max()    x1_min, x1_max = expand(x1_min, x1_max)    x2_min, x2_max = expand(x2_min, x2_max)    x1, x2 = np.mgrid[x1_min:x1_max:500j, x2_min:x2_max:500j]    grid_test = np.stack((x1.flat, x2.flat), axis=1)    grid_hat = gmm.predict(grid_test)    grid_hat = grid_hat.reshape(x1.shape)    plt.figure(figsize=(9, 7), facecolor='w')    plt.pcolormesh(x1, x2, grid_hat, cmap=cm_light)    plt.scatter(x[:, 0], x[:, 1], s=50, marker='o', cmap=cm_dark, edgecolors='k')    plt.scatter(x_test[:, 0], x_test[:, 1], s=60, marker='^', cmap=cm_dark, edgecolors='k')    p = gmm.predict_proba(grid_test)    print(p)    p = p[:, 0].reshape(x1.shape)    CS = plt.contour(x1, x2, p, levels=(0.1, 0.5, 0.8), colors=list('rgb'), linewidths=2)    plt.clabel(CS, fontsize=15, fmt='%.1f', inline=True)    ax1_min, ax1_max, ax2_min, ax2_max = plt.axis()    xx = 0.9*ax1_min + 0.1*ax1_max    yy = 0.1*ax2_min + 0.9*ax2_max    plt.text(xx, yy, acc_str, fontsize=18)    yy = 0.15*ax2_min + 0.85*ax2_max    plt.text(xx, yy, acc_test_str, fontsize=18)    plt.xlim((x1_min, x1_max))    plt.ylim((x2_min, x2_max))    plt.xlabel(u'身高(cm)', fontsize='large')    plt.ylabel(u'体重(kg)', fontsize='large')    plt.title(u'EM算法估算GMM的参数', fontsize=20)    plt.grid()    plt.show()

代码说明:

  • np.ravel(): 把array降为一维,如果没有必要,不会产生源数据的副本

  • predict_proba:

    predict:训练后返回预测结果,显示标签值

    predict_proba:返回一个 n 行 k 列的数组, 第 i 行 第 j 列上的数值是模型预测 第 i 个预测样本为某个标签的概率,并且每一行的概率和为1。

贝叶斯网络

示例一

用高斯朴素贝叶斯来对鸢尾花数据进行分类,代码本身并不难。

注意:这个用到一个Pipline操作,先给标准化,然后多项式回归,然后再进行高斯朴素贝叶斯

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport matplotlib as mplfrom sklearn.preprocessing import StandardScaler, MinMaxScaler, PolynomialFeaturesfrom sklearn.naive_bayes import GaussianNB, MultinomialNBfrom sklearn.pipeline import Pipelinefrom sklearn.metrics import accuracy_scorefrom sklearn.model_selection import train_test_splitfrom sklearn.neighbors import KNeighborsClassifierdef iris_type(s):    it = {
'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2} return it[s]filePath = '/home/johnny/PycharmProjects/pythonProject1/Machine_Learning/Data/iris.data'if __name__ == "__main__": data = pd.read_csv(filePath, header=None) x, y = data[np.arange(4)], data[4] y = pd.Categorical(values=y).codes feature_names = u'花萼长度', u'花萼宽度', u'花瓣长度', u'花瓣宽度' features = [0,1] x = x[features] x, x_test, y, y_test = train_test_split(x, y, train_size=0.7, random_state=0) priors = np.array((1,2,4), dtype=float) priors /= priors.sum() gnb = Pipeline([ ('sc', StandardScaler()), ('poly', PolynomialFeatures(degree=1)), ('clf', GaussianNB(priors=priors))]) # 由于鸢尾花数据是样本均衡的,其实不需要设置先验值 gnb.fit(x, y.ravel()) y_hat = gnb.predict(x) print('训练集准确度: %.2f%%' % (100 * accuracy_score(y, y_hat))) y_test_hat = gnb.predict(x_test) print('测试集准确度:%.2f%%' % (100 * accuracy_score(y_test, y_test_hat))) # 画图 N, M = 500, 500 # 横纵各采样多少个值 x1_min, x2_min = x.min() x1_max, x2_max = x.max() t1 = np.linspace(x1_min, x1_max, N) t2 = np.linspace(x2_min, x2_max, M) x1, x2 = np.meshgrid(t1, t2) # 生成网格采样点 x_grid = np.stack((x1.flat, x2.flat), axis=1) # 测试点 cm_light = mpl.colors.ListedColormap(['#77E0A0', '#FF8080', '#A0A0FF']) cm_dark = mpl.colors.ListedColormap(['g', 'r', 'b']) y_grid_hat = gnb.predict(x_grid) # 预测值 y_grid_hat = y_grid_hat.reshape(x1.shape) plt.figure(facecolor='w') plt.pcolormesh(x1, x2, y_grid_hat, cmap=cm_light) # 预测值的显示 plt.scatter(x[features[0]], x[features[1]], edgecolors='k', s=50, cmap=cm_dark) plt.scatter(x_test[features[0]], x_test[features[1]], marker='^', edgecolors='k', s=120, cmap=cm_dark) plt.xlabel(feature_names[features[0]], fontsize=13) plt.ylabel(feature_names[features[1]], fontsize=13) plt.xlim(x1_min, x1_max) plt.ylim(x2_min, x2_max) plt.title(u'GaussianNB for Iris', fontsize=18) plt.grid(True) plt.show()

LDA

这里用到了gensim。需要安装,直接用pip install指令就可以安装

from gensim import corpora, models, similaritiesfrom pprint import pprintpath = './LDA_test.txt'if __name__ == '__main__':    f = open(path)    stop_list = set('for a of the and to in'.split())    print('After')    texts = [[word for word in line.strip().lower().split() if word not in stop_list] for line in f]    print('Text = ')    print(texts)    dictionary = corpora.Dictionary(texts) # 去掉texts当中的重复词,组成一个字典(发现,所有的词都按照字典序排下去了)    V = len(dictionary)    corpus = [dictionary.doc2bow(text) for text in texts] # 生成每一篇文档的词袋向量    print("corpus",corpus)    corpus_tfidf = models.TfidfModel(corpus)[corpus]    corpus_tfidf = corpus    print('TF-IDF:')    for c in corpus_tfidf:        print(c)    print('\nLDA Model:')    num_topics = 2    lda = models.LdaModel(corpus_tfidf, num_topics=num_topics, id2word=dictionary,                          alpha='auto', eta='auto', minimum_probability=0.001, passes=10)    doc_topic = [doc_t for doc_t in lda[corpus_tfidf]]    print('Document-Topic:\n')    pprint(doc_topic)    for doc_topic in lda.get_document_topics(corpus_tfidf):        print(doc_topic)    for topic_id in range(num_topics):        print('Topic', topic_id)        pprint(lda.show_topic(topic_id))    similarity = similarities.MatrixSimilarity(lda[corpus_tfidf])    print('Similarity:')    pprint(list(similarity))    hda = models.HdpModel(corpus_tfidf, id2word=dictionary)    topic_result = [a for a in hda[corpus_tfidf]]    print('\n\nUSE WITH CARE--\nHDA Model:')    pprint(topic_result)    print('HDA Topics:')    print(hda.print_topics(num_topics=2, num_words=5))

代码说明:

  • doc2bow

    计算每个不同单词的出现次数,将单词转换为其整数单词 id 并将结果作为稀疏向量(按照<单词id,出现次数>的格式)返回。具体这个id是什么,这个在dictionary当中可以查到.

  • similarities.MatrixSimilarity

    这个功能,还能按ctrl+b才知道是怎么回事:用来计算文档语料的余弦相似度。余弦相似度:通过计算两个向量的夹角余弦值来评估他们的相似度

    Compute cosine similarity against a corpus of documents by storing the index matrix in memory.Unless the entire matrix fits into main memory, use :class:`~gensim.similarities.docsim.Similarity` instead.

转载地址:https://blog.csdn.net/johnny_love_1968/article/details/116524053 如侵犯您的版权,请留言回复原文章的地址,我们会给您删除此文章,给您带来不便请您谅解!

上一篇:SVM简介
下一篇:提升算法介绍

发表评论

最新留言

第一次来,支持一个
[***.219.124.196]2024年04月16日 22时38分17秒