kaggle研究生招生（上）-白红宇的个人博客

kaggle研究生招生（上）

发布日期：2021-07-01 02:16:42 浏览次数：2 分类：技术文章

本文共 6187 字，大约阅读时间需要 20 分钟。

每天逛 kaggle

在这里插入图片描述

看来这个也是非常出名的数据集

GRE分数（290至340）

托福成绩（92-120）

大学评级（1至5）

目的声明（1至5）

推荐信强度（1至5）

本科生CGPA（6.8至9.92）

研究经验（0或1）

入学率（0.34至0.97）

import pandas as pdimport matplotlib.pyplot as pltimport numpy as npimport seaborn as snsimport sysimport os

df = pd.read_csv("../input/Admission_Predict.csv",sep = ",")

在这里插入图片描述

硕士入学的三个最重要特征：CGPA、GRE和托福成绩

进入硕士学位的三个最不重要的特征：研究、LOR和SOP

相关系数矩阵

fig,ax = plt.subplots(figsize=(10, 10))sns.heatmap(df.corr(), ax=ax, annot=True, linewidths=0.05, fmt= '.2f',cmap="magma")plt.show()

在这里插入图片描述

但是数据大多数候选人都有研究经验。

因此，本研究将成为入学机会的一个不重要的特征

print("Not Having Research:",len(df[df.Research == 0]))print("Having Research:",len(df[df.Research == 1]))y = np.array([len(df[df.Research == 0]),len(df[df.Research == 1])])x = ["Not Having Research","Having Research"]plt.bar(x,y)plt.title("Research Experience")plt.xlabel("Canditates")plt.ylabel("Frequency")plt.show()

在这里插入图片描述

数据中托福最低分为92分，托福最高分为120分。平均107.41。

y = np.array([df["TOEFL Score"].min(),df["TOEFL Score"].mean(),df["TOEFL Score"].max()])x = ["Worst","Average","Best"]plt.bar(x,y)plt.title("TOEFL Scores")plt.xlabel("Level")plt.ylabel("TOEFL Score")plt.show()

在这里插入图片描述

GRE分数：

此柱状图显示GRE分数的频率。

密度介于310和330之间。在这个范围以上是候选人脱颖而出的一个很好的特征。

df["GRE Score"].plot(kind = 'hist',bins = 200,figsize = (6,6))plt.title("GRE Scores")plt.xlabel("GRE Score")plt.ylabel("Frequency")plt.show()

在这里插入图片描述

大学评分的CGPA分数：

随着大学质量的提高，CGPA分数也随之提高。

plt.scatter(df["University Rating"],df.CGPA)plt.title("CGPA Scores for University Ratings")plt.xlabel("University Rating")plt.ylabel("CGPA")plt.show()

在这里插入图片描述

GRE分数高的个体通常有较高的CGPA分数。

plt.scatter(df["GRE Score"],df.CGPA)plt.title("CGPA for GRE Scores")plt.xlabel("GRE Score")plt.ylabel("CGPA")plt.show()

在这里插入图片描述

df[df.CGPA >= 8.5].plot(kind='scatter', x='GRE Score', y='TOEFL Score',color="red")plt.xlabel("GRE Score")plt.ylabel("TOEFL SCORE")plt.title("CGPA>=8.5")plt.grid(True)plt.show()

在这里插入图片描述

从好大学毕业的候选人更有幸被录取。

s = df[df["Chance of Admit"] >= 0.75]["University Rating"].value_counts().head(5)plt.title("University Ratings of Candidates with an 75% acceptance chance")s.plot(kind='bar',figsize=(20, 10))plt.xlabel("University Rating")plt.ylabel("Candidates")plt.show()

在这里插入图片描述

CGPA分数高的候选人通常具有较高的SOP分数。

plt.scatter(df["CGPA"],df.SOP)plt.xlabel("CGPA")plt.ylabel("SOP")plt.title("SOP for CGPA")plt.show()

在这里插入图片描述

GRE分数高的候选人通常具有较高的SOP分数。

plt.scatter(df["GRE Score"],df["SOP"])plt.xlabel("GRE Score")plt.ylabel("SOP")plt.title("SOP for GRE Score")plt.show()

在这里插入图片描述

上面是数据分析过程，下面开始model的训练

去掉第一列的序号

# reading the datasetdf = pd.read_csv("../input/Admission_Predict.csv",sep = ",")# it may be needed in the future.serialNo = df["Serial No."].valuesdf.drop(["Serial No."],axis=1,inplace = True)

y = df["Chance of Admit"].valuesx = df.drop(["Chance of Admit"],axis=1)# separating train (80%) and test (%20) setsfrom sklearn.model_selection import train_test_splitx_train, x_test,y_train, y_test = train_test_split(x,y,test_size = 0.20,random_state = 42)

缩放到固定范围（0-1）

# normalizationfrom sklearn.preprocessing import MinMaxScalerscalerX = MinMaxScaler(feature_range=(0, 1))x_train[x_train.columns] = scalerX.fit_transform(x_train[x_train.columns])x_test[x_test.columns] = scalerX.transform(x_test[x_test.columns])

线性模型

from sklearn.linear_model import LinearRegressionlr = LinearRegression()lr.fit(x_train,y_train)y_head_lr = lr.predict(x_test)print("real value of y_test[1]: " + str(y_test[1]) + " -> the predict: " + str(lr.predict(x_test.iloc[[1],:])))print("real value of y_test[2]: " + str(y_test[2]) + " -> the predict: " + str(lr.predict(x_test.iloc[[2],:])))from sklearn.metrics import r2_scoreprint("r_square score: ", r2_score(y_test,y_head_lr))y_head_lr_train = lr.predict(x_train)print("r_square score (train dataset): ", r2_score(y_train,y_head_lr_train))

real value of y_test[1]: 0.68 -> the predict: [0.72368741]

real value of y_test[2]: 0.9 -> the predict: [0.93536809]

r_square score: 0.821208259148699

r_square score (train dataset): 0.7951946003191085

随机森林

from sklearn.ensemble import RandomForestRegressorrfr = RandomForestRegressor(n_estimators = 100, random_state = 42)rfr.fit(x_train,y_train)y_head_rfr = rfr.predict(x_test) from sklearn.metrics import r2_scoreprint("r_square score: ", r2_score(y_test,y_head_rfr))print("real value of y_test[1]: " + str(y_test[1]) + " -> the predict: " + str(rfr.predict(x_test.iloc[[1],:])))print("real value of y_test[2]: " + str(y_test[2]) + " -> the predict: " + str(rfr.predict(x_test.iloc[[2],:])))y_head_rf_train = rfr.predict(x_train)print("r_square score (train dataset): ", r2_score(y_train,y_head_rf_train))

r_square score: 0.8074111823415694

real value of y_test[1]: 0.68 -> the predict: [0.7249]

real value of y_test[2]: 0.9 -> the predict: [0.9407]

r_square score (train dataset): 0.9634880602889714

决策树

from sklearn.tree import DecisionTreeRegressordtr = DecisionTreeRegressor(random_state = 42)dtr.fit(x_train,y_train)y_head_dtr = dtr.predict(x_test) from sklearn.metrics import r2_scoreprint("r_square score: ", r2_score(y_test,y_head_dtr))print("real value of y_test[1]: " + str(y_test[1]) + " -> the predict: " + str(dtr.predict(x_test.iloc[[1],:])))print("real value of y_test[2]: " + str(y_test[2]) + " -> the predict: " + str(dtr.predict(x_test.iloc[[2],:])))y_head_dtr_train = dtr.predict(x_train)print("r_square score (train dataset): ", r2_score(y_train,y_head_dtr_train))

r_square score: 0.6262105228127393

real value of y_test[1]: 0.68 -> the predict: [0.73]

real value of y_test[2]: 0.9 -> the predict: [0.94]

r_square score (train dataset): 1.0

线性回归和随机森林回归算法优于决策树回归算法。

y = np.array([r2_score(y_test,y_head_lr),r2_score(y_test,y_head_rfr),r2_score(y_test,y_head_dtr)])x = ["LinearRegression","RandomForestReg.","DecisionTreeReg."]plt.bar(x,y)plt.title("Comparison of Regression Algorithms")plt.xlabel("Regressor")plt.ylabel("r2_score")plt.show()

在这里插入图片描述

可视化三种算法

red = plt.scatter(np.arange(0,80,5),y_head_lr[0:80:5],color = "red")green = plt.scatter(np.arange(0,80,5),y_head_rfr[0:80:5],color = "green")blue = plt.scatter(np.arange(0,80,5),y_head_dtr[0:80:5],color = "blue")black = plt.scatter(np.arange(0,80,5),y_test[0:80:5],color = "black")plt.title("Comparison of Regression Algorithms")plt.xlabel("Index of Candidate")plt.ylabel("Chance of Admit")plt.legend((red,green,blue,black),('LR', 'RFR', 'DTR', 'REAL'))plt.show()

在这里插入图片描述

转载地址：https://maoli.blog.csdn.net/article/details/91596325 如侵犯您的版权，请留言回复原文章的地址，我们会给您删除此文章，给您带来不便请您谅解！

上一篇：聊天机器人

下一篇：RLC 串联电路

发表评论

关于作者

喝酒易醉，品茶养心，人生如梦，品茶悟道，何以解忧？唯有杜康！

-- 愿君每日到此一游！

线性模型

随机森林

决策树

线性回归和随机森林回归算法优于决策树回归算法。

发表评论

最新留言

关于作者

推荐文章