Pandas多维特征数据预处理及sklearn数据不均衡处理相关技术实践-大数据ML样本集案例实战...-白红宇的个人博客

1 机器学习调优步骤（第一行不平衡问题处理）

2 Pandas多维特征数据预处理

数据初始化展示

import pandas as pd    取出第一行  loans_2007 = pd.read_csv('C:\\ML\\MLData\\filtered_loans_2007.csv', skiprows=1)  print(len(loans_2007))  half_count = len(loans_2007) / 2  loans_2007 = loans_2007.dropna(thresh=half_count, axis=1)  #loans_2007 = loans_2007.drop(['desc', 'url'],axis=1)  loans_2007.to_csv('loans_2007.csv', index=False)    loans_2007.head(3)复制代码

显示第0行数据

import pandas as pd  loans_2007 = pd.read_csv("loans_2007.csv")  #loans_2007.drop_duplicates()  print(loans_2007.iloc[0])  print(loans_2007.shape[1])    id                                1077501  member_id                      1.2966e+06  loan_amnt                            5000  funded_amnt                          5000  funded_amnt_inv                      4975  term                            36 months  int_rate                           10.65%  installment                        162.87  grade                                   B  sub_grade                              B2  emp_title                             NaN  emp_length                      10+ years  home_ownership                       RENT  annual_inc                          24000  verification_status              Verified  issue_d                          Dec-2011  loan_status                    Fully Paid  pymnt_plan                              n  purpose                       credit_card  title                            Computer  zip_code                            860xx  addr_state                             AZ  dti                                 27.65  delinq_2yrs                             0  earliest_cr_line                 Jan-1985  inq_last_6mths                          1  open_acc                                3  pub_rec                                 0  revol_bal                           13648  revol_util                          83.7%  total_acc                               9  initial_list_status                     f  out_prncp                               0  out_prncp_inv                           0  total_pymnt                       5863.16  total_pymnt_inv                   5833.84  total_rec_prncp                      5000  total_rec_int                      863.16  total_rec_late_fee                      0  recoveries                              0  collection_recovery_fee                 0  last_pymnt_d                     Jan-2015  last_pymnt_amnt                    171.62  last_credit_pull_d               Nov-2016  collections_12_mths_ex_med              0  policy_code                             1  application_type               INDIVIDUAL  acc_now_delinq                          0  chargeoff_within_12_mths                0  delinq_amnt                             0  pub_rec_bankruptcies                    0  tax_liens                               0  Name: 0, dtype: object  52复制代码

删除无意义列

loans_2007 = loans_2007.drop(["id", "member_id", "funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d"], axis=1)  loans_2007 = loans_2007.drop(["zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp"], axis=1)  loans_2007 = loans_2007.drop(["total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis=1)  print(loans_2007.iloc[0])  print(loans_2007.shape[1])复制代码

查看预测值的状态类型

print(loans_2007['loan_status'].value_counts())    Fully Paid                                             33902  Charged Off                                             5658  Does not meet the credit policy. Status:Fully Paid      1988  Does not meet the credit policy. Status:Charged Off      761  Current                                                  201  Late (31-120 days)                                        10  In Grace Period                                            9  Late (16-30 days)                                          5  Default                                                    1  Name: loan_status, dtype: int64复制代码

根据贷款状态，舍弃部分不清晰结论，给出明确分类0和1，进行替换

loans_2007 = loans_2007[(loans_2007['loan_status'] == "Fully Paid") | (loans_2007['loan_status'] == "Charged Off")]    status_replace = {      "loan_status" : {          "Fully Paid": 1,          "Charged Off": 0,      }  }    loans_2007 = loans_2007.replace(status_replace)复制代码

去除每一列值都相同的列

#let's look for any columns that contain only one unique value and remove them    orig_columns = loans_2007.columns  drop_columns = []  for col in orig_columns:      col_series = loans_2007[col].dropna().unique()      if len(col_series) == 1:          drop_columns.append(col)  loans_2007 = loans_2007.drop(drop_columns, axis=1)  print(drop_columns)  print loans_2007.shape  loans_2007.to_csv('filtered_loans_2007.csv', index=False)  ['initial_list_status', 'collections_12_mths_ex_med', 'policy_code', 'application_type', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'tax_liens']  (39560, 24)复制代码

空值处理

import pandas as pd  loans = pd.read_csv('filtered_loans_2007.csv')  null_counts = loans.isnull().sum()  print(null_counts)    loan_amnt                 0  term                      0  int_rate                  0  installment               0  emp_length                0  home_ownership            0  annual_inc                0  verification_status       0  loan_status               0  pymnt_plan                0  purpose                   0  title                    10  addr_state                0  dti                       0  delinq_2yrs               0  earliest_cr_line          0  inq_last_6mths            0  open_acc                  0  pub_rec                   0  revol_bal                 0  revol_util               50  total_acc                 0  last_credit_pull_d        2  pub_rec_bankruptcies    697  dtype: int64    loans = loans.drop("pub_rec_bankruptcies", axis=1)  loans = loans.dropna(axis=0)复制代码

String类型分布

print(loans.dtypes.value_counts())  object     12  float64    10  int64       1  dtype: int64    object_columns_df = loans.select_dtypes(include=["object"])  print(object_columns_df.iloc[0])    term                     36 months  int_rate                    10.65%  emp_length               10+ years  home_ownership                RENT  verification_status       Verified  pymnt_plan                       n  purpose                credit_card  title                     Computer  addr_state                      AZ  earliest_cr_line          Jan-1985  revol_util                   83.7%  last_credit_pull_d        Nov-2016  Name: 0, dtype: object    cols = ['home_ownership', 'verification_status', 'emp_length', 'term', 'addr_state']  for c in cols:      print(loans[c].value_counts())        RENT        18780  MORTGAGE    17574  OWN          3045  OTHER          96  NONE            3  Name: home_ownership, dtype: int64    Not Verified       16856  Verified           12705  Source Verified     9937  Name: verification_status, dtype: int64    10+ years    8821  < 1 year     4563  2 years      4371  3 years      4074  4 years      3409  5 years      3270  1 year       3227  6 years      2212  7 years      1756  8 years      1472  9 years      1254  n/a          1069  Name: emp_length, dtype: int64     36 months    29041   60 months    10457  Name: term, dtype: int64    CA    7070  NY    3788  FL    2856  TX    2714  NJ    1838  IL    1517  PA    1504  VA    1400  GA    1393  MA    1336  OH    1208  MD    1049  AZ     874  WA     834  CO     786  NC     780  CT     747  MI     722  MO     682  MN     611  NV     492  SC     470  WI     453  AL     446  OR     445  LA     435  KY     325  OK     298  KS     269  UT     256  AR     243  DC     211  RI     198  NM     188  WV     176  HI     172  NH     172  DE     113  MT      84  WY      83  AK      79  SD      63  VT      54  MS      19  TN      17  IN       9  ID       6  IA       5  NE       5  ME       3  Name: addr_state, dtype: int64   复制代码

String类型分布2

print(loans["purpose"].value_counts())print(loans["title"].value_counts())debt_consolidation    18533credit_card            5099other                  3963home_improvement       2965major_purchase         2181small_business         1815car                    1544wedding                 945medical                 692moving                  581vacation                379house                   378educational             320renewable_energy        103Name: purpose, dtype: int64Debt Consolidation                         2168Debt Consolidation Loan                    1706Personal Loan                               658Consolidation                               509debt consolidation                          502Credit Card Consolidation                   356Home Improvement                            354Debt consolidation                          333Small Business Loan                         322Credit Card Loan                            313Personal                                    308Consolidation Loan                          255Home Improvement Loan                       246personal loan                               234personal                                    220Loan                                        212Wedding Loan                                209consolidation                               200Car Loan                                    200Other Loan                                  190Credit Card Payoff                          155Wedding                                     152Major Purchase Loan                         144Credit Card Refinance                       143Consolidate                                 127Medical                                     122Credit Card                                 117home improvement                            111My Loan                                      94Credit Cards                                 93                                           ... DebtConsolidationn                            1 Freedom                                      1Credit Card Consolidation Loan - SEG          1SOLAR PV                                      1Pay on Credit card                            1To pay off balloon payments due               1Paying off the debt                           1Payoff ING PLOC                               1Josh CC Loan                                  1House payoff                                  1Taking care of Business                       1Gluten Free Bakery in ideal town for it       1Startup Money for Small Business              1FundToFinanceCar                              1getting ready for Baby                        1Dougs Wedding Loan                            1d rock                                        1LC Loan 2                                     1swimming pool repair                          1engagement                                    1Cut the credit cards Loan                     1vinman                                        1working hard to get out of debt               1consolidate the rest of my debt               1Medical/Vacation                              12BDebtFree                                    1Paying Off High Interest Credit Cards!        1Baby on the way!                              1cart loan                                     1Consolidaton                                  1Name: title, dtype: int64复制代码

类型转换

mapping_dict = {      "emp_length": {          "10+ years": 10,          "9 years": 9,          "8 years": 8,          "7 years": 7,          "6 years": 6,          "5 years": 5,          "4 years": 4,          "3 years": 3,          "2 years": 2,          "1 year": 1,          "< 1 year": 0,          "n/a": 0      }  }  loans = loans.drop(["last_credit_pull_d", "earliest_cr_line", "addr_state", "title"], axis=1)  loans["int_rate"] = loans["int_rate"].str.rstrip("%").astype("float")  loans["revol_util"] = loans["revol_util"].str.rstrip("%").astype("float")  loans = loans.replace(mapping_dict)复制代码

独热编码

cat_columns = ["home_ownership", "verification_status", "emp_length", "purpose", "term"]  dummy_df = pd.get_dummies(loans[cat_columns])  loans = pd.concat([loans, dummy_df], axis=1)  loans = loans.drop(cat_columns, axis=1)  loans = loans.drop("pymnt_plan", axis=1)复制代码

查看转换类型

import pandas as pd  loans = pd.read_csv("cleaned_loans2007.csv")  print(loans.info())    
      
         RangeIndex: 39498 entries, 0 to 39497  Data columns (total 37 columns):  loan_amnt                              39498 non-null float64  int_rate                               39498 non-null float64  installment                            39498 non-null float64  annual_inc                             39498 non-null float64  loan_status                            39498 non-null int64  dti                                    39498 non-null float64  delinq_2yrs                            39498 non-null float64  inq_last_6mths                         39498 non-null float64  open_acc                               39498 non-null float64  pub_rec                                39498 non-null float64  revol_bal                              39498 non-null float64  revol_util                             39498 non-null float64  total_acc                              39498 non-null float64  home_ownership_MORTGAGE                39498 non-null int64  home_ownership_NONE                    39498 non-null int64  home_ownership_OTHER                   39498 non-null int64  home_ownership_OWN                     39498 non-null int64  home_ownership_RENT                    39498 non-null int64  verification_status_Not Verified       39498 non-null int64  verification_status_Source Verified    39498 non-null int64  verification_status_Verified           39498 non-null int64  purpose_car                            39498 non-null int64  purpose_credit_card                    39498 non-null int64  purpose_debt_consolidation             39498 non-null int64  purpose_educational                    39498 non-null int64  purpose_home_improvement               39498 non-null int64  purpose_house                          39498 non-null int64  purpose_major_purchase                 39498 non-null int64  purpose_medical                        39498 non-null int64  purpose_moving                         39498 non-null int64  purpose_other                          39498 non-null int64  purpose_renewable_energy               39498 non-null int64  purpose_small_business                 39498 non-null int64  purpose_vacation                       39498 non-null int64  purpose_wedding                        39498 non-null int64  term_ 36 months                        39498 non-null int64  term_ 60 months                        39498 non-null int64  dtypes: float64(12), int64(25)  memory usage: 11.1 MB复制代码

3 准确率理论及不均衡处理

初始定义

import pandas as pd  # False positives.  fp_filter = (predictions == 1) & (loans["loan_status"] == 0)  fp = len(predictions[fp_filter])    # True positives.  tp_filter = (predictions == 1) & (loans["loan_status"] == 1)  tp = len(predictions[tp_filter])    # False negatives.  fn_filter = (predictions == 0) & (loans["loan_status"] == 1)  fn = len(predictions[fn_filter])    # True negatives  tn_filter = (predictions == 0) & (loans["loan_status"] == 0)  tn = len(predictions[tn_filter])复制代码

逻辑回归不处理不均衡

from sklearn.linear_model import LogisticRegression      lr = LogisticRegression()      cols = loans.columns      train_cols = cols.drop("loan_status")      features = loans[train_cols]      target = loans["loan_status"]      lr.fit(features, target)      predictions = lr.predict(features)            from sklearn.linear_model import LogisticRegression      from sklearn.cross_validation import cross_val_predict, KFold      lr = LogisticRegression()      kf = KFold(features.shape[0], random_state=1)      predictions = cross_val_predict(lr, features, target, cv=kf)      predictions = pd.Series(predictions)            # False positives.      fp_filter = (predictions == 1) & (loans["loan_status"] == 0)      fp = len(predictions[fp_filter])            # True positives.      tp_filter = (predictions == 1) & (loans["loan_status"] == 1)      tp = len(predictions[tp_filter])            # False negatives.      fn_filter = (predictions == 0) & (loans["loan_status"] == 1)      fn = len(predictions[fn_filter])            # True negatives      tn_filter = (predictions == 0) & (loans["loan_status"] == 0)      tn = len(predictions[tn_filter])            # Rates      tpr = tp / float((tp + fn))      fpr = fp / float((fp + tn))            print(tpr)      print(fpr)      print predictions[:20]            0.999084438406      0.998049299521      0     1      1     1      2     1      3     1      4     1      5     1      6     1      7     1      8     1      9     1      10    1      11    1      12    1      13    1      14    1      15    1      16    1      17    1      18    1      19    1      dtype: int64复制代码

逻辑回归balanced处理不均衡

from sklearn.linear_model import LogisticRegression  from sklearn.cross_validation import cross_val_predict  lr = LogisticRegression(class_weight="balanced")  kf = KFold(features.shape[0], random_state=1)  predictions = cross_val_predict(lr, features, target, cv=kf)  predictions = pd.Series(predictions)    # False positives.  fp_filter = (predictions == 1) & (loans["loan_status"] == 0)  fp = len(predictions[fp_filter])    # True positives.  tp_filter = (predictions == 1) & (loans["loan_status"] == 1)  tp = len(predictions[tp_filter])    # False negatives.  fn_filter = (predictions == 0) & (loans["loan_status"] == 1)  fn = len(predictions[fn_filter])    # True negatives  tn_filter = (predictions == 0) & (loans["loan_status"] == 0)  tn = len(predictions[tn_filter])    # Rates  tpr = tp / float((tp + fn))  fpr = fp / float((fp + tn))    print(tpr)  print(fpr)  print predictions[:20]    0.670781771464  0.400780280192  0     1  1     0  2     0  3     1  4     1  5     0  6     0  7     0  8     0  9     0  10    1  11    0  12    1  13    1  14    0  15    0  16    1  17    1  18    1  19    0  dtype: int64复制代码

逻辑回归penalty处理不均衡

from sklearn.linear_model import LogisticRegression  from sklearn.cross_validation import cross_val_predict  penalty = {      0: 5,      1: 1  }    lr = LogisticRegression(class_weight=penalty)  kf = KFold(features.shape[0], random_state=1)  predictions = cross_val_predict(lr, features, target, cv=kf)  predictions = pd.Series(predictions)    # False positives.  fp_filter = (predictions == 1) & (loans["loan_status"] == 0)  fp = len(predictions[fp_filter])    # True positives.  tp_filter = (predictions == 1) & (loans["loan_status"] == 1)  tp = len(predictions[tp_filter])    # False negatives.  fn_filter = (predictions == 0) & (loans["loan_status"] == 1)  fn = len(predictions[fn_filter])    # True negatives  tn_filter = (predictions == 0) & (loans["loan_status"] == 0)  tn = len(predictions[tn_filter])    # Rates  tpr = tp / float((tp + fn))  fpr = fp / float((fp + tn))    print(tpr)  print(fpr)  0.731799521545  0.478985635751复制代码

随机森林balanced处理不均衡

from sklearn.ensemble import RandomForestClassifier  from sklearn.cross_validation import cross_val_predict  rf = RandomForestClassifier(n_estimators=10,class_weight="balanced", random_state=1)  #print help(RandomForestClassifier)  kf = KFold(features.shape[0], random_state=1)  predictions = cross_val_predict(rf, features, target, cv=kf)  predictions = pd.Series(predictions)    # False positives.  fp_filter = (predictions == 1) & (loans["loan_status"] == 0)  fp = len(predictions[fp_filter])    # True positives.  tp_filter = (predictions == 1) & (loans["loan_status"] == 1)  tp = len(predictions[tp_filter])    # False negatives.  fn_filter = (predictions == 0) & (loans["loan_status"] == 1)  fn = len(predictions[fn_filter])    # True negatives  tn_filter = (predictions == 0) & (loans["loan_status"] == 0)  tn = len(predictions[tn_filter])    # Rates  tpr = tp / float((tp + fn))  fpr = fp / float((fp + tn))复制代码

4 总结

通过本文，对于数据特征工程，具有非常重要的意义。

秦凯新于深圳 20181220

转载地址：https://blog.csdn.net/weixin_34186950/article/details/88003764 如侵犯您的版权，请留言回复原文章的地址，我们会给您删除此文章，给您带来不便请您谅解！

1 机器学习调优步骤（第一行不平衡问题处理）

2 Pandas多维特征数据预处理

3 准确率理论及不均衡处理

4 总结

发表评论

最新留言

关于作者

推荐文章