专注于快乐的事情

评分卡模型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import pandas as pd
import datetime
import collections
import numpy as np
import numbers
import random
import sys
import pickle
from itertools import combinations
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from importlib import reload
from matplotlib import pyplot as plt
from sklearn.linear_model import LogisticRegressionCV
# -*- coding: utf-8 -*-

初始工作

读取数据文件

1
2
3
4
5

folderOfData = './datasets/score/'
data1 = pd.read_csv(folderOfData+'PPD_LogInfo_3_1_Training_Set.csv', header = 0)
data2 = pd.read_csv(folderOfData+'PPD_Training_Master_GBK_3_1_Training_Set.csv', header = 0,encoding = 'gbk')
data3 = pd.read_csv(folderOfData+'PPD_Userupdate_Info_3_1_Training_Set.csv', header = 0)
1
data1.head()























































Idx Listinginfo1 LogInfo1 LogInfo2 LogInfo3
0 10001 2014-03-05 107 6 2014-02-20
1 10001 2014-03-05 107 6 2014-02-23
2 10001 2014-03-05 107 6 2014-02-24
3 10001 2014-03-05 107 6 2014-02-25
4 10001 2014-03-05 107 6 2014-02-27

信贷客户的登彔信息,字段描叙如下:

字段 含义
Idx 用户的唯一标识
LogInfo3 登录日期
LogInfo2 登录事件代码
1
data2.head().T























































































































































































































































































































































































































































































































0 1 2 3 4
Idx 10001 10002 10003 10006 10007
UserInfo_1 1 1 1 4 5
UserInfo_2 深圳 温州 宜昌 南平 辽阳
UserInfo_3 4 4 3 1 1
UserInfo_4 深圳 温州 宜昌 南平 辽阳
WeblogInfo_1 NaN NaN NaN NaN NaN
WeblogInfo_2 1 0 0 NaN 0
WeblogInfo_3 NaN NaN NaN NaN NaN
WeblogInfo_4 1 1 2 NaN 1
WeblogInfo_5 1 1 2 NaN 1
WeblogInfo_6 1 1 2 NaN 1
WeblogInfo_7 14 14 9 2 3
WeblogInfo_8 0 0 3 0 0
WeblogInfo_9 0 0 0 0 0
WeblogInfo_10 0 0 0 0 0
WeblogInfo_11 0 0 0 0 0
WeblogInfo_12 0 0 0 0 0
WeblogInfo_13 0 0 0 0 0
WeblogInfo_14 6 0 0 0 0
WeblogInfo_15 6 0 0 0 0
WeblogInfo_16 0 7 3 0 0
WeblogInfo_17 6 7 4 2 3
WeblogInfo_18 2 0 2 0 0
UserInfo_5 2 2 2 2 2
UserInfo_6 2 2 2 2 2
UserInfo_7 广东 浙江 湖北 福建 辽宁
UserInfo_8 深圳 温州 宜昌 南平 辽阳
UserInfo_9 中国移动 中国移动 中国电信 中国移动 中国移动
UserInfo_10 0 1 0 0 0
UserInfo_11 NaN 0 0 0 NaN
ThirdParty_Info_Period7_7 -1 -1 -1 -1 -1
ThirdParty_Info_Period7_8 -1 -1 -1 -1 -1
ThirdParty_Info_Period7_9 -1 -1 -1 -1 -1
ThirdParty_Info_Period7_10 -1 -1 -1 -1 -1
ThirdParty_Info_Period7_11 -1 -1 -1 -1 -1
ThirdParty_Info_Period7_12 -1 -1 -1 -1 -1
ThirdParty_Info_Period7_13 -1 -1 -1 -1 -1
ThirdParty_Info_Period7_14 -1 -1 -1 -1 -1
ThirdParty_Info_Period7_15 -1 -1 -1 -1 -1
ThirdParty_Info_Period7_16 -1 -1 -1 -1 -1
ThirdParty_Info_Period7_17 -1 -1 -1 -1 -1
SocialNetwork_1 0 0 0 0 0
SocialNetwork_2 0 0 0 0 0
SocialNetwork_3 -1 -1 -1 -1 -1
SocialNetwork_4 -1 -1 -1 -1 -1
SocialNetwork_5 -1 -1 -1 -1 -1
SocialNetwork_6 -1 -1 -1 -1 -1
SocialNetwork_7 -1 -1 -1 -1 -1
SocialNetwork_8 126 33 -1 -1 -1
SocialNetwork_9 234 110 -1 -1 -1
SocialNetwork_10 222 1 -1 -1 -1
SocialNetwork_11 -1 -1 -1 -1 -1
SocialNetwork_12 0 0 -1 -1 -1
SocialNetwork_13 0 0 1 0 0
SocialNetwork_14 0 0 0 0 0
SocialNetwork_15 0 0 0 0 0
SocialNetwork_16 0 0 0 0 0
SocialNetwork_17 1 2 0 0 0
target 0 0 0 0 0
ListingInfo 2014/3/5 2014/2/26 2014/2/28 2014/2/25 2014/2/27

228 rows × 5 columns


信贷客户在平台上的申报信息和部分三方数据信息,以及需要预测的目标变量。

字段 含义
Idx 用户的唯一标识
Target 目标变量,以1、0表示违约与非违约
ThirdParty_Info_Period* 第三方数据信息,除-1外,其他都是非负整数。-1可能是特殊值
ListingInfo 贷款发放日期
1
data3.head()

















































Idx ListingInfo1 UserupdateInfo1 UserupdateInfo2
0 10001 2014/03/05 _EducationId 2014/02/20
1 10001 2014/03/05 _HasBuyCar 2014/02/20
2 10001 2014/03/05 _LastUpdateDate 2014/02/20
3 10001 2014/03/05 _MarriageStatusId 2014/02/20
4 10001 2014/03/05 _MobilePhone 2014/02/20

部分客户的信息修改行为
|字段|含义|
|::|::|
|Idx|用户的唯一标识|
|UserupdateInfo1|更改信息的所属字段|
|UserupdateInfo2|更改日期|

特征构造-数据中衍生特征

归属地是否一致,放在city_match字段

1
2
3
4
5
6

data2['city_match'] = data2.apply(lambda x: int(x.UserInfo_2 == x.UserInfo_4 == x.UserInfo_8 == x.UserInfo_20),axis = 1)
del data2['UserInfo_2']
del data2['UserInfo_4']
del data2['UserInfo_8']
del data2['UserInfo_20']
1
#data2['city_match']

提取申请日期,计算日期差,放在ListingGap字段中,并查看日期差的分布。

1
2
3
4
5
6
7
8
9
10
## 登录日期
data1['logInfo'] = data1['LogInfo3'].map(lambda x: datetime.datetime.strptime(x,'%Y-%m-%d'))
## 贷款发放日期
data1['Listinginfo'] = data1['Listinginfo1'].map(lambda x: datetime.datetime.strptime(x,'%Y-%m-%d'))
data1['ListingGap'] = data1[['logInfo','Listinginfo']].apply(lambda x: (x[1]-x[0]).days,axis = 1)

plt.hist(data1['ListingGap'],bins=200)
plt.title('Days between login date and listing date')
ListingGap2 = data1['ListingGap'].map(lambda x: min(x,365))
plt.hist(ListingGap2,bins=200)
(array([9.0180e+04, 9.2185e+04, 8.4625e+04, 6.5690e+04, 4.6703e+04,
        2.0999e+04, 3.1156e+04, 2.3292e+04, 1.1277e+04, 6.6520e+03,
        5.9060e+03, 2.8950e+03, 5.5110e+03, 4.0640e+03, 3.6030e+03,
        2.9530e+03, 2.8020e+03, 1.2120e+03, 2.5110e+03, 2.2570e+03,
        2.0350e+03, 1.6930e+03, 9.8200e+02, 1.7250e+03, 1.5240e+03,
        1.4620e+03, 1.5680e+03, 1.4340e+03, 6.2800e+02, 1.0770e+03,
        1.2250e+03, 1.2550e+03, 1.0920e+03, 1.0690e+03, 6.5100e+02,
        8.8700e+02, 8.1400e+02, 7.6600e+02, 8.7400e+02, 3.3800e+02,
        6.6000e+02, 9.5600e+02, 6.2700e+02, 5.6000e+02, 5.2800e+02,
        2.5100e+02, 5.5800e+02, 5.6200e+02, 6.1400e+02, 6.8600e+02,
        5.6300e+02, 3.0500e+02, 5.7700e+02, 6.6400e+02, 5.2200e+02,
        4.5700e+02, 5.6000e+02, 2.4000e+02, 3.9600e+02, 5.3800e+02,
        4.4900e+02, 4.6300e+02, 1.8200e+02, 4.3400e+02, 3.0500e+02,
        4.5400e+02, 3.7100e+02, 4.3400e+02, 2.3300e+02, 3.3700e+02,
        3.3100e+02, 3.2700e+02, 4.2000e+02, 4.2500e+02, 2.0900e+02,
        2.5500e+02, 3.9000e+02, 3.2700e+02, 3.8700e+02, 1.7900e+02,
        2.5100e+02, 3.3100e+02, 2.9000e+02, 2.7800e+02, 2.3100e+02,
        1.3200e+02, 2.5300e+02, 3.8300e+02, 2.9800e+02, 3.1200e+02,
        3.4400e+02, 9.4000e+01, 2.2200e+02, 2.7400e+02, 2.0300e+02,
        2.1300e+02, 2.9000e+02, 9.6000e+01, 2.0600e+02, 1.7700e+02,
        1.3900e+02, 2.0100e+02, 1.1800e+02, 2.4700e+02, 2.6200e+02,
        1.9200e+02, 1.5900e+02, 2.0900e+02, 9.2000e+01, 2.3300e+02,
        1.6800e+02, 1.7300e+02, 1.5900e+02, 2.6700e+02, 9.9000e+01,
        2.1000e+02, 1.9400e+02, 1.2300e+02, 1.8800e+02, 9.6000e+01,
        2.1400e+02, 1.9200e+02, 1.7300e+02, 1.4000e+02, 1.5300e+02,
        5.6000e+01, 1.2200e+02, 1.8000e+02, 1.2700e+02, 1.4800e+02,
        1.0600e+02, 6.7000e+01, 1.5900e+02, 6.4000e+01, 1.5200e+02,
        1.1900e+02, 1.6000e+02, 9.6000e+01, 1.2400e+02, 1.1200e+02,
        1.6300e+02, 1.7300e+02, 5.2000e+01, 6.1000e+01, 1.3400e+02,
        9.9000e+01, 9.7000e+01, 1.0100e+02, 3.3000e+01, 1.0800e+02,
        1.2100e+02, 8.1000e+01, 7.8000e+01, 9.2000e+01, 7.9000e+01,
        1.1400e+02, 1.0100e+02, 9.7000e+01, 9.1000e+01, 4.9000e+01,
        1.1800e+02, 1.0700e+02, 1.1800e+02, 1.2900e+02, 1.2700e+02,
        7.3000e+01, 1.4600e+02, 9.9000e+01, 1.3000e+02, 9.2000e+01,
        8.9000e+01, 4.6000e+01, 1.1000e+02, 9.2000e+01, 9.1000e+01,
        1.0800e+02, 1.2600e+02, 8.5000e+01, 9.7000e+01, 1.0700e+02,
        6.5000e+01, 8.1000e+01, 8.0000e+01, 1.2200e+02, 1.2700e+02,
        9.5000e+01, 1.7100e+02, 7.2000e+01, 2.6000e+01, 8.3000e+01,
        1.0400e+02, 1.1200e+02, 6.7000e+01, 1.1900e+02, 7.3000e+01,
        6.4000e+01, 7.6000e+01, 1.0000e+02, 1.0600e+02, 1.6415e+04]),
 array([  0.   ,   1.825,   3.65 ,   5.475,   7.3  ,   9.125,  10.95 ,
         12.775,  14.6  ,  16.425,  18.25 ,  20.075,  21.9  ,  23.725,
         25.55 ,  27.375,  29.2  ,  31.025,  32.85 ,  34.675,  36.5  ,
         38.325,  40.15 ,  41.975,  43.8  ,  45.625,  47.45 ,  49.275,
         51.1  ,  52.925,  54.75 ,  56.575,  58.4  ,  60.225,  62.05 ,
         63.875,  65.7  ,  67.525,  69.35 ,  71.175,  73.   ,  74.825,
         76.65 ,  78.475,  80.3  ,  82.125,  83.95 ,  85.775,  87.6  ,
         89.425,  91.25 ,  93.075,  94.9  ,  96.725,  98.55 , 100.375,
        102.2  , 104.025, 105.85 , 107.675, 109.5  , 111.325, 113.15 ,
        114.975, 116.8  , 118.625, 120.45 , 122.275, 124.1  , 125.925,
        127.75 , 129.575, 131.4  , 133.225, 135.05 , 136.875, 138.7  ,
        140.525, 142.35 , 144.175, 146.   , 147.825, 149.65 , 151.475,
        153.3  , 155.125, 156.95 , 158.775, 160.6  , 162.425, 164.25 ,
        166.075, 167.9  , 169.725, 171.55 , 173.375, 175.2  , 177.025,
        178.85 , 180.675, 182.5  , 184.325, 186.15 , 187.975, 189.8  ,
        191.625, 193.45 , 195.275, 197.1  , 198.925, 200.75 , 202.575,
        204.4  , 206.225, 208.05 , 209.875, 211.7  , 213.525, 215.35 ,
        217.175, 219.   , 220.825, 222.65 , 224.475, 226.3  , 228.125,
        229.95 , 231.775, 233.6  , 235.425, 237.25 , 239.075, 240.9  ,
        242.725, 244.55 , 246.375, 248.2  , 250.025, 251.85 , 253.675,
        255.5  , 257.325, 259.15 , 260.975, 262.8  , 264.625, 266.45 ,
        268.275, 270.1  , 271.925, 273.75 , 275.575, 277.4  , 279.225,
        281.05 , 282.875, 284.7  , 286.525, 288.35 , 290.175, 292.   ,
        293.825, 295.65 , 297.475, 299.3  , 301.125, 302.95 , 304.775,
        306.6  , 308.425, 310.25 , 312.075, 313.9  , 315.725, 317.55 ,
        319.375, 321.2  , 323.025, 324.85 , 326.675, 328.5  , 330.325,
        332.15 , 333.975, 335.8  , 337.625, 339.45 , 341.275, 343.1  ,
        344.925, 346.75 , 348.575, 350.4  , 352.225, 354.05 , 355.875,
        357.7  , 359.525, 361.35 , 363.175, 365.   ]),
 <a list of 200 Patch objects>)

png

计算登录日期与放款日期时间的间隔天数,从上面的图中,可以看到绝大部分的天数在180天以内,使用180天作为最大的时间窗口计算新特征。

所有可以使用的时间窗口可以有7 days, 30 days, 60 days, 90 days, 120 days, 150 days and 180 days.在每个时间窗口内,计算总的登录次数,不同的登录方式,以及每种登录方式的平均次数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
def TimeWindowSelection(df, daysCol, time_windows):
'''
:param df: the dataset containg variabel of days
:param daysCol: the column of days
:param time_windows: the list of time window
:return:
'''
freq_tw = {}
for tw in time_windows:
freq = sum(df[daysCol].apply(lambda x: int(x<=tw)))
freq_tw[tw] = freq
return freq_tw

def DeivdedByZero(nominator, denominator):
'''
当分母为0时,返回0;否则返回正常值
'''
if denominator == 0:
return 0
else:
return nominator*1.0/denominator

timeWindows = TimeWindowSelection(data1, 'ListingGap', range(30,361,30))

time_window = [7, 30, 60, 90, 120, 150, 180]
var_list = ['LogInfo1','LogInfo2']
data1GroupbyIdx = pd.DataFrame({'Idx':data1['Idx'].drop_duplicates()})

for tw in time_window:
data1['TruncatedLogInfo'] = data1['Listinginfo'].map(lambda x: x + datetime.timedelta(-tw))
temp = data1.loc[data1['logInfo'] >= data1['TruncatedLogInfo']]
for var in var_list:
#count the frequences of LogInfo1 and LogInfo2
count_stats = temp.groupby(['Idx'])[var].count().to_dict()
data1GroupbyIdx[str(var)+'_'+str(tw)+'_count'] = data1GroupbyIdx['Idx'].map(lambda x: count_stats.get(x,0))

# count the distinct value of LogInfo1 and LogInfo2
Idx_UserupdateInfo1 = temp[['Idx', var]].drop_duplicates()
uniq_stats = Idx_UserupdateInfo1.groupby(['Idx'])[var].count().to_dict()
data1GroupbyIdx[str(var) + '_' + str(tw) + '_unique'] = data1GroupbyIdx['Idx'].map(lambda x: uniq_stats.get(x,0))

# calculate the average count of each value in LogInfo1 and LogInfo2
data1GroupbyIdx[str(var) + '_' + str(tw) + '_avg_count'] = data1GroupbyIdx[[str(var)+'_'+str(tw)+'_count',str(var) + '_' + str(tw) + '_unique']].\
apply(lambda x: DeivdedByZero(x[0],x[1]), axis=1)


data3['ListingInfo'] = data3['ListingInfo1'].map(lambda x: datetime.datetime.strptime(x,'%Y/%m/%d'))
data3['UserupdateInfo'] = data3['UserupdateInfo2'].map(lambda x: datetime.datetime.strptime(x,'%Y/%m/%d'))
data3['ListingGap'] = data3[['UserupdateInfo','ListingInfo']].apply(lambda x: (x[1]-x[0]).days,axis = 1)
collections.Counter(data3['ListingGap'])
hist_ListingGap = np.histogram(data3['ListingGap'])
hist_ListingGap = pd.DataFrame({'Freq':hist_ListingGap[0],'gap':hist_ListingGap[1][1:]})
hist_ListingGap['CumFreq'] = hist_ListingGap['Freq'].cumsum()
hist_ListingGap['CumPercent'] = hist_ListingGap['CumFreq'].map(lambda x: x*1.0/hist_ListingGap.iloc[-1]['CumFreq'])

对某些统一的字段进行统一

1
2
3
4
5
6
7
8
9

def ChangeContent(x):
y = x.upper()
if y == '_MOBILEPHONE':
y = '_PHONE'
return y

data3['UserupdateInfo1'] = data3['UserupdateInfo1'].map(ChangeContent)
data3GroupbyIdx = pd.DataFrame({'Idx':data3['Idx'].drop_duplicates()})

数据一致性处理

数据含义一致性,一般采取手工解决的方式。
对 QQ和qQ, Idnumber和idNumber,MOBILEPHONE和PHONE 进行统一。
在时间切片内,计算
(1) 更新的频率
(2) 每种更新对象的种类个数
(3) 对重要信息如IDNUMBER,HASBUYCAR, MARRIAGESTATUSID, PHONE的更新

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
time_window = [7, 30, 60, 90, 120, 150, 180]
for tw in time_window:
data3['TruncatedLogInfo'] = data3['ListingInfo'].map(lambda x: x + datetime.timedelta(-tw))
temp = data3.loc[data3['UserupdateInfo'] >= data3['TruncatedLogInfo']]

#frequency of updating
freq_stats = temp.groupby(['Idx'])['UserupdateInfo1'].count().to_dict()
data3GroupbyIdx['UserupdateInfo_'+str(tw)+'_freq'] = data3GroupbyIdx['Idx'].map(lambda x: freq_stats.get(x,0))

# number of updated types
Idx_UserupdateInfo1 = temp[['Idx','UserupdateInfo1']].drop_duplicates()
uniq_stats = Idx_UserupdateInfo1.groupby(['Idx'])['UserupdateInfo1'].count().to_dict()
data3GroupbyIdx['UserupdateInfo_' + str(tw) + '_unique'] = data3GroupbyIdx['Idx'].map(lambda x: uniq_stats.get(x, x))

#average count of each type
data3GroupbyIdx['UserupdateInfo_' + str(tw) + '_avg_count'] = data3GroupbyIdx[['UserupdateInfo_'+str(tw)+'_freq', 'UserupdateInfo_' + str(tw) + '_unique']]. \
apply(lambda x: x[0] * 1.0 / x[1], axis=1)

#whether the applicant changed items like IDNUMBER,HASBUYCAR, MARRIAGESTATUSID, PHONE
Idx_UserupdateInfo1['UserupdateInfo1'] = Idx_UserupdateInfo1['UserupdateInfo1'].map(lambda x: [x])
Idx_UserupdateInfo1_V2 = Idx_UserupdateInfo1.groupby(['Idx'])['UserupdateInfo1'].sum()
for item in ['_IDNUMBER','_HASBUYCAR','_MARRIAGESTATUSID','_PHONE']:
item_dict = Idx_UserupdateInfo1_V2.map(lambda x: int(item in x)).to_dict()
data3GroupbyIdx['UserupdateInfo_' + str(tw) + str(item)] = data3GroupbyIdx['Idx'].map(lambda x: item_dict.get(x, x))

# 进行表的链接
allData = pd.concat([data2.set_index('Idx'), data3GroupbyIdx.set_index('Idx'), data1GroupbyIdx.set_index('Idx')],axis= 1)
allData.to_csv(folderOfData+'allData_0.csv',encoding = 'gbk')

缺失值处理

缺失占比太高,舍弃该字段或者该条记录,
缺失占比不高,可以采取补缺或者作为特殊值。

1
2
3
4

allData = pd.read_csv(folderOfData+'allData_0.csv',header = 0,encoding = 'gbk')
allFeatures = list(allData.columns)
allFeatures
['Idx',
 'UserInfo_1',
 'UserInfo_3',
 'WeblogInfo_1',
 'WeblogInfo_2',
 'WeblogInfo_3',
 'WeblogInfo_4',
 'WeblogInfo_5',
 'WeblogInfo_6',
 'WeblogInfo_7',
 'WeblogInfo_8',
 'WeblogInfo_9',
 'WeblogInfo_10',
 'WeblogInfo_11',
 'WeblogInfo_12',
 'WeblogInfo_13',
 'WeblogInfo_14',
 'WeblogInfo_15',
 'WeblogInfo_16',
 'WeblogInfo_17',
 'WeblogInfo_18',
 'UserInfo_5',
 'UserInfo_6',
 'UserInfo_7',
 'UserInfo_9',
 'UserInfo_10',
 'UserInfo_11',
 'UserInfo_12',
 'UserInfo_13',
 'UserInfo_14',
 'UserInfo_15',
 'UserInfo_16',
 'UserInfo_17',
 'UserInfo_18',
 'UserInfo_19',
 'UserInfo_21',
 'UserInfo_22',
 'UserInfo_23',
 'UserInfo_24',
 'Education_Info1',
 'Education_Info2',
 'Education_Info3',
 'Education_Info4',
 'Education_Info5',
 'Education_Info6',
 'Education_Info7',
 'Education_Info8',
 'WeblogInfo_19',
 'WeblogInfo_20',
 'WeblogInfo_21',
 'WeblogInfo_23',
 'WeblogInfo_24',
 'WeblogInfo_25',
 'WeblogInfo_26',
 'WeblogInfo_27',
 'WeblogInfo_28',
 'WeblogInfo_29',
 'WeblogInfo_30',
 'WeblogInfo_31',
 'WeblogInfo_32',
 'WeblogInfo_33',
 'WeblogInfo_34',
 'WeblogInfo_35',
 'WeblogInfo_36',
 'WeblogInfo_37',
 'WeblogInfo_38',
 'WeblogInfo_39',
 'WeblogInfo_40',
 'WeblogInfo_41',
 'WeblogInfo_42',
 'WeblogInfo_43',
 'WeblogInfo_44',
 'WeblogInfo_45',
 'WeblogInfo_46',
 'WeblogInfo_47',
 'WeblogInfo_48',
 'WeblogInfo_49',
 'WeblogInfo_50',
 'WeblogInfo_51',
 'WeblogInfo_52',
 'WeblogInfo_53',
 'WeblogInfo_54',
 'WeblogInfo_55',
 'WeblogInfo_56',
 'WeblogInfo_57',
 'WeblogInfo_58',
 'ThirdParty_Info_Period1_1',
 'ThirdParty_Info_Period1_2',
 'ThirdParty_Info_Period1_3',
 'ThirdParty_Info_Period1_4',
 'ThirdParty_Info_Period1_5',
 'ThirdParty_Info_Period1_6',
 'ThirdParty_Info_Period1_7',
 'ThirdParty_Info_Period1_8',
 'ThirdParty_Info_Period1_9',
 'ThirdParty_Info_Period1_10',
 'ThirdParty_Info_Period1_11',
 'ThirdParty_Info_Period1_12',
 'ThirdParty_Info_Period1_13',
 'ThirdParty_Info_Period1_14',
 'ThirdParty_Info_Period1_15',
 'ThirdParty_Info_Period1_16',
 'ThirdParty_Info_Period1_17',
 'ThirdParty_Info_Period2_1',
 'ThirdParty_Info_Period2_2',
 'ThirdParty_Info_Period2_3',
 'ThirdParty_Info_Period2_4',
 'ThirdParty_Info_Period2_5',
 'ThirdParty_Info_Period2_6',
 'ThirdParty_Info_Period2_7',
 'ThirdParty_Info_Period2_8',
 'ThirdParty_Info_Period2_9',
 'ThirdParty_Info_Period2_10',
 'ThirdParty_Info_Period2_11',
 'ThirdParty_Info_Period2_12',
 'ThirdParty_Info_Period2_13',
 'ThirdParty_Info_Period2_14',
 'ThirdParty_Info_Period2_15',
 'ThirdParty_Info_Period2_16',
 'ThirdParty_Info_Period2_17',
 'ThirdParty_Info_Period3_1',
 'ThirdParty_Info_Period3_2',
 'ThirdParty_Info_Period3_3',
 'ThirdParty_Info_Period3_4',
 'ThirdParty_Info_Period3_5',
 'ThirdParty_Info_Period3_6',
 'ThirdParty_Info_Period3_7',
 'ThirdParty_Info_Period3_8',
 'ThirdParty_Info_Period3_9',
 'ThirdParty_Info_Period3_10',
 'ThirdParty_Info_Period3_11',
 'ThirdParty_Info_Period3_12',
 'ThirdParty_Info_Period3_13',
 'ThirdParty_Info_Period3_14',
 'ThirdParty_Info_Period3_15',
 'ThirdParty_Info_Period3_16',
 'ThirdParty_Info_Period3_17',
 'ThirdParty_Info_Period4_1',
 'ThirdParty_Info_Period4_2',
 'ThirdParty_Info_Period4_3',
 'ThirdParty_Info_Period4_4',
 'ThirdParty_Info_Period4_5',
 'ThirdParty_Info_Period4_6',
 'ThirdParty_Info_Period4_7',
 'ThirdParty_Info_Period4_8',
 'ThirdParty_Info_Period4_9',
 'ThirdParty_Info_Period4_10',
 'ThirdParty_Info_Period4_11',
 'ThirdParty_Info_Period4_12',
 'ThirdParty_Info_Period4_13',
 'ThirdParty_Info_Period4_14',
 'ThirdParty_Info_Period4_15',
 'ThirdParty_Info_Period4_16',
 'ThirdParty_Info_Period4_17',
 'ThirdParty_Info_Period5_1',
 'ThirdParty_Info_Period5_2',
 'ThirdParty_Info_Period5_3',
 'ThirdParty_Info_Period5_4',
 'ThirdParty_Info_Period5_5',
 'ThirdParty_Info_Period5_6',
 'ThirdParty_Info_Period5_7',
 'ThirdParty_Info_Period5_8',
 'ThirdParty_Info_Period5_9',
 'ThirdParty_Info_Period5_10',
 'ThirdParty_Info_Period5_11',
 'ThirdParty_Info_Period5_12',
 'ThirdParty_Info_Period5_13',
 'ThirdParty_Info_Period5_14',
 'ThirdParty_Info_Period5_15',
 'ThirdParty_Info_Period5_16',
 'ThirdParty_Info_Period5_17',
 'ThirdParty_Info_Period6_1',
 'ThirdParty_Info_Period6_2',
 'ThirdParty_Info_Period6_3',
 'ThirdParty_Info_Period6_4',
 'ThirdParty_Info_Period6_5',
 'ThirdParty_Info_Period6_6',
 'ThirdParty_Info_Period6_7',
 'ThirdParty_Info_Period6_8',
 'ThirdParty_Info_Period6_9',
 'ThirdParty_Info_Period6_10',
 'ThirdParty_Info_Period6_11',
 'ThirdParty_Info_Period6_12',
 'ThirdParty_Info_Period6_13',
 'ThirdParty_Info_Period6_14',
 'ThirdParty_Info_Period6_15',
 'ThirdParty_Info_Period6_16',
 'ThirdParty_Info_Period6_17',
 'ThirdParty_Info_Period7_1',
 'ThirdParty_Info_Period7_2',
 'ThirdParty_Info_Period7_3',
 'ThirdParty_Info_Period7_4',
 'ThirdParty_Info_Period7_5',
 'ThirdParty_Info_Period7_6',
 'ThirdParty_Info_Period7_7',
 'ThirdParty_Info_Period7_8',
 'ThirdParty_Info_Period7_9',
 'ThirdParty_Info_Period7_10',
 'ThirdParty_Info_Period7_11',
 'ThirdParty_Info_Period7_12',
 'ThirdParty_Info_Period7_13',
 'ThirdParty_Info_Period7_14',
 'ThirdParty_Info_Period7_15',
 'ThirdParty_Info_Period7_16',
 'ThirdParty_Info_Period7_17',
 'SocialNetwork_1',
 'SocialNetwork_2',
 'SocialNetwork_3',
 'SocialNetwork_4',
 'SocialNetwork_5',
 'SocialNetwork_6',
 'SocialNetwork_7',
 'SocialNetwork_8',
 'SocialNetwork_9',
 'SocialNetwork_10',
 'SocialNetwork_11',
 'SocialNetwork_12',
 'SocialNetwork_13',
 'SocialNetwork_14',
 'SocialNetwork_15',
 'SocialNetwork_16',
 'SocialNetwork_17',
 'target',
 'ListingInfo',
 'city_match',
 'UserupdateInfo_7_freq',
 'UserupdateInfo_7_unique',
 'UserupdateInfo_7_avg_count',
 'UserupdateInfo_7_IDNUMBER',
 'UserupdateInfo_7_HASBUYCAR',
 'UserupdateInfo_7_MARRIAGESTATUSID',
 'UserupdateInfo_7_PHONE',
 'UserupdateInfo_30_freq',
 'UserupdateInfo_30_unique',
 'UserupdateInfo_30_avg_count',
 'UserupdateInfo_30_IDNUMBER',
 'UserupdateInfo_30_HASBUYCAR',
 'UserupdateInfo_30_MARRIAGESTATUSID',
 'UserupdateInfo_30_PHONE',
 'UserupdateInfo_60_freq',
 'UserupdateInfo_60_unique',
 'UserupdateInfo_60_avg_count',
 'UserupdateInfo_60_IDNUMBER',
 'UserupdateInfo_60_HASBUYCAR',
 'UserupdateInfo_60_MARRIAGESTATUSID',
 'UserupdateInfo_60_PHONE',
 'UserupdateInfo_90_freq',
 'UserupdateInfo_90_unique',
 'UserupdateInfo_90_avg_count',
 'UserupdateInfo_90_IDNUMBER',
 'UserupdateInfo_90_HASBUYCAR',
 'UserupdateInfo_90_MARRIAGESTATUSID',
 'UserupdateInfo_90_PHONE',
 'UserupdateInfo_120_freq',
 'UserupdateInfo_120_unique',
 'UserupdateInfo_120_avg_count',
 'UserupdateInfo_120_IDNUMBER',
 'UserupdateInfo_120_HASBUYCAR',
 'UserupdateInfo_120_MARRIAGESTATUSID',
 'UserupdateInfo_120_PHONE',
 'UserupdateInfo_150_freq',
 'UserupdateInfo_150_unique',
 'UserupdateInfo_150_avg_count',
 'UserupdateInfo_150_IDNUMBER',
 'UserupdateInfo_150_HASBUYCAR',
 'UserupdateInfo_150_MARRIAGESTATUSID',
 'UserupdateInfo_150_PHONE',
 'UserupdateInfo_180_freq',
 'UserupdateInfo_180_unique',
 'UserupdateInfo_180_avg_count',
 'UserupdateInfo_180_IDNUMBER',
 'UserupdateInfo_180_HASBUYCAR',
 'UserupdateInfo_180_MARRIAGESTATUSID',
 'UserupdateInfo_180_PHONE',
 'LogInfo1_7_count',
 'LogInfo1_7_unique',
 'LogInfo1_7_avg_count',
 'LogInfo2_7_count',
 'LogInfo2_7_unique',
 'LogInfo2_7_avg_count',
 'LogInfo1_30_count',
 'LogInfo1_30_unique',
 'LogInfo1_30_avg_count',
 'LogInfo2_30_count',
 'LogInfo2_30_unique',
 'LogInfo2_30_avg_count',
 'LogInfo1_60_count',
 'LogInfo1_60_unique',
 'LogInfo1_60_avg_count',
 'LogInfo2_60_count',
 'LogInfo2_60_unique',
 'LogInfo2_60_avg_count',
 'LogInfo1_90_count',
 'LogInfo1_90_unique',
 'LogInfo1_90_avg_count',
 'LogInfo2_90_count',
 'LogInfo2_90_unique',
 'LogInfo2_90_avg_count',
 'LogInfo1_120_count',
 'LogInfo1_120_unique',
 'LogInfo1_120_avg_count',
 'LogInfo2_120_count',
 'LogInfo2_120_unique',
 'LogInfo2_120_avg_count',
 'LogInfo1_150_count',
 'LogInfo1_150_unique',
 'LogInfo1_150_avg_count',
 'LogInfo2_150_count',
 'LogInfo2_150_unique',
 'LogInfo2_150_avg_count',
 'LogInfo1_180_count',
 'LogInfo1_180_unique',
 'LogInfo1_180_avg_count',
 'LogInfo2_180_count',
 'LogInfo2_180_unique',
 'LogInfo2_180_avg_count']
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
allFeatures.remove('target')
if 'Idx' in allFeatures:
allFeatures.remove('Idx')
allFeatures.remove('ListingInfo')

#检查是否有常数型变量,并且检查是类别型还是数值型变量
numerical_var = []
for col in allFeatures:
if len(set(allData[col])) == 1:
print('delete {} from the dataset because it is a constant'.format(col))
del allData[col]
allFeatures.remove(col)
else:
#uniq_vals = list(set(allData[col]))
#if np.nan in uniq_vals:
#uniq_vals.remove(np.nan)
uniq_valid_vals = [i for i in allData[col] if i == i]
uniq_valid_vals = list(set(uniq_valid_vals))
if len(uniq_valid_vals) >= 10 and isinstance(uniq_valid_vals[0], numbers.Real):
numerical_var.append(col)

categorical_var = [i for i in allFeatures if i not in numerical_var]
delete WeblogInfo_10 from the dataset because it is a constant

数据集中度处理

在信用风控模型的开发中,数据集中度是常见的问题。 即在变量中,某单一数值的占比就占了全部样本值的 绝大多数。例如,在一批训练样本中,学历为本科的 样本占了全部样本的90%。具有极高的集中度的字段或者变量,需要按照风险程度迚行区分:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
#检查变量的最多值的占比情况,以及每个变量中占比最大的值
records_count = allData.shape[0]
col_most_values,col_large_value = {},{}
for col in allFeatures:
value_count = allData[col].groupby(allData[col]).count()
col_most_values[col] = max(value_count)/records_count
large_value = value_count[value_count== max(value_count)].index[0]
col_large_value[col] = large_value
col_most_values_df = pd.DataFrame.from_dict(col_most_values, orient = 'index')
col_most_values_df.columns = ['max percent']
col_most_values_df = col_most_values_df.sort_values(by = 'max percent', ascending = False)
pcnt = list(col_most_values_df[:500]['max percent'])
vars = list(col_most_values_df[:500].index)
plt.bar(range(len(pcnt)), height = pcnt)
plt.title('Largest Percentage of Single Value in Each Variable')

#计算多数值产比超过90%的字段中,少数值的坏样本率是否会显著高于多数值
large_percent_cols = list(col_most_values_df[col_most_values_df['max percent']>=0.9].index)
bad_rate_diff = {}
for col in large_percent_cols:
large_value = col_large_value[col]
temp = allData[[col,'target']]
temp[col] = temp.apply(lambda x: int(x[col]==large_value),axis=1)
bad_rate = temp.groupby(col).mean()
if bad_rate.iloc[0]['target'] == 0:
bad_rate_diff[col] = 0
continue
bad_rate_diff[col] = np.log(bad_rate.iloc[0]['target']/bad_rate.iloc[1]['target'])
bad_rate_diff_sorted = sorted(bad_rate_diff.items(),key=lambda x: x[1], reverse=True)
bad_rate_diff_sorted_values = [x[1] for x in bad_rate_diff_sorted]
plt.bar(x = range(len(bad_rate_diff_sorted_values)), height = bad_rate_diff_sorted_values)

#由于所有的少数值的坏样本率并没有显著高于多数值,意味着这些变量可以直接剔除
for col in large_percent_cols:
if col in numerical_var:
numerical_var.remove(col)
else:
categorical_var.remove(col)
del allData[col]


'''
对类别型变量,如果缺失超过80%, 就删除,否则当成特殊的状态
'''

def MissingCategorial(df,x):
missing_vals = df[x].map(lambda x: int(x!=x))
return sum(missing_vals)*1.0/df.shape[0]

missing_pcnt_threshould_1 = 0.8
for col in categorical_var:
missingRate = MissingCategorial(allData,col)
print('{0} has missing rate as {1}'.format(col,missingRate))
if missingRate > missing_pcnt_threshould_1:
categorical_var.remove(col)
del allData[col]
if 0 < missingRate < missing_pcnt_threshould_1:
# In this way we convert NaN to NAN, which is a string instead of np.nan
allData[col] = allData[col].map(lambda x: str(x).upper())

allData_bk = allData.copy()
'''
检查数值型变量
'''

def MissingContinuous(df,x):
missing_vals = df[x].map(lambda x: int(np.isnan(x)))
return sum(missing_vals) * 1.0 / df.shape[0]

def MakeupRandom(x, sampledList):
if x==x:
return x
else:
randIndex = random.randint(0, len(sampledList)-1)
return sampledList[randIndex]

missing_pcnt_threshould_2 = 0.8
deleted_var = []
for col in numerical_var:
missingRate = MissingContinuous(allData, col)
print('{0} has missing rate as {1}'.format(col, missingRate))
if missingRate > missing_pcnt_threshould_2:
deleted_var.append(col)
print('we delete variable {} because of its high missing rate'.format(col))
else:
if missingRate > 0:
not_missing = allData.loc[allData[col] == allData[col]][col]
#makeuped = allData[col].map(lambda x: MakeupRandom(x, list(not_missing)))
missing_position = allData.loc[allData[col] != allData[col]][col].index
not_missing_sample = random.sample(list(not_missing), len(missing_position))
allData.loc[missing_position,col] = not_missing_sample
#del allData[col]
#allData[col] = makeuped
missingRate2 = MissingContinuous(allData, col)
print('missing rate after making up is:{}'.format(str(missingRate2)))

if deleted_var != []:
for col in deleted_var:
numerical_var.remove(col)
del allData[col]


allData.to_csv(folderOfData+'allData_1.csv', header=True,encoding='gbk', columns = allData.columns, index=False)
/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel_launcher.py:23: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


UserInfo_1 has missing rate as 0.0002
UserInfo_3 has missing rate as 0.00023333333333333333
WeblogInfo_2 has missing rate as 0.055266666666666665
UserInfo_5 has missing rate as 0.0
UserInfo_6 has missing rate as 0.0
UserInfo_7 has missing rate as 0.0
UserInfo_9 has missing rate as 0.0
UserInfo_10 has missing rate as 0.0
UserInfo_11 has missing rate as 0.6303
UserInfo_12 has missing rate as 0.6303
UserInfo_13 has missing rate as 0.6303
UserInfo_14 has missing rate as 0.0
UserInfo_15 has missing rate as 0.0
UserInfo_16 has missing rate as 0.0
UserInfo_17 has missing rate as 0.0
UserInfo_19 has missing rate as 0.0
WeblogInfo_19 has missing rate as 0.09876666666666667
WeblogInfo_20 has missing rate as 0.2683333333333333
WeblogInfo_21 has missing rate as 0.10246666666666666
WeblogInfo_30 has missing rate as 0.008433333333333333
SocialNetwork_12 has missing rate as 0.0
SocialNetwork_13 has missing rate as 0.0
SocialNetwork_17 has missing rate as 0.0
WeblogInfo_1 has missing rate as 0.9676666666666667
we delete variable WeblogInfo_1 because of its high missing rate
WeblogInfo_3 has missing rate as 0.9676666666666667
we delete variable WeblogInfo_3 because of its high missing rate
WeblogInfo_4 has missing rate as 0.05503333333333333
missing rate after making up is:0.0
WeblogInfo_5 has missing rate as 0.05503333333333333
missing rate after making up is:0.0
WeblogInfo_6 has missing rate as 0.05503333333333333
missing rate after making up is:0.0
WeblogInfo_7 has missing rate as 0.0
WeblogInfo_8 has missing rate as 0.0
WeblogInfo_15 has missing rate as 0.0
WeblogInfo_16 has missing rate as 0.0
WeblogInfo_17 has missing rate as 0.0
WeblogInfo_18 has missing rate as 0.0
UserInfo_18 has missing rate as 0.0
WeblogInfo_24 has missing rate as 0.008433333333333333
missing rate after making up is:0.0
WeblogInfo_27 has missing rate as 0.008433333333333333
missing rate after making up is:0.0
WeblogInfo_33 has missing rate as 0.008433333333333333
missing rate after making up is:0.0
WeblogInfo_36 has missing rate as 0.008433333333333333
missing rate after making up is:0.0
ThirdParty_Info_Period1_1 has missing rate as 0.0
ThirdParty_Info_Period1_2 has missing rate as 0.0
ThirdParty_Info_Period1_3 has missing rate as 0.0
ThirdParty_Info_Period1_4 has missing rate as 0.0
ThirdParty_Info_Period1_5 has missing rate as 0.0
ThirdParty_Info_Period1_6 has missing rate as 0.0
ThirdParty_Info_Period1_7 has missing rate as 0.0
ThirdParty_Info_Period1_8 has missing rate as 0.0
ThirdParty_Info_Period1_9 has missing rate as 0.0
ThirdParty_Info_Period1_10 has missing rate as 0.0
ThirdParty_Info_Period1_11 has missing rate as 0.0
ThirdParty_Info_Period1_12 has missing rate as 0.0
ThirdParty_Info_Period1_13 has missing rate as 0.0
ThirdParty_Info_Period1_14 has missing rate as 0.0
ThirdParty_Info_Period1_15 has missing rate as 0.0
ThirdParty_Info_Period1_16 has missing rate as 0.0
ThirdParty_Info_Period1_17 has missing rate as 0.0
ThirdParty_Info_Period2_1 has missing rate as 0.0
ThirdParty_Info_Period2_2 has missing rate as 0.0
ThirdParty_Info_Period2_3 has missing rate as 0.0
ThirdParty_Info_Period2_4 has missing rate as 0.0
ThirdParty_Info_Period2_5 has missing rate as 0.0
ThirdParty_Info_Period2_6 has missing rate as 0.0
ThirdParty_Info_Period2_7 has missing rate as 0.0
ThirdParty_Info_Period2_8 has missing rate as 0.0
ThirdParty_Info_Period2_9 has missing rate as 0.0
ThirdParty_Info_Period2_10 has missing rate as 0.0
ThirdParty_Info_Period2_11 has missing rate as 0.0
ThirdParty_Info_Period2_12 has missing rate as 0.0
ThirdParty_Info_Period2_13 has missing rate as 0.0
ThirdParty_Info_Period2_14 has missing rate as 0.0
ThirdParty_Info_Period2_15 has missing rate as 0.0
ThirdParty_Info_Period2_16 has missing rate as 0.0
ThirdParty_Info_Period2_17 has missing rate as 0.0
ThirdParty_Info_Period3_1 has missing rate as 0.0
ThirdParty_Info_Period3_2 has missing rate as 0.0
ThirdParty_Info_Period3_3 has missing rate as 0.0
ThirdParty_Info_Period3_4 has missing rate as 0.0
ThirdParty_Info_Period3_5 has missing rate as 0.0
ThirdParty_Info_Period3_6 has missing rate as 0.0
ThirdParty_Info_Period3_7 has missing rate as 0.0
ThirdParty_Info_Period3_8 has missing rate as 0.0
ThirdParty_Info_Period3_9 has missing rate as 0.0
ThirdParty_Info_Period3_10 has missing rate as 0.0
ThirdParty_Info_Period3_11 has missing rate as 0.0
ThirdParty_Info_Period3_12 has missing rate as 0.0
ThirdParty_Info_Period3_13 has missing rate as 0.0
ThirdParty_Info_Period3_14 has missing rate as 0.0
ThirdParty_Info_Period3_15 has missing rate as 0.0
ThirdParty_Info_Period3_16 has missing rate as 0.0
ThirdParty_Info_Period3_17 has missing rate as 0.0
ThirdParty_Info_Period4_1 has missing rate as 0.0
ThirdParty_Info_Period4_2 has missing rate as 0.0
ThirdParty_Info_Period4_3 has missing rate as 0.0
ThirdParty_Info_Period4_4 has missing rate as 0.0
ThirdParty_Info_Period4_5 has missing rate as 0.0
ThirdParty_Info_Period4_6 has missing rate as 0.0
ThirdParty_Info_Period4_7 has missing rate as 0.0
ThirdParty_Info_Period4_8 has missing rate as 0.0
ThirdParty_Info_Period4_9 has missing rate as 0.0
ThirdParty_Info_Period4_10 has missing rate as 0.0
ThirdParty_Info_Period4_11 has missing rate as 0.0
ThirdParty_Info_Period4_12 has missing rate as 0.0
ThirdParty_Info_Period4_13 has missing rate as 0.0
ThirdParty_Info_Period4_14 has missing rate as 0.0
ThirdParty_Info_Period4_15 has missing rate as 0.0
ThirdParty_Info_Period4_16 has missing rate as 0.0
ThirdParty_Info_Period4_17 has missing rate as 0.0
ThirdParty_Info_Period5_1 has missing rate as 0.0
ThirdParty_Info_Period5_2 has missing rate as 0.0
ThirdParty_Info_Period5_3 has missing rate as 0.0
ThirdParty_Info_Period5_4 has missing rate as 0.0
ThirdParty_Info_Period5_5 has missing rate as 0.0
ThirdParty_Info_Period5_6 has missing rate as 0.0
ThirdParty_Info_Period5_7 has missing rate as 0.0
ThirdParty_Info_Period5_8 has missing rate as 0.0
ThirdParty_Info_Period5_9 has missing rate as 0.0
ThirdParty_Info_Period5_10 has missing rate as 0.0
ThirdParty_Info_Period5_11 has missing rate as 0.0
ThirdParty_Info_Period5_12 has missing rate as 0.0
ThirdParty_Info_Period5_13 has missing rate as 0.0
ThirdParty_Info_Period5_14 has missing rate as 0.0
ThirdParty_Info_Period5_15 has missing rate as 0.0
ThirdParty_Info_Period5_16 has missing rate as 0.0
ThirdParty_Info_Period5_17 has missing rate as 0.0
ThirdParty_Info_Period6_1 has missing rate as 0.0
ThirdParty_Info_Period6_2 has missing rate as 0.0
ThirdParty_Info_Period6_3 has missing rate as 0.0
ThirdParty_Info_Period6_4 has missing rate as 0.0
ThirdParty_Info_Period6_5 has missing rate as 0.0
ThirdParty_Info_Period6_6 has missing rate as 0.0
ThirdParty_Info_Period6_7 has missing rate as 0.0
ThirdParty_Info_Period6_8 has missing rate as 0.0
ThirdParty_Info_Period6_9 has missing rate as 0.0
ThirdParty_Info_Period6_10 has missing rate as 0.0
ThirdParty_Info_Period6_11 has missing rate as 0.0
ThirdParty_Info_Period6_12 has missing rate as 0.0
ThirdParty_Info_Period6_13 has missing rate as 0.0
ThirdParty_Info_Period6_14 has missing rate as 0.0
ThirdParty_Info_Period6_15 has missing rate as 0.0
ThirdParty_Info_Period6_16 has missing rate as 0.0
ThirdParty_Info_Period6_17 has missing rate as 0.0
SocialNetwork_8 has missing rate as 0.0
SocialNetwork_9 has missing rate as 0.0
SocialNetwork_10 has missing rate as 0.0
UserupdateInfo_7_freq has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_7_unique has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_7_avg_count has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_7_IDNUMBER has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_7_HASBUYCAR has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_7_MARRIAGESTATUSID has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_7_PHONE has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_30_freq has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_30_unique has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_30_avg_count has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_30_IDNUMBER has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_30_HASBUYCAR has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_30_MARRIAGESTATUSID has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_30_PHONE has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_60_freq has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_60_unique has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_60_avg_count has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_60_IDNUMBER has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_60_HASBUYCAR has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_60_MARRIAGESTATUSID has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_60_PHONE has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_90_freq has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_90_unique has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_90_avg_count has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_90_IDNUMBER has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_90_HASBUYCAR has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_90_MARRIAGESTATUSID has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_90_PHONE has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_120_freq has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_120_unique has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_120_avg_count has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_120_IDNUMBER has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_120_HASBUYCAR has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_120_MARRIAGESTATUSID has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_120_PHONE has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_150_freq has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_150_unique has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_150_avg_count has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_150_IDNUMBER has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_150_HASBUYCAR has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_150_MARRIAGESTATUSID has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_150_PHONE has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_180_freq has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_180_unique has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_180_avg_count has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_180_IDNUMBER has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_180_HASBUYCAR has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_180_MARRIAGESTATUSID has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
UserupdateInfo_180_PHONE has missing rate as 0.00016666666666666666
missing rate after making up is:0.0
LogInfo1_7_count has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo1_7_unique has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo1_7_avg_count has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo2_7_count has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo2_7_unique has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo2_7_avg_count has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo1_30_count has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo1_30_unique has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo1_30_avg_count has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo2_30_count has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo2_30_unique has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo2_30_avg_count has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo1_60_count has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo1_60_unique has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo1_60_avg_count has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo2_60_count has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo2_60_unique has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo2_60_avg_count has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo1_90_count has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo1_90_unique has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo1_90_avg_count has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo2_90_count has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo2_90_unique has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo2_90_avg_count has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo1_120_count has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo1_120_unique has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo1_120_avg_count has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo2_120_count has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo2_120_unique has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo2_120_avg_count has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo1_150_count has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo1_150_unique has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo1_150_avg_count has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo2_150_count has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo2_150_unique has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo2_150_avg_count has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo1_180_count has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo1_180_unique has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo1_180_avg_count has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo2_180_count has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo2_180_unique has missing rate as 0.03376666666666667
missing rate after making up is:0.0
LogInfo2_180_avg_count has missing rate as 0.03376666666666667
missing rate after making up is:0.0

png

1
2


评论系统未开启,无法评论!