数据原型

广告特征 adFeature.csv

    aid  advertiserId  campaignId  creativeId  creativeSize  adCategoryId  \
0   177          8203       76104     1500666            59           282   
1  2050         19441      178687      245165            53             1   
2  1716          5552      158101     1080850            35            27   
3   336           370        4833      119845            22            67   
4   671         45705      352827      660519            42            67   

   productId  productType  
0          0            6  
1          0            6  
2        113            9  
3        113            9  
4          0            4

用户特征 user_feature_sample.csv

   Unnamed: 0    LBS  age appIdAction appIdInstall  carrier  \
0           0  950.0    4         NaN          NaN        1   
1           1  803.0    2         NaN          NaN        1   
2           2  927.0    1         NaN          NaN        3   
3           3  486.0    4         NaN          NaN        3   
4           4  112.0    5         NaN          NaN        3   

   consumptionAbility       ct  education  gender    ...     \
0                   2      3 1          7       2    ...      
1                   1      3 1          2       1    ...      
2                   1      3 1          5       1    ...      
3                   1      1 3          7       1    ...      
4                   2  3 1 4 2          6       1    ...      

                                           interest5  \
0  52 100 72 131 116 11 71 12 8 113 28 73 6 132 9...   
1                                                NaN   
2  77 72 80 116 101 13 1 109 8 50 6 42 76 9 46 36...   
3  100 80 92 37 116 13 47 4 71 8 50 28 98 115 6 4...   
4        131 116 13 8 6 132 42 9 59 18 58 64 129 103   

                                  kw1                             kw2  kw3  \
0  664359 276966 734911 103617 562294  11395 79112 115065 77033 36176  NaN   
1  338851 361151 542834 496283 229952     80263 39618 53539 180 38163  NaN   
2  746140 695808 126355 771775 411511  105115 71378 41409 74061 44005  NaN   
3   283399 402245 734509 654027 32061  74010 32918 67882 116802 20957  NaN   
4  562294 157603 589706 657719 495672   62764 97803 89066 55545 74061  NaN   

  marriageStatus   os                    topic1                    topic2  \
0             11    2   9826 105 8525 5488 7281  9708 5553 6745 7477 7150   
1           5 13    1  4391 9140 5669 1348 4388  9401 7724 1380 8890 7153   
2          13 10    1  1502 5488 9826 2187 8088  5005 9154 2756 5612 4209   
3             11  1 2  1619 7342 3064 9213 8525   810 2438 5659 1844 9262   
4             11    1    477 9826 5808 644 2747  5483 2199 5424 1511 7751   

  topic3       uid  
0    NaN  26325489  
1    NaN   1184123  
2    NaN  76072711  
3    NaN  63071413  
4    NaN  81294159

用户特征里出现了空值 NaN，以及用空格分开的一对多的值，在下一段介绍怎么处理

用户-广告关系 train.csv

    aid       uid  label
0   699  78508957     -1
1  1991   3637295     -1
2  1119  19229018     -1
3  2013  79277120     -1
4   692  41528441     -1

特征提取

代码参考的是参考资料里的 baseline 的代码，稍做改动

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder,LabelEncoder
from scipy import sparse

user_feature.fillna('-1')
data = pd.merge(train, ad_feature, on='aid', how='left')
data = pd.merge(data, user_feature, on='uid', how='left')
data.dropna()

# 只有数字的数据适合用 one hot
one_hot_feature=['LBS','age','carrier','consumptionAbility','education','gender','house','os','ct','marriageStatus','advertiserId','campaignId', 'creativeId','adCategoryId', 'productId', 'productType']
# 这些是关系型的数据，是用空格分开的数字，可以用词向量来处理
vector_feature=['appIdAction','appIdInstall','interest1','interest2','interest3','interest4','interest5','kw1','kw2','kw3','topic1','topic2','topic3']

for f in one_hot_feature:
    data[f] = LabelEncoder().fit_transform(data[f])

train_x=train[['creativeSize']]
test_x=train[['creativeSize']]
for f in one_hot_feature:
    enc.fit(data[f].values.reshape(-1,1))
    train_a = enc.transform(train[f].values.reshape(-1,1))
    train_x = sparse.hstack((train_x, train_a))

cv = CountVectorizer()
for f in vector_feature:
    cv.fit(data[f])
    train_a=cv.trainsform(data[f])
    train_x=sparse.hstack((train_x, train_a))

数字的特征提取

对于纯数字的特征，要先 LabelEncoder().fit_transform() 后再 OneHotEncoder().fit_transform()。

LanbelEncoder 会给值重新赋值，这样可以忽略值的大小。例如

1 2	In [3]: LabelEncoder().fit_transform([111,0.2,3000,42,25]) Out[3]: array([3, 0, 4, 2, 1])

关系的特征提取

用户特征分为三层

第一层是广告和用户的信息
第二层是广告和用户的关系，这里用pd.merge 合并成一个二维数组
第三层是用户和其他信息的关系，是字符串格式，内容是用空格隔开的数字，被当成单词后会按词向量处理，即忽略顺序。

第三层处理完后用 sparse.hstack 压成一个二维数组，就成了最终训练用的特征。

特殊的关系特征

用户特征里面有一类

1	['os','ct']

这类值是用空格隔开的数字，但是数字只有1位。例子里用 LabelEncoder，会被当成整个字符串，显然不合适。用 CountVectorizer 则会报

1	ValueError: empty vocabulary; perhaps the documents only contain stop words

我的做法是去掉空格（不去也一样）后用 CountVectorizer(analyzer=’char’)

关系的数量

关系特征里面还有一项是数量，比如 interest1，有些人只有一个，有些人有很多，更高分的选手把这项也计算出来作为特征。

参考资料：
2018腾讯广告算法大赛baseline：https://github.com/YouChouNoBB/2018-tencent-ad-competition-baseline