大众号:尤而小屋
作者:Peter
修改:Peter
大家好,我是Peter~
本文是根据机器学习的相关规矩办法对IC电子产品的数据发掘,首要内容包括:
- 数据预处理:针对数据去重、缺失值处理、时刻字段处理、用户年纪分段等
- 词云图制造:不同用户对不同品牌brand和品种category_code的偏好
- 相关规矩发掘:针对不同性别、不同品牌的相关信息发掘
数据基本信息
导入数据
In [1]:
importpandasaspd
importnumpyasnp
#显现所有列
#pd.set_option('display.max_columns',None)
#显现所有行
#pd.set_option('display.max_rows',None)
#设置value的显现长度为100,默认为50
#pd.set_option('max_colwidth',100)
importtime
importos
fromdatetimeimportdatetime
importmatplotlib.pyplotasplt
importseabornassns
%matplotlibinline
#设置中文编码和负号的正常显现
plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus']=False
importmissingnoasms
frompyecharts.globalsimportCurrentConfig,OnlineHostType
frompyechartsimportoptionsasopts#装备项
frompyecharts.chartsimportBar,Scatter,Pie,Line,Map,WordCloud,Grid,Page#各个图形的类
frompyecharts.commons.utilsimportJsCode
frompyecharts.globalsimportThemeType,SymbolType
importplotly.expressaspx
importplotly.graph_objectsasgo
fromplotly.subplotsimportmake_subplots#画子图
importjieba
fromsnownlpimportSnowNLP
fromsklearn.clusterimportKMeans
fromsklearn.preprocessingimportLabelEncoder
fromsklearn.preprocessingimportMinMaxScaler
importwarnings
warnings.filterwarnings("ignore")
In [2]:
#数据中存在中文,指定读取的编码格局
df=pd.read_csv("ic_sale.csv",
encoding="gb18030",#windows体系需求指定类型;mac不需求
converters={"order_id":str,"product_id":str,"category_id":str,"user_id":str}
)
df.head()
Out[2]:
基本信息
In [3]:
#1、数据shape
df.shape
Out[3]:
(564169,11)
In [4]:
#2、数据字段类型
df.dtypes
Out[4]:
event_timeobject
order_idobject
product_idobject
category_idobject
category_codeobject
brandobject
pricefloat64
user_idobject
ageint64
sexobject
localobject
dtype:object
In [5]:
#3、数据描述计算信息
df.describe()
Out[5]:
price | age | |
---|---|---|
count | 564169.000000 | 564169.000000 |
mean | 208.269324 | 33.184388 |
std | 304.559875 | 10.122088 |
min | 0.000000 | 16.000000 |
25% | 23.130000 | 24.000000 |
50% | 87.940000 | 33.000000 |
75% | 277.750000 | 42.000000 |
max | 18328.680000 | 50.000000 |
In [6]:
#4、总共多少个不同客户
df["user_id"].nunique()
Out[6]:
6908
数据预处理
数据去重处理
In [7]:
df.shape#去重前
Out[7]:
(564169,11)
In [8]:
df.drop_duplicates(ignore_index=True,inplace=True)
In [9]:
df.shape#去重后
Out[9]:
(561214,11)
特征信息
In [10]:
stats=[]
forcolindf.columns:
stats.append((col,
df[col].nunique(),
round(df[col].isnull().sum()*100/df.shape[0],3),
round(df[col].value_counts(normalize=True,dropna=False).values[0]*100,3),
df[col].dtype)
)
stats_df=pd.DataFrame(stats,
columns=['特征名','属性个数','缺失值占比','最大属性占比','特征类型'])
stats_df.sort_values('缺失值占比',ascending=False,ignore_index=True)
缺失值处理
In [11]:
df=df[df["price"]>0]
In [12]:
df.isnull().sum()
Out[12]:
event_time0
order_id0
product_id0
category_id0
category_code128662
brand27132
price0
user_id0
age0
sex0
local0
dtype:int64
In [13]:
ms.bar(df,color="red")#缺失值可视化
plt.show()
最后直接填充缺失值:missing
In [14]:
df.fillna("missing",inplace=True)#填充missing
时刻字段处理
In [15]:
df["event_time"].value_counts()
Out[15]:
1970-01-0100:33:40UTC1302
2020-04-0916:30:01UTC51
2020-04-0816:30:01UTC49
2020-04-0616:30:01UTC46
2020-04-0516:30:01UTC44
...
2020-07-2813:10:35UTC1
2020-07-2813:10:21UTC1
2020-07-2813:09:37UTC1
2020-07-2813:08:23UTC1
2020-08-1317:16:24UTC1
Name:event_time,Length:389813,dtype:int64
从上面的成果中看到:1970-01-01 00:33:40
最多,其实便是时刻字段的缺失值
In [16]:
#去掉最后的UTC
df["event_time"]=df["event_time"].apply(lambdax:x[:19])
#时刻数据类型转化:字符类型---->指定时刻格局
df['event_time']=pd.to_datetime(df['event_time'],format="%Y-%m-%d%H:%M:%S")
#提取多个时刻相关字段
#df['month']=df['event_time'].dt.month
#df['day']=df['event_time'].dt.day
#df['dayofweek']=df['event_time'].dt.dayofweek
#df['hour']=df['event_time'].dt.hour
用户年纪分段
In [17]:
#不同性别下的年纪分布
fig=px.box(df,y=["age"],color="sex")
fig.show()
#不同年纪段人数计算
fig=plt.figure(figsize=(12,6))
sns.countplot(df["age"])
plt.title("CountsofDifferentAge")
plt.show()
针对年纪字段的分箱操作:
In [19]:
df["age"]=pd.cut(df["age"],bins=4,precision=0)
df["age"]#分段之后的age字段显现
Out[19]:
0(16.0,24.0]
1(33.0,42.0]
2(24.0,33.0]
3(16.0,24.0]
4(16.0,24.0]
...
561209(16.0,24.0]
561210(16.0,24.0]
561211(16.0,24.0]
561212(16.0,24.0]
561213(16.0,24.0]
Name:age,Length:561175,dtype:category
Categories(4,interval[float64,right]):[(16.0,24.0]<(24.0,33.0]<(33.0,42.0]<(42.0,50.0]]
不同地区用户的消费水平对比
In [22]:
fig=px.scatter(df[df["brand"]!="missing"],#除掉missing数据
#x="local",
y="price",
facet_col="age",
color="local",
size="price"
)
fig.show()
不同年纪段和性别的品牌偏好
In [23]:
age_brand=df.groupby(["age","sex","brand"]).size().reset_index().rename(columns={0:"number"})
age_brand.head()
Out[23]:
age | sex | brand | number | |
---|---|---|---|---|
0 | (16.0, 24.0] | 女 | a-case | 32 |
1 | (16.0, 24.0] | 女 | acana | 0 |
2 | (16.0, 24.0] | 女 | accesstyle | 3 |
3 | (16.0, 24.0] | 女 | action | 0 |
4 | (16.0, 24.0] | 女 | activision | 3 |
In [24]:
#完成排序功用-降序
age_brand=age_brand.sort_values(["age","number"],ascending=[True,False],ignore_index=True)
age_brand.head()
Out[24]:
age | sex | brand | number | |
---|---|---|---|---|
0 | (16.0, 24.0] | 男 | samsung | 11884 |
1 | (16.0, 24.0] | 女 | samsung | 11882 |
2 | (16.0, 24.0] | 男 | apple | 4561 |
3 | (16.0, 24.0] | 女 | apple | 4283 |
4 | (16.0, 24.0] | 男 | missing | 3354 |
In [25]:
#条件挑选
age_brand=age_brand.query("number>0&brand!='missing'")
In [26]:
fig=px.treemap(
age_brand,#传入数据
path=[px.Constant("all"),"age","sex","brand"],#传递数据路径
values="number"#数值显现
)
fig.update_traces(root_color="lightskyblue")
fig.update_layout(margin=dict(t=30,l=30,r=25,b=30))
fig.show()
品牌数量词云图
In [27]:
age_brand.head()
Out[27]:
age | sex | brand | number | |
---|---|---|---|---|
0 | (16.0, 24.0] | 男 | samsung | 11884 |
1 | (16.0, 24.0] | 女 | samsung | 11882 |
2 | (16.0, 24.0] | 男 | apple | 4561 |
3 | (16.0, 24.0] | 女 | apple | 4283 |
6 | (16.0, 24.0] | 男 | ava | 3317 |
In [28]:
brand_list=age_brand["brand"].value_counts().reset_index()
brand_list.columns=["word","number"]
brand_list.head(10)
Out[28]:
word | number | |
---|---|---|
0 | samsung | 8 |
1 | darina | 8 |
2 | huion | 8 |
3 | aquapick | 8 |
4 | amigami | 8 |
5 | sjcam | 8 |
6 | rockstar | 8 |
7 | franke | 8 |
8 | bridgestone | 8 |
9 | tailg | 8 |
In [29]:
information_zip=[tuple(z)forzinzip(brand_list["word"].tolist(),brand_list["number"].tolist())]
#绘图
c=(
WordCloud()
.add("",information_zip,word_size_range=[20,80],shape=SymbolType.DIAMOND)
.set_global_opts(title_opts=opts.TitleOpts(title="品牌词云图"))
)
c.render_notebook()
不同品牌的不同品种category_code
category_code处理
检查有多少种不同的category_code和对应的数量,运用value_counts()办法:
In [30]:
df["category_code"].value_counts()
Out[30]:
missing128662
electronics.smartphone101502
computers.notebook25917
appliances.kitchen.refrigerators20296
electronics.audio.headphone20049
...
kids.swing8
country_yard.watering5
sport.snowboard3
apparel.costume2
apparel.shoes2
Name:category_code,Length:124,dtype:int64
定论:除掉missing部分,最多的是electronics.smartphone,即:电子智能手机,其次便是电脑笔记本
In [31]:
fig=px.bar(df["category_code"].value_counts()[1:30])#前30个category_code
fig.show()
只选取需求的字段:
In [32]:
df=df[df["category_code"]!="missing"]#去除missing部分
df=df[["category_code","brand","age","sex","local"]]
将category_code字段进行切割处理:
In [33]:
df["category_code"]=df["category_code"].apply(lambdax:x.split(".")if"."inxelse[x])
df.head()
Out[33]:
category_code | brand | age | sex | local | |
---|---|---|---|---|---|
0 | [electronics, tablet] | samsung | (16.0, 24.0] | 女 | 海南 |
1 | [electronics, audio, headphone] | huawei | (33.0, 42.0] | 女 | 北京 |
3 | [furniture, kitchen, table] | maestro | (16.0, 24.0] | 男 | 重庆 |
4 | [electronics, smartphone] | apple | (16.0, 24.0] | 男 | 北京 |
5 | [appliances, kitchen, refrigerators] | lg | (16.0, 24.0] | 男 | 北京 |
category_code词云图
In [34]:
data=df["category_code"].tolist()
data[:3]
Out[34]:
[['electronics','tablet'],
['electronics','audio','headphone'],
['furniture','kitchen','table']]
In [35]:
importitertools
#经过chain办法从可迭代目标中生成;展开成列表
sum_data=list(itertools.chain.from_iterable(data))
sum_data[:10]
Out[35]:
['electronics','tablet','electronics','audio','headphone','furniture','kitchen','table','electronics','smartphone']
In [36]:
category_code_number=pd.value_counts(sum_data).to_frame().reset_index()
category_code_number.columns=["category_code","number"]
category_code_number.head()
Out[36]:
category_code | number | |
---|---|---|
0 | electronics | 156709 |
1 | appliances | 150331 |
2 | kitchen | 107852 |
3 | smartphone | 101502 |
4 | computers | 76877 |
In [37]:
information_zip=[tuple(z)forzinzip(category_code_number["category_code"].tolist(),category_code_number["number"].tolist())]
#绘图
c=(
WordCloud()
.add("",information_zip,word_size_range=[20,80],shape=SymbolType.DIAMOND)
.set_global_opts(title_opts=opts.TitleOpts(title="商品品种词云图"))
)
c.render_notebook()
根据相关规矩建模
根据性别sex
查找频频项集-male
In [38]:
male=df[df["sex"]=="男"]
male.head()
Out[38]:
category_code | brand | age | sex | local | |
---|---|---|---|---|---|
3 | [furniture, kitchen, table] | maestro | (16.0, 24.0] | 男 | 重庆 |
4 | [electronics, smartphone] | apple | (16.0, 24.0] | 男 | 北京 |
5 | [appliances, kitchen, refrigerators] | lg | (16.0, 24.0] | 男 | 北京 |
6 | [appliances, personal, scales] | polaris | (24.0, 33.0] | 男 | 广东 |
17 | [appliances, kitchen, kettle] | tefal | (33.0, 42.0] | 男 | 广东 |
In [39]:
importefficient_aprioriasea
male_list=male["category_code"].tolist()
#itemsets:频频项rules:相关规矩
itemsets,rules=ea.apriori(male_list,
min_support=0.005,
min_confidence=1
)
一个频频项
In [40]:
len(itemsets[1])
Out[40]:
60
In [41]:
itemsets[1]#一个频频项集
#字典的值value的降序排列
dict(sorted(itemsets[1].items(),key=lambdax:x[1],reverse=True))
二个频频项
In [43]:
len(itemsets[2])#总个数
Out[43]:
84
In [44]:
#两个频频项集
dict(sorted(itemsets[2].items(),key=lambdax:x[1],reverse=True))
三个频频项
In [45]:
len(itemsets[3])#总个数
Out[45]:
32
In [46]:
#三个频频项集
dict(sorted(itemsets[3].items(),key=lambdax:x[1],reverse=True))
Out[46]:
{('appliances','kitchen','refrigerators'):10209,
('audio','electronics','headphone'):10154,
('electronics','tv','video'):8876,
('appliances','environment','vacuum'):8069,
('appliances','kitchen','washer'):7235,
('appliances','kettle','kitchen'):6389,
('computers','mouse','peripherals'):6359,
('furniture','kitchen','table'):5626,
('appliances','hood','kitchen'):4487,
('appliances','blender','kitchen'):4439,
('appliances','kitchen','microwave'):3830,
('air_conditioner','appliances','environment'):3806,
('appliances','personal','scales'):3423,
('computers','network','router'):3318,
('components','computers','hdd'):2598,
('appliances','kitchen','meat_grinder'):2361,
('components','computers','cpu'):2055,
('appliances','kitchen','oven'):1958,
('appliances','environment','fan'):1952,
('computers','keyboard','peripherals'):1940,
('computers','peripherals','printer'):1802,
('appliances','environment','water_heater'):1753,
('computers','monitor','peripherals'):1733,
('components','computers','cooler'):1717,
('cabinet','furniture','living_room'):1550,
('chair','furniture','kitchen'):1513,
('appliances','hair_cutter','personal'):1388,
('air_heater','appliances','environment'):1341,
('appliances','dishwasher','kitchen'):1329,
('furniture','living_room','shelving'):1314,
('appliances','kitchen','mixer'):1288,
('construction','screw','tools'):1194}
查找频频项集-female
In [47]:
female=df[df["sex"]=="女"]
female.head()
Out[47]:
category_code | brand | age | sex | local | |
---|---|---|---|---|---|
0 | [electronics, tablet] | samsung | (16.0, 24.0] | 女 | 海南 |
1 | [electronics, audio, headphone] | huawei | (33.0, 42.0] | 女 | 北京 |
7 | [electronics, video, tv] | samsung | (16.0, 24.0] | 女 | 北京 |
8 | [computers, components, cpu] | intel | (42.0, 50.0] | 女 | 浙江 |
10 | [computers, notebook] | asus | (42.0, 50.0] | 女 | 广东 |
In [48]:
importefficient_aprioriasea
female_list=male["category_code"].tolist()
#itemsets:频频项rules:相关规矩
itemsets,rules=ea.apriori(female_list,
min_support=0.005,
min_confidence=1
)
一个频频项
In [49]:
len(itemsets[1])#总个数
Out[49]:
60
In [50]:
#一个频频项集
dict(sorted(itemsets[1].items(),key=lambdax:x[1],reverse=True))
二个频频项
In [51]:
#两个频频项集
dict(sorted(itemsets[2].items(),key=lambdax:x[1],reverse=True))
三个频频项
In [52]:
#三个频频项集
dict(sorted(itemsets[3].items(),key=lambdax:x[1],reverse=True))
根据品牌brand
In [53]:
brand_category=df.groupby(["brand"])["category_code"].sum().reset_index()
brand_category
#去重功用-set
brand_category["category_code"]=brand_category["category_code"].apply(lambdax:list(set(x)))
brand_category
importefficient_aprioriasea
brand_list=brand_category["category_code"].tolist()
#itemsets:频频项rules:相关规矩
itemsets,rules=ea.apriori(
brand_list,
min_support=0.05,
min_confidence=1
)
#三个频频项集
dict(sorted(itemsets[3].items(),key=lambdax:x[1],reverse=True))
#两个频频项集
dict(sorted(itemsets[2].items(),key=lambdax:x[1],reverse=True))
#一个频频项集
dict(sorted(itemsets[1].items(),key=lambdax:x[1],reverse=True))
定论
-
从消费用户的年纪来看,平均在33岁,属于主力消费且有必定经济实力的人群;
-
从用户的产品偏好来看,用户首要喜爱:三星、苹果、ava(主营儿童产品,比方儿童头盔、摩托车)、tefal(特福,首要家电产品,比方蒸锅、不粘锅等)
-
从用户搜索的产品品种来看,用户更重视的是smartphone、kitchen、electronics;也就说:智能手机、厨房用品和电子产品是用户的重视点
-
从相关规矩发掘到的信息来看:
- 男性/女性的相关产品信息可能是
electronics
与smartphone
,appliances
与kitchen
,或者computers
与notebook
- 在同一个品牌中,
appliances
和kitchen
;以及audio--->electronics--->headphone
是首要相关产品
- 男性/女性的相关产品信息可能是