信息纪录 - 【python数据分析】第十章数据聚合与分组运算-3

文章数据
收藏(次)
【python数据分析】第十章数据聚合与分组运算-3

加关注
透视表和交叉表
--------------------------------------------------------------------------------
import numpy as np
import pandas as pd
--------------------------------------------------------------------------------
frame = pd.DataFrame({
    'name':['zs', 'li','ll', 'sd', 'ds', 'd2', 'as', 'jk', 'sw'],
    'gender':[1,   2,    2,    1,    1,   1,    2,     1,    2],
    'clazz':[1,   1,     3,    1,    2,    2,   2,    3,     1],
    'tall'  :[178, 155, 160, 185,   183, 167,  168,  164,  163],
    'score':[50,  69,  89, 92,      78,  40,    73,    65,  28]
})
班级学生的性别，班级，身高，得分 组成的数据
--------------------------------------------------------------------------------
统计每一班级的不同性别的平均身高和得分，(班级分组，在性别分组，求平均)
frame.pivot_table() 会对数据进行分组和数据聚合，他默认使用平均值进行聚合
所以如果使用  pivot_table 直接完成我们的要求

frame.pivot_table(index=['clazz', 'gender'])
输出：
		            score	tall
clazz	gender		
1       1	            71.0	181.5
        2	            48.5	159.0
2       1	            59.0	175.0
        2	            73.0	168.0
3       1	            65.0	164.0
        2   	           89.0	        160.0
  
# index 指定分组依据，其他的全部会平均值，无法计算的麻烦列被丢弃
# 也能指定对哪些列聚合，其他没有指定的丢弃
frame.pivot_table(['score'], index=['clazz', 'gender'])
输出：

                   score
clazz	gender	
1	1           71.0
        2           48.5
2	1           59.0
        2           73.0
3	1           65.0
        2           89.0
--------------------------------------------------------------------------------
frame.pivot_table(['score'], index=['clazz', 'gender'], columns='name')
不知道能说明什么，输出的结果时这样的，只是展示还有 columns 这个参数
score
name	         as	d2	ds	jk	li	ll	sd	sw	zs
clazz	gender									
1	1	NaN	NaN	NaN	NaN	NaN	NaN	92.0	NaN	50.0
        2	NaN	NaN	NaN	NaN	69.0	NaN	NaN	28.0	NaN
2	1	NaN	40.0	78.0	NaN	NaN	NaN	NaN	NaN	NaN
        2	73.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	1	NaN	NaN	NaN	65.0	NaN	NaN	NaN	NaN	NaN
        2	NaN	NaN	NaN	NaN	NaN	89.0	NaN	NaN	NaN
--------------------------------------------------------------------------------
frame.pivot_table(['score'], index=['clazz'], margins=True)
最下面加一个总计项，是全部数据的平均值
输出：

          score
clazz	
1	        59.750000
2	        63.666667
3	        77.000000
All	      64.888889
--------------------------------------------------------------------------------
pivot_table 使用平均值来聚合函数，换成其他的，使用 aggfunc，
frame.pivot_table(['score'], index=['clazz'], margins=True, aggfunc=lambda x:x.max())
--------------------------------------------------------------------------------
填充 NA 值 fill_value = 0
frame.pivot_table(['score'], index=['clazz'], margins=True, aggfunc=lambda x:x.max(),
                  fill_value=0, columns='name')
--------------------------------------------------------------------------------
pivot_table 的方法定义：
frame.pivot_table(
    values=None,             被聚合的列， 不指定，就是除了index外的全部列
    index=None,              分组依据
    columns=None,            成为新表的列
    aggfunc='mean',          聚合方法，默认是平均
    fill_value=None,         Na值填充
    margins=False,           添加行/列的小计
    dropna=True,             NA数据行删除
    margins_name='All',      小计的名称
    observed=False,
) -> 'DataFrame'
================================================================================
交叉表：crosstab
交叉表（cross-tabulation，简称crosstab）是一种用于计算分组频率的特殊透视表
频率就是出现个数， 比如数据使用性别分组，那么男生多让人，女生多少人
--------------------------------------------------------------------------------
frame = pd.DataFrame({
    'name':['zs', 'li','ll', 'sd', 'ds', 'd2', 'as', 'jk', 'sw'],
    'gender':[1,   2,    2,    1,    1,   1,    2,     1,    2],
    'clazz':[1,   1,     3,    1,    2,    2,   2,    3,     1],
    'tall'  :[178, 155, 160, 185,   183, 167,  168,  164,  163],
    'score':[50,  69,  89, 92,      78,  40,    73,    65,  np.nan]
})

pd.crosstab([frame.clazz, frame.gender], 'mycount', margins=True)
输出：就是统计这样的分组下本组的样本数量
	col_0	mycount	All
clazz	gender		
1	1	       2	2
        2	       2	2
2	1	       2	2
        2	       1	1
3	1	       1	1
        2	       1	1
All		       9	9
--------------------------------------------------------------------------------
方法定义
pd.crosstab(
    index,                     # 分组依据
    columns,                   # 统计后的数据形成一列，列名时什么
    values=None,
    rownames=None,
    colnames=None,                  
    aggfunc=None,                # 默认是计算本组的样本点的个数
    margins=False,               # 是否有小计
    margins_name: str = 'All',
    dropna: bool = True,
    normalize=False,
) -> 'DataFrame'
--------------------------------------------------------------------------------
有 aggfunc, 那就能修改默认统计样本点个数。
--------------------------------------------------------------------------------
def fun(x):
    print(x)
    return x.max()
--------------------------------------------------------------------------------
pd.crosstab([frame.clazz, frame.gender], 'mycount', values=frame.tall, margins=True, aggfunc=fun)
--------------------------------------------------------------------------------
如果指定了 aggfunc ，就必须指定 values=xxx
因为默认行为只要传入分组依据的列即可，所以自定义 aggfunc=fun 就没有分组数据用于聚合了
这时就必须指定 values=xxx， 这个会按照分组依据传递给 fun。 value 可以是 原数据的一列
也可以是非原数据，但是数据长度相同即可。
现在改成了分组依据中最高的身高了，
输出：
col_0	                 mycount  All
clazz	gender		
1	1	            185	 185
        2	            163	 163
2	1	            183	 183
        2	            168	 168
3	1	            164  164
        2	            160	 160
All		            185  185
--------------------------------------------------------------------------------
>> 目录 <<
点赞人
文章标签
评论列表
正义联盟

【python数据分析】第十章数据聚合与分组运算-3

推荐

【python数据分析】第十章 数据聚合与分组运算-3

推荐

【python数据分析】第十章数据聚合与分组运算-3