信息纪录 - 【python数据分析】第八章数据规整-2

文章数据
收藏(次)
【python数据分析】第八章数据规整-2

加关注
合并数据集
--------------------------------------------------------------------------------
pandas对象中的数据可以通过一些方式进行合并：
1、pandas.merge可根据一个或多个键将不同DataFrame中的行连接起来。SQL或其他关系型数据库
   的用户对此应该会比较熟悉，因为它实现的就是数据库的join操作。
2、pandas.concat可以沿着一条轴将多个对象堆叠到一起。
3、实例方法combine_first可以将重复数据拼接在一起，用一个对象中的值填充另一个对象中的缺失值。
--------------------------------------------------------------------------------
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],'data2': range(3)})

pd.merge(df1, df2)
输出：
   key	data1	data2
0	b	0	1
1	b	1	1
2	b	6	1
3	a	2	0
4	a	4	0
5	a	5	0
没有指定key,merge就会将重叠列的列名当做键（两个DF数据都有 key 这列，他就认为这是key）。
不过，最好明确指定一下,比如 pd.merge(df1, df2, on='key')

没有相同的列：那就分别指定
df1 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
df2 = pd.DataFrame({'rkey': ['a', 'b', 'd'],'data2': range(3)})
pd.merge(df1, df2, left_on='lkey', right_on='rkey')

上面数据合并以后，是取了数据的交集，使用 how 可以修改， inner， outer， left， right
pd.merge(df1, df2, left_on='lkey', right_on='rkey', how='outer')
--------------------------------------------------------------------------------
多个列的连接：

left = pd.DataFrame({
    'key1': ['foo', 'foo', 'bar'],
    'key2': ['one', 'two', 'one'],
    'lval': [1, 2, 3]})
right = pd.DataFrame({
    'key1': ['foo', 'foo', 'bar', 'bar'],
    'key2': ['one', 'one', 'one', 'two'],
    'rval': [4, 5, 6, 7]
})
pd.merge(left, right, on=['key1', 'key2'], how='outer')
输出：
	key1	key2	lval	rval
0	foo	one	1.0	4.0
1	foo	one	1.0	5.0
2	foo	two	2.0	NaN
3	bar	one	3.0	6.0
4	bar	two	NaN	7.0

pd.merge(left, right, on='key1')
输出：
	key1	key2_x	lval	key2_y	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

两个输出看一下，如果单列连接，非连接条件的列重名时会被重新命名，连接条件不会重命名
如何修改重名呢？suffixes=['_x','_y'] 这是默认值
pd.merge(left, right, on=['key1', 'key2'], how='outer', suffixes=['_left','_right'])
--------------------------------------------------------------------------------
pd.merge(
    left,                                    #
    right,                                   #
    how: str = 'inner',                      #
    on=None,                                 # 连接条件
    left_on=None,                            # 左侧的某一列数据列作为连接条件
    right_on=None,                           #
    left_index: bool = False,                # 左侧的索引作为连接条件
    right_index: bool = False,               #
    sort: bool = False,                      #
    suffixes=('_x', '_y'),                   #
    copy: bool = True,                       # 是否 copy,不破坏原数据
    indicator: bool = False,                 #
    validate=None,                           #
) -> 'DataFrame'

left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b','c'], 'value': range(6)})
right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])

# 左边使用 key 列作为连接条件，右边使用 索引作为连接条件
pd.merge(left1, right1, left_on='key', right_index=True)
--------------------------------------------------------------------------------
有二级索引的
--------------------------------------------------------------------------------
lefth = pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio','Nevada', 'Nevada'],
                       'key2': [2000, 2001, 2002, 2001,2002],
                       'data': np.arange(5.)})
righth = pd.DataFrame(np.arange(12).reshape((6, 2)),
        index=[
              ['Nevada', 'Nevada', 'Ohio', 'Ohio','Ohio', 'Ohio'],
              [2001, 2000, 2000, 2000, 2001, 2002]],
        columns=['event1', 'event2'])

lefth 只有自动生成的索引
righth 指定了二级索引
左边的指定连接条件的列名，右面使用索引即可
pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True, how='outer')
--------------------------------------------------------------------------------
都使用索引作为连接，也可以
left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]],
    index=['a', 'c', 'e'],
    columns=['Ohio', 'Nevada'])
right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.],[13, 14]],
    index=['b', 'c', 'd', 'e'],
    columns=['Missouri', 'Alabama'])

pd.merge(left2, right2, how='outer', left_index=True, right_index=True)
--------------------------------------------------------------------------------
DateFrame 的实例方法： join, 定义如下， 但要求没有重叠的列，要求没有重叠的列
on  指定谁的索引呢？？？
DateFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False) -> 'DataFrame'
================================================================================
轴向连接：
就是数据沿着某个轴进行扩展，比如原来4列的数据和自己水平轴连接一下，变成8列了
--------------------------------------------------------------------------------
Numpy 库的方法 concatenate()

arr = np.arange(12).reshape((3, 4))
np.concatenate([arr,arr], axis=0)      # 垂直方法扩展一个自己
np.concatenate([arr,arr], axis=1)      # 水平方法扩展一个自己
--------------------------------------------------------------------------------
Pandas 中的方法： concat
s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = pd.Series([5, 6], index=['f', 'g'])

pd.concat([s1, s2, s3])  # series 一维的，默认直接往下排，
输出：
a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

pd.concat([s1, s2, s3], axis=1)  # 水平上扩展，但是没有相同的键，所以还是垂直方向扩展，同时保证是矩形结构
输出：
	 0	 1   2
a	0.0	NaN	NaN
b	1.0	NaN	NaN
c	NaN	2.0	NaN
d	NaN	3.0	NaN
e	NaN	4.0	NaN
f	NaN	NaN	5.0
g	NaN	NaN	6.0
--------------------------------------------------------------------------------
s4 = pd.concat([s1, s3])
s4 输出：
a    0
b    1
f    5
g    6
dtype: int64

pd.concat([s1, s4], axis=1)   # 把 s1 和 s4 水平连接。s1 的列写完，再写 s4的列，不足的补NA
输出：
	0	1
a	0.0	0
b	1.0	1
f	NaN	5
g	NaN	6

pd.concat([s1, s4], axis=1, join='inner')   # 要求两边都有的才保留下来

书上还介绍：使用参数 join_axes 指定列，测试时发现没有这个参数，可能 pandas 的版本不同，
pd.concat([s1, s4], axis=1, join_axes=[['a', 'c', 'b','e']])  # error， 没有这个参数

方法的定义如下：
pd.concat(
    objs: Union[Iterable[Union[ForwardRef('DataFrame'), ForwardRef('Series')]], 
          Mapping[Union[Hashable, NoneType], Union[ForwardRef('DataFrame'), ForwardRef('Series')]]],
    axis=0,
    join='outer',
    ignore_index: bool = False,
    keys=None,
    levels=None,
    names=None,
    verify_integrity: bool = False,
    sort: bool = False,
    copy: bool = True,
) -> Union[ForwardRef('DataFrame'), ForwardRef('Series')]
--------------------------------------------------------------------------------
pd.concat([s1, s2, s3])  会在垂直方向上依次排列下来，这样就分不清哪个数据是属于之前s1的
针对此问题可以使用二级索引，就像分组一样，原s1的是一组，s2的是一组
使用 concat 中的参数 keys 即可

res = pd.concat([s1, s3, s4], keys=['s1', 's3', 's4'])  # 数据和 keys 的值一一对应

res 输出：他的类型还是 pandas.core.series.Series
s1  a    0
    b    1
s3  f    5
    g    6
s4  a    0
    b    1
    f    5
    g    6
dtype: int64

res.unstack() 就能变成 dataFrame
================================================================================
df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'],columns=['one', 'two'])
df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'c'],columns=['three', 'four'])
pd.concat([df1, df2], axis=1)
输出：
	one	two	three	four
a	0	1	5.0	6.0
b	2	3	NaN	NaN
c	4	5	7.0	8.0

pd.concat([df1, df2], axis=1, keys=['df1', 'df2'])  # keys 还是为了分清合并后数据是哪个原数据的
pd.concat({'df1':df1, 'df2':df2}, axis=1)， 不使用 keys 这样也可以
输出：
  df1	          df2
   one	 two	  three	  four
a	  0	    1	     5.0	   6.0
b	  2	    3	     NaN	   NaN
c	  4	    5	     7.0	   8.0
pd.concat({'df1':df1, 'df2':df2}, axis=1, names=['一级', '二级'])
    一级	df1	           df2
    二级	one	  two	     three	 four
a	        0	     1	      5.0	    6.0
b	        2	     3	      NaN	    NaN
c       	4	     5	      7.0	    8.0
--------------------------------------------------------------------------------
df1 = pd.DataFrame(np.random.randn(3, 4), columns=['a', 'b', 'c', 'd'])
df2 = pd.DataFrame(np.random.randn(2, 3), columns=['b', 'd', 'a'])

他们都有各自的索引，而且他们保留了各自的索引，导致输出结果索引重复了
pd.concat([df1, df2])
输出： # 行索引重复了， 0 1 2 0 1
   a	     b	      c          	d
0	-1.575272	-1.107551	1.455479	0.834577
1	-0.250725	-0.326574	1.481303	0.654970
2	0.153223	1.489016	1.597061	1.814133
0	-0.125009	-0.378465	NaN	0.454676
1	0.098119	0.878741	NaN	0.647189

pd.concat([df1, df2], ignore_index=True)  # 忽略个人的索引，重新计算索引
输出：
       a	      b	        c        	d
0	-1.575272	-1.107551	1.455479	0.834577
1	-0.250725	-0.326574	1.481303	0.654970
2	0.153223	1.489016	1.597061	1.814133
3	-0.125009	-0.378465	NaN	0.454676
4	0.098119	0.878741	NaN	0.647189
--------------------------------------------------------------------------------
>> 目录 <<
点赞人
文章标签
评论列表
正义联盟

【python数据分析】第八章数据规整-2

推荐

【python数据分析】第八章 数据规整-2

推荐

【python数据分析】第八章数据规整-2