python - pandas DataFrame update/combine when indices don't align -
consider 2 dataframes store information on same characteristic of same observation, different time periods:
import pandas pd import numpy np df1 = pd.dataframe({"obs":["a","a","b","b"], "year":[1,2,1,2], "val":[3, np.nan, 3, np.nan]}) df1 out: obs val year 0 3 1 1 nan 2 2 b 3 1 3 b nan 2 df2 = pd.dataframe({"obs":["a","a","b","b"], "val":[np.nan, 4, np.nan, 4], "year":[1,2,1,2]}) df2.index = (range(5,9)) df2 out: obs val year 5 nan 1 6 4 2 7 b nan 1 8 b 4 2
now merge or combine these 2 data frames such values collected in single column, nan
in df1
replaced corresponding observation-year values df2
. can achieve doing:
merged = pd.merge(df1, df2, on=["obs", "year"], how="left") merged.loc[~np.isfinite(merged.val_x), 'val_x'] = merged[~np.isfinite(merged.val_x)].val_y
i.e. doing regular merge , replacing nan
s in 1 column values other column hand.
is there better/more succinct way of doing this? feel sort of df.combine
, df.combine_first
, df.update
i'm after, seem align on indices.
i'm going assume goal obtain merged['val_x']
, don't care other columns in merged
.
here options:
def using_merge(df1, df2): merged = pd.merge(df1, df2, on=["obs", "year"], how="left") mask = ~np.isfinite(merged.val_x) merged.loc[mask, 'val_x'] = merged.loc[mask, 'val_y'] return merged['val_x'] def using_update(df1, d2): merged = pd.merge(df1, df2, on=["obs", "year"], how="left") merged['val_y'].update(merged['val_x']) return merged['val_y'] def using_set_index(df1, df2): df1 = df1.set_index(['obs','year']) df2 = df2.set_index(['obs','year']) return df1['val'].combine_first(df2['val'])
none more succinct others. there bit of performance difference:
import numpy np import pandas pd import itertools # generate large-ish example np.random.seed(2015) n, m = 200, 200 df1 = pd.dataframe(list(it.product(np.arange(n), np.arange(m))), columns=['obs','year']) df1['val'] = np.random.choice([1,2,np.nan], size=len(df1)) df2 = pd.dataframe(list(it.product(np.arange(n), np.arange(m))), columns=['obs','year']) df2['val'] = np.random.choice([1,2,np.nan], size=len(df1)) df2.index = np.arange(len(df2)) + len(df1) m1 = using_merge(df1, df2) m2 = using_update(df1, df2) m3 = using_set_index(df1, df2) assert m3.reset_index(drop=true).equals(m1) assert m1.equals(m2)
in [158]: %timeit using_merge(df1, df2) 100 loops, best of 3: 13.6 ms per loop in [159]: %timeit using_update(df1, df2) 100 loops, best of 3: 12.3 ms per loop in [160]: %timeit using_set_index(df1, df2) 100 loops, best of 3: 8 ms per loop
so larger dataframes, pays set index, , use combine_first
.
Comments
Post a Comment