python - pandas DataFrame update/combine when indices don't align -


consider 2 dataframes store information on same characteristic of same observation, different time periods:

import pandas pd import numpy np  df1 = pd.dataframe({"obs":["a","a","b","b"],     "year":[1,2,1,2],     "val":[3, np.nan, 3, np.nan]})  df1  out:    obs  val  year 0       3     1 1     nan     2 2    b    3     1 3    b  nan     2  df2 = pd.dataframe({"obs":["a","a","b","b"],     "val":[np.nan, 4, np.nan, 4],     "year":[1,2,1,2]}) df2.index = (range(5,9))  df2  out:    obs  val  year 5     nan     1 6       4     2 7    b  nan     1 8    b    4     2 

now merge or combine these 2 data frames such values collected in single column, nan in df1 replaced corresponding observation-year values df2. can achieve doing:

merged = pd.merge(df1, df2, on=["obs", "year"], how="left") merged.loc[~np.isfinite(merged.val_x), 'val_x'] = merged[~np.isfinite(merged.val_x)].val_y 

i.e. doing regular merge , replacing nans in 1 column values other column hand.

is there better/more succinct way of doing this? feel sort of df.combine, df.combine_first, df.update i'm after, seem align on indices.

i'm going assume goal obtain merged['val_x'] , don't care other columns in merged.

here options:

def using_merge(df1, df2):     merged = pd.merge(df1, df2, on=["obs", "year"], how="left")     mask = ~np.isfinite(merged.val_x)     merged.loc[mask, 'val_x'] = merged.loc[mask, 'val_y']     return merged['val_x']  def using_update(df1, d2):     merged = pd.merge(df1, df2, on=["obs", "year"], how="left")     merged['val_y'].update(merged['val_x'])     return merged['val_y']  def using_set_index(df1, df2):     df1 = df1.set_index(['obs','year'])     df2 = df2.set_index(['obs','year'])     return df1['val'].combine_first(df2['val']) 

none more succinct others. there bit of performance difference:

import numpy np import pandas pd import itertools  # generate large-ish example np.random.seed(2015) n, m = 200, 200 df1 = pd.dataframe(list(it.product(np.arange(n), np.arange(m))),                     columns=['obs','year']) df1['val'] = np.random.choice([1,2,np.nan], size=len(df1))  df2 = pd.dataframe(list(it.product(np.arange(n), np.arange(m))),                     columns=['obs','year']) df2['val'] = np.random.choice([1,2,np.nan], size=len(df1)) df2.index = np.arange(len(df2)) + len(df1)  m1 = using_merge(df1, df2) m2 = using_update(df1, df2) m3 = using_set_index(df1, df2) assert m3.reset_index(drop=true).equals(m1) assert m1.equals(m2) 

in [158]: %timeit using_merge(df1, df2) 100 loops, best of 3: 13.6 ms per loop  in [159]: %timeit using_update(df1, df2) 100 loops, best of 3: 12.3 ms per loop  in [160]: %timeit using_set_index(df1, df2) 100 loops, best of 3: 8 ms per loop 

so larger dataframes, pays set index, , use combine_first.


Comments

Popular posts from this blog

python - pip install -U PySide error -

arrays - C++ error: a brace-enclosed initializer is not allowed here before ‘{’ token -

apache - setting document root in antoher partition on ubuntu -