I'm trying to construct a fast Pandas approach for dropping certain rows from a Dataframe when some condition is met. Specifically, I want to drop the first occurrence of some variable in the dataframe if some other value in that row is equal to 0. This is perhaps easiest explained by example:
foo = np.array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])
bar = np.array([1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1])
df = pd.DataFrame({'foo': foo, 'bar':bar})
# So df is:
idx | foo | bar
0 1 1
1 1 0
2 1 1
3 1 0
4 1 1
5 1 0
6 1 1
7 1 0
8 1 1
9 1 0
10 1 1
11 2 0
12 2 1
13 2 0
14 2 1
15 3 1
16 3 1
17 3 0
18 3 1
I want to look at the first row when the 'foo' column is a new value, then drop it from the dataframe if the 'bar' value in that row = 0.
I can find when this condition is met using groupby:
df.groupby('foo').first()
# Result:
bar
foo
1 1
2 0
3 1
So I see that I need to drop the first row when foo = 2 (i.e. just drop row with index = 11 in my original data frame). I cannot work out, however, how to use this groupby result as a mask for my original data frame, since the shapes / sizes are different.
I found a related question on groupby modifications (Drop pandas dataframe rows based on groupby() condition), but in this example they drop ALL rows when this condition is met, whereas I only want to drop the first row.
Is this possible please?
ansev :
Use Series.shift:\n\ndf.loc[~(df['foo'].ne(df['foo'].shift()) & df['bar'].eq(0))]\n\n\nor\n\ndf.loc[df.duplicated(subset = 'foo') | df['bar'].ne(0)]\n\n\n\n\nclearly much better\n\n%%timeit\ndf.loc[~(df['foo'].ne(df['foo'].shift()) & df['bar'].eq(0))]\n#970 µs ± 51.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) each)\n\n\n\n\n%%timeit\ndf.loc[df.duplicated(subset = 'foo') | df['bar'].ne(0)]\n#1.34 ms ± 34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n\n\n\n\n%%timeit\ndf.loc[~df.index.isin(df.drop_duplicates(subset='foo').loc[lambda x: x.bar==0].index)]\n#2.16 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n\n\nif foo is like in your example:\n\n%%timeit\ndf.loc[~(df['foo'].diff().ne(0)&df['bar'].eq(0))]\n908 µs ± 15.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n\n\nor\n\n%%timeit\ndf.loc[df['foo'].duplicated().add(df['bar']).ne(0)]\n787 µs ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n",
2020-02-11T00:04:12