Skip to content

ENH: Make DataFrame.filter accept filters in new formats #61317

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
datapythonista opened this issue Apr 20, 2025 · 18 comments
Open

ENH: Make DataFrame.filter accept filters in new formats #61317

datapythonista opened this issue Apr 20, 2025 · 18 comments
Labels
Filters e.g. head, tail, nth Needs Discussion Requires discussion from core team before further action

Comments

@datapythonista
Copy link
Member

datapythonista commented Apr 20, 2025

I think it'd be very nice for users to get this working:

df.filter(df["age"] > 18)  # same as `df[df["age"] > 18`

df.filter("age > 18")  # same as `df.query("age > 18")`, I think `.query` should be deprecated if this is implemented in `.filter`

df.filter(lambda df: df["age"] > 18)  # same as `df[df['age'].apply(lambda x: x > 18)]`, useful for method chaining

I think implementing this is reasonably simple. I think the main challenge is how to design the API in a way that filter can be intuitive and still work with the current parameters. And particularly, keeping backward compatibility. But personally, I think this would be so useful, that worth finding a solution.

CC: @rhshadrach

@datapythonista datapythonista added Needs Discussion Requires discussion from core team before further action Filters e.g. head, tail, nth labels Apr 20, 2025
@rhshadrach
Copy link
Member

Partial proposal:

  • Accept items (will maybe want to rename this argument?) of type:
    • Series (will align on index)
    • Non-Series list-likes (must be same length as df)
    • strings a la query
    • UDFs to be discussed.
  • Deprecate like, regex; offer no alternatives.
  • Deprecate axis=1 but add DataFrame.select (somewhat talked about in ENH: Improve Filter function with Filter_Columns and Filter_Rows #55289).

For UDFs, it seems to me that the usage in the OP can be readily handled by pipe. I would more expect passing a UDF to filter would operate row-wise similar to apply(..., by_row=True).

Another question is how strict we are on the values that will be filtered. Do we require these to be bool/np.bool_, or do we allow any value an internally pandas will evaluate the truthyness of it. I would lean toward the latter.

@datapythonista
Copy link
Member Author

I like the idea.

If I understand correctly, the main use of df.filter(cond) where cond is a Series will be equivalent to now use df[cond]. I think implementing the like and regex behaviors would be trivial with df.filter(df["col"].str.contains("xxx")) and same for regex, right? It does feel we're offering a very reasonable alternative.

I see your point for using .pipe to filter, and in a way kind of agree. But it feels like df.filter(lambda x: x["age"] > 18) will be together with df.filter(df["age"] > 18) the most common used case by far. While df.pipe(lambda x: x[x["age"] > 18) may seem a reasonable alternative, I think it will really make users' life easier to support the former, as I think it's way more intuitive.

In any case, what you propose seems like a great improvement.

@rhshadrach
Copy link
Member

rhshadrach commented Apr 22, 2025

I think implementing the like and regex behaviors would be trivial with df.filter(df["col"].str.contains("xxx")) and same for regex, right? It does feel we're offering a very reasonable alternative.

Agreed - I should have said no new alternatives. 😆

For UDFs, one reason not to have df.filter(lambda x: x["age"] > 18) operate by row is that it is effectively a transpose (x being a Series means it can only have one dtype), one of the behaviors I would love to remove from pandas across the board. Another is that agg, apply, transform all pass columns (vertical) objects into the UDF. While it doesn't make sense for filter to act column-by-column, passing the entire DataFrame seems closer in behavior than operating horizontally.

However I do not find it intuitive that in df.filter(lambda x: x["age"] > 18) the x is the same as df. I agree in the utility of having this for method chaining, but I immediately think x as being a component (element / column / row) of df instead of the entire thing. Perhaps that's just me?

A bit of restatement of my previous post, but it seems like df.filter(lambda x: ...) acting by row provides new functionality otherwise not readily available (I think?) where as having x be all of df is very close to duplicating pipe.

Finally, if we are to have x be the same as df in this case, what is the validation on the result? Must it be a Series with the same index as df, or are we going to allow alignment. Can users returns list-likes of the same length?

Overall, I lean toward operate by-row here, but not strongly.

I think .query should be deprecated if this is implemented in .filter

I agree, but desire the deprecation would be slow. That is, first introduce filter and change the docs to discourage the use of query. Then after 1 or 2 years, start the deprecation process.

cc @pandas-dev/pandas-core for any thoughts.

@simonjayhawkins
Copy link
Member

Overall, I lean toward operate by-row here, but not strongly.

DataFrame.filter does not filter a Dataframe on its contents, the filter is applied to the labels of the index. The suggestion in the OP is to essentially add value based conditional filtering to this method.

If you operate by row, (or by column if the axis argument is retained), then if you passed a Series with the Series.name set to the index label then it would be easier to filter based on the index label and thereby potentially justify the removal of like and regex and offer no alternatives?

@datapythonista
Copy link
Member Author

I see your point @rhshadrach, and I think what you propose is very reasonable and maybe even thr best option in theory.

In practice, I would be very surprised if most users don't find the pyspark-like API of the function receiving the whole dataframe more intuitive. See this example in their docs:

df.filter(df.age > 3).show()

We can't compare directly with a lazy API, but I think what I propose is quite similar to this.

Also, it was discussed before about adding pandas.col("my_col") to avoid the lambda. I guess that would look like:

df.filter(pd.col('age') > 3)

Personally if filter will accept both this expression and a lambda, I think it's way more clear and intuitive that the lambda works the way I described.

Let's see what other people think, maybe what's clear and intuitive to me it's not to others.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Apr 23, 2025

Maybe I'm missing something, but why deprecate query(). I have LOTS of code that uses that.

Why not leave filter as is - it operates on labels - and maybe expand query() to take expressions as proposed here.

So that df.query(df["age"] > 18) and df.query("age > 18") would do the same thing

@datapythonista
Copy link
Member Author

That's a reasonable option. I think filter is more clear, and is what everybody else is using. If we were to implement the API from scratch now, I think it would be the obvious choice. For backward compatibility query may be better, and we can surely consider it. But I would rather have a very long deprecation timeline, than keep the API IMHO wrong because of a choice we did that now is not ideal.

@rhshadrach
Copy link
Member

Why not leave filter as is - it operates on labels

Because it's at odds with other DataFrame libraries.

The exceptions are Modin and dask, but I think they were designed to model the pandas API.

In addition, I would argue query is an odd choice of a name for a filtering method.

I'm fine with leaving query as it is for a long time, including indefinitely.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Apr 25, 2025

Why not leave filter as is - it operates on labels

Because it's at odds with other DataFrame libraries.

But those libraries were introduced after pandas ! So shouldn't THEY be modifying their API's? (I say this somewhat facetiously)

In addition, I would argue query is an odd choice of a name for a filtering method.

Then maybe you think that SQL (Structured Query Language) should be called SFL (Structured Filtering Language) ? (definitely said facetitously)

To me query is appropriate if you think of it from the perspective of how SQL works.

@datapythonista
Copy link
Member Author

For a method that returns a subset of rows based on a condition, I think the standard terminology is filter. Query seems more appropriate for a more complex expression that can get data doing operations that not only involve a filter. I think SQL is consistent with this, since it allows to do more than the WHERE clause. And I think query feels inappropriate. df.where would be consistent with SQL, but to me filter is clearly the right choice.

@rhshadrach
Copy link
Member

rhshadrach commented Apr 25, 2025

To me query is appropriate if you think of it from the perspective of how SQL works.

Queries in SQL can do so much more than filter. DataFrame.query can only filter. I think this is supporting my contention that query is an odd name.

@simonjayhawkins
Copy link
Member

Why not leave filter as is - it operates on labels - and maybe expand query() to take expressions as proposed here.

That makes sense to me as we would not be mixing label based "filtering" with value based "filtering"

I think filter is more clear, and is what everybody else is using. If we were to implement the API from scratch now, I think it would be the obvious choice.

This also makes sense to me.

So the issue is how to make this transition. If we don't mix the label based and value based "filtering" this surely makes the transition path more difficult.

I'm still not clear how we keep the current label based filtering functionality if we "Deprecate like, regex; offer no alternatives." When we deprecate we say something like "x is deprecated. use ... instead". @rhshadrach can you clarify what the ... would be?

@datapythonista
Copy link
Member Author

When we deprecate we say something like "x is deprecated. use ... instead". @rhshadrach can you clarify what the ... would be?

df.filter(df["col"].index.str.contains("xxx")) (or same with square brackets)... this was discussed above in one of the many replies, Richard meant no new specific alternative, and all the existing filter method funcionality is already possible (and personally I'd bet that the alternative using a boolean mask based on the index attribute may already be more popular than the filter method.

@simonjayhawkins
Copy link
Member

and personally I'd bet that the alternative using a boolean mask based on the index attribute may already be more popular than the filter method.

Yes, boolean indexing is one of the core strengths of pandas and remains the best practice for filtering data. Its clarity and explicit nature make it ideal for developers who want to see exactly which rows or columns are being selected—for example, using expressions like df[df['age'] > 18] directly leaves little room for ambiguity.

In contrast, convenience methods like DataFrame.filter and DataFrame.query were originally designed to offer syntactic sugar for specific filtering operations that might be less straightforward with boolean indexing. However, extending these methods to incorporate functionality already achievable through boolean indexing creates a duplicate API. This duplication tends to blur the clear separation of concerns: boolean indexing for explicit condition-based filtering, and the convenience methods for more specialized use cases such as label-based filtering or evaluating query expressions.

From the perspective of user-friendliness and maintainability, especially for newcomers, it is perhaps more intuitive to keep these methods unchanged. Retaining their original, focused design helps avoid confusion. New users won't have to decide between multiple approaches for the same operation, and experienced users can continue to leverage boolean indexing as a robust tool for data selection. Moreover, a stable and clear API encourages better code clarity and consistency, both of which are essential for long-term maintainability.

In summary, while extending these convenience methods might seem like a way to offer more flexibility, doing so risks introducing unnecessary redundancy and potential confusion. Maintaining the current API allows developers to choose the most appropriate filtering method—whether it's the explicit power of boolean indexing or the specialized convenience of filter and query—without overlapping functionality?

@datapythonista
Copy link
Member Author

I think the existing API is already duplicated, as you mention, filter is syntactic sugar for 3 very particular use cases (I personally never used).

I don't think the square brackets is a good API for method chaining, so I'm happy with the duplication after the changes proposed here.

Also, after having used both pyspark and polars, I find the filter method with a condition one of the essential functionality of a dataframe library. If we manage to implement the syntax below, I think it'll be the most important and convenient API change to pandas since I started contributing:

df.filter(pd.col('age') > 3)

Of course other will have different points of view, but for a large amount of our user base I think this would be a huge improvement. And as a first step it needs the changes proposed in this issue.

@simonjayhawkins
Copy link
Member

And as a first step it needs the changes proposed in this issue.

If DataFrame.filter did not already exist and do something different it would definitely be more straightforward to implement this.

I think the existing API is already duplicated, as you mention, filter is syntactic sugar for 3 very particular use cases (I personally never used).

I don't disagree. Let me think on this some more.

I don't think the square brackets is a good API for method chaining, so I'm happy with the duplication after the changes proposed here.

noted.

@simonjayhawkins
Copy link
Member

for some additional context, it seems we are covering some of the same ground as #12401

@simonjayhawkins
Copy link
Member

pinging @jorisvandenbossche for input as participant #12401 and the follow up open issue #26642.

I'm guessing from #26642 (comment) that @jorisvandenbossche may want to retain the syntatic sugar that .filter offers but is not adverse to renaming the method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Filters e.g. head, tail, nth Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

4 participants