-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Make DataFrame.filter accept filters in new formats #61317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Partial proposal:
For UDFs, it seems to me that the usage in the OP can be readily handled by Another question is how strict we are on the values that will be filtered. Do we require these to be |
I like the idea. If I understand correctly, the main use of I see your point for using In any case, what you propose seems like a great improvement. |
Agreed - I should have said no new alternatives. 😆 For UDFs, one reason not to have However I do not find it intuitive that in A bit of restatement of my previous post, but it seems like Finally, if we are to have Overall, I lean toward operate by-row here, but not strongly.
I agree, but desire the deprecation would be slow. That is, first introduce filter and change the docs to discourage the use of cc @pandas-dev/pandas-core for any thoughts. |
If you operate by row, (or by column if the axis argument is retained), then if you passed a Series with the Series.name set to the index label then it would be easier to filter based on the index label and thereby potentially justify the removal of |
I see your point @rhshadrach, and I think what you propose is very reasonable and maybe even thr best option in theory. In practice, I would be very surprised if most users don't find the pyspark-like API of the function receiving the whole dataframe more intuitive. See this example in their docs: df.filter(df.age > 3).show() We can't compare directly with a lazy API, but I think what I propose is quite similar to this. Also, it was discussed before about adding df.filter(pd.col('age') > 3) Personally if filter will accept both this expression and a lambda, I think it's way more clear and intuitive that the lambda works the way I described. Let's see what other people think, maybe what's clear and intuitive to me it's not to others. |
Maybe I'm missing something, but why deprecate Why not leave So that |
That's a reasonable option. I think filter is more clear, and is what everybody else is using. If we were to implement the API from scratch now, I think it would be the obvious choice. For backward compatibility query may be better, and we can surely consider it. But I would rather have a very long deprecation timeline, than keep the API IMHO wrong because of a choice we did that now is not ideal. |
Because it's at odds with other DataFrame libraries.
The exceptions are Modin and dask, but I think they were designed to model the pandas API. In addition, I would argue I'm fine with leaving query as it is for a long time, including indefinitely. |
But those libraries were introduced after
Then maybe you think that SQL (Structured Query Language) should be called SFL (Structured Filtering Language) ? (definitely said facetitously) To me |
For a method that returns a subset of rows based on a condition, I think the standard terminology is filter. Query seems more appropriate for a more complex expression that can get data doing operations that not only involve a filter. I think SQL is consistent with this, since it allows to do more than the WHERE clause. And I think query feels inappropriate. |
Queries in SQL can do so much more than filter. |
That makes sense to me as we would not be mixing label based "filtering" with value based "filtering"
This also makes sense to me. So the issue is how to make this transition. If we don't mix the label based and value based "filtering" this surely makes the transition path more difficult. I'm still not clear how we keep the current label based filtering functionality if we "Deprecate like, regex; offer no alternatives." When we deprecate we say something like "x is deprecated. use ... instead". @rhshadrach can you clarify what the ... would be? |
|
Yes, boolean indexing is one of the core strengths of pandas and remains the best practice for filtering data. Its clarity and explicit nature make it ideal for developers who want to see exactly which rows or columns are being selected—for example, using expressions like In contrast, convenience methods like From the perspective of user-friendliness and maintainability, especially for newcomers, it is perhaps more intuitive to keep these methods unchanged. Retaining their original, focused design helps avoid confusion. New users won't have to decide between multiple approaches for the same operation, and experienced users can continue to leverage boolean indexing as a robust tool for data selection. Moreover, a stable and clear API encourages better code clarity and consistency, both of which are essential for long-term maintainability. In summary, while extending these convenience methods might seem like a way to offer more flexibility, doing so risks introducing unnecessary redundancy and potential confusion. Maintaining the current API allows developers to choose the most appropriate filtering method—whether it's the explicit power of boolean indexing or the specialized convenience of |
I think the existing API is already duplicated, as you mention, filter is syntactic sugar for 3 very particular use cases (I personally never used). I don't think the square brackets is a good API for method chaining, so I'm happy with the duplication after the changes proposed here. Also, after having used both pyspark and polars, I find the filter method with a condition one of the essential functionality of a dataframe library. If we manage to implement the syntax below, I think it'll be the most important and convenient API change to pandas since I started contributing: df.filter(pd.col('age') > 3) Of course other will have different points of view, but for a large amount of our user base I think this would be a huge improvement. And as a first step it needs the changes proposed in this issue. |
If DataFrame.filter did not already exist and do something different it would definitely be more straightforward to implement this.
I don't disagree. Let me think on this some more.
noted. |
for some additional context, it seems we are covering some of the same ground as #12401 |
pinging @jorisvandenbossche for input as participant #12401 and the follow up open issue #26642. I'm guessing from #26642 (comment) that @jorisvandenbossche may want to retain the syntatic sugar that .filter offers but is not adverse to renaming the method. |
I think it'd be very nice for users to get this working:
I think implementing this is reasonably simple. I think the main challenge is how to design the API in a way that filter can be intuitive and still work with the current parameters. And particularly, keeping backward compatibility. But personally, I think this would be so useful, that worth finding a solution.
CC: @rhshadrach
The text was updated successfully, but these errors were encountered: