-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
perf: improve membership check performance in column filtering #61046
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pre-commit.ci autofix |
Do you have an example benchmark where this PR improves the performance of |
This optimization was flagged by a tool I’m developing, which performs code inspection to identify potential performance improvements. However, it doesn’t measure execution times, so I haven’t benchmarked the actual impact yet. From a theoretical perspective, the change makes sense since the previous implementation performed lookups in a list (O(n)) while the new approach uses a set (O(1)). Would you be able to help me create a proper benchmark for this case? |
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
I'm busy with other things right now, but I'll try to create a benchmark as soon as possible. |
2977503
to
7e8c88d
Compare
Hello @mroeschke , sorry for the delay, I managed to do the benchmark and would like your opinion and review. Benchmark ExplanationThis benchmark script measures the performance of Since the change affects only specific code paths when
Benachmark resultsThis is the results in my branch:
this is the results in Main branch
The results show a performance improvement with modification, especially as the number of columns and selected percentage increase. In small datasets (10–100 columns), performance is similar between the branches, as expected the overhead of column filtering is minimal. However, as the number of columns grows (1000, 10000, 50000), the modification consistently outperforms the main branch, especially for larger filter percentage values (50%, 90%). For example:
How I ran the benchmark1. Synchronize branches
2. Create
|
Thanks @allrob23 |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.