Skip to content

perf: improve membership check performance in column filtering #61046

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 26, 2025

Conversation

allrob23
Copy link
Contributor

@allrob23 allrob23 commented Mar 4, 2025

@allrob23
Copy link
Contributor Author

allrob23 commented Mar 4, 2025

pre-commit.ci autofix

@mroeschke mroeschke added Performance Memory or execution speed performance IO CSV read_csv, to_csv labels Mar 4, 2025
@mroeschke
Copy link
Member

Do you have an example benchmark where this PR improves the performance of read_csv?

@allrob23
Copy link
Contributor Author

allrob23 commented Mar 4, 2025

This optimization was flagged by a tool I’m developing, which performs code inspection to identify potential performance improvements. However, it doesn’t measure execution times, so I haven’t benchmarked the actual impact yet.

From a theoretical perspective, the change makes sense since the previous implementation performed lookups in a list (O(n)) while the new approach uses a set (O(1)).

Would you be able to help me create a proper benchmark for this case?

Copy link
Contributor

github-actions bot commented Apr 4, 2025

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Apr 4, 2025
@allrob23
Copy link
Contributor Author

allrob23 commented Apr 4, 2025

I'm busy with other things right now, but I'll try to create a benchmark as soon as possible.

@allrob23
Copy link
Contributor Author

allrob23 commented Apr 26, 2025

Hello @mroeschke , sorry for the delay, I managed to do the benchmark and would like your opinion and review.

Benchmark Explanation

This benchmark script measures the performance of pandas.read_csv with usecols filtering after a code modification that optimizes column selection using a set lookup.

Since the change affects only specific code paths when usecols is provided, the benchmark was designed to:

  1. Vary the number of columns (10 to 50,000) to simulate different dataset sizes.
  2. Select different percentages of columns (1%, 10%, 50%, 90%) to trigger and measure the filtering logic.
  3. Run multiple iterations (5 by default, easily adjustable) to collect stable mean and median read times.

Benachmark results

This is the results in my branch:

 
 Number of Columns Select Percentage  Mean Time (s)  Median Time (s)
                10                1%         0.0008           0.0008
                10               10%         0.0008           0.0008
                10               50%         0.0013           0.0013
                10               90%         0.0018           0.0018
               100                1%         0.0009           0.0009
               100               10%         0.0019           0.0019
               100               50%         0.0062           0.0063
               100               90%         0.0106           0.0106
              1000                1%         0.0025           0.0025
              1000               10%         0.0155           0.0118
              1000               50%         0.0533           0.0527
              1000               90%         0.0958           0.0935
             10000                1%         0.0176           0.0173
             10000               10%         0.1173           0.1120
             10000               50%         0.5866           0.5692
             10000               90%         1.0830           1.0885
             50000                1%         0.0913           0.0907
             50000               10%         0.6784           0.7101
             50000               50%         3.3062           3.2970
             50000               90%         5.9495           5.8902

this is the results in Main branch

 Number of Columns Select Percentage  Mean Time (s)  Median Time (s)
                10                1%         0.0009           0.0009
                10               10%         0.0008           0.0008
                10               50%         0.0013           0.0013
                10               90%         0.0017           0.0017
               100                1%         0.0009           0.0009
               100               10%         0.0020           0.0020
               100               50%         0.0065           0.0064
               100               90%         0.0110           0.0110
              1000                1%         0.0024           0.0024
              1000               10%         0.0160           0.0118
              1000               50%         0.0548           0.0547
              1000               90%         0.1046           0.1014
             10000                1%         0.0181           0.0174
             10000               10%         0.1259           0.1203
             10000               50%         0.8275           0.8285
             10000               90%         1.8634           1.8587
             50000                1%         0.0924           0.0917
             50000               10%         0.9305           0.9910
             50000               50%         9.2642           9.2397
             50000               90%        24.1806          24.3157

The results show a performance improvement with modification, especially as the number of columns and selected percentage increase.

In small datasets (10–100 columns), performance is similar between the branches, as expected the overhead of column filtering is minimal.

However, as the number of columns grows (1000, 10000, 50000), the modification consistently outperforms the main branch, especially for larger filter percentage values (50%, 90%).

For example:

  • With 10,000 columns and 90% selected, the mean time dropped from 1.86s (main) to 1.08s (new).
  • With 50,000 columns and 90% selected, the mean time dropped from 24.18s (main) to 5.94s (new).

How I ran the benchmark

1. Synchronize branches

git remote add allrob23-fork https://github.com/allrob23/pandas.git
git fetch allrob23-fork

2. Create bench.py script with code

get the code from this gist https://gist.github.com/allrob23/dea86dba17bbfaa988e2b5a08b27db79

3. Run the code

python3 bench.py

4. Get the results printed at the end

5. Change to this PR branch

git checkout perf-columns-set

6. Run the code again

python3 bench.py

7. Get the results printed at the end

I ran this using docker + compile pandas with reasons:

# git
git clone https://github.com/pandas-dev/pandas/
cd pandas
# docker
docker build -t pandas-dev .
docker run -it --rm -v ${PWD}:/home/pandas pandas-dev
# compile with reasons
python -m pip install -ve . --no-build-isolation -Ceditable-verbose=true

@mroeschke mroeschke added this to the 3.0 milestone Apr 26, 2025
@mroeschke mroeschke removed the Stale label Apr 26, 2025
@mroeschke mroeschke merged commit 44c5613 into pandas-dev:main Apr 26, 2025
42 checks passed
@mroeschke
Copy link
Member

Thanks @allrob23

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: Optimize membership check in column filtering for better performance
2 participants