perf: improve membership check performance in column filtering #61046

allrob23 · 2025-03-04T01:52:00Z

closes PERF: Optimize membership check in column filtering for better performance #61045
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

allrob23 · 2025-03-04T12:50:55Z

pre-commit.ci autofix

mroeschke · 2025-03-04T17:57:34Z

Do you have an example benchmark where this PR improves the performance of read_csv?

allrob23 · 2025-03-04T23:16:54Z

This optimization was flagged by a tool I’m developing, which performs code inspection to identify potential performance improvements. However, it doesn’t measure execution times, so I haven’t benchmarked the actual impact yet.

From a theoretical perspective, the change makes sense since the previous implementation performed lookups in a list (O(n)) while the new approach uses a set (O(1)).

Would you be able to help me create a proper benchmark for this case?

github-actions · 2025-04-04T00:08:05Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

allrob23 · 2025-04-04T00:41:03Z

I'm busy with other things right now, but I'll try to create a benchmark as soon as possible.

for more information, see https://pre-commit.ci

allrob23 · 2025-04-26T15:17:18Z

Hello @mroeschke , sorry for the delay, I managed to do the benchmark and would like your opinion and review.

Benchmark Explanation

This benchmark script measures the performance of pandas.read_csv with usecols filtering after a code modification that optimizes column selection using a set lookup.

Since the change affects only specific code paths when usecols is provided, the benchmark was designed to:

Vary the number of columns (10 to 50,000) to simulate different dataset sizes.
Select different percentages of columns (1%, 10%, 50%, 90%) to trigger and measure the filtering logic.
Run multiple iterations (5 by default, easily adjustable) to collect stable mean and median read times.

Benachmark results

This is the results in my branch:

 
 Number of Columns Select Percentage  Mean Time (s)  Median Time (s)
                10                1%         0.0008           0.0008
                10               10%         0.0008           0.0008
                10               50%         0.0013           0.0013
                10               90%         0.0018           0.0018
               100                1%         0.0009           0.0009
               100               10%         0.0019           0.0019
               100               50%         0.0062           0.0063
               100               90%         0.0106           0.0106
              1000                1%         0.0025           0.0025
              1000               10%         0.0155           0.0118
              1000               50%         0.0533           0.0527
              1000               90%         0.0958           0.0935
             10000                1%         0.0176           0.0173
             10000               10%         0.1173           0.1120
             10000               50%         0.5866           0.5692
             10000               90%         1.0830           1.0885
             50000                1%         0.0913           0.0907
             50000               10%         0.6784           0.7101
             50000               50%         3.3062           3.2970
             50000               90%         5.9495           5.8902

this is the results in Main branch

 Number of Columns Select Percentage  Mean Time (s)  Median Time (s)
                10                1%         0.0009           0.0009
                10               10%         0.0008           0.0008
                10               50%         0.0013           0.0013
                10               90%         0.0017           0.0017
               100                1%         0.0009           0.0009
               100               10%         0.0020           0.0020
               100               50%         0.0065           0.0064
               100               90%         0.0110           0.0110
              1000                1%         0.0024           0.0024
              1000               10%         0.0160           0.0118
              1000               50%         0.0548           0.0547
              1000               90%         0.1046           0.1014
             10000                1%         0.0181           0.0174
             10000               10%         0.1259           0.1203
             10000               50%         0.8275           0.8285
             10000               90%         1.8634           1.8587
             50000                1%         0.0924           0.0917
             50000               10%         0.9305           0.9910
             50000               50%         9.2642           9.2397
             50000               90%        24.1806          24.3157

The results show a performance improvement with modification, especially as the number of columns and selected percentage increase.

In small datasets (10–100 columns), performance is similar between the branches, as expected the overhead of column filtering is minimal.

However, as the number of columns grows (1000, 10000, 50000), the modification consistently outperforms the main branch, especially for larger filter percentage values (50%, 90%).

For example:

With 10,000 columns and 90% selected, the mean time dropped from 1.86s (main) to 1.08s (new).
With 50,000 columns and 90% selected, the mean time dropped from 24.18s (main) to 5.94s (new).

How I ran the benchmark

1. Synchronize branches

git remote add allrob23-fork https://github.com/allrob23/pandas.git
git fetch allrob23-fork

2. Create `bench.py` script with code

get the code from this gist https://gist.github.com/allrob23/dea86dba17bbfaa988e2b5a08b27db79

3. Run the code

python3 bench.py

4. Get the results printed at the end

5. Change to this PR branch

git checkout perf-columns-set

6. Run the code again

python3 bench.py

7. Get the results printed at the end

I ran this using docker + compile pandas with reasons:

# git
git clone https://github.com/pandas-dev/pandas/
cd pandas

# docker
docker build -t pandas-dev .
docker run -it --rm -v ${PWD}:/home/pandas pandas-dev

# compile with reasons
python -m pip install -ve . --no-build-isolation -Ceditable-verbose=true

mroeschke · 2025-04-26T18:10:46Z

Thanks @allrob23

mroeschke added Performance Memory or execution speed performance IO CSV read_csv, to_csv labels Mar 4, 2025

github-actions bot added the Stale label Apr 4, 2025

allrob23 and others added 2 commits April 26, 2025 10:48

perf: improve membership check performance in column filtering

5d8e1b6

[pre-commit.ci] auto fixes from pre-commit.com hooks

7e8c88d

for more information, see https://pre-commit.ci

allrob23 force-pushed the perf-columns-set branch from 2977503 to 7e8c88d Compare April 26, 2025 14:48

mroeschke approved these changes Apr 26, 2025

View reviewed changes

mroeschke added this to the 3.0 milestone Apr 26, 2025

mroeschke removed the Stale label Apr 26, 2025

mroeschke merged commit 44c5613 into pandas-dev:main Apr 26, 2025
42 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: improve membership check performance in column filtering #61046

perf: improve membership check performance in column filtering #61046

allrob23 commented Mar 4, 2025 •

edited

Loading

allrob23 commented Mar 4, 2025

mroeschke commented Mar 4, 2025

allrob23 commented Mar 4, 2025

github-actions bot commented Apr 4, 2025

allrob23 commented Apr 4, 2025

allrob23 commented Apr 26, 2025 •

edited

Loading

mroeschke commented Apr 26, 2025

perf: improve membership check performance in column filtering #61046

perf: improve membership check performance in column filtering #61046

Conversation

allrob23 commented Mar 4, 2025 • edited Loading

allrob23 commented Mar 4, 2025

mroeschke commented Mar 4, 2025

allrob23 commented Mar 4, 2025

github-actions bot commented Apr 4, 2025

allrob23 commented Apr 4, 2025

allrob23 commented Apr 26, 2025 • edited Loading

Benchmark Explanation

Benachmark results

How I ran the benchmark

1. Synchronize branches

2. Create bench.py script with code

3. Run the code

4. Get the results printed at the end

5. Change to this PR branch

6. Run the code again

7. Get the results printed at the end

I ran this using docker + compile pandas with reasons:

mroeschke commented Apr 26, 2025

allrob23 commented Mar 4, 2025 •

edited

Loading

allrob23 commented Apr 26, 2025 •

edited

Loading

2. Create `bench.py` script with code