ENH: grep-like select columns of a DataFrame by a part of their names #61319

kirisakow · 2025-04-20T14:10:10Z

Feature Type

Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas

Problem Description

I wish I could grep-like select columns of a DataFrame by a part of their names, and return a subset of the original DataFrame containing only columns that match the substring.

Feature Description

from typing import List, Union
import pandas as pd


class ExtendedDF(pd.DataFrame):
    @property
    def _constructor(self):
        return ExtendedDF

    def select_by_substr(self, substr: Union[str, List[str]], *, ignore_case: bool = True) -> Union[pd.DataFrame, 'ExtendedDF']:
        """grep-like select columns of a DataFrame by a part of their names.

        Args:
            substr (Union[str, List[str]]): a string or a list of strings to be used as search patterns
            ignore_case (bool): if True (default), ignore search pattern case

        Returns:
            pd.DataFrame: a subset of the original DataFrame containing only columns that match the substring

        Usage:

        Consider two DataFrame objects extracted from two different sources, and thus varying in their column names:

        ```py
        df1 = pd.DataFrame({
            'Distance': [105.0, 0.0, 4.0, 1.0, 1241.0],
            'Distance_percent': [0.2, 0.0, 5.2, 11.1, 92.8],
            'Mixed': [921.0, 0.0, 52.0, 5.0, 0.0],
            'Mixed_percent': [1.9, 0.0, 67.5, 55.6, 0.0],
            'avg_diff': [121146.9, 293246.3, 212169.9, 41299.8, 29438.3],
            'med_diff': [17544.0, 1657.0, 55205.0, 95750.0, 2577.0],
        })
        df2 = pd.DataFrame({
            'distance': [105.0, 0.0, 4.0, 1.0, 1241.0],
            'distance_percent': [0.2, 0.0, 5.2, 11.1, 92.8],
            'mixed': [921.0, 0.0, 52.0, 5.0, 0.0],
            'mixed_percent': [1.9, 0.0, 67.5, 55.6, 0.0],
            'diff_avg': [121146.9, 293246.3, 212169.9, 41299.8, 29438.3],
            'diff_med': [17544.0, 1657.0, 55205.0, 95750.0, 2577.0],
        })
        df1 = ExtendedDF(df1)
        df2 = ExtendedDF(df2)
        ```
        ```
        df1
           Distance  Distance_percent  Mixed  Mixed_percent  avg_diff  med_diff
        0     105.0               0.2  921.0            1.9  121146.9   17544.0
        1       0.0               0.0    0.0            0.0  293246.3    1657.0
        2       4.0               5.2   52.0           67.5  212169.9   55205.0
        3       1.0              11.1    5.0           55.6   41299.8   95750.0
        4    1241.0              92.8    0.0            0.0   29438.3    2577.0

        df2
           distance  distance_percent  mixed  mixed_percent  diff_avg  diff_med
        0     105.0               0.2  921.0            1.9  121146.9   17544.0
        1       0.0               0.0    0.0            0.0  293246.3    1657.0
        2       4.0               5.2   52.0           67.5  212169.9   55205.0
        3       1.0              11.1    5.0           55.6   41299.8   95750.0
        4    1241.0              92.8    0.0            0.0   29438.3    2577.0
        ```

        As an analyst, I need to inspect which column is which between the two datasets:

        (a) either by defining a single string search pattern (`ignore_case=True` by default):

        ```py
        cols_to_select = 'diff'
        print('df1:')
        print(df1.select_by_substr(cols_to_select).T) # transposed for a better legibility
        print()
        print('df2:')
        print(df2.select_by_substr(cols_to_select).T) # transposed for a better legibility
        ```
        ```
        df1:
                         0         1         2        3        4
        avg_diff  121146.9  293246.3  212169.9  41299.8  29438.3
        med_diff   17544.0    1657.0   55205.0  95750.0   2577.0

        df2:
                         0         1         2        3        4
        diff_avg  121146.9  293246.3  212169.9  41299.8  29438.3
        diff_med   17544.0    1657.0   55205.0  95750.0   2577.0
        ```

        (b) or by defining a list of string search patterns (`ignore_case=True` by default):

        ```py
        cols_to_select = ['dist', 'Mix']
        print('df1:')
        print(df1.select_by_substr(cols_to_select).T) # transposed for a better legibility
        print()
        print('df2:')
        print(df2.select_by_substr(cols_to_select).T) # transposed for a better legibility
        ```
        ```
        df1:
                              0    1     2     3       4
        Mixed             921.0  0.0  52.0   5.0     0.0
        Distance          105.0  0.0   4.0   1.0  1241.0
        Mixed_percent       1.9  0.0  67.5  55.6     0.0
        Distance_percent    0.2  0.0   5.2  11.1    92.8

        df2:
                              0    1     2     3       4
        mixed_percent       1.9  0.0  67.5  55.6     0.0
        mixed             921.0  0.0  52.0   5.0     0.0
        distance          105.0  0.0   4.0   1.0  1241.0
        distance_percent    0.2  0.0   5.2  11.1    92.8
        ```

        (c) or, same as (b) but with an explicit `ignore_case=False`:

        ```py
        cols_to_select = ['dist', 'Mix']
        print('df1:')
        print(df1.select_by_substr(cols_to_select, ignore_case=False).T) # transposed for a better legibility
        print()
        print('df2:')
        print(df2.select_by_substr(cols_to_select, ignore_case=False).T) # transposed for a better legibility
        ```
        ```
        df1:
                           0    1     2     3    4
        Mixed_percent    1.9  0.0  67.5  55.6  0.0
        Mixed          921.0  0.0  52.0   5.0  0.0

        df2:
                              0    1    2     3       4
        distance_percent    0.2  0.0  5.2  11.1    92.8
        distance          105.0  0.0  4.0   1.0  1241.0
        ```
        """
        substr = [substr] if isinstance(substr, str) else substr
        if ignore_case:
            selected_cols = [col_name for col_name in self.columns for s in substr if s.casefold() in col_name.casefold()]
        else:
            selected_cols = [col_name for col_name in self.columns for s in substr if s in col_name]
        selected_cols = list(set(selected_cols))
        return self[selected_cols]

Alternative Solutions

Idk

Additional Context

No response

snitish · 2025-04-21T01:58:55Z

@kirisakow does DataFrame.filter satisfy your use case?

df_new = df.filter(like=substr)

Delengowski · 2025-04-21T02:38:52Z

Filter is one way to do this. You could also pass a callable that takes in a dataframe, and returns the columns you care about, to __getitem__.

mroeschke · 2025-04-21T15:49:27Z

Thanks for the suggestion but per the 2 suggestions above, there are more primitive APIs available that would allow you to compose the functionality in this request, so I don't think this would require a dedicated API for this. Thanks but closing

kirisakow added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 20, 2025

snitish added Closing Candidate May be closeable, needs more eyeballs Indexing Related to indexing on series/frames, not to indexes themselves and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 21, 2025

snitish added Needs Info Clarification about behavior needed to assess issue and removed Closing Candidate May be closeable, needs more eyeballs labels Apr 21, 2025

mroeschke closed this as completed Apr 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: grep-like select columns of a DataFrame by a part of their names #61319

ENH: grep-like select columns of a DataFrame by a part of their names #61319

kirisakow commented Apr 20, 2025

snitish commented Apr 21, 2025 •

edited

Loading

Delengowski commented Apr 21, 2025

mroeschke commented Apr 21, 2025

ENH: grep-like select columns of a DataFrame by a part of their names #61319

ENH: grep-like select columns of a DataFrame by a part of their names #61319

Comments

kirisakow commented Apr 20, 2025

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

snitish commented Apr 21, 2025 • edited Loading

Delengowski commented Apr 21, 2025

mroeschke commented Apr 21, 2025

snitish commented Apr 21, 2025 •

edited

Loading