Skip to content

ENH: grep-like select columns of a DataFrame by a part of their names #61319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 3 tasks
kirisakow opened this issue Apr 20, 2025 · 3 comments
Closed
1 of 3 tasks
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves Needs Info Clarification about behavior needed to assess issue

Comments

@kirisakow
Copy link
Contributor

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I wish I could grep-like select columns of a DataFrame by a part of their names, and return a subset of the original DataFrame containing only columns that match the substring.

Feature Description

from typing import List, Union
import pandas as pd


class ExtendedDF(pd.DataFrame):
    @property
    def _constructor(self):
        return ExtendedDF

    def select_by_substr(self, substr: Union[str, List[str]], *, ignore_case: bool = True) -> Union[pd.DataFrame, 'ExtendedDF']:
        """grep-like select columns of a DataFrame by a part of their names.

        Args:
            substr (Union[str, List[str]]): a string or a list of strings to be used as search patterns
            ignore_case (bool): if True (default), ignore search pattern case

        Returns:
            pd.DataFrame: a subset of the original DataFrame containing only columns that match the substring

        Usage:

        Consider two DataFrame objects extracted from two different sources, and thus varying in their column names:

        ```py
        df1 = pd.DataFrame({
            'Distance': [105.0, 0.0, 4.0, 1.0, 1241.0],
            'Distance_percent': [0.2, 0.0, 5.2, 11.1, 92.8],
            'Mixed': [921.0, 0.0, 52.0, 5.0, 0.0],
            'Mixed_percent': [1.9, 0.0, 67.5, 55.6, 0.0],
            'avg_diff': [121146.9, 293246.3, 212169.9, 41299.8, 29438.3],
            'med_diff': [17544.0, 1657.0, 55205.0, 95750.0, 2577.0],
        })
        df2 = pd.DataFrame({
            'distance': [105.0, 0.0, 4.0, 1.0, 1241.0],
            'distance_percent': [0.2, 0.0, 5.2, 11.1, 92.8],
            'mixed': [921.0, 0.0, 52.0, 5.0, 0.0],
            'mixed_percent': [1.9, 0.0, 67.5, 55.6, 0.0],
            'diff_avg': [121146.9, 293246.3, 212169.9, 41299.8, 29438.3],
            'diff_med': [17544.0, 1657.0, 55205.0, 95750.0, 2577.0],
        })
        df1 = ExtendedDF(df1)
        df2 = ExtendedDF(df2)
        ```
        ```
        df1
           Distance  Distance_percent  Mixed  Mixed_percent  avg_diff  med_diff
        0     105.0               0.2  921.0            1.9  121146.9   17544.0
        1       0.0               0.0    0.0            0.0  293246.3    1657.0
        2       4.0               5.2   52.0           67.5  212169.9   55205.0
        3       1.0              11.1    5.0           55.6   41299.8   95750.0
        4    1241.0              92.8    0.0            0.0   29438.3    2577.0

        df2
           distance  distance_percent  mixed  mixed_percent  diff_avg  diff_med
        0     105.0               0.2  921.0            1.9  121146.9   17544.0
        1       0.0               0.0    0.0            0.0  293246.3    1657.0
        2       4.0               5.2   52.0           67.5  212169.9   55205.0
        3       1.0              11.1    5.0           55.6   41299.8   95750.0
        4    1241.0              92.8    0.0            0.0   29438.3    2577.0
        ```

        As an analyst, I need to inspect which column is which between the two datasets:

        (a) either by defining a single string search pattern (`ignore_case=True` by default):

        ```py
        cols_to_select = 'diff'
        print('df1:')
        print(df1.select_by_substr(cols_to_select).T) # transposed for a better legibility
        print()
        print('df2:')
        print(df2.select_by_substr(cols_to_select).T) # transposed for a better legibility
        ```
        ```
        df1:
                         0         1         2        3        4
        avg_diff  121146.9  293246.3  212169.9  41299.8  29438.3
        med_diff   17544.0    1657.0   55205.0  95750.0   2577.0

        df2:
                         0         1         2        3        4
        diff_avg  121146.9  293246.3  212169.9  41299.8  29438.3
        diff_med   17544.0    1657.0   55205.0  95750.0   2577.0
        ```

        (b) or by defining a list of string search patterns (`ignore_case=True` by default):

        ```py
        cols_to_select = ['dist', 'Mix']
        print('df1:')
        print(df1.select_by_substr(cols_to_select).T) # transposed for a better legibility
        print()
        print('df2:')
        print(df2.select_by_substr(cols_to_select).T) # transposed for a better legibility
        ```
        ```
        df1:
                              0    1     2     3       4
        Mixed             921.0  0.0  52.0   5.0     0.0
        Distance          105.0  0.0   4.0   1.0  1241.0
        Mixed_percent       1.9  0.0  67.5  55.6     0.0
        Distance_percent    0.2  0.0   5.2  11.1    92.8

        df2:
                              0    1     2     3       4
        mixed_percent       1.9  0.0  67.5  55.6     0.0
        mixed             921.0  0.0  52.0   5.0     0.0
        distance          105.0  0.0   4.0   1.0  1241.0
        distance_percent    0.2  0.0   5.2  11.1    92.8
        ```

        (c) or, same as (b) but with an explicit `ignore_case=False`:

        ```py
        cols_to_select = ['dist', 'Mix']
        print('df1:')
        print(df1.select_by_substr(cols_to_select, ignore_case=False).T) # transposed for a better legibility
        print()
        print('df2:')
        print(df2.select_by_substr(cols_to_select, ignore_case=False).T) # transposed for a better legibility
        ```
        ```
        df1:
                           0    1     2     3    4
        Mixed_percent    1.9  0.0  67.5  55.6  0.0
        Mixed          921.0  0.0  52.0   5.0  0.0

        df2:
                              0    1    2     3       4
        distance_percent    0.2  0.0  5.2  11.1    92.8
        distance          105.0  0.0  4.0   1.0  1241.0
        ```
        """
        substr = [substr] if isinstance(substr, str) else substr
        if ignore_case:
            selected_cols = [col_name for col_name in self.columns for s in substr if s.casefold() in col_name.casefold()]
        else:
            selected_cols = [col_name for col_name in self.columns for s in substr if s in col_name]
        selected_cols = list(set(selected_cols))
        return self[selected_cols]

Alternative Solutions

Idk

Additional Context

No response

@kirisakow kirisakow added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 20, 2025
@snitish snitish added Closing Candidate May be closeable, needs more eyeballs Indexing Related to indexing on series/frames, not to indexes themselves and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 21, 2025
@snitish
Copy link
Member

snitish commented Apr 21, 2025

@kirisakow does DataFrame.filter satisfy your use case?

df_new = df.filter(like=substr)

@snitish snitish added Needs Info Clarification about behavior needed to assess issue and removed Closing Candidate May be closeable, needs more eyeballs labels Apr 21, 2025
@Delengowski
Copy link
Contributor

Filter is one way to do this. You could also pass a callable that takes in a dataframe, and returns the columns you care about, to __getitem__.

@mroeschke
Copy link
Member

Thanks for the suggestion but per the 2 suggestions above, there are more primitive APIs available that would allow you to compose the functionality in this request, so I don't think this would require a dedicated API for this. Thanks but closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves Needs Info Clarification about behavior needed to assess issue
Projects
None yet
4 participants