Skip to content

Update groupby().first() documentation to clarify behavior with missing data (#27578) #61345

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 30 additions & 12 deletions pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -3232,9 +3232,12 @@ def first(
self, numeric_only: bool = False, min_count: int = -1, skipna: bool = True
) -> NDFrameT:
"""
Compute the first entry of each column within each group.
Compute the first non-null entry of each column within each group.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas documentation is quite consistent with using NA instead of null. Can you use NA throughout.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, this line is incorrect as you can pass skipna=False.


Defaults to skipping NA elements.
This method operates column-wise, returning the first non-null value
in each column for every group. Unlike `nth(0)`, which returns the
first row (even if it contains nulls), `first()` skips over NA/null
values in each column independently.

Parameters
----------
Expand All @@ -3251,15 +3254,15 @@ def first(
Returns
-------
Series or DataFrame
First values within each group.
First non-null values within each group, selected independently per column.

See Also
--------
DataFrame.groupby : Apply a function groupby to each row or column of a
DataFrame.
core.groupby.DataFrameGroupBy.last : Compute the last non-null entry
of each column.
core.groupby.DataFrameGroupBy.nth : Take the nth row from each group.
DataFrame.groupby : Group DataFrame using a mapper or by a Series of columns.
Series.groupby : Group Series using a mapper or by a Series of values.
GroupBy.nth : Take the nth row from each group.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GroupBy is not public. Can you use DataFrameGroupBy instead.

GroupBy.head : Return the first `n` rows from each group.
GroupBy.last : Compute the last non-null entry of each column.

Examples
--------
Expand All @@ -3272,23 +3275,38 @@ def first(
... )
... )
>>> df["D"] = pd.to_datetime(df["D"])

>>> df.groupby("A").first()
B C D
B C D
A
1 5.0 1 2000-03-11
3 6.0 3 2000-03-13

>>> df.groupby("A").nth(0)
B C D
A
1 NaN 1 2000-03-11
3 6.0 3 2000-03-13
Comment on lines +3285 to +3289
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this page should only include documentation on first. Can you remove the use of other methods.


>>> df.groupby("A").head(1)
A B C D
0 1 NaN 1 2000-03-11
2 3 6.0 3 2000-03-13

>>> df.groupby("A").first(min_count=2)
B C D
A
1 NaN 1.0 2000-03-11
3 NaN NaN NaT
1 NaN 1.0 2000-03-11
3 NaN NaN NaT

>>> df.groupby("A").first(numeric_only=True)
B C
B C
A
1 5.0 1
3 6.0 3
"""


def first_compat(obj: NDFrameT):
def first(x: Series):
"""Helper function for first item that isn't NA."""
Expand Down
Loading