The available samplers follow the
scikit-learn API
using the base estimator
and incorporating a sampling functionality via the sample
method:
Estimator: | The base object, implements a estimator = obj.fit(data, targets) |
---|---|
Resampler: | To resample a data sets, each sampler implements a data_resampled, targets_resampled = obj.fit_resample(data, targets) |
Imbalanced-learn samplers accept the same inputs as scikit-learn estimators:
- data, 2-dimensional array-like structures, such as:
- Python's list of lists :class:`list`,
- Numpy arrays :class:`numpy.ndarray`,
- Pandas dataframes :class:`pandas.DataFrame`,
- Scipy sparse matrices :class:`scipy.sparse.csr_matrix` or :class:`scipy.sparse.csc_matrix`;
- targets, 1-dimensional array-like structures, such as:
- Numpy arrays :class:`numpy.ndarray`,
- Pandas series :class:`pandas.Series`.
The output will be of the following type:
- data_resampled, 2-dimensional aray-like structures, such as:
- Numpy arrays :class:`numpy.ndarray`,
- Pandas dataframes :class:`pandas.DataFrame`,
- Scipy sparse matrices :class:`scipy.sparse.csr_matrix` or :class:`scipy.sparse.csc_matrix`;
- targets_resampled, 1-dimensional array-like structures, such as:
- Numpy arrays :class:`numpy.ndarray`,
- Pandas series :class:`pandas.Series`.
Pandas in/out
Unlike scikit-learn, imbalanced-learn provides support for pandas in/out. Therefore providing a dataframe, will output as well a dataframe.
Sparse input
For sparse input the data is converted to the Compressed Sparse Rows
representation (see scipy.sparse.csr_matrix
) before being fed to the
sampler. To avoid unnecessary memory copies, it is recommended to choose the
CSR representation upstream.
The learning and prediction phrases of machine learning algorithms can be impacted by the issue of imbalanced datasets. This imbalance refers to the difference in the number of samples across different classes. We demonstrate the effect of training a Logistic Regression classifier with varying levels of class balancing by adjusting their weights.

As expected, the decision function of the Logistic Regression classifier varies significantly depending on how imbalanced the data is. With a greater imbalance ratio, the decision function tends to favour the class with the larger number of samples, usually referred to as the majority class.