Skip to content

[WIP] [Initiative]: Batch Initiative #1629

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
catblade opened this issue Apr 22, 2025 · 0 comments
Open

[WIP] [Initiative]: Batch Initiative #1629

catblade opened this issue Apr 22, 2025 · 0 comments

Comments

@catblade
Copy link
Contributor

catblade commented Apr 22, 2025

Name

Batch

Short description

enhance collaboration among projects, improve interoperability, and empower users to efficiently leverage batch systems in cloud-native environment.

Responsible group

TAG Infrastructure, TAG Workloads Foundation

Does the initiative belong to a subproject?

No

Subproject name

No response

Primary contact

Alex Scammon [email protected]

Additional contacts

Marlow Warnicke ([email protected])
Abishek Malvankar ([email protected])

Initiative description

In scope

To reduce fragmentation in the k8s batch ecosystem: congregate leads and users from different external and internal projects and user groups (CNCF TAGs, k8s sub-projects focused on batch-related features such as topology-aware scheduling) in the batch ecosystem to gather requirements, validate designs and encourage reutilization of core K8s APIs.

The following recommendations for enhancements:

  • Additions to the batch API group, currently including Job and CronJob resources that benefit batch use cases such as HPC, AI/ML, data analytics and CI.
  • Primitives for job-level queueing, not limited to the k8s Job resource. Long-term, this could include multi-cluster support.
  • Primitives to control and maximize utilization of resources in fixed-size clusters (on-prem) and elastic clusters (cloud).
  • Benchmarking models for Batch systems
  • Data Locality
  • User Stories
  • Scheduling support for specialized hardware (Accelerators, NUMA, Networking, etc.)

Out of scope

  • Addition of new API kinds that serve a specialized type of workload. The focus should be on general APIs that specialized controllers can build on top of.
  • Uses of the batch APIs as support for serving workloads (eg. backups, upgrades, migrations). These can be served by existing SIGs.
  • Proposals that duplicate the functionality of core kubernetes components (job-controller, kube-scheduler, cluster-autoscaler).
  • Job workflows or pipelines. Mature third party frameworks serve these use cases with the current kubernetes primitives. But additional primitives to support these frameworks could be in scope.

Deliverable(s) or exit criteria

  • Maintaining a landscape document for currently available projects (already published-relocated and maintained)
  • Data Locality project-deliverables TBD, but something that helps in this space (already in process)
  • Benchmarking suite for Batch systems (already in process)
  • User stories published doc for Batch systems (already in process)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: New
Status: status/new
Development

No branches or pull requests

1 participant