Skip to content

test: error handle, state mgmt, backoff, timeouts #1546

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

samrose
Copy link
Collaborator

@samrose samrose commented Apr 14, 2025

What kind of change does this PR introduce?

EC2 Test Resilience Improvements

Retry Wrapper Function

  • Added retry_with_backoff decorator that implements exponential backoff
  • Configurable retry attempts, delays, and exception types
  • Proper logging of retry attempts and failures

Error Handling and Logging

  • Added comprehensive error handling throughout the code
  • Improved logging with detailed messages and error context
  • Added proper exception handling for AWS API calls

Instance State Management

  • Added wait_for_instance_running function with retries
  • Added proper state validation before proceeding
  • Added timeout for instance state transitions

Backoff Strategy

  • Implemented exponential backoff in the retry decorator
  • Configurable initial delay and maximum delay
  • Proper sleep intervals between retries

Resource Validation

  • Added validate_aws_resources function to check security groups and IAM roles
  • Validates resources before instance creation
  • Provides clear error messages for validation failures

Simplified Startup

  • Broke down the instance creation process into smaller, focused functions
  • Each function has a single responsibility
  • Better error isolation and handling

AWS API Timeouts

  • Added proper timeouts for SSH connections
  • Added timeout for health checks
  • Added timeout for instance state transitions

Robust Health Checks

  • Improved health check system with proper error handling
  • Added timeout for health checks
  • Better logging of health check failures
  • Separate function for checking individual services

Cleanup Code

  • Added proper cleanup in finally block
  • Ensures instance termination even on failures
  • Logs cleanup failures

Detailed Logging

  • Added comprehensive logging throughout
  • Logs all major operations and state transitions
  • Logs errors with proper context
  • Helps diagnose failures

@samrose samrose requested review from a team as code owners April 14, 2025 18:01
@steve-chavez
Copy link
Member

Will this help the sporadic timeouts we get from time to time on the testinfra CI job?

@samrose samrose marked this pull request as draft April 14, 2025 20:11
@samrose
Copy link
Collaborator Author

samrose commented Apr 14, 2025

Will this help the sporadic timeouts we get from time to time on the testinfra CI job?

Steve, yes, I am trying to target that. I am going to wait on this until I finish #1547 as that will let me iterate on this locally

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants