Skip to content
This repository was archived by the owner on May 5, 2022. It is now read-only.

Latest commit

 

History

History
55 lines (36 loc) · 2.91 KB

persistence.md

File metadata and controls

55 lines (36 loc) · 2.91 KB

Persistent Data

Locations where we store data.

The Machine database is a simple PostgreSQL instance storing metadata about sources runs over time, such as timing, status, connection to batches, and links to results files on S3.

Database tables:

  1. Processing results of single sources, including sample data and output CSV’s, are added to the runs table.
  2. Groups of runs resulting from Github events sent to Webhook are added to the jobs table.
  3. Groups of runs periodically enqueued as a batch are added to the sets table.

Other information:

The queue is used to schedule runs for Worker instances, and its size is used to grow and shrink the Worker pool. The queue is generally empty, and used only to store temporary data for scheduling runs. We use PQ to implement the queue in Python. Data is stored in the one PostgreSQL database but treated as separate.

There are four queues:

  1. tasks queue contains new runs to be handled.
  2. done queue contains complete runs to be recognized.
  3. due queue contains delayed runs that may have gone overtime.
  4. heartbeat queue contains pings from active workers.

Other information:

We use the S3 bucket data.openaddresses.io to store new and historical data.

  • S3 access is handled via the Boto library.
  • Boto expects current AWS credentials in the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.

We use the Mapbox API account open-addresses to store a tiled dot map with worldwide locations of address points.

  • Uploads are handled via the Boto3 library, using credentials granted by the Mapbox API.