Locations where we store data.
The Machine database is a simple PostgreSQL instance storing metadata about sources runs over time, such as timing, status, connection to batches, and links to results files on S3.
Database tables:
- Processing results of single sources, including sample data and output CSV’s, are added to the
runs
table. - Groups of
runs
resulting from Github events sent to Webhook are added to thejobs
table. - Groups of
runs
periodically enqueued as a batch are added to thesets
table.
Other information:
- Complete schema can be found in
openaddr/ci/schema.pgsql
and inopenaddr/ci/coverage/schema.pgsql
. - Public URL at
machine-db.openaddresses.io
. - Lives on an RDS
db.t2.micro
instance. - Two weeks of nightly backups are kept.
The queue is used to schedule runs for Worker instances, and its size is used to grow and shrink the Worker pool. The queue is generally empty, and used only to store temporary data for scheduling runs. We use PQ to implement the queue in Python. Data is stored in the one PostgreSQL database but treated as separate.
There are four queues:
tasks
queue contains new runs to be handled.done
queue contains complete runs to be recognized.due
queue contains delayed runs that may have gone overtime.heartbeat
queue contains pings from active workers.
Other information:
- Database details are re-used, with identical
machine-db.openaddresses.io
public URL. - Queue metrics in Cloudwatch are kept up-to-date by dequeuer.
- Queue length Cloudwatch alarms determine size of Worker pool.
We use the S3 bucket data.openaddresses.io
to store new and historical data.
- S3 access is handled via the Boto library.
- Boto expects current AWS credentials in the
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
environment variables.
We use the Mapbox API account open-addresses
to store a tiled dot map with worldwide locations of address points.
- Uploads are handled via the Boto3 library, using credentials granted by the Mapbox API.