Dshackle Archive is a tool that efficiently extracts blockchain data in JSON format and archives it into Avro files for scalable analysis using traditional Big Data tools. It supports Bitcoin and Ethereum-compatible blockchains and can operate in both batch mode for historical data and streaming mode for real-time data extraction.
Dshackle Archive copies JSON data from a blockchain to a plain files archive (i.e., it’s the Extraction of data, as in ETL). The archive is Avro files which contain blocks and transaction details fetched from blockchain nodes via their API. The Avro contain the JSON responses as is.
Features:
-
Extracts data from Bitcoin and Ethereum compatible blockchains.
-
Produces data in Avro format, where one file could be a range of blocks or a separate file per each block
-
Archival can be scaled by using multiple blockchain nodes by using Dshackle Load Balancer.
-
Runs in:
-
batch archive mode, when historical data is archived
-
or in a streaming mode when only fresh data is added to the archive
-
-
Archive to:
-
Filesystem
-
S3
-
-
Produces Notification to:
-
Filesystem
-
Apache Pulsar
-
The general idea it to use Dshackle Archive to copy data from a blockchain to a plain files, keeping the data structures as is, and then use traditional Big Data tools (Spark, Beam, etc.) to analyse data in a scalable way.
Note
|
It takes several weeks to archive the whole blockchain. It’s a very IO intensive operation, but can be scaled by using multiple nodes. |
Note
|
It uses Dshackle protocol to connect to the blockchain, not a plain HTTP/JSON RPC. So a Dshackle instance is required to be running. |
Usage: dshackle-archive [OPTIONS] --blockchain <BLOCKCHAIN> --connection <HOST:PORT> <COMMAND> Arguments: <COMMAND> [possible values: stream, fix, verify, archive, compact] Options: -b, --blockchain <BLOCKCHAIN> Blockchain --dryRun Do not modify the storage -c, --connection <HOST:PORT> Connection (host:port) --connection.notls Disable TLS --parallel <PARALLEL> How many requests to make in parallel. Range: 1..512. Default: 8 [default: 8] --notify.dir <NOTIFY_DIR> Write notifications as JSON line to the specified dir in a file <dshackle-archive-%STARTTIME.jsonl> --notify.pulsar.topic <PULSAR_TOPIC> Send notifications as JSON to the Pulsar to the specified topic (notify.pulsar.url must be specified) --notify.pulsar.url <PULSAR_URL> Send notifications as JSON to the Pulsar with specified URL (notify.pulsar.topic must be specified) --auth.aws.accessKey <ACCESS_KEY> AWS / S3 Access Key --auth.aws.secretKey <SECRET_KEY> AWS / S3 Secret Key --aws.endpoint <ENDPOINT> AWS / S3 endpoint url instead of the default one --aws.region <REGION> AWS / S3 region ID to use for requests --aws.s3.pathStyle Enable S3 Path Style access (default is false). Use this flag for a no-AWS service --aws.trustTls Trust any TLS certificate for AWS / S3 (default is false) -d, --dir <DIR> Target directory --continue Continue from the last file if set --tail <TAIL> Ensure that the latest T are archived (with `fix` command) -r, --range <RANGE> Blocks Range (`N...M`) --rangeChunk <RANGE_CHUNK> Range chunk size (default 1000) -h, --help Print help
-
archive
- the main operation, copies data from a blockchain to archive -
stream
- append fresh blocks one by one to the archive -
compact
- merge individual block files into larger range files -
fix
- fix archive by making new archives for missing chunks -
verify
- verify that archive files contains required data and delete incomplete files
The main operation, copies historical data from a blockchain to archive.
The data is copied in ranges, and with default range of 1000 blocks it produces two files per range. One for blocks in that range, and another one with all transactions in all blocks in that range.
See Archive Format.
Continuously append fresh blocks one by one to the archive. In addition to the copying, Dshackle archive can be configured to notify an external system about new blocks in the archive.
Note that when it’s in streaming mode the archives are writen in a per-block basis.
I.e., each block comes with in a separate pair of two files.
One for the block itself, and another one for all transactions in that block.
To merge the individual files into larger ranges use compact
command.
To notify an external system, there are two options:
-
--notify.dir
- write notifications as JSON line to the specified dir in a file<dshackle-archive-%STARTTIME.jsonl>
-
--notify.pubsub
- send notifications as JSON to the specified Google Pubsub topic -
--notify.pulsar.url
+--notify.pulsar.topic
- send notifications as JSON to the specified Apache Pulsar topic
See Notification format.
Fixes the archive by checking if there are any missing blocks, and if so, it creates new archives for the missing blocks.
Dshackle Archive copies and stored data as JSON responses from blockchain nodes the resulting archive is much larger that the node database size, which keeps data in a compact format. It uses Snappy compression for Avro files, which give a good compression ratio, but still the resulting archive is large.
Average size of a 1000 blocks range (w/o expensive JSON such as stateDiff
and trace
):
-
~300Mb for Ethereum
-
~400Mb for Bitcoin
And the whole archive (w/o expensive JSON such as stateDiff
and trace
):
-
~2.5Tb for Ethereum
-
~1.9Tb for Bitcoin
-
Avro structure and Java stubs: https://github.com/emeraldpay/dshackle-archive-avro
-
Dshackle load balancer: https://github.com/emeraldpay/dshackle
-
First you need to archive the historical data, which may takes several week depending on how many and how fast nodes you have.
-
After finishing the initial archive, you run in the Streaming mode which append new blocks to the archive as they are mined.
-
Periodically (ex. once a day) you run Compaction to merge individual block files into larger range files.
-
Also, periodically (ex. once a day) you run a pair of Verify and Fix commands to ensure the integrity of the archive.
Dshackle requires only compatibility onj JSON RPC level, so technically it can work with any blockchain that uses similar API. I.e., it’s compatible with all major blockchains, including Bitcoin, Ethereum, Binance Smart Chain, Polygon, etc.
It uses Dshackle protocol to connect to the blockchain, not a plain HTTP/JSON RPC. So a Dshackle instance is required to be running.
Dshackle is a Load Balancer for Blockchain APIs, and it can route requests to multiple nodes, which scales up the archival throughput.
Dshackle provides two commands to ensure the integrity of the archive:
-
first you run
verify
command, which checks the archive and deletes incomplete or corrupted files -
then you run
fix
command, which copies the data again for the blocks deleted in the previous step
You can schedule the execution of these commands to run periodically, e.g. once a day.
To avoid scanning the whole archive every time, you can specify a range to check, e.g. --tail 1000
.
The option above specifies that it should verify/fix only the last 1000 blocks.
I.e., it goes backward from the current head block.
For a complete descriptions, schema and libs to access Avro files please refer to https://github.com/emeraldpay/dshackle-archive-avro
-
blockchainType
- type of blockchain, as a definitions of what fields to expect. One ofETHEREUM
orBITCOIN
-
blockchainId
- actual blockchain id (ETH
,BTC
, etc) -
archiveTimestamp
- when the archive record was created. Milliseconds since epoch -
height
- block height -
blockId
- block hash -
timestamp
- block timestamp. Milliseconds since epoch -
parentId
- parent block hash -
json
- JSON response for that block
-
unclesCount
- number of uncles for the current block -
uncle0Json
- JSON for first uncle (eth_getUncleByBlockHashAndIndex(0)
) -
uncle1Json
- JSON for second uncle (eth_getUncleByBlockHashAndIndex(1)
)
-
none
-
blockchainType
- type of blockchain, as a definitions of what fields to expect. One ofETHEREUM
orBITCOIN
-
blockchainId
- blockchain id (ETH
,BTC
, etc) -
archiveTimestamp
- when the archive record was created. Milliseconds since epoch -
height
- block height -
blockId
- block hash -
timestamp
- block timestamp. Milliseconds since epoch -
index
- index of the transaction in block -
txid
- hash or transaction id of the transaction -
json
- JSON response for that transaction -
raw
- raw bytes of the transaction
-
from
- from address -
to
- to address -
receiptJson
- JSON response foreth_getTransactionReceipt
-
traceJson
- JSON response fortrace_replayTransaction(trace)
-
stateDiffJson
- JSON response fortrace_replayTransaction(stateDiff)
-
none
{
"version":"https://schema.emrld.io/dshackle-archive/notify",
"ts":"2022-05-20T23:14:24.481327Z",
"blockchain":"ETH",
"type":"transactions",
"run":"stream",
"heightStart":14813875,
"heightEnd":14813875,
"location":"gs://my-bucket/blockchain-archive/eth/014000000/014813000/014813875.txes.avro"
}
-
version
id of the current JSON format -
ts
timestamp of the archive event -
blockchain
blockchain -
type
type of file (transactions
orblocks
) -
run
mode in which the Dshackle Archive is run (archive
,stream
,copy
orcompact
) -
heightStart
andheightEnd
range of blocks in the archived files -
location
a URL to the archived file
Want to support the project, prioritize a specific feature, or get commercial help with using Dshackle in your project? Please contact [email protected] to discuss the possibility.
Copyright 2025 EmeraldPay
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.