Releases: dathere/qsv
4.0.0
[4.0.0] - 2025-04-13
Highlights:
This is a major release with numerous improvements!
- qsv can now read more file formats by leveraging the Polars engine:
- Arrow/IPC, Avro, Parquet, JSON (JSON array) and JSONL
- Automatic decompression support for compressed CSV file dialects (csv, tsv/tab & ssv) using gzip (.gz), zlib (.zlib), zstd (.zst) compression formats. (e.g. data.csv.gz, data.tsv.zst, data.ssv.zlib)
qsv lens data.csv.gz qsv sample 1000 data.parquet | qsv stats | qsv lens qsv frequency data.tab.zlib | qsv lens qsv search Waldo data.ssv.zst | qsv table qsv select 2-5 data.jsonl | qsv lens
- New
geoconvert
command for converting spatial formats (GeoJSON and SHP) to CSV:# convert TX_cities.geojson to CSV, filter out the geometry column and browse with lens qsv geoconvert TX_cities.geojson geojson csv | qsv select '!geometry' | qsv lens
- Enhanced
split
command with new--filter
option:- Similar to GNU split
--filter
- Spawns a subprocess for each chunk
# split input.csv into outdir, each chunk having 100,000 rows, gzip compressing each chunk qsv split --size 100000 outdir data.csv --filter 'gzip $FILE'
- Similar to GNU split
- Expanded
to
command:- added LibreOffice/OpenOffice Calc (ODS) support
- re-enabled
parquet
generation now that it's using Arrow instead of DuckDB (which made for very long compiles)
- New
uniqueCombinedWith
JSON Schema custom keyword invalidate
command:- Allows validating uniqueness across multiple columns
- Useful for composite key validation
- QSV_DOTENV_PATH now supports the sentinel value "<NONE>" to disable dotenv processing altogether.
Added
geoconvert
: new command to convert spatial formats to CSV by @rzmk in #2681 & #2688split
: add--filter
options #2660sqlp
: add decimal type support #2646to
: add backto
parquet support #2665- feat: Extended auto decompression support. In addition to snappy auto-decompression, auto-decompress CSV dialects (tsv/tab & ssv files) using gzip, zlib and zstd compression formats #2671
to
: add ODS support #2674validate
: add uniqueCombinedWith custom JSON Schema Validation keyword #2636- feat:
prompt
add file formats supported to dialog box filter when polars feature is enabled #2667 - feat: add
QSV_POLARS_FLOAT_PRECISION
env var #2678 tests
: add tests for https://100.dathere.com/lessons/3 by @rzmk in #2638
Changed
qsvdp
binary variant can now use thegeocode
&geoconvert
commands 50f0046geocode
feature now gates thegeocode
&geoconvert
command 9d046e8stats
: made stdin handling more robust by adding delimiter inferencing ddecd98- feat: setting QSV_DOTENV_PATH to sentinel value "<NONE>" disables dotenv processing #2684
- refactor: polars special formats support #2683
contrib(completions)
: update completions to v3.3.0 by @rzmk in #2626contrib(completions)
: update completions for qsv v4.0.0 by @rzmk in #2677- deps: bump polars to 0.46.0 at py-1.27.1 tag #2675 and e5d29d7
- build(deps): bump actions/setup-python from 5.4.0 to 5.5.0 by @dependabot in #2627
- build(deps): bump arboard from 3.4.1 to 3.5.0 by @dependabot in #2653
- build(deps): bump chrono-tz from 0.10.2 to 0.10.3 by @dependabot in #2623
- build(deps): bump crossbeam-channel from 0.5.14 to 0.5.15 by @dependabot in #2672
- build(deps): bump csvs_convert from 0.11.0 to 0.11.1 by @dependabot in #2686
- build(deps): bump data-encoding from 2.8.0 to 2.9.0 by @dependabot in #2685
- build(deps): bump flate2 from 1.1.0 to 1.1.1 by @dependabot in #2649
- build(deps): bump flexi_logger from 0.29.8 to 0.30.0 by @dependabot in #2650
- build(deps): bump flexi_logger from 0.30.0 to 0.30.1 by @dependabot in #2651
- build(deps): bump governor from 0.8.1 to 0.9.0 by @dependabot in #2625
- build(deps): bump governor from 0.9.0 to 0.10.0 by @dependabot in #2631
- build(deps): bump jsonschema from 0.29.0 to 0.29.1 by @dependabot in #2635
- build(deps): bump log from 0.4.26 to 0.4.27 by @dependabot in #2622
- build(deps): bump mimalloc from 0.1.44 to 0.1.45 by @dependabot in #2652
- build(deps): bump minijinja from 2.8.0 to 2.9.0 by @dependabot in #2643
- build(deps): bump minijinja-contrib from 2.8.0 to 2.9.0 by @dependabot in #2642
- build(deps): bump pyo3 from 0.24.0 to 0.24.1 by @dependabot in #2645
- build(deps): bump qsv-dateparser from 0.12.1 to 0.13.0 by @dependabot in #2639
- build(deps): bump qsv-sniffer from 0.10.3 to 0.11.0 by @dependabot in #2640
- build(deps): bump redis from 0.29.2 to 0.29.4 by @dependabot in #2663
- build(deps): bump redis from 0.29.4 to 0.29.5 by @dependabot in #2666
- build(deps): bump smallvec from 1.14.0 to 1.15.0 by @dependabot in #2656
- build(deps): bump sysinfo from 0.34.0 to 0.34.1 by @dependabot in #2637
- build(deps): bump sysinfo from 0.34.1 to 0.34.2 by @dependabot in #2648
- build(deps): bump titlecase from 3.4.0 to 3.5.0 by @dependabot in #2669
- build(deps): bump tokio from 1.44.1 to 1.44.2 by @dependabot in #2662
- applied select clippy lint suggestions
- bumped indirect dependencies to latest version
Fixed
- fix:
select
panic when idx is out of bounds #2670 - fix: correct link to qsv-dateparser accepted date formats #2632
- fix: reset SIGPIPE handling #2664
- docs: fix typo it's -> its by @rzmk in #2680
Full Changelog: 3.3.0...4.0.0
3.3.0
[3.3.0] - 2025-03-23
Highlights:
stats
got another round of improvements:- boolean inferencing is now configurable!
Before, it was limited to a simple, English-centric heuristic:- When a column's cardinality is 2; and the 2 values' first characters are
0/1
,t/f
ory/n
case-insensitive, the data type of the column is inferred as boolean - With the new
--boolean-patterns <arg>
option, we can now specify arbitrarytrue_pattern:false_pattern
pattern pairs. Each pattern can be a string of length > 1, case-insensitive. If a pattern ends with "*", it is treated as a prefix.
For example,t*:f*
matches "true", "Truthy", "T" as boolean true so long as the corresponding false pattern (e.g. "Fake, False, f") is also matched. Bear in mind that the cardinality still needs to be 2, so multiple matches on the same column on different patterns will disqualify the field as boolean if cardinality > 2 (e.g. If a column's domain is "True", "truthy" and "False", it doesn't qualify as it's cardinality is 3. On the other hand, if it's "True", "true", "False", "false", "FALSE" - it still qualifies as they resolve to just "true/false" case-insensitive).
For backwards compatibility, the default true/false pairs are1:0,t*:f*,y*:n*
.
- When a column's cardinality is 2; and the 2 values' first characters are
- percentiles can now be computed!
By enabling the--percentiles
flag,stats
will now return the 5th, 10th, 40th, 60th, 90th and 95th percentile by default using the nearest-rank method for all numeric and date/datetime columns. The returned percentiles can be configured to return different percentiles using the--percentile-list <arg>
option.
Note that the method for computing quartiles (Method 3) is basically a specialized implementation of the nearest rank method for q1 (25th), q2 (50th or median) and q3 (75th percentile), thus the choice of non-overlapping defaults for--percentile-list
.
- boolean inferencing is now configurable!
frequency
: now usesqsv-stats
0.32.0, which uses the more memory-efficient, often fasterfoldhash
crate- in the same vein, by replacing
ahash
withfoldhash
suite-wide, qsv got a lot more memory-efficient and often faster when doing hash lookups sample
: "streaming" bernoulli sampling now works for any remotely hosted CSVs with servers that support chunked downloads, without requiring range request support.- we're now using the latest Polars engine - v0.46.0 at the py-1.26.0 tag.
Added
Changed
- refactor: replace ahash with faster foldhash #2619
- replace std
assert_eq!
macro withsimilar_asserts::assert_eq!
macro for easier debugging #2605 - deps: bump polars to 0.46.0 at py-1.25.2 tag #2604
- deps: bump Polars to v0.46.0 at py-1.26.0 tag #2621
- build(deps): bump actix-web from 4.9.0 to 4.10.2 by @dependabot in #2591
- build(deps): bump indexmap from 2.7.1 to 2.8.0 by @dependabot in #2592
- build(deps): bump mimalloc from 0.1.43 to 0.1.44 by @dependabot in #2608
- build(deps): bump qsv-stats from 0.30.0 to 0.31.0 by @dependabot in #2603
- build(deps): bump qsv-stats from 0.31.0 to 0.32.0 by @dependabot in #2620
- build(deps): bump reqwest from 0.12.12 to 0.12.13 by @dependabot in #2593
- build(deps): bump reqwest from 0.12.13 to 0.12.14 by @dependabot in #2596
- build(deps): bump reqwest from 0.12.14 to 0.12.15 by @dependabot in #2609
- build(deps): bump rfd from 0.15.2 to 0.15.3 by @dependabot in #2597
- build(deps): bump rust_decimal from 1.37.0 to 1.37.1 by @dependabot in #2616
- build(deps): bump simd-json from 0.14.3 to 0.15.0 by @dependabot in #2615
- build(deps): bump tempfile from 3.18.0 to 3.19.0 by @dependabot in #2602
- build(deps): bump tempfile from 3.19.0 to 3.19.1 by @dependabot in #2612
- build(deps): bump uuid from 1.15.1 to 1.16.0 by @dependabot in #2601
- build(deps): bump zip from 2.2.3 to 2.4.1 by @dependabot in #2607
- apply select clippy lint suggestions
- bumped indirect dependencies to latest version
- set Rust nightly to 2025-03-07, the same version Polars uses 17f6bdb
Fixed
- updated lock file, primarily to fix CVE-2025-29787 e44e5df
luau
: fix flaky register_lookup_table CI test that only intermittently fails in Windows by using buffered writer in lookupwrite_cache_file
helper f494b46sample
: refactor "streaming" Bernoulli sampling, so it actually works without requiring range requests support #2600
Full Changelog: 3.2.0...3.3.0
3.2.0
[3.2.0] - 2025-03-09
Added
sample
: "streaming" bernoulli sampling of remote files when hosted on servers with range requests support #2588
Changed
- Updated benchmarks.sh to add Homebrew installation prompt by @ondohotola in #2575
- feat: migrate to Rust 2024 edition #2587
- deps: bump
luau
from 0.660 to 0.663 #2567 - deps: bump polars to 0.46.0 at py-1.24.0 tag f70ce71
- deps: replace deprecated
simple-home-dir
withdirectories
crate 6768cd5 - deps: bump arrow from 54.2.0 to 54.2.1 fc479b2
- build(deps): bump bytemuck from 1.21.0 to 1.22.0 by @dependabot in #2570
- build(deps): bump console from 0.15.10 to 0.15.11 by @dependabot in #2569
- build(deps): bump governor from 0.8.0 to 0.8.1 by @dependabot in #2562
- build(deps): bump minijinja from 2.7.0 to 2.8.0 by @dependabot in #2573
- build(deps): bump minijinja-contrib from 2.7.0 to 2.8.0 by @dependabot in #2571
- build(deps): bump pyo3 from 0.23.4 to 0.23.5 by @dependabot in #2558
- build(deps): bump pyo3 from 0.23.5 to 0.24.0 by @dependabot in #2590
- build(deps): bump redis from 0.29.0 to 0.29.1 by @dependabot in #2568
- build(deps): bump robinraju/release-downloader from 1.11 to 1.12 by @dependabot in #2580
- build(deps): bump serde_json from 1.0.139 to 1.0.140 by @dependabot in #2572
- build(deps): bump tempfile from 3.17.1 to 3.18.0 by @dependabot in #2581
- build(deps): bump uuid from 1.14.0 to 1.15.0 by @dependabot in #2563
- build(deps): bump uuid from 1.15.0 to 1.15.1 by @dependabot in #2566
- applied select clippy lint suggestions
- bumped indirect dependencies to latest versions
Fixed
apply
: fixcurrencytonum
handling of "0.00" value by adding parsing strictness control with--formatstr
option #2586describegpt
: fix panic by adding error handling when LLM API response is not in expected format #2577tojsonl
: fix display of floats as per the JSON spec #2583
New Contributors
- @ondohotola made their first contribution in #2575
Full Changelog: 3.1.1...3.2.0
3.1.1
[3.1.1] - 2025-02-24
Highlights:
sample
: is now a "smart" command that uses the stats cache to validate and make sampling faster.- With the QSV_STATSCACHE_MODE env var, you can now control the stats cache behavior suite-wide, making sure "smart" commands use it when appropriate.
luau
command's capabilities have been significantly expanded with:- New accumulate helper function for aggregating values across rows
- Optional naming for cumulative helper functions
- More robust error handling and improved docstrings
- Enhanced scripting performance with fast-float parsing
- new Wiki section with examples of using its helper functions
schema
: now does type-aware sorting of enum lists, making JSON Schema enum list customization easier when fine-tuning it for JSON Schema validation withvalidate
.lens
: adds--freeze-columns
option with a default of 1, improving navigation of wide CSVsstats
: adds--dataset-stats
option to explicitly compute dataset-level statistics. Starting with qsv 2.0.0, it was computed automatically to support Datapusher+ and the DRUF workflow, but it was causing confusion with some command-line users.
Added
lens
: added--freeze-columns
option #2552luau
: added accumulate helper function #2537 #2539luau
: added a new section in the Wiki with examples of using the new helper functions https://github.com/dathere/qsv/wiki/Luau-Helper-Functions-Examplessample
: is now "smart" - using the stats cache to validate and make sampling faster #2529 #2530 71ec7edschema
: added type-aware sort of JSON Schema enum list #2551stats
: added--dataset-stats
option #2555python
: added precompiled qsvpy binary for Python 3.13 c408778- added QSV_STATSCACHE_MODE env var to control stats cache suite-wide 4afb98d 2adc313 ba75f08
- docs: updated PERFORMANCE docs and added a TLDR version 77ed167 c61c249 db0bb3f
- chore: added *.tab & *.ssv to typos config 5236675
Changed
frequency
: made error handling more robust b195519luau
: refactored all cumulative helper functions (cum_) now have name as an optional argument #2540schema
: refactored to use QSV_STATSCACHE_MODE env var 5771ff4select
: refactored select helper bfbe64cstats
: optimized memory layout of central Stats struct 52f697estats
: optimized record_count functionality 0e3114a 18791dacontrib(completions)
: update qsv completions for qsv 3.1 by @rzmk in #2556- deps: bump arrow and tempfile 4cc2679
- deps: bump cached and redis crates e622d14
- deps: bump csvlens from 0.11 to 0.12 b2fd985
- deps: use our patched fork of csvlens with ability to freeze columns d66ec6d
- deps: bump polars to 0.46.0 at py-1.23.0 tag 6072aa2
- deps: bump flate2 from 1.0.35 to 1.1.0 eed471a
- deps: bump gzp from 0.11 to 1.0.0 43c8a4a
- build(deps): bump jaq-json from 1.1.0 to 1.1.1 by @dependabot in #2547
- build(deps): bump jaq-core from 2.1.0 to 2.1.1 by @dependabot in #2546
- build(deps): bump log from 0.4.25 to 0.4.26 by @dependabot in #2545
- build(deps): bump tempfile from 3.16.0 to 3.17.0 by @dependabot in #2532
- build(deps): bump tempfile from 3.17.0 to 3.17.1 by @dependabot in #2535
- build(deps): bump serde_json from 1.0.138 to 1.0.139 by @dependabot in #2541
- build(deps): bump serde from 1.0.217 to 1.0.218 by @dependabot in #2542
- build(deps): bump smallvec from 1.13.2 to 1.14.0 by @dependabot in #2528
- build(deps): bump strum from 0.27.0 to 0.27.1 by @dependabot in #2533
- build(deps): bump strum_macros from 0.27.0 to 0.27.1 by @dependabot in #2534
- build(deps): bump uuid from 1.13.1 to 1.13.2 by @dependabot in #2538
- build(deps): bump uuid from 1.13.2 to 1.14.0 by @dependabot in #2544
- chore: we now have ~1,800 tests! f5d09ed
- applied select clippy lint suggestions
- bumped indirect dependencies to latest versions
- bumped MSRV to latest Rust stable - v1.85
Fixed
count
: refactored to fall back to "regular" CSV reader when Polars counting returns a zero count fd39bcbschema
: fixed off-by-one error 60de090- ensured get_stats_record helper returns field/stats correctly ad86a37
- Fixed RUSTSEC-2025-0007: ring is unmaintained #2548
stats
: only addqsv__value
column when--dataset-stats
is enabled 64267d3- skip format check when path starts with temp dir (indicating its a file streamed from STDIN) or is a snappy file ff8957e
Removed
frequency
: removed--stats-mode
option now that we have a suite-wide QSV_STATSCACHE_MODE env var ba75f08 416abb7- chore: removed simdutf8 conditional directive for aarch64 architecture, now that its no longer needed ec1e16c
- removed publish-linux-qsvpy-glibc-231-musl-123.yml workflow as it was getting cross compilation errors and we have another musl workflow that works 7c08617
Full Changelog: 3.0.0...3.1.1
3.0.0
[3.0.0] - 2025-02-13
Highlights:
sample
: Five new sampling methods! In addition to reservoir & indexed - added bernoulli, systematic, stratified, weighted & cluster sampling. And they're all memory efficient so you should be able to sample arbitrarily large datasets!stats
: Added "sortiness" [-1 (Descending) to 1 (Ascending)] & "uniqueness_ratio" [0 (many repeated values) to 1 (All unique values)] stats (more info).
The qsv-stats engine was also optimized to squeeze out more performance, withstats
now 2.6x faster while using less memory despite the addition of these new stats.diff
: is now a "smart" command, so that it uses the stats cache to short-circuit diffs if files are identical per their fingerprint hashes, and to validate that the diff key column is all unique.- The stats cache has been refactored and improved performance for "smart" commands:
frequency
is not only 3.3x faster, it uses far less memory as it now doesn't need to maintain hashmaps for columns with all unique values.tojsonl
is 2.25x fasterschema
is 1.4x faster
luau
got a major performance boost with the v0.660 engine upgrade, taking advantage of several compiler optimizations.luau
is now up to 3.1x faster!validate
had a major performance regression - going down from 3.295 seconds in v2.1.0 to 13.159 seconds in v2.2.1 in the benchmarks. 4x slower! With the jsonschema 0.29 crate update,validate
now clocks in 3.022 seconds!template
also got a big boost and is now 2.9x faster with the minijinja 2.7 crate update.
Added
joinp
: additionaljoinp
asof
join sort and match options #2486stats
: add "sortiness" statistic #2499stats
add uniqueness_ratio #2521stats
&frequency
: add--vis-whitespace
option. Fulfills #2501 #2503sample
: add more sampling methods (in addition to indexed and reservoir - added bernoulli, systematic, stratified, weighted & cluster sampling) and made them all memory efficient so we can sample arbitrarily large datasets: #2507 & #2511diff
: makediff
a "smart" command. Fulfills #2493 and #2509 #2518benchmarks
: added new benchmarks forsample
for new sampling methods d758c54
Changed
luau
: bump from 0.653 to 0.660 and optimize for performance 4402df6 de429b4 07ff8b8 3211f5cstats
: compute string len stats only for string columns #2495contrib(completions)
: update qsv completions for qsv 2.2.1 by @rzmk in #2494- deps: bump polars to latest upstream after its py-1.22.0 release
- deps: backported csv-core 0.1.12 fix to our qsv-optimized csv-core fork dathere/rust-csv@5d0916e
- build(deps): bump actions/setup-python from 5.3.0 to 5.4.0 by @dependabot in #2488
- build(deps): bump bytes from 1.9.0 to 1.10.0 by @dependabot in #2497
- build(deps): bump data-encoding from 2.7.0 to 2.8.0 by @dependabot in #2512
- build(deps): bump geosuggest-core from 0.6.5 to 0.6.6 by @dependabot in #2520
- build(deps): bump geosuggest-utils from 0.6.5 to 0.6.6 by @dependabot in #2519
- build(deps): bump jsonschema from 0.28.3 to 0.29.0 by @dependabot in #2510
- build(deps): bump minijinja from 2.6.0 to 2.7.0 by @dependabot in #2489
- build(deps): bump mlua from 0.10.2 to 0.10.3 by @dependabot in #2485
- build(deps): bump qsv-stats from 0.27.0 to 0.28.0 by @dependabot in #2496
- build(deps): bump qsv-stats from 0.28.0 to 0.29.0 by @dependabot in #2498
- build(deps): bump qsv-stats from 0.29.0 to 0.30.0 by @dependabot in #2505
- chore: Bump rand to 0.9 #2504
- build(deps): bump simple-home-dir from 0.4.6 to 0.4.7 by @dependabot in #2515
- build(deps): bump uuid from 1.12.1 to 1.13.1 by @dependabot in #2500
- bumped numerous indirect dependencies to latest versions
- applied select clippy lint suggestions
- bumped MSRV to latest Rust stable - v1.84.1
Fixed
- docs: QSV_AUTOINDEX => QSV_AUTOINDEX_SIZE typo. Fixes #2479 #2484
- fix:
search
&searchset
off by 1 when using--flag
option. Fixes #2508 #2513
Full Changelog: 2.2.1...3.0.0
2.2.1
[2.2.1] - 2025-01-27
Changed
- deps: bumped polars to 0.46.0. This will allow us to publish qsv to crates.io as qsv was using features that were not enabled in polars 0.45.1 275b2b8
Fixed
stats
: fix cache json processing bug. Fixes #2476 #2477- benchmarks: v6.1.0 - ensured all
stats
cache benchmarks actually used the stats cache even if the default--cache-threshold
is 5 seconds - too high to trigger stats cache creation ac33010
Full Changelog: 2.2.0...2.2.1
2.2.0
[2.2.0] - 2025-01-26
Highlights:
stats
- the β€οΈ of qsv, got a little tune-up:- It got a tad faster now that we only compute string length stats for string types. Previously, we were also computing length for numbers, thinking it'll be useful for storage sizing purposes (as everything is stored as string with CSV). But as performance is goal number 1, we're no longer doing so. Besides, this sizing info can be derived using other stats.
- Fixed the problem with the stats cache being deleted/ignored even when not necessary.
This bug snuck in while implementing the--cache-threshold
cache suppression option. Withstats
getting its cache mojo back - expect near-instant cache-backed response not only forstats
but also other "automagical" smart commands πͺ.
diff
- @janriemer squashed some bugs without sacrificingdiff
's ludicrous speed! πvalidate
: addeddynamicEnum
custom JSON Schema keyword column specifier support.
You can now specify which column to validate against (by name or by 0-based column index), instead of always using the first column. This works for local & remote lookup files using thehttp/s://
,ckan://
anddathere://
URL schemes.extdedup
now actually uses a proper memory-mapped backed on-disk hash table.
Previously, it was only deduping in-memory as the odht crate was not properly wired to a memory mapped file π€¦ (I took the name of the odht crate literally and thought it was handling it π€·). Thanks for the detailed bug report @Svenskunganka!- JSON query parsing overhaul.
Thefetch
,fetchpost
&json
commands now use the latestjaq
engine, making for faster performance especially now that we're precompiling and caching the jaq filter. - Polars engine upgraded. π»ββοΈ
By two versions! py-polars 1.20.0 and 1.21.0 - giving thesqlp
,joinp
,pivotp
&count
commands a little boost. π
NOTE: qsv v2.2.0 is not available on crates.io as it does not allow enabling unreleased features as we await a new version of Polars. As soon as Polars 0.46.0 is published, a new qsv patch release will be published to crates.io.
This means that installation option 3 usingcargo install
will be limited to 1.0.0 - the last qsv version available on crates.io. All other installation and update options to install/update qsv 2.2.0 still work.
Added
diff
: add--delimiter
"convenience" option. Fulfills #2447 #2464slice
: add stdin and snappy compressed file support ab34a62validate
: add dynamicEnum column specifier support. Fulfills #2470 #2472
Changed
fetch
,fetchpost
&json
:jaq
dependency upgrade - fromjaq-interpret
&jaq-parse
tojaq-core
/jaq-json
/jaq-std
#2458fetch
&fetchpost
: cache compiled jaq filter #2467joinp
: adjust asofby test to reflect Polars py-1.20.0 behavior 853a266stats
: compute string length stats for string type only #2471sqlp
: wordsmith fastpath explanation 4e3f853- refactor: standardize -q and -Q shortcut options. Fulfills #2466 #2468
- deps: bump polars to 0.45.1 at py-polars-1.20.0 tag #2448
- deps: bump polars to 0.45.1 at py-polars-1.21.0 tag 4525d00
- deps: Bump csv-diff to 0.1.1 by @janriemer in #2456
- deps: Bump csvlens to latest upstream 27a723e
- deps: use latest strum upstream 2ca1b0d
- build(deps): bump base62 from 2.2.0 to 2.2.1 by @dependabot in #2440
- build(deps): bump chrono-tz from 0.10.0 to 0.10.1 by @dependabot in #2449
- build(deps): bump data-encoding from 2.6.0 to 2.7.0 by @dependabot in #2444
- build(deps): bump indexmap from 2.7.0 to 2.7.1 by @dependabot in #2461
- build(deps): bump jsonschema from 0.28.1 to 0.28.2 by @dependabot in #2469
- build(deps): bump jsonschema from 0.28.2 to 0.28.3 by @dependabot in #2473
- build(deps): bump log from 0.4.22 to 0.4.25 by @dependabot in #2439
- build(deps): bump semver from 1.0.24 to 1.0.25 by @dependabot in #2459
- build(deps): bump serde_json from 1.0.135 to 1.0.136 by @dependabot in #2455
- build(deps): bump serde_json from 1.0.136 to 1.0.137 by @dependabot in #2460
- build(deps): bump simple-home-dir from 0.4.5 to 0.4.6 by @dependabot in #2445
- build(deps): bump uuid from 1.11.1 to 1.12.0 by @dependabot in #2441
- build(deps): bump uuid from 1.12.0 to 1.12.1 by @dependabot in #2465
- tests: enabled Windows CI caching for faster CI tests
- bumped numerous indirect dependencies to latest versions
- applied select clippy lint suggestions
Fixed
count
: Sometimes, polars count returns zero even if there are rows. Fixed by doing a regular csv reader count when polars count returns zero abcd365diff
: Fix name to index conversion by @janriemer. Fixes #2443 #2457extdedup
: refactor/fix to actually have on-disk hash table backed by a mem-mapped file. Fixes #2462 #2475stats
: fix stats caching as it was inadvertently deleting the stats cache even when not necessary 96e6d28
Removed
foreach
: refactored to remove unmaintainedlocal-encoding
dependency #2454- remove
polars
feature from qsvdp binary variant. We'll use py-polars from DP+ directly.
Full Changelog: 2.1.0...2.2.0
2.1.0
[2.1.0] - 2025-01-12
Highlights:
join
&joinp
fine-tuning continues, with several join key transformation options (--ignore-leading-zeros
&--norm-unicode
);join
fixes for--right-anti
and--right-semi
joins; and reverting ajoin
performance regression with 2.0.0.pivotp
uses more summary statistics for even smarter aggregation suggestions
NOTE: qsv v2.1.0 is not available on crates.io. This was caused by qsv's use of a brand new
string_normalize
Polars feature that is not yet available on the latest release of Polars - v0.45.1. Once a new version of Polars is published with this feature, a new qsv patch release will be published to crates.io.
This means that installation option 3 usingcargo install
will be limited to 1.0.0 - the last qsv version available on crates.io. All other installation and update options to install/update qsv 2.1.0 still work.
Added
join
: add--ignore-leading-zeros
option #2430joinp
add--norm-unicode
option to unicode normalize join keys #2436pivotp
added more smart aggregation suggestions #2428template
: added to qsvdp binary variant 9df85e6benchmarks
: addedpivotp
benchmark 92e4c51
Changed
joinp
: refactored--ignore-leading-zeros
handling #2433- Migrate from unmaintained dynfmt to dynfmt2 #2421
- deps: bump csvlens to latest upstream 52c766d
- deps: bump to latest csv qsv-optimized fork 58ac650
- deps: bumped MiniJinja to 2.6.0 8176368
- deps: bump to latest Polars upstream
- deps: bump qsv-stats to 0.26.0
- build(deps): bump azure/trusted-signing-action from 0.5.0 to 0.5.1 by @dependabot in #2420
- build(deps): bump base62 from 2.0.3 to 2.1.0 by @dependabot in #2419
- build(deps): bump base62 from 2.1.0 to 2.2.0 by @dependabot in #2426
- build(deps): bump phf from 0.11.2 to 0.11.3 by @dependabot in #2417
- build(deps): bump pyo3 from 0.23.3 to 0.23.4 by @dependabot in #2431
- build(deps): bump serde_json from 1.0.134 to 1.0.135 by @dependabot in #2416
- build(deps): bump tokio from 1.42.0 to 1.43.0 by @dependabot in #2423
- build(deps): bump uuid from 1.11.0 to 1.11.1 by @dependabot in #2427
- apply several clippy suggestions
- bumped numerous indirect dependencies to latest versions
- bumped Rust nightly from 2024-12-19 to 2025-01-05 (same version used by Polars)
- bump MSRV to latest Rust stable - v1.84.0
Fixed
join
: revert optimization that actually resulted in a performance regression e42af2bjoin
:--right-anti
and--right-semi
joins didn't swap headers properly #2435count
: polars-poweredcount
didn't use the right data type SQL count(*) d8c1524
Full Changelog: 2.0.0...2.1.0
2.0.0
qsv v2.0.0 is here! π
It took 193 releases to get to v1.0.0, and we're already at v2.0.0 a month later!?!
Yes! We wanted a running start for 2025, and qsv 2.0.0 marks qsv's biggest release yet!
- It fully enables the "Data Resource Upload First (DRUF)" workflow, allowing Datapusher+ to infer "automagical metadata" from the data itself. It exposes two Domain Specific Language (DSL) options - Luau and MiniJinja - to enable powerful data transformation and validation capabilities. This allows data stewards to upload data first, then use qsv's DSL capabilities inside DP+ to automatically generate rich metadata - including data dictionaries, field descriptions, data quality rules, and data validation schemas. This "automagical metadata" approach dramatically reduces the friction in compiling high-quality, high-resolution metadata (using the DCAT-US 3.0 specification as a reference) that would otherwise be a manual, laborious, and error-prone process.
Under the hood, thefetchpost
,template
,stats
,validate
andluau
commands now have the necessary scaffolding to fully support this workflow inside Datapusher+ and ckanext-scheming. - It adds a new "smart"
pivotp
command, powered by Polars, to enable fast pivot operations on large datasets. It's "smart" as it uses the stats cache to automatically suggest an aggregation based on a column's data type and summary statistics. You can now pivot your data in seconds by simply specifying the columns to pivot on while blowing past Excel's PivotTable limitations. stats
now computes geometric mean and harmonic mean and adds string length stats, all while getting a performance boost.join
andjoinp
got a lot of love in this release, with several new options:joinp
: non-equi join support! ππ―π₯³
See "Lightning Fast and Space Efficient Inequality Joins" paper and this Polars non-equi join tracking issue.join
&joinp
:--right-anti
and--right-semi
joinsjoinp
:--ignore-leading-zeros
option for join keysjoinp
:--maintain-order
option to maintain the order of the either the left or right dataset in the outputjoinp
: expanded--cache-schema
options to makejoinp
smarter/faster by leveraging the stats cachejoin
:--keys-output
option to write successfully joined keys to a separate output file.
This release lays the groundwork for the outliers
"smart" command to quickly identify outliers using stats/frequency info.
It also sets the stage for an initial implementation of our "Data Concierge" that leverages all the high-quality, high-res metadata we automagically compile with DRUF to enable Metadata Gardening Agents to proactively link seemingly unrelated data and glean insights as it constantly grooms the Data Catalog - effectively making it a FAIR Data Factory.
Added
fetchpost
: add--globals-json
option #2357fixlengths
: add--remove-empty
option; refactored for performance. Fulfills #2391. #2411join
: add--keys-output
option. Fulfills #2407. #2408join
: add--right-anti
and--right-semi
options. Fulfills #2379. #2380joinp
: add non-equi join support! ππ―π₯³ #2409joinp
: add--ignore-leading-zeros
option. Fulfills #2398. #2400joinp
: add--maintain-order
option #2338joinp
: add--right-anti
and--right-semi
options. Fulfills #2377. #2378luau
: addl helper functions. Fulfills #1782. #2362luau
: addqsv_writejson
helper #2375pivotp
: new polars polars-powered command. Fulfills #799. #2364pivotp
: "smart" pivotp. #2367stats
: add geometric mean and harmonic mean. Fulfills #2227. #2342stats
: add string length stats to set stage for upcomingoutliers
"smart" command to quickly identify outliers using stats/frequency info #2390template
: add--globals-json
option #2356tojsonl
: add--quiet
option. Fulfills #2335. #2336validate
: add--validate-schema
option to check if the JSON Schema itself is valid #2393contrib(completions)
: add joinp--ignore-case
and slice--invert
by @rzmk in #2322contrib(completions)
: add--quiet
totojsonl
by @rzmk in #2337ci
: add qsv_glibc_2.31-headless to action by @rzmk in #2330- Add license to MSI installer by @rzmk in #2321
Changed
lens
: optimized csvlens library usage, dropping clap dependency #2403pivotp
: an even smarterpivotp
#2368stats
: performance boost 51349ba- Update deb package by @tino097 in #2226
ci
: attempt using files-folder instead of files by @rzmk in #2320- Setting QSV_FREEMEMORY_HEADROOM_PCT to 0 disables memory availability check #2353
- build(deps): bump actix-governor from 0.7.0 to 0.8.0 by @dependabot in #2351
- build(deps): bump bytemuck from 1.20.0 to 1.21.0 by @dependabot in #2361
- build(deps): bump chrono from 0.4.38 to 0.4.39 by @dependabot in #2345
- build(deps): bump crossbeam-channel from 0.5.13 to 0.5.14 by @dependabot in #2354
- build(deps): bump flexi_logger from 0.29.6 to 0.29.7 by @dependabot in #2348
- build(deps): bump governor from 0.7.0 to 0.8.0 by @dependabot in #2347
- build(deps): bump itertools from 0.13.0 to 0.14.0 by @dependabot in #2413
- build(deps): bump jsonschema from 0.26.1 to 0.26.2 by @dependabot in #2355
- build(deps): bump jsonschema from 0.26.2 to 0.27.0 by @dependabot in #2371
- build(deps): bump jsonschema from 0.27.1 to 0.28.0 by @dependabot in #2389
- build(deps): bump jsonschema from 0.28.0 to 0.28.1 by @dependabot in #2396
- bump polars from 0.44.2 to 0.45 #2340
- build(deps): bump polars from 0.45.0 to 0.45.1 by @dependabot in #2344
- bump pyo3 from 0.22 to 0.23 now that Polars supports it #2352
- build(deps): bump redis from 0.27.5 to 0.27.6 by @dependabot in #2331
- build(deps): bump reqwest from 0.12.9 to 0.12.11 by @dependabot in #2385
- build(deps): bump reqwest from 0.12.11 to 0.12.12 by @dependabot in #2395
- build(deps): bump rfd from 0.15.1 to 0.15.2 by @dependabot in #2404
- build(deps): bump serde from 1.0.215 to 1.0.216 by @dependabot in #2349
- build(deps): bump serde from 1.0.216 to 1.0.217 by @dependabot in #2384
- build(deps): bump serde_json from 1.0.133 to 1.0.134 by @dependabot in #2365
- build(deps): bump sysinfo from 0.32.1 to 0.33.0 by @dependabot in #2334
- build(deps): bump sysinfo from 0.33.0 to 0.33.1 by @dependabot in #2383
- deps: bump tabwriter to 1.4.1 bbcbeba
- build(deps): bump tokio from 1.41.1 to 1.42.0 by @dependabot in #2333
- build(deps): bump xxhash-rust from 0.8.12 to 0.8.13 by @dependabot in #2359
- build(deps): bump xxhash-rust from 0.8.13 to 0.8.14 by @dependabot in #2372
- build(deps): bump xxhash-rust from 0.8.14 to 0.8.15 by @dependabot in #2392
- apply several clippy suggestions
- bumped numerous indirect dependencies to latest versions
- bumped Rust nightly from 2024-11-28 to 2024-12-19 (same version used by Polars)
Fixed
1.0.0
qsv v1.0.0 is here! π
After over 3 years of development, nearly 200 releases, and 11,000+ commits, qsv has finally reached v1.0.0!
What started as a hobby project to learn Rust during COVID has evolved into a powerful data wrangling tool used in multiple datHere products, open source projects, and even in several mission-critical production environments!
To mark this major milestone, this larger than usual release includes major performance improvements, new features, and various optimizations!
Added
joinp
: add--ignore-case
option #2287py
: add ability to load python expression from file #2295replace
: add--not-one
flag (resolves #2305) by @rzmk in #2307slice
: add--invert
option #2298stats
: add dataset-level stats #2297sqlp
: auto-decompression of gzip, zstd & zlib compressed csv files withread_csv
table function (implements suggestion from @wardi in #2301) #2315template
: add lookup support #2313- added
ui
feature to make it easier to make a headless build of qsv #2289 - added better panic handling #2304
- added new benchmark for
template
command cd7e480 - added π
lookup support
legend b46de73
Changed
- move qsv from personal Github repo to datHere GitHub org #2317
template
: parallelized template rendering for significant speedups #2273- simplify input format check #2309
- bump embedded
luau
from 0.650 to 0.653 986a1d3 - deps: Switch back to
simple-home-dir
fromsimple-expand-tilde
#2319 - deps: Add minijinja contrib #2276
- deps: bump pyo3 down to 0.21.2 because polars-mem-engine is not compatible with pyo3 0.23.x yet 7f9fc8a
- build(deps): bump base62 from 2.0.2 to 2.0.3 by @dependabot in #2281
- build(deps): bump bytemuck from 1.19.0 to 1.20.0 by @dependabot in #2299
- build(deps): bump bytes from 1.8.0 to 1.9.0 by @dependabot in #2314
- build(deps): bump file-format from 0.25.0 to 0.26.0 by @dependabot in #2277
- build(deps): bump hashbrown from 0.15.1 to 0.15.2 by @dependabot in #2310
- build(deps): bump itoa from 1.0.11 to 1.0.12 by @dependabot in #2300
- build(deps): bump itoa from 1.0.12 to 1.0.13 by @dependabot in #2302
- build(deps): bump itoa from 1.0.13 to 1.0.14 by @dependabot in #2311
- build(deps): bump mlua from 0.10.0 to 0.10.1 by @dependabot in #2280
- build(deps): bump mlua from 0.10.1 to 0.10.2 by @dependabot in #2316
- build(deps): bump serial_test from 3.1.1 to 3.2.0 by @dependabot in #2279
- build(deps): bump minijinja from 2.4.0 to 2.5.0 by @dependabot in #2284
- build(deps): bump minijinja-contrib from 2.3.1 to 2.5.0 by @dependabot in #2283
- build(deps): bump rfd from 0.15.0 to 0.15.1 by @dependabot in #2291
- build(deps): bump sanitize-filename from 0.5.0 to 0.6.0 by @dependabot in #2275
- build(deps): bump serde from 1.0.214 to 1.0.215 by @dependabot in #2286
- build(deps): bump serde_json from 1.0.132 to 1.0.133 by @dependabot in #2292
- build(deps): bump tempfile from 3.13.0 to 3.14.0 by @dependabot in #2278
- build(deps): bump tokio from 1.41.0 to 1.41.1 by @dependabot in #2274
- build(deps): bump url from 2.5.3 to 2.5.4 by @dependabot in #2306
- applied several clippy suggestions
- bumped numerous indirect dependencies to latest versions
- bumped MSRV to latest Rust stable (1.83.0)
- bumped Rust nightly from 2024-11-01 to 2024-11-28, the same version used by Polars
Fixed
- fix
get_stats_records()
helper to handle input files with embedded spaces (fixes #2294) #2296 - added better panic handling (fixes #2301) #2304
- implement simple format check for input files (fixes #2301) #2308
Removed
- removed
simple-expand-tilde
dependency in favor ofsimple-home-dir
#2318 - removed patched fork of
indicatif
now that 0.17.9 is released, fixing GH unmaintained advisory forinstant
33fa54a - removed
clipboard
command fromqsvlite
binary variant 9c663d8
Full Changelog: 0.138.0...1.0.0