-
Notifications
You must be signed in to change notification settings - Fork 0
developer.dataset.traps
Purpose: outline data set traps.
QDataSet is the result of looking at the limitations of data models attempted over the years, and trying to create a model that was flexible enough to work in many situations. With this flexibility it is able to adapt, but people have a tough time understanding it. Further, it has the problem that it's so unconstrained that the hundreds of methods that use it might as well be passing Objects around, you don't know which scheme of any dataset from the interface.
When using X, Y, and Z in your model, you will be trapped into locking your data model to the display, and you won't be able to model high dimensionality (and also single-dimensionality). Suppose you have Flux sampled at Time, Energy, and Pitch Angle. Maybe Time is clearly "X" and Flux is clearly "Z", but you can see why this breaks down quickly.
When using X, Y, and Z in your model, you will be trapped into locking your data model to the display, and it's not clear how you would describe scalars (single-dimensionality). Suppose you wish to model scalars, indicating an event time. Would the interface just have a getX() method? Suppose you want to indicate an Energy, so should this be getY()?
Tables have two indexes. Qubes have N indexes. In QDataSet, the rank function returns the number of indices, and you can say rank 2 dataset instead of Table. What should the thing with one index be called? In the old Das2 model we called these "Vectors" which was a bad name. Here are suggested names:
X | SDI | QDataSet |
---|---|---|
Number | Rank 0 DataSet | |
Array | XYData,Data1D | Rank 1 DataSet |
Table | Data2D | Rank 2 DataSet |
ThreeQube (?) | Rank 3 DataSet | |
Rank 4 DataSet |
And words to avoid, because they have other definitions that cause confusion: Scalar, Vector, Matrix
Note in QDataSet, the rank is quite different than the dimensionality. Both a simple spectrogram and a table of N columns are both rank 2, but the spectrogram occupies three dimensions, while the table occupies N dimensions. (The BUNDLE_1 property is used to differentiate.)
This still bugs me, where humans want to count things with natural numbers, but indexes should clearly start at 0. TODO: there must be guidelines for this, and this should be considered.
Cadence is the spacing between measurements, which is different than the duration of a measurement.
Ratiometric spacings come up a lot in our field, where you have channels whose centers increase logarithmically. "decibels" and "percentIncrease" are good units for describing the cadence with a single number.
Times can be modeled effectively with a double and a unit that contains a Datum, such as "seconds since 2010-01-01T00:00."
- When just one number is used to identify a location with an aliasing interval, it must be the center of the interval.
- If an aliasing interval bin must be described, then have a bin with a start and end so there is nothing implicit.
- The best case would be to have all three numbers.