Skip to content

developer.dataset.traps

Jeremy Faden edited this page Jun 14, 2024 · 3 revisions

Purpose: outline data set traps.

Introduction

QDataSet is the result of looking at the limitations of data models attempted over the years, and trying to create a model that was flexible enough to work in many situations. With this flexibility it is able to adapt, but people have a tough time understanding it. Further, it has the problem that it's so unconstrained that the hundreds of methods that use it might as well be passing Objects around, you don't know which scheme of any dataset from the interface.

X,Y,Z,?

When using X, Y, and Z in your model, you will be trapped into locking your data model to the display, and you won't be able to model high dimensionality (and also single-dimensionality). Suppose you have Flux sampled at Time, Energy, and Pitch Angle. Maybe Time is clearly "X" and Flux is clearly "Z", but you can see why this breaks down quickly.

Scalars

When using X, Y, and Z in your model, you will be trapped into locking your data model to the display, and it's not clear how you would describe scalars (single-dimensionality). Suppose you wish to model scalars, indicating an event time. Would the interface just have a getX() method? Suppose you want to indicate an Energy, so should this be getY()?

Names for Dimensionality

Tables have two indexes. Qubes have N indexes. In QDataSet, the rank function returns the number of indices, and you can say rank 2 dataset instead of Table. What should the thing with one index be called? In the old Das2 model we called these "Vectors" which was a bad name. Here are suggested names:

X SDI QDataSet
Number Rank 0 DataSet
Array XYData,Data1D Rank 1 DataSet
Table Data2D Rank 2 DataSet
ThreeQube (?) Rank 3 DataSet
Rank 4 DataSet

And words to avoid, because they have other definitions that cause confusion: Scalar, Vector, Matrix

Note in QDataSet, the rank is quite different than the dimensionality. Both a simple spectrogram and a table of N columns are both rank 2, but the spectrogram occupies three dimensions, while the table occupies N dimensions. (The BUNDLE_1 property is used to differentiate.)

Zeroth vs First

This still bugs me, where humans want to count things with natural numbers, but indexes should clearly start at 0. TODO: there must be guidelines for this, and this should be considered.

Cadence vs Duration

Cadence is the spacing between measurements, which is different than the duration of a measurement.

Ratiometric spacing

Ratiometric spacings come up a lot in our field, where you have channels whose centers increase logarithmically. "decibels" and "percentIncrease" are good units for describing the cadence with a single number.

Times can be modeled as double since Datum

Times can be modeled effectively with a double and a unit that contains a Datum, such as "seconds since 2010-01-01T00:00."

Tags should identify centers, Bins identify ranges, but the best is to identify all three

  • When just one number is used to identify a location with an aliasing interval, it must be the center of the interval.
  • If an aliasing interval bin must be described, then have a bin with a start and end so there is nothing implicit.
  • The best case would be to have all three numbers.

Table Of Contents

URIs that Point to Data Files

Download a CDF and Plot it with Autoplot

Load a CDF directly from a website

URIs that Point to Data Servers

Saving to vap files

Loading vap files

Data Sources

CDF Files

HDF/NetCDF Files

Aggregation

CDAWeb

HAPI Servers

Exporting Data

Export Types

Additional controls

Aggregation

Tools

PNGWalk Tool

Data Mash Up

Events List

Run Batch

Advanced Topics

TimeSeriesBrowse and other Capabilities

Events Lists

Caching

Autoranging

Managing Autoplot's Data Cache

Using Autoplot with Python, IDL, and Matlab

Reading data into Python

Reading data into IDL

Reading data into Matlab

QDataSet Data Model

Clone this wiki locally