Skip to content

A Dataset in the platform is a key entity. Conceptually, a Datasets represents a container of data but the kind and volume of data different Datasets contain can vary widely. For example, all river discharge time series measurements from river gauging stations in the Czech Republic can be one Dataset just as much as a single time series from a single gauging station can be a Dataset. Both these datasets are completely separate entities.

Datasets must be organized in projects. One Dataset can reside in exactly one project and a Dataset can be accessed only in a context of that specific Project.

Note that technically speaking, any Dataset entity itself contains only descriptive information, which is sometimes referred to as "dataset definition" or "metadata". Actual data values are stored and managed in independent Core Data Services. This complexity is mostly hidden from the user and in most cases one can treat the Dataset as a thing that contains data and descriptive information about the data.

Dataset entity structure

Basic properties

 {
    "id": "be500af4-2c0f-4b57-a834-a63d8c97a2e6",
    "name": "test 1",
    "datasetType": "file",
    "projectId": "6b4c70c4-ef2f-4b39-acb0-1e4b24313e8e",  //optional
    "spatialInformation": {},  //Content is GeoJson - see below
    "createdAt": "2019-01-11T08:24:57.0289363",
    "createdBy": "862281a7-2e66-4c76-af8d-29c82c723b4b",
    "updatedAt": "2019-01-11T08:24:57.0289363",
    "updatedBy": "862281a7-2e66-4c76-af8d-29c82c723b4b"
  },

Note: Please note that the use of slashes in dataset names is not allowed.

Supported DatasetType values

  • gisvectordata GIS Vector Data similar to data in Shapefile format
  • multidimensional Multidimensional timeseries similar to data in DFS2, DFSU, or NetCDF format
  • timeseries Time series data like those that can be stored in DIMS CORE software.
  • file Any files in their original form

Categorization properties (optional)

  "properties": { // properties specific to the DatasetType
    "unit": "degC",
    "variable": "Temperature"
  },
  "metadata": // any additional information
  {
    "key1": "value1",
    "key2": {
       "key21": 123.4 
    }
  }

Spatial information structure

{
  location: {}, // GeoJson Geometry, see below
  primarySpatialReference: "<string>", 
  resolution: "<string>"
}
- Location property contains always longitude/latitude representation (EPSG 4326). It allows fast area searching and other geography/map operations. When the conversion from data spatial representation is not possible, the location is null. - PrimarySpatialReference contains spatial reference in which the data are represented. Usually contains SRID, but string type allows more complex spatial reference definition in the future. - Resolution is not used now.

Sample location in GeoJson Geometry

"location": {
   "type": "Polygon",
   "coordinates": [
     [
       [ -180.0, -90.0 ],
       [ 180.0, -90.0 ],
       [ 180.0, 83.64513 ],
       [ -180.0, 83.64513 ],
       [ -180.0, -90.0 ]
     ]
   ]
}

Temporal information structure

Endpoints

Use GET/api/metadata/project/{id}/dataset/list to list datasets, PUT/api/metadata/project/{id}/dataset to update dataset metadata. Note that updating dataset metadata through this end point does not change properties of the data. You can only edit dataset name, description, and metadata that are not directly derived from the data such as storage size. For further details of endpoints for manipulating datasets, see our swagger specification.

Dataset can be created by importing data into the platform, or by converting one platform dataset into a new dataset. See section Conversion pipeline and Transfers for further details and examples regarding import and conversion.

When updating a Dataset clients must include RowVersion property. This helps to resolve potential conflicts when multiple users are trying to update one Dataset at the same time. This feature was introduced with web api version 2.

Limitations and desired enhancements

  • Note that properties of the Dataset entity are set by the the import pipeline and can be subsequently edited by authorized users. However, there is no mechanism for synchronization between the Dataset and related data values as of May 2018. So for example, users can edit the EndTime of a Dataset of type timeseries, but this has no effect on the time series values handled by the Time Series Service. Similarly, appending more values through the time series service does not automatically update relevant Dataset properties. We aim to implement mechanism for eventual consistency between these entities.
  • The "schema", i.e. Property names and types, is "fixed" by the Dataset class but there is not fixed schema for Dataset.Properties and Dataset.Metadata. This provides useful flexibility but it means clients must develop their own conventions and write code that can handle situations if their conventions are broken by other clients.