Dataset schema file specification 1.0
Dataset storage enables you to sequentially store and retrieve data records, in various formats. Each Actor run is assigned its own dataset, which is created when the first item is stored to it. Datasets usually contain results from web scraping, crawling or data processing jobs. The data can be visualized as a table where each object is a row and its attributes are the columns. The data can be exported in JSON, CSV, XML, RSS, Excel, or HTML formats.
The specification is also at Apify documentation.
Dataset can be assigned a schema which describes:
- Content of the dataset, i.e., the schema of objects that are allowed to be added
- Different views on how we can look at the data, aka transformations
- Visualization of the View using predefined components (grid, table, …), which improves the run view interface at Apify Console and also provides a better interface for datasets shared by Apify users
Basic properties
- Storage is immutable. I.e., if you want to change the structure, then you need to create a new dataset.
- Its schema is weak. I.e., you can always push their additional properties, but schema will ensure that all the listed once are there with a correct type. This is to make Actors more compatible, i.e., some Actor expects dataset to contain certain fields but does not care about the additional ones.
There are two ways how to create a dataset with schema:
User can start the Actor that has dataset schema linked from its Output schema
Or user can do it pragmatically via API (for empty dataset) by
- either by passing the schema as payload to create dataset API endpoint.
- or using the SDK:
const dataset = await Apify.openDataset('my-new-dataset', { schema });
By opening an existing dataset with schema
parameter, the system ensures that you are opening a dataset that is compatible with the Actor as otherwise, you get an error:
Uncaught Error: Dataset schema is not compatible with the provided schema
Structure
{
"actorDatasetSchemaVersion": 1,
"title": "E-shop products",
"description": "Dataset containing the whole product catalog including prices and stock availability.",
// A JSON schema object describing the dataset fields, with our extensions: the "title", "description", and "example" properties.
// "example" is used to generate code and API examples for the Actor output.
// For details, see https://docs.apify.com/platform/actors/development/actor-definition/dataset-schema
"fields": {
"type": "object",
"properties": {
"title": {
"type": "string",
"description": "The name of the results",
},
"imageUrl": {
"type": "string",
"description": "Function executed for each request",
},
"priceUsd": {
"type": "integer",
"description": "Price of the item",
},
"manufacturer": {
"type": "object",
"properties": {
"title": { ... },
"url": { ... },
}
},
...
},
"required": ["title"],
},
// Define the ways how to present the Dataset to users
"views": {
"overview": {
"title": "Products overview",
"description": "Displays only basic fields such as title and price",
"transformation": {
"flatten": ["stockInfo"],
"fields": [
"title",
"imageUrl",
"variants"
]
},
"display": {
"component": "table",
"properties": {
"title": {
"label": "Title"
},
"imageUrl": {
"label": "Image",
"format": "image" // optional, in this case the format is overridden to show "image" instead of image link "text". "image" format only works with .jpeg, .png or other image format urls.
},
"stockInfo.availability": {
"label": "Availability"
}
}
}
},
"productVariants": {
"title": "Product variants",
"description": "Each product expanded into item per variant",
"transformation": {
"fields": [
"title",
"price",
"productVariants"
],
"unwind": "productVariants"
},
"display": {
// Simply renders all the available fields.
// This component is used by default when no display is specified.
"component": "table"
}
}
},
}
DatasetSchema object definition
Property | Type | Required | Description |
---|---|---|---|
actorSpecification | integer | true | Specifies the version of dataset schema structure document. Currently only version 1 is available. |
fields | JSON schema | true | JSON schema object with more formats in the future. |
views | [DatasetView] | true | An array of objects with a description of an API and UI views. |
JSON schema
Items of a dataset can be described by a JSON schema definition, passed into the fields
property. The Actor system then ensures that each records added to the dataset complies with the provided schema.
{
"type": "object",
"required": ["name", "email"],
"properties": {
"id": {
"type": "string"
},
"name": {
"type": "string"
},
"email": {
"type": "string"
},
"arr": {
"type": "array",
"items": {
"type": "object",
"required": [],
"properties": {
"site": {
"type": "string"
},
"url": {
"type": "string"
}
}
}
}
}
}
DatasetView object definition
Property | Type | Required | Description |
---|---|---|---|
title | string | true | The title is visible in UI in the Output tab as well as in the API. |
description | string | false | The description is only available in the API response. The usage of this field is optional. |
transformation | ViewTransformation object | true | The definition of data transformation is applied when dataset data are loaded from Dataset API. |
display | ViewDisplay object | true | The definition of Output tab UI visualization. |
ViewTransformation object definition
Property | Type | Required | Description |
---|---|---|---|
fields | string[] | true | Selects fields that are going to be presented in the output. The order of fields matches the order of columns in visualization UI. In case the fields value is missing, it will be presented as “undefined” in the UI. |
unwind | string | false | Deconstructs nested children into parent object, e.g.: with unwind:[”foo”], the object {”foo”:{”bar”:”hello”}} is turned into {’bar”:”hello”} . |
flatten | string[] | false | Transforms nested object into flat structure. eg: with flatten:[”foo”] the object {”foo”:{”bar”:”hello”}} is turned into {’foo.bar”:”hello”} . |
omit | string | false | Removes the specified fields from the output. Nested fields names can be used there as well. |
limit | integer | false | The maximum number of results returned. Default is all results. |
desc | boolean | false | By default, results are sorted in ascending based on the write event into the dataset. desc:true param will return the newest writes to the dataset first. |
ViewDisplay object definition
Property | Type | Required | Description |
---|---|---|---|
component | string | true | Only component “table” is available. |
properties | Object | false | Object with keys matching the transformation.fields and ViewDisplayProperty as values. In case properties are not set the table will be rendered automatically with fields formatted as Strings, Arrays or Objects. |
ViewDisplayProperty object definition
Property | Type | Required | Description |
---|---|---|---|
label | string | false | In case the data are visualized as in Table view. The label will be visible table column’s header. |
format | enum(text, number, date, link, boolean, image, array, object) | false | Describes how output data values are formatted in order to be rendered in the output tab UI. |