Dataset schema file specification 1.0

Dataset storage enables you to sequentially store and retrieve data records, in various formats. Each Actor run is assigned its own dataset, which is created when the first item is stored to it. Datasets usually contain results from web scraping, crawling or data processing jobs. The data can be visualized as a table where each object is a row and its attributes are the columns. The data can be exported in JSON, CSV, XML, RSS, Excel, or HTML formats.

The specification is also at Apify documentation.

Dataset can be assigned a schema which describes:

  • Content of the dataset, i.e., the schema of objects that are allowed to be added
  • Different views on how we can look at the data, aka transformations
  • Visualization of the View using predefined components (grid, table, …), which improves the run view interface at Apify Console and also provides a better interface for datasets shared by Apify users
Dataset schema

Basic properties

  • Storage is immutable. I.e., if you want to change the structure, then you need to create a new dataset.
  • Its schema is weak. I.e., you can always push their additional properties, but schema will ensure that all the listed once are there with a correct type. This is to make Actors more compatible, i.e., some Actor expects dataset to contain certain fields but does not care about the additional ones.

There are two ways how to create a dataset with schema:

  1. User can start the Actor that has dataset schema linked from its Output schema

  2. Or user can do it pragmatically via API (for empty dataset) by

    • either by passing the schema as payload to create dataset API endpoint.
    • or using the SDK:
    const dataset = await Apify.openDataset('my-new-dataset', { schema });

By opening an existing dataset with schema parameter, the system ensures that you are opening a dataset that is compatible with the Actor as otherwise, you get an error:

Uncaught Error: Dataset schema is not compatible with the provided schema

Structure

{
    "actorDatasetSchemaVersion": 1,
    "title": "E-shop products",
    "description": "Dataset containing the whole product catalog including prices and stock availability.",

    // A JSON schema object describing the dataset fields, with our extensions: the "title", "description", and "example" properties.
    // "example" is used to generate code and API examples for the Actor output.
    // For details, see https://docs.apify.com/platform/actors/development/actor-definition/dataset-schema
    "fields": {
        "type": "object",
        "properties": {
            "title": {
                "type": "string",
                "description": "The name of the results",
            },
            "imageUrl": {
                "type": "string",
                "description": "Function executed for each request",
            },
            "priceUsd": {
                "type": "integer",
                "description": "Price of the item",
            },
            "manufacturer": {
                "type": "object",
                "properties": {
                    "title": { ... },
                    "url": { ... },
                }
            },
            ...
        },
        "required": ["title"],
    },

    // Define the ways how to present the Dataset to users
    "views": {
        "overview": {
            "title": "Products overview",
            "description": "Displays only basic fields such as title and price",
            "transformation": {
                "flatten": ["stockInfo"],
                "fields": [
                    "title",
                    "imageUrl",
                    "variants"
                ]
            },
            "display": {
                "component": "table",
                "properties": {
                    "title": {
                      "label": "Title"
                    },
                    "imageUrl": {
                        "label": "Image",
                        "format": "image" // optional, in this case the format is overridden to show "image" instead of image link "text". "image" format only works with .jpeg, .png or other image format urls.
                    },
                    "stockInfo.availability": {
                        "label": "Availability"
                    }
                }
            }
        },
        "productVariants": {
            "title": "Product variants",
            "description": "Each product expanded into item per variant",
            "transformation": {
                "fields": [
                    "title",
                    "price",
                    "productVariants"
                ],
                "unwind": "productVariants"
            },
            "display": {
                // Simply renders all the available fields.
                // This component is used by default when no display is specified.
                "component": "table"
            }
        }
    },
}

DatasetSchema object definition

PropertyTypeRequiredDescription
actorSpecificationintegertrueSpecifies the version of dataset schema
structure document.
Currently only version 1 is available.
fieldsJSON schematrueJSON schema object with more formats in the future.
views[DatasetView]trueAn array of objects with a description of an API
and UI views.

JSON schema

Items of a dataset can be described by a JSON schema definition, passed into the fields property. The Actor system then ensures that each records added to the dataset complies with the provided schema.

{
    "type": "object",
    "required": ["name", "email"],
    "properties": {
        "id": {
            "type": "string"
        },
        "name": {
            "type": "string"
        },
        "email": {
            "type": "string"
        },
        "arr": {
            "type": "array",
            "items": {
                "type": "object",
                "required": [],
                "properties": {
                    "site": {
                        "type": "string"
                    },
                    "url": {
                        "type": "string"
                    }
                }
            }
        }
    }
}

DatasetView object definition

PropertyTypeRequiredDescription
titlestringtrueThe title is visible in UI in the Output tab
as well as in the API.
descriptionstringfalseThe description is only available in the API response.
The usage of this field is optional.
transformationViewTransformation objecttrueThe definition of data transformation
is applied when dataset data are loaded from
Dataset API.
displayViewDisplay objecttrueThe definition of Output tab UI visualization.

ViewTransformation object definition

PropertyTypeRequiredDescription
fieldsstring[]trueSelects fields that are going to be presented in the output.
The order of fields matches the order of columns
in visualization UI. In case the fields value
is missing, it will be presented as “undefined” in the UI.
unwindstringfalseDeconstructs nested children into parent object,
e.g.: with unwind:[”foo”], the object {”foo”:{”bar”:”hello”}}
is turned into {’bar”:”hello”}.
flattenstring[]falseTransforms nested object into flat structure.
eg: with flatten:[”foo”] the object {”foo”:{”bar”:”hello”}}
is turned into {’foo.bar”:”hello”}.
omitstringfalseRemoves the specified fields from the output.
Nested fields names can be used there as well.
limitintegerfalseThe maximum number of results returned.
Default is all results.
descbooleanfalseBy default, results are sorted in ascending based
on the write event into the dataset. desc:true param
will return the newest writes to the dataset first.

ViewDisplay object definition

PropertyTypeRequiredDescription
componentstringtrueOnly component “table” is available.
propertiesObjectfalseObject with keys matching the transformation.fields
and ViewDisplayProperty as values. In case properties are not set
the table will be rendered automatically with fields formatted as Strings,
Arrays or Objects.

ViewDisplayProperty object definition

PropertyTypeRequiredDescription
labelstringfalseIn case the data are visualized as in Table view.
The label will be visible table column’s header.
formatenum(text, number, date, link,
boolean, image, array, object)
falseDescribes how output data values are formatted
in order to be rendered in the output tab UI.