Data Schema

A data schema is a yaml document that provides a detailed description of a data set.

# example data schema
name: some_dataset
description: >-
  An in depth description of what the data set represents.
  It's important to put as much information in here as is
  reasonable, one of the most difficult things about
  working with data is making sure everyone using it
  understands what it means.
labels:
  custom: value
fields:
  field_name:
    type: integer
    required: true
    nullable: false
    description: A description of this one field in particular.
    labels:
      another_custom: value
    checks:
    - name: minimum
      args:
        limit: 0

Top level

There are 4 top level keys within a schema.

name:

The name of the data set. This should be a unique short name that can be used as an identifier for the data set. Required.

description:

A long form description of the data set. This should be extensive enough that a person who isn’t directly familiar with the data set should be able to read it and understand how to use it.

labels:

The labels section of a data schema is used to store custom key-value pairs. This section provides space for organizations to add custom data to a schema to power internal processes. Potential uses of this may include retention schedules (retention: 4yr), alerting targets (alert: product@example.com), or storage destination (store_to: cold_storage). The only requirement placed on this section is that it is a dictionary.

fields:

The fields section describes the fields (also called columns) within a data set. This is the section of the data schema that powers record validation. It is formatted as a dictionary mapping field names to a detailed field entry that describes various properties of the field.

Field Entry

A field entry describes a specific field within a data set in detail. It accepts 7 keys.

type:

The type of value expected for the field. Allowed types are string, integer, float, or boolean. More information on types and casting can be found in the types section. Required.

required:

Whether the field is required to be present in a record being processed. If false the field can be omitted and it should be filled with a null value. If true an omitted field should trigger a processing error, even if null values would be otherwise tolerated. Default: true.

nullable:

Whether the field is allowed to be null. If false a processing error should be raised if null is received. If true null values are accepted without further processing. Default: false.

description:

A long form description of the field, the values it can take, and what the real world meaning of these values is.

labels:

The labels section of a field entry shares a similiar purpose with the labels section at the top level of the document. This section allows organizations to add custom data to a field to power internal processes. A potential use case would be to mark a field as containing personally identifiable information (pii: true). The only requirement placed on this section is that it is a dictionary.

checks:

A list of checks to run against an incoming value before accepting it. These can be stated in a terse form check entry as a string if no arguments need to be provided for the checker, or an extended form check entry as a dictionary if arguments need to be given. More information on available built in checks can be found in the checks section.

derived:

Whether the field is explicitly declared or derived from context or other field values. This provides a means to document data that may not be directly present in the data when received. By default a derived field will not be considered during record validation. Examples of potential derived fields might include a processed_at field recording when it was validated or a platform field that is filled in by taking the user agent from another field and mapping it to web, ios, or android.

Check Entry

Check entries have two forms: the terse form that is a single string and the extended form that is a whole dictionary. An example of the terse form would be:

# terse form
checks:
- json

This can be stated in extended form as:

# extended form
checks:
- name: json
  args: {}

Since the json check requires to arguments the terse form is preferred. Some checks, such as maximum, require arguments preventing the terse form from being used.

An extended form check entry will only have two keys.

name:

The name of the check. A list of built-in checks can be found in the checks section.

args:

A dictionary of arguments specific to the check being used. These are used to configure the check.