Data Schema
===========

A data schema is a yaml document that provides a detailed description of a
data set.

.. code-block:: yaml

   # example data schema
   name: some_dataset
   description: >-
     An in depth description of what the data set represents.
     It's important to put as much information in here as is
     reasonable, one of the most difficult things about
     working with data is making sure everyone using it
     understands what it means.
   labels:
     custom: value
   fields:
     field_name:
       type: integer
       required: true
       nullable: false
       description: A description of this one field in particular.
       labels:
         another_custom: value
       checks:
       - name: minimum
         args:
           limit: 0

Top level
---------
There are 4 top level keys within a schema.

**name:**
  The name of the data set. This should be a unique short name that can
  be used as an identifier for the data set. *Required.*
**description:**
  A long form description of the data set. This should be extensive enough
  that a person who isn't directly familiar with the data set should be
  able to read it and understand how to use it.
**labels:**
  The :code:`labels` section of a data schema is used to store custom
  key-value pairs. This section provides space for organizations to add
  custom data to a schema to power internal processes. Potential uses
  of this may include retention schedules (:code:`retention: 4yr`),
  alerting targets (:code:`alert: product@example.com`), or storage
  destination (:code:`store_to: cold_storage`). The only requirement
  placed on this section is that it is a dictionary.
**fields:**
  The :code:`fields` section describes the fields (also called columns)
  within a data set. This is the section of the data schema that powers
  record validation. It is formatted as a dictionary mapping field names
  to a detailed field entry that describes various properties of
  the field.

Field Entry
-----------
A field entry describes a specific field within a data set in detail. It
accepts 7 keys.

**type:**
  The type of value expected for the field. Allowed types are
  :code:`string`, :code:`integer`, :code:`float`, or :code:`boolean`.
  More information on types and casting can be found in the
  :doc:`types section <types>`. *Required.*
**required:**
  Whether the field is required to be present in a record being processed.
  If :code:`false` the field can be omitted and it should be filled with a
  null value. If :code:`true` an omitted field should trigger a
  processing error, even if null values would be otherwise tolerated.
  *Default:* :code:`true`.
**nullable:**
  Whether the field is allowed to be null. If :code:`false` a processing
  error should be raised if null is received. If :code:`true` null values
  are accepted without further processing. *Default:* :code:`false`.
**description:**
  A long form description of the field, the values it can take, and what
  the real world meaning of these values is.
**labels:**
  The :code:`labels` section of a field entry shares a similiar purpose with
  the :code:`labels` section at the top level of the document. This section
  allows organizations to add custom data to a field to power internal
  processes. A potential use case would be to mark a field as containing
  personally identifiable information (:code:`pii: true`). The only
  requirement placed on this section is that it is a dictionary.
**checks:**
  A list of checks to run against an incoming value before accepting it.
  These can be stated in a terse form check entry as a string if no arguments
  need to be provided for the checker, or an extended form check entry
  as a dictionary if arguments need to be given. More information on
  available built in checks can be found in the :doc:`checks section <checks>`.
**derived:**
  Whether the field is explicitly declared or derived from context or other
  field values. This provides a means to document data that may not be
  directly present in the data when received. By default a derived field will
  not be considered during record validation. Examples of potential derived
  fields might include a :code:`processed_at` field recording when it was
  validated or a :code:`platform` field that is filled in by taking the user
  agent from another field and mapping it to :code:`web`, :code:`ios`, or
  :code:`android`.

Check Entry
-----------
Check entries have two forms: the terse form that is a single string and
the extended form that is a whole dictionary. An example of the terse form
would be:

.. code-block:: yaml

   # terse form
   checks:
   - json

This can be stated in extended form as:

.. code-block:: yaml

   # extended form
   checks:
   - name: json
     args: {}

Since the :code:`json` check requires to arguments the terse form is
preferred. Some checks, such as `maximum`, require arguments preventing
the terse form from being used.

An extended form check entry will only have two keys.

**name:**
  The name of the check. A list of built-in checks can be found in the
  :doc:`checks section <checks>`.
**args:**
  A dictionary of arguments specific to the check being used. These are
  used to configure the check.