Data Schema =========== A data schema is a yaml document that provides a detailed description of a data set. .. code-block:: yaml # example data schema name: some_dataset description: >- An in depth description of what the data set represents. It's important to put as much information in here as is reasonable, one of the most difficult things about working with data is making sure everyone using it understands what it means. labels: custom: value fields: field_name: type: integer required: true nullable: false description: A description of this one field in particular. labels: another_custom: value checks: - name: minimum args: limit: 0 Top level --------- There are 4 top level keys within a schema. **name:** The name of the data set. This should be a unique short name that can be used as an identifier for the data set. *Required.* **description:** A long form description of the data set. This should be extensive enough that a person who isn't directly familiar with the data set should be able to read it and understand how to use it. **labels:** The :code:`labels` section of a data schema is used to store custom key-value pairs. This section provides space for organizations to add custom data to a schema to power internal processes. Potential uses of this may include retention schedules (:code:`retention: 4yr`), alerting targets (:code:`alert: product@example.com`), or storage destination (:code:`store_to: cold_storage`). The only requirement placed on this section is that it is a dictionary. **fields:** The :code:`fields` section describes the fields (also called columns) within a data set. This is the section of the data schema that powers record validation. It is formatted as a dictionary mapping field names to a detailed field entry that describes various properties of the field. Field Entry ----------- A field entry describes a specific field within a data set in detail. It accepts 7 keys. **type:** The type of value expected for the field. Allowed types are :code:`string`, :code:`integer`, :code:`float`, or :code:`boolean`. More information on types and casting can be found in the :doc:`types section `. *Required.* **required:** Whether the field is required to be present in a record being processed. If :code:`false` the field can be omitted and it should be filled with a null value. If :code:`true` an omitted field should trigger a processing error, even if null values would be otherwise tolerated. *Default:* :code:`true`. **nullable:** Whether the field is allowed to be null. If :code:`false` a processing error should be raised if null is received. If :code:`true` null values are accepted without further processing. *Default:* :code:`false`. **description:** A long form description of the field, the values it can take, and what the real world meaning of these values is. **labels:** The :code:`labels` section of a field entry shares a similiar purpose with the :code:`labels` section at the top level of the document. This section allows organizations to add custom data to a field to power internal processes. A potential use case would be to mark a field as containing personally identifiable information (:code:`pii: true`). The only requirement placed on this section is that it is a dictionary. **checks:** A list of checks to run against an incoming value before accepting it. These can be stated in a terse form check entry as a string if no arguments need to be provided for the checker, or an extended form check entry as a dictionary if arguments need to be given. More information on available built in checks can be found in the :doc:`checks section `. **derived:** Whether the field is explicitly declared or derived from context or other field values. This provides a means to document data that may not be directly present in the data when received. By default a derived field will not be considered during record validation. Examples of potential derived fields might include a :code:`processed_at` field recording when it was validated or a :code:`platform` field that is filled in by taking the user agent from another field and mapping it to :code:`web`, :code:`ios`, or :code:`android`. Check Entry ----------- Check entries have two forms: the terse form that is a single string and the extended form that is a whole dictionary. An example of the terse form would be: .. code-block:: yaml # terse form checks: - json This can be stated in extended form as: .. code-block:: yaml # extended form checks: - name: json args: {} Since the :code:`json` check requires to arguments the terse form is preferred. Some checks, such as `maximum`, require arguments preventing the terse form from being used. An extended form check entry will only have two keys. **name:** The name of the check. A list of built-in checks can be found in the :doc:`checks section `. **args:** A dictionary of arguments specific to the check being used. These are used to configure the check.