--- no_site_title: true title: "Preserves Schema" --- Tony Garnock-Jones June 2021. Version 0.1.3. [abnf]: https://tools.ietf.org/html/rfc7405 This document proposes a Schema language for the [Preserves data model](./preserves.html). ## Introduction A Preserves schema connects Preserves `Value`s to host-language data structures. Each definition within a schema can be processed by a compiler to produce - a host-language *type definition*; - a partial *parsing* function from `Value`s to instances of the produced type; and - a total *serialization* function from instances of the type to `Value`s. Every parsed `Value` retains enough information to always be able to be serialized again, and every instance of a host-language data structure contains, by construction, enough information to be successfully serialized. **Example.** Sending the schema version 1 . Date = . Person = . to the TypeScript schema compiler produces types, type Date = {"year": number, "month": number, "day": number}; type Person = {"name": string, "birthday": Date}; constructors, function Date({year, month, day}: {year: number, month: number, day: number}): Date; function Person({name, birthday}: {name: string, birthday: Date}): Person; partial parsing functions which throw on parse failure, function asDate(v: _val): Date; function asPerson(v: _val): Person; total parsing functions which yield `undefined` on parse failure, function toDate(v: _val): undefined | Date; function toPerson(v: _val): undefined | Person; and total serialization functions, function fromDate(_v: Date): _val; function fromPerson(_v: Person): _val; ## Concepts **Bundle.** A collection of schemas, each named by a module path. **Definition.** A named pattern within a schema. When compiled, a definition will usually produce a type (plus associated constructors and predicates), a parser function, and a serializer function. **Metaschema.** The Preserves metaschema is a schema describing the abstract syntax of all schema instances (including itself). **Module path.** A sequence of symbols, denoting a leaf in a tree with symbol-labelled edges. **Pattern.** A pattern describes a collection of `Value`s as well as providing names for the portions of matching `Value`s that should be captured in a host-language data type. **Schema abstract syntax tree (AST).** Schema-manipulating tools will usually work with schema AST; that is, with `Value`s conforming to the metaschema or instances of the corresponding host-language datastructures. **Schema domain-specific language (DSL).** While human beings *can* work directly with Preserves documents matching the metaschema, the schema DSL provides an easier-to-read and -write language for working with schemas that can be translated into instances **Schema.** A collection of definitions, plus an optional schema-wide reference to a schema describing embedded values. ## Identifiers and Capitalization Conventions Throughout, `id` is used in the grammar to denote an *identifier*, which is a symbol that matches the regular expression `^[a-zA-Z][a-zA-Z_0-9]*$`. This is a lowest-common-denominator constraint that allows for a reasonable mapping to the identifiers of many programming languages. Identifiers are case-sensitive. Schemas should be written with an awareness of the fact that some programming languages cannot preserve case differences. Avoid using two identifiers in the same context that differ only in case. Schemas should be written using the following capitalization conventions: - `UpperCamelCase` for *definition* names. - Either `lowerCamelCase` or `UpperCamelCase` for definition-unique names for alternatives within a union definition. - `lowerCamelCase` for *module* names (schema names, package names) and *field* or *variable* names. ## Concrete (DSL) Syntax In this section, we use an [ABNF][abnf]-like notation to define a textual syntax that is easy for people to read and write. Most of the examples in this document are written using this syntax. In the following section, we will define the abstract syntax that this surface syntax translates into. ### Schema files and bundles. Each schema should be placed in a single file. Schema files usually end with extension `.prs`, and consist of a sequence of Preserves `Value`s[^like-sexps] separated into *clauses* by the Preserves `Symbol` "`.`". [^like-sexps]: That is, schema files use Preserves as a kind of S-expression! A bundle of schema files is a directory tree containing `.prs` files. ### Clauses. Clause = (Version / EmbeddedTypeName / Include / Definition) "." Version = "version" "1" EmbeddedTypeName = "embeddedType" ("#f" / Ref) Include = "include" string Definition = id "=" (OrPattern / AndPattern / Pattern) **Version specification.** Mandatory. Names the version of the schema language used in the file. This version of the specification is referred to in schema files as `version 1`. **Embedded type name.** Optional. If given as `#f` (the default), it declares that values parsed by the schema do not contain embedded `Value`s of any particular type. If given as a `Ref`, a reference to a definition in this or a neighbouring schema, it declares that embedded `Value`s must themselves conform to the named definition. **Include.** *Experimental.* Includes the contents of a neighbouring file as if it were textually inserted in place of this clause. The file path may be relative to the current file, or absolute. **Definition.** Each definition clause implicitly connects a pattern with a type name and a set of associated functions. ### Union definitions. OrPattern = AltPattern "/" AltPattern *("/" AltPattern) The right-hand-side of a definition may supply two or more *alternatives*. When parsing, the alternatives are tried in order; the result of the first successful alternative is the result of the entire parse. The type corresponding to an `OrPattern` is a union type, a variant type, or an algebraic sum type, depending on the host language. Each alternative with an `OrPattern` must have a definition-unique *name*. The name can either be given explicitly as `@name` (see discussion of `NamedPattern` below) or inferred. It can only be inferred from the label of a record pattern, from the name of a reference to another definition, or from the text of a "sufficiently identifierlike" literal pattern - one that matches a string, symbol, number or boolean: AltPattern = "@" id SimplePattern / "<" id PatternSequence ">" / Ref / LiteralPattern -- with a side condition ### Intersection definitions. AndPattern = NamedPattern "&" NamedPattern *("&" NamedPattern) The right-hand-side of a definition may supply two or more patterns, the *intersection* of whose denotations is the denotation of the overall definition. When parsing, every pattern is tried; if all succeed, the resulting information is combined into a single record type. When serializing, the terms resulting from serializing at each pattern are *merged* together. #### Experimental. Intersections are an experimental feature. They can be used to express *optional dictionary entries*:[^not-ideal-optional-encoding] MyDict = {a: int, b: string} & @c MaybeC . MaybeC = @present {c: symbol} / @invalid {c: any} / @absent {} . It is not yet clear whether they pull their weight. In particular, the semantics of serializing a value defined by intersection are not completely clear. [^not-ideal-optional-encoding]: This encoding is not ideal. It passes responsibility for checking for invalid inputs up to the user, rather than handling it completely at the Schema layer. ### Patterns. Pattern = SimplePattern / CompoundPattern Patterns come in two kinds: - the parsers for *simple patterns* yield a single host-language value; for example, a string, an array, a pointer, and so on. - the parsers for *compound patterns* yield zero or more *fields* which combine into an overall record type associated with a definition. #### Simple patterns SimplePattern = AnyPattern / AtomKindPattern / EmbeddedPattern / LiteralPattern / SequenceOfPattern / SetOfPattern / DictOfPattern / Ref The `any` pattern matches any input `Value`: AnyPattern = "any" Specifying the name of a kind of `Atom` matches that kind of atom: AtomKindPattern = "bool" / "float" / "double" / "int" / "string" / "bytes" / "symbol" Embedded input `Value`s are matched with embedded patterns. The portion under the `#!` prefix is the *interface* schema for the embedded value.[^interface-schema] The result of a match is an instance of the schema-wide `embeddedType`, if one is supplied. EmbeddedPattern = "#!" SimplePattern A literal pattern may be expressed in any of three ways: non-symbol atoms stand for themselves directly; symbols, prefixed with an equal sign, are matched literally; and any `Value` at all may be quoted by placing it in a `< ... >` record: LiteralPattern = "="symbol / "<" value ">" / non-symbol-atom Brackets containing an item pattern and a literal ellipsis match a sequence of items, each matching the nested item pattern. Sets and uniform dictionaries are similar. SequenceOfPattern = "[" SimplePattern "..." "]" SetOfPattern = "#{" SimplePattern "}" DictOfPattern = "{" SimplePattern ":" SimplePattern "...:..." "}" Finally, a reference to some other definition, in this schema or a neighbouring schema within this bundle, is made by mentioning the possibly-qualified name of the definition as a bare symbol: Ref = symbol Periods "`.`" in such symbols are special: - `Name` refers to the definition named `Name` in the current schema. - `Mod.Submod.Name` refers to definition `Name` in `Mod.Submod`, some other schema in the bundle. Each period-separated portion of a reference name must be an `id`, an identifier. [^interface-schema]: Embedded patterns are experimental. One interpretation is that an embedded value denotes a reference to some stateful actor in a potentially-distributed system, and that the interface schema associated with an embedded value describes the messages that may be sent to that actor. **Examples.** `#!any` may denote a reference to an Actor able to receive any value as a message; `#!#t`, a reference to an Actor expecting *only* the "true" message; `#!Session`, a reference to an Actor expecting any message matching a schema defined as `Session` in this file. #### Compound patterns CompoundPattern = RecordPattern / TuplePattern / VariableTuplePattern / DictionaryPattern A record pattern matches an input record. It may be specified as a record with a literal in the label position, or as a quoted `< ... >` record with a pattern for each of the label and field-sequence positions:[^record-shorthand] RecordPattern = "<" NamedPattern NamedPattern ">" / "<" value PatternSequence ">" PatternSequence = *(NamedPattern) [NamedSimplePattern "..."] [^record-shorthand]: Note that `