preserves/preserves-text.md

---
no_site_title: true
title: "Preserves: Text Syntax"
---

Tony Garnock-Jones <tonyg@leastfixedpoint.com>  
{{ site.version_date }}. Version {{ site.version }}.

  [sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt
  [abnf]: https://tools.ietf.org/html/rfc7405

*Preserves* is a data model, with associated serialization formats. This
document defines one of those formats: a textual syntax for `Value`s
from the [Preserves data model](preserves.html) that is easy for people
to read and write. An [equivalent machine-oriented binary
syntax](preserves-binary.html) also exists.

## Preliminaries

The definition uses [case-sensitive ABNF][abnf].

ABNF allows easy definition of US-ASCII-based languages. However,
Preserves is a Unicode-based language. Therefore, we reinterpret ABNF as
a grammar for recognising sequences of Unicode scalar values.

<a id="encoding"></a>
**Encoding.** Textual syntax for a `Value` *SHOULD* be encoded using
UTF-8 where possible.

<a id="whitespace"></a>
**Whitespace.** Whitespace `ws` is defined as any number of spaces, tabs,
carriage returns, or line feeds.

                ws = *(%x20 / %x09 / CR / LF)

<a id="commas"></a>
**Commas.** In some positions inside compound terms, commas are permitted and ignored.

            commas = *(ws ",") ws

<a id="delimiters"></a>
**Delimiters.** Some tokens (`Boolean`, `SymbolOrNumber`) *MUST* be
followed by a `delimiter` or by the end of the input.[^delimiters-lookahead]

         delimiter = ws
                   / "<" / ">" / "[" / "]" / "{" / "}"
                   / "#" / ":" / DQUOTE / "|" / "@" / ";" / ","

[^delimiters-lookahead]: The addition of this constraint means that
    implementations must now use some kind of lookahead to make sure a
    delimiter follows a `Boolean`; this should not be onerous, as
    something similar is required to read `SymbolOrNumber`s correctly.

## Grammar

Standalone documents may have trailing whitespace.

          Document = Value ws

Any `Value` may be preceded by whitespace.

             Value = ws (Record / Collection / Atom / Embedded)
        Collection = Sequence / Set / Dictionary
              Atom = Boolean / String / ByteString /
                     QuotedSymbol / SymbolOrNumber

Each `Record` is an angle-bracket enclosed grouping of its
label-`Value` followed by its field-`Value`s.

            Record = "<" Value *Value ws ">"

`Sequence`s are enclosed in square brackets. `Set`s are written as
values enclosed by the tokens `#{` and `}`. `Dictionary` values are
curly-brace-enclosed colon-separated pairs of
values.[^printing-collections] It is an error for a set to contain
duplicate elements or for a dictionary to contain duplicate keys. When
printing sets and dictionaries, implementations *SHOULD* order elements
resp. keys with respect to the [total order over
`Value`s](preserves.html#total-order).[^rationale-print-ordering]

          Sequence =  "[" *(commas Value)              commas "]"
               Set = "#{" *(commas Value)              commas "}"
        Dictionary =  "{" *(commas Value ws ":" Value) commas "}"

  [^printing-collections]: **Implementation note.** When implementing
    printing of `Value`s using the textual syntax, consider supporting
    (a) optional pretty-printing with indentation, (b) optional
    JSON-compatible print mode for that subset of `Value` that is
    compatible with JSON, and (c) optional submodes for no commas,
    commas separating, and commas terminating elements or key/value
    pairs within a collection.

  [^rationale-print-ordering]: **Rationale.** Consistently printing
    the elements of unordered collections in some arbitrary but stable
    order helps, for example, keep diffs small and somewhat meaningful
    when Preserves values are pretty-printed to text documents under
    source control.

`Boolean`s are the simple literal strings `#t` and `#f` for true and
false, respectively.

           Boolean = %s"#t" / %s"#f"

`String`s are, [as in JSON](https://tools.ietf.org/html/rfc8259#section-7),
possibly escaped text surrounded by double quotes. The escaping rules are
the same as for JSON,[^string-json-correspondence]
[^escaping-surrogate-pairs]
[except](https://tools.ietf.org/html/rfc8259#section-8.2) that unpaired
[surrogate code points](https://unicode.org/glossary/#surrogate_code_point)
*MUST NOT* be generated or accepted.[^unpaired-surrogates]

            String = DQUOTE *char DQUOTE
              char = <any unicode scalar value except "\" or DQUOTE> / escaped / "\" DQUOTE
           escaped = "\\" / "\/" / %s"\b" / %s"\f" / %s"\n" / %s"\r" / %s"\t"
                   / %s"\u" 4HEXDIG

  [^string-json-correspondence]: The grammar for `String` has the same
    effect as the
    [JSON](https://tools.ietf.org/html/rfc8259#section-7) grammar for
    `string`.

  [^escaping-surrogate-pairs]: In particular, note JSON's rules around
    the use of surrogate pairs for scalar values not in the Basic
    Multilingual Plane. We encourage implementations to avoid using
    `\u` escapes when producing output, and instead to rely on the
    UTF-8 encoding of the entire document to handle scalar values outside
    the ASCII range correctly.

  [^unpaired-surrogates]: Because Preserves forbids unpaired surrogates in
    its text syntax, any valid JSON text including an unpaired [surrogate
    code point](https://unicode.org/glossary/#surrogate_code_point) will
    not be parseable using the Preserves text syntax rules.

A `ByteString` may be written in any of three different forms.[^rationale-bytestring]

  [^rationale-bytestring]: **Rationale.** While the [machine-oriented
    syntax](preserves-binary.html) defines just one representation for
    binary data, the text syntax is intended primarily for humans to use,
    and so it defines many. Different usages of binary data will be more
    naturally expressed in text as hexadecimal, Base 64, or almost-ASCII.
    Accepting multiple syntax variations improves the ergonomics of the
    text syntax.

The first is similar to a `String`, but prepended with a hash sign `#`.
Many bytes map directly to printable 7-bit ASCII; the remainder must be
escaped, either as `\x` followed by a two-digit hexadecimal number, or
following the usual rules for double quote and backslash.

        ByteString = "#" DQUOTE *binchar DQUOTE
           binchar = <any unicode scalar value ≥32 and ≤126 except "\" or DQUOTE>
                   / "\" ("\" / "/" / %s"b" / %s"f" / %s"n" / %s"r" / %s"t")
                   / %s"\x" 2HEXDIG
                   / "\" DQUOTE

The second is pairs of hexadecimal digits interleaved with whitespace
and surrounded by `#x"` and `"`.

       ByteString =/ %s"#x" DQUOTE *(ws 2HEXDIG) ws DQUOTE

The third is a sequence of [Base64](https://tools.ietf.org/html/rfc4648)
characters, interleaved with whitespace and surrounded by `#[` and `]`.
[Plain](https://datatracker.ietf.org/doc/html/rfc4648#section-4) (`+`,`/`)
and [URL-safe](https://datatracker.ietf.org/doc/html/rfc4648#section-5)
(`-`,`_`) Base64 characters are accepted;
[URL-safe](https://datatracker.ietf.org/doc/html/rfc4648#section-5)
(`-`,`_`) characters *SHOULD* be generated by default. Padding characters
(`=`) may be omitted.

       ByteString =/ "#[" *(ws base64char) ws "]"
        base64char = ALPHA / DIGIT / "+" / "/" / "-" / "_" / "="

A `Symbol` may be written in either of two forms.

The first is a quoted form, much the same as the syntax for `String`s,
including embedded escape syntax, except using a bar or pipe character
(`|`) instead of a double quote mark.

      QuotedSymbol = "|" *symchar "|"
           symchar = <any unicode scalar value except "\" or "|"> / escaped / "\|"

Alternatively, a `Symbol` may be written in a “bare” form[^cf-sexp-token].
The grammar for numeric data is a subset of the grammar for bare `Symbol`s,
so if a `SymbolOrNumber` also matches the grammar for `Double` or
`SignedInteger` then it must be interpreted as one of those, and otherwise
it must be interpreted as a bare `Symbol`.

    SymbolOrNumber = 1*(ALPHA / DIGIT / sympunct / symuchar)
          sympunct = "~" / "!" / "$" / "%" / "^" / "&" / "*" /
                     "?" / "_" / "=" / "+" / "-" / "/" / "."
          symuchar = <any scalar value ≥128 whose Unicode category is
                      Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me, Nd,
                      Nl, No, Pc, Pd, Po, Sc, Sm, Sk, So, or Co>

  [^cf-sexp-token]: Compare with the [SPKI S-expression][sexp.txt]
    definition of “token representation”, and with the
    [R6RS definition of identifiers](http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-7.html#node_sec_4.2.4).

Numeric data follow the [JSON
grammar](https://tools.ietf.org/html/rfc8259#section-6) except that leading
zeros are permitted and an optional leading `+` sign is allowed.
`Double`s always have either a fractional part or an exponent
part, where `SignedInteger`s never have
either.[^reading-and-writing-floats-accurately]
[^arbitrary-precision-signedinteger]

            Double = flt
     SignedInteger = int

               nat = 1*DIGIT
               int = ["-"/"+"] nat
              frac = "." 1*DIGIT
               exp = %i"e" ["-"/"+"] 1*DIGIT
               flt = int (frac exp / frac / exp)

  [^reading-and-writing-floats-accurately]: **Implementation note.**
    Your language's standard library likely has a good routine for
    converting between decimal notation and IEEE 754 floating-point.
    However, if not, or if you are interested in the challenges of
    accurately reading and writing floating point numbers, see the
    excellent matched pair of 1990 papers by Clinger and Steele &
    White, and a recent follow-up by Jaffer:

    Clinger, William D. ‘How to Read Floating Point Numbers
    Accurately’. In Proc. PLDI. White Plains, New York, 1990.
    <https://doi.org/10.1145/93542.93557>.

    Steele, Guy L., Jr., and Jon L. White. ‘How to Print
    Floating-Point Numbers Accurately’. In Proc. PLDI. White Plains,
    New York, 1990. <https://doi.org/10.1145/93542.93559>.

    Jaffer, Aubrey. ‘Easy Accurate Reading and Writing of
    Floating-Point Numbers’. ArXiv:1310.8121 [Cs], 27 October 2013.
    <http://arxiv.org/abs/1310.8121>.

  [^arbitrary-precision-signedinteger]: **Implementation note.** Be
    aware when implementing reading and writing of `SignedInteger`s
    that the data model *requires* arbitrary-precision integers. Your
    implementation may (but, ideally, should not) truncate precision
    when reading or writing a `SignedInteger`; however, if it does so,
    it should (a) signal its client that truncation has occurred, and
    (b) make it clear to the client that comparing such truncated
    values for equality or ordering will not yield results that match
    the expected semantics of the data model.

Some valid IEEE 754 `Double`s are not covered by the grammar
above, namely, the several million NaNs and the two infinities. These are
represented as raw hexadecimal strings similar to hexadecimal
`ByteString`s. Implementations are free to use hexadecimal floating-point
syntax whereever convenient, even for values representable using the
grammar above.[^rationale-no-general-machine-syntax]

           Double =/ "#xd" DQUOTE 8(ws 2HEXDIG) ws DQUOTE

  [^rationale-no-general-machine-syntax]: **Rationale.** Previous versions
    of this specification included an escape to the [machine-oriented
    binary syntax](preserves-binary.html) by prefixing a `ByteString`
    containing the binary representation of a `Value` with `#=`. The only
    true need for this feature was to represent otherwise-unrepresentable
    floating-point values. Instead, this specification allows such
    floating-point values to be written directly. Removing the `#=` syntax
    simplifies implementations (there is no longer any need to support the
    machine-oriented syntax) and avoids complications around treatment of
    annotations potentially contained within machine-encoded values.

Finally, an `Embedded` is written as a `Value` chosen to represent the
denoted object, prefixed with `#:`.

           Embedded = "#:" Value

## <a id="annotations"></a>Annotations and Comments

When written down, a `Value` may have an associated sequence of
*annotations* carrying “out-of-band” contextual metadata about the
value. Each annotation is, in turn, a `Value`, and may itself have
annotations. The ordering of annotations attached to a `Value` is
significant.

            Value =/ ws Annotation Value
        Annotation = "@" Value

Each annotation is preceded by `@`; the underlying annotated value
follows its annotations. Here we extend only the syntactic nonterminal
named “`Value`” without altering the semantic class of `Value`s.

**Comments.** Strings annotating a `Value` are conventionally
interpreted as comments associated with that value. Comments are
sufficiently common that special syntax exists for them.

       Annotation =/ "#" [(%x20 / %x09) linecomment] (CR / LF)
       linecomment = *<any unicode scalar value except CR or LF>

When written this way, everything between the hash-space or hash-tab and
the end of the line is included in the string annotating the `Value`.
Comments that are just hash `#` followed immediately by newline yield an
empty-string annotation.

**Equivalence.** Annotations appear within syntax denoting a `Value`;
however, the annotations are not part of the denoted value. They are
only part of the syntax. Annotations do not play a part in
equivalences and orderings of `Value`s.

Reflective tools such as debuggers, user interfaces, and message
routers and relays---tools which process `Value`s generically---may
use annotated inputs to tailor their operation, or may insert
annotations in their outputs. By contrast, in ordinary programs, as a
rule of thumb, the presence, absence or content of an annotation
should not change the control flow or output of the program.
Annotations are data *describing* `Value`s, and are not in the domain
of any specific application of `Value`s. That is, an annotation will
almost never cause a non-reflective program to do anything observably
different.

## Security Considerations

**Whitespace.** The textual format allows arbitrary whitespace in many
positions. Consider optional restrictions on the amount of consecutive
whitespace that may appear.

**Annotations.** Similarly, in modes where a `Value` is being read
while annotations are skipped, an endless sequence of annotations may
give an illusion of progress.

## Acknowledgements

The text syntax for `Boolean`s, `Symbol`s, and `ByteString`s is
directly inspired by [Racket](https://racket-lang.org/)'s lexical
syntax.

## Appendix. Regular expressions for bare symbols and numbers

When parsing, if a token matches both `SymbolOrNumber` and `Number`, it's a
number; use `Double` and `SignedInteger` to disambiguate. If it
matches `SymbolOrNumber` but not `Number`, it's a "bare" `Symbol`.

    SymbolOrNumber: ^[-a-zA-Z0-9~!$%^&*?_=+/.]+$
            Number: ^([-+]?\d+)((\.\d+([eE][-+]?\d+)?)|([eE][-+]?\d+))?$
            Double: ^([-+]?\d+)((\.\d+([eE][-+]?\d+)?)|([eE][-+]?\d+))$
     SignedInteger: ^([-+]?\d+)$

When printing, if a symbol matches both `SymbolOrNumber` and `Number` or
neither `SymbolOrNumber` nor `Number`, it must be quoted (`|...|`). If it
matches `SymbolOrNumber` but not `Number`, it may be printed as a "bare"
`Symbol`.

<!-- Heading to visually offset the footnotes from the main document: -->
## Notes