preserves/preserves.md

---
no_site_title: true
title: "Preserves: an Expressive Data Language"
---

Tony Garnock-Jones <tonyg@leastfixedpoint.com>  
May 2021. Version 0.6.0.

  [sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt
  [spki]: http://world.std.com/~cme/html/spki.html
  [varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
  [LEB128]: https://en.wikipedia.org/wiki/LEB128
  [erlang-map]: http://erlang.org/doc/reference_manual/data_types.html#map
  [abnf]: https://tools.ietf.org/html/rfc7405
  [canonical]: canonical-binary.html

This document proposes a data model and serialization format called
*Preserves*.

Preserves supports *records* with user-defined *labels*, embedded
*references*, and the usual suite of atomic and compound data types,
including *binary* data as a distinct type from text strings. Its
*annotations* allow separation of data from metadata such as
[comments](conventions.html#comments), trace information, and
provenance information.

Preserves departs from many other data languages in defining how to
*compare* two values. Comparison is based on the data model, not on
syntax or on data structures of any particular implementation
language.

## Starting with Semantics

Taking inspiration from functional programming, we start with a
definition of the *values* that we want to work with and give them
meaning independent of their syntax.

Our `Value`s fall into two broad categories: *atomic* and *compound*
data. Every `Value` is finite and non-cyclic. Embedded values, called
`Embedded`s, are a third, special-case category.

                          Value = Atom
                                | Compound
                                | Embedded

                           Atom = Boolean
                                | Float
                                | Double
                                | SignedInteger
                                | String
                                | ByteString
                                | Symbol

                       Compound = Record
                                | Sequence
                                | Set
                                | Dictionary

**Total order.**<a name="total-order"></a> As we go, we will
incrementally specify a total order over `Value`s. Two values of the
same kind are compared using kind-specific rules. The ordering among
values of different kinds is essentially arbitrary, but having a total
order is convenient for many tasks, so we define it as
follows:

            (Values)        Atom < Compound < Embedded

            (Compounds)     Record < Sequence < Set < Dictionary

            (Atoms)         Boolean < Float < Double < SignedInteger
                              < String < ByteString < Symbol

**Equivalence.**<a name="equivalence"></a> Two `Value`s are equal if
neither is less than the other according to the total order.

### Signed integers.

A `SignedInteger` is a signed integer of arbitrary width.
`SignedInteger`s are compared as mathematical integers.

### Unicode strings.

A `String` is a sequence of Unicode
[code-point](http://www.unicode.org/glossary/#code_point)s. `String`s
are compared lexicographically, code-point by
code-point.[^utf8-is-awesome]

  [^utf8-is-awesome]: Happily, the design of UTF-8 is such that this
    gives the same result as a lexicographic byte-by-byte comparison
    of the UTF-8 encoding of a string!

### Binary data.

A `ByteString` is a sequence of octets. `ByteString`s are compared
lexicographically.

### Symbols.

Programming languages like Lisp and Prolog frequently use string-like
values called *symbols*. Here, a `Symbol` is, like a `String`, a
sequence of Unicode code-points representing an identifier of some
kind. `Symbol`s are also compared lexicographically by code-point.

### Booleans.

There are two `Boolean`s, “false” and “true”. The “false” value is
less-than the “true” value.

### IEEE floating-point values.

`Float`s and `Double`s are single- and double-precision IEEE 754
floating-point values, respectively. `Float`s, `Double`s and
`SignedInteger`s are disjoint; by the rules [above](#total-order),
every `Float` is less than every `Double`, and every `SignedInteger`
is greater than both. Two `Float`s or two `Double`s are to be ordered
by the `totalOrder` predicate defined in section 5.10 of
[IEEE Std 754-2008](https://dx.doi.org/10.1109/IEEESTD.2008.4610935).

### Records.

A `Record` is a *labelled* tuple of `Value`s, the record's *fields*. A
label can be any `Value`, but is usually a `Symbol`.[^extensibility]
[^iri-labels] `Record`s are compared lexicographically: first by
label, then by field sequence.

  [^extensibility]: The [Racket](https://racket-lang.org/) programming
    language defines
    “[prefab](http://docs.racket-lang.org/guide/define-struct.html#(part._prefab-struct))”
    structure types, which map well to our `Record`s. Racket supports
    record extensibility by encoding record supertypes into record
    labels as specially-formatted lists.

  [^iri-labels]: It is occasionally (but seldom) necessary to
    interpret such `Symbol` labels as UTF-8 encoded IRIs. Where a
    label can be read as a relative IRI, it is notionally interpreted
    with respect to the IRI
    `urn:uuid:6bf094a6-20f1-4887-ada7-46834a9b5b34`; where a label can
    be read as an absolute IRI, it stands for that IRI; and otherwise,
    it cannot be read as an IRI at all, and so the label simply stands
    for itself—for its own `Value`.

### Sequences.

A `Sequence` is a sequence of `Value`s. `Sequence`s are compared
lexicographically.

### Sets.

A `Set` is an unordered finite set of `Value`s. It contains no
duplicate values, following the [equivalence relation](#equivalence)
induced by the total order on `Value`s. Two `Set`s are compared by
sorting their elements ascending using the [total order](#total-order)
and comparing the resulting `Sequence`s.

### Dictionaries.

A `Dictionary` is an unordered finite collection of pairs of `Value`s.
Each pair comprises a *key* and a *value*. Keys in a `Dictionary` are
pairwise distinct. Instances of `Dictionary` are compared by
lexicographic comparison of the sequences resulting from ordering each
`Dictionary`'s pairs in ascending order by key.

### Embeddeds.

An `Embedded` allows inclusion of *domain-specific*, potentially
*stateful* or *located* data into a `Value`.[^embedded-rationale]
`Embedded`s may be used to denote stateful objects, network services,
object capabilities, file descriptors, Unix processes, or other
possibly-stateful things. Because each `Embedded` is a domain-specific
datum, comparison of two `Embedded`s is done according to
domain-specific rules.

  [^embedded-rationale]: **Rationale.** Why include `Embedded`s as a
    special class, distinct from, say, a specially-labeled `Record`?
    First, a `Record` can only hold other `Value`s: in order to embed
    values such as live pointers to Java objects, some means of
    "escaping" from the `Value` data type must be provided. Second,
    `Embedded`s are meant to be able to denote stateful entities, for
    which comparison by address is appropriate; however, we do not
    wish to place restrictions on the *nature* of these entities: if
    we had used `Record`s instead of distinct `Embedded`s, users would
    have to invent an encoding of domain data into `Record`s that
    reflected domain ordering into `Value` ordering. This is often
    difficult and may not always be possible. Finally, because
    `Embedded`s are intended to be able to represent network and
    memory *locations*, they must be able to be rewritten at network
    and process boundaries. Having a distinct class allows generic
    `Embedded` rewriting without the quotation-related complications
    of encoding references as, say, `Record`s.

*Examples.* In a Java or Python implementation, an `Embedded` may
denote a reference to a Java or Python object; comparison would be
done via the language's own rules for equivalence and ordering. In a
Unix application, an `Embedded` may denote an open file descriptor or
a process ID. In an HTTP-based application, each `Embedded` might be a
URL, compared according to
[RFC 6943](https://tools.ietf.org/html/rfc6943#section-3.3). When a
`Value` is serialized for storage or transfer, `Embedded`s will
usually be represented as ordinary `Value`s, in which case the
ordinary rules for comparing `Value`s will apply.

## Textual Syntax

Now we have discussed `Value`s and their meanings, we may turn to
techniques for *representing* `Value`s for communication or storage.

In this section, we use [case-sensitive ABNF][abnf] to define a
textual syntax that is easy for people to read and
write.[^json-superset] Most of the examples in this document are
written using this syntax. In the following section, we will define an
equivalent compact machine-readable syntax.

  [^json-superset]: The grammar of the textual syntax is a superset of
    JSON, with the slightly unusual feature that `true`, `false`, and
    `null` are all read as `Symbol`s, and that `SignedInteger`s are
    never read as `Double`s.

### Character set.

[ABNF][abnf] allows easy definition of US-ASCII-based languages.
However, Preserves is a Unicode-based language. Therefore, we
reinterpret ABNF as a grammar for recognising sequences of Unicode
code points.

Textual syntax for a `Value` *SHOULD* be encoded using UTF-8 where
possible.

### Whitespace.

Whitespace is defined as any number of spaces, tabs, carriage returns,
line feeds, or commas.

                ws = *(%x20 / %x09 / newline / ",")
           newline = CR / LF

### Grammar.

Standalone documents may have trailing whitespace.

          Document = Value ws

Any `Value` may be preceded by whitespace.

             Value = ws (Record / Collection / Atom / Embedded / Compact)
        Collection = Sequence / Dictionary / Set
              Atom = Boolean / Float / Double / SignedInteger /
                     String / ByteString / Symbol

Each `Record` is an angle-bracket enclosed grouping of its
label-`Value` followed by its field-`Value`s.

            Record = "<" Value *Value ws ">"

`Sequence`s are enclosed in square brackets. `Dictionary` values are
curly-brace-enclosed colon-separated pairs of values. `Set`s are
written as values enclosed by the tokens `#{` and
`}`.[^printing-collections] It is an error for a set to contain
duplicate elements or for a dictionary to contain duplicate keys.

          Sequence = "[" *Value ws "]"
        Dictionary = "{" *(Value ws ":" Value) ws "}"
               Set = "#{" *Value ws "}"

  [^printing-collections]: **Implementation note.** When implementing
    printing of `Value`s using the textual syntax, consider supporting
    (a) optional pretty-printing with indentation, (b) optional
    JSON-compatible print mode for that subset of `Value` that is
    compatible with JSON, and (c) optional submodes for no commas,
    commas separating, and commas terminating elements or key/value
    pairs within a collection.

`Boolean`s are the simple literal strings `#t` and `#f` for true and
false, respectively.

           Boolean = %s"#t" / %s"#f"

Numeric data follow the
[JSON grammar](https://tools.ietf.org/html/rfc8259#section-6), with
the addition of a trailing “f” distinguishing `Float` from `Double`
values. `Float`s and `Double`s always have either a fractional part or
an exponent part, where `SignedInteger`s never have
either.[^reading-and-writing-floats-accurately]
[^arbitrary-precision-signedinteger]

             Float = flt %i"f"
            Double = flt
     SignedInteger = int

          digit1-9 = %x31-39
               nat = %x30 / ( digit1-9 *DIGIT )
               int = ["-"] nat
              frac = "." 1*DIGIT
               exp = %i"e" ["-"/"+"] 1*DIGIT
               flt = int (frac exp / frac / exp)

  [^reading-and-writing-floats-accurately]: **Implementation note.**
    Your language's standard library likely has a good routine for
    converting between decimal notation and IEEE 754 floating-point.
    However, if not, or if you are interested in the challenges of
    accurately reading and writing floating point numbers, see the
    excellent matched pair of 1990 papers by Clinger and Steele &
    White, and a recent follow-up by Jaffer:

    Clinger, William D. ‘How to Read Floating Point Numbers
    Accurately’. In Proc. PLDI. White Plains, New York, 1990.
    <https://doi.org/10.1145/93542.93557>.

    Steele, Guy L., Jr., and Jon L. White. ‘How to Print
    Floating-Point Numbers Accurately’. In Proc. PLDI. White Plains,
    New York, 1990. <https://doi.org/10.1145/93542.93559>.

    Jaffer, Aubrey. ‘Easy Accurate Reading and Writing of
    Floating-Point Numbers’. ArXiv:1310.8121 [Cs], 27 October 2013.
    <http://arxiv.org/abs/1310.8121>.

  [^arbitrary-precision-signedinteger]: **Implementation note.** Be
    aware when implementing reading and writing of `SignedInteger`s
    that the data model *requires* arbitrary-precision integers. Your
    implementation may (but, ideally, should not) truncate precision
    when reading or writing a `SignedInteger`; however, if it does so,
    it should (a) signal its client that truncation has occurred, and
    (b) make it clear to the client that comparing such truncated
    values for equality or ordering will not yield results that match
    the expected semantics of the data model.

`String`s are,
[as in JSON](https://tools.ietf.org/html/rfc8259#section-7), possibly
escaped text surrounded by double quotes. The escaping rules are the
same as for JSON.[^string-json-correspondence] [^escaping-surrogate-pairs]

            String = %x22 *char %x22
              char = unescaped / %x7C / escape (escaped / %x22 / %s"u" 4HEXDIG)
         unescaped = %x20-21 / %x23-5B / %x5D-7B / %x7D-10FFFF
            escape = %x5C              ; \
           escaped = ( %x5C /          ; \    reverse solidus U+005C
                       %x2F /          ; /    solidus         U+002F
                       %x62 /          ; b    backspace       U+0008
                       %x66 /          ; f    form feed       U+000C
                       %x6E /          ; n    line feed       U+000A
                       %x72 /          ; r    carriage return U+000D
                       %x74 )          ; t    tab             U+0009

  [^string-json-correspondence]: The grammar for `String` has the same
    effect as the
    [JSON](https://tools.ietf.org/html/rfc8259#section-7) grammar for
    `string`. Some auxiliary definitions (e.g. `escaped`) are lifted
    largely unmodified from the text of RFC 8259.

  [^escaping-surrogate-pairs]: In particular, note JSON's rules around
    the use of surrogate pairs for code points not in the Basic
    Multilingual Plane. We encourage implementations to avoid using
    `\u` escapes when producing output, and instead to rely on the
    UTF-8 encoding of the entire document to handle non-ASCII
    codepoints correctly.

A `ByteString` may be written in any of three different forms.

The first is similar to a `String`, but prepended with a hash sign
`#`. In addition, only Unicode code points overlapping with printable
7-bit ASCII are permitted unescaped inside such a `ByteString`; other
byte values must be escaped by prepending a two-digit hexadecimal
value with `\x`.

        ByteString = "#" %x22 *binchar %x22
           binchar = binunescaped / escape (escaped / %x22 / %s"x" 2HEXDIG)
      binunescaped = %x20-21 / %x23-5B / %x5D-7E

The second is as a sequence of pairs of hexadecimal digits interleaved
with whitespace and surrounded by `#x"` and `"`.

       ByteString =/ %s"#x" %x22 *(ws / 2HEXDIG) ws %x22

The third is as a sequence of
[Base64](https://tools.ietf.org/html/rfc4648) characters, interleaved
with whitespace and surrounded by `#[` and `]`. Plain and URL-safe
Base64 characters are allowed.

       ByteString =/ "#[" *(ws / base64char) ws "]" /
        base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "="

A `Symbol` may be written in a “bare” form[^cf-sexp-token] so long as
it conforms to certain restrictions on the characters appearing in the
symbol. Alternatively, it may be written in a quoted form. The quoted
form is much the same as the syntax for `String`s, including embedded
escape syntax, except using a bar or pipe character (`|`) instead of a
double quote mark.

            Symbol = symstart *symcont / "|" *symchar "|"
          symstart = ALPHA / sympunct / symustart
           symcont = ALPHA / sympunct / symustart / symucont / DIGIT / "-"
          sympunct = "~" / "!" / "$" / "%" / "^" / "&" / "*" /
                     "?" / "_" / "=" / "+" / "/" / "."
           symchar = unescaped / %x22 / escape (escaped / %x7C / %s"u" 4HEXDIG)
         symustart = <any code point greater than 127 whose Unicode
                      category is Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me,
                      Pc, Po, Sc, Sm, Sk, So, or Co>
          symucont = <any code point greater than 127 whose Unicode
                      category is Nd, Nl, No, or Pd>

  [^cf-sexp-token]: Compare with the [SPKI S-expression][sexp.txt]
    definition of “token representation”, and with the
    [R6RS definition of identifiers](http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-7.html#node_sec_4.2.4).

An `Embedded` is written as a `Value` chosen to represent the denoted
object, prefixed with `#!`.

           Embedded = "#!" Value

Finally, any `Value` may be represented by escaping from the textual
syntax to the [compact binary syntax](#compact-binary-syntax) by
prefixing a `ByteString` containing the binary representation of the
`Value` with `#=`.[^rationale-switch-to-binary]
[^no-literal-binary-in-text] [^compact-value-annotations]

           Compact = "#=" ws ByteString

  [^rationale-switch-to-binary]: **Rationale.** The textual syntax
    cannot express every `Value`: specifically, it cannot express the
    several million floating-point NaNs, or the two floating-point
    Infinities. Since the compact binary format for `Value`s expresses
    each `Value` with precision, embedding binary `Value`s solves the
    problem.

  [^no-literal-binary-in-text]: Every text is ultimately physically
    stored as bytes; therefore, it might seem possible to escape to
    the raw binary form of compact binary encoding from within a
    pieces of textual syntax. However, while bytes must be involved in
    any *representation* of text, the text *itself* is logically a
    sequence of *code points* and is not *intrinsically* a binary
    structure at all. It would be incoherent to expect to be able to
    access the representation of the text from within the text itself.

  [^compact-value-annotations]: Any text-syntax annotations preceding
    the `#` are prepended to any binary-syntax annotations yielded by
    decoding the `ByteString`.

### Annotations.

**Syntax.** When written down, a `Value` may have an associated
sequence of *annotations* carrying “out-of-band” contextual metadata
about the value. Each annotation is, in turn, a `Value`, and may
itself have annotations.

            Value =/ ws "@" Value Value

Each annotation is preceded by `@`; the underlying annotated value
follows its annotations. Here we extend only the syntactic nonterminal
named “`Value`” without altering the semantic class of `Value`s.

**Comments.** Strings annotating a `Value` are conventionally
interpreted as comments associated with that value. Comments are
sufficiently common that special syntax exists for them.

            Value =/ ws
                     ";" *(%x00-09 / %x0B-0C / %x0E-%x10FFFF) newline
                     Value

When written this way, everything between the `;` and the newline is
included in the string annotating the `Value`.

**Equivalence.** Annotations appear within syntax denoting a `Value`;
however, the annotations are not part of the denoted value. They are
only part of the syntax. Annotations do not play a part in
equivalences and orderings of `Value`s.

Reflective tools such as debuggers, user interfaces, and message
routers and relays---tools which process `Value`s generically---may
use annotated inputs to tailor their operation, or may insert
annotations in their outputs. By contrast, in ordinary programs, as a
rule of thumb, the presence, absence or content of an annotation
should not change the control flow or output of the program.
Annotations are data *describing* `Value`s, and are not in the domain
of any specific application of `Value`s. That is, an annotation will
almost never cause a non-reflective program to do anything observably
different.

## Compact Binary Syntax

A `Repr` is a binary-syntax encoding, or representation, of a `Value`.
For a value `v`, we write `«v»` for the `Repr` of v.

### Type and Length representation.

Each `Repr` starts with a tag byte, describing the kind of information
represented. Depending on the tag, a length indicator, further encoded
information, and/or an ending tag may follow.

    tag                          (simple atomic data and small integers)
    tag ++ binarydata            (most integers)
    tag ++ length ++ binarydata  (large integers, strings, symbols, and binary)
    tag ++ repr ++ ... ++ endtag (compound data)

The unique end tag is byte value `0x84`.

If present after a tag, the length of a following piece of binary data
is formatted as a [base 128 varint][varint].[^see-also-leb128] We
write `varint(m)` for the varint-encoding of `m`. Quoting the
[Google Protocol Buffers][varint] definition,

  [^see-also-leb128]: Also known as [LEB128][] encoding, for unsigned
    integers. Varints and LEB128-encoded integers differ only for
    signed integers, which are not used in Preserves.

> Each byte in a varint, except the last byte, has the most
> significant bit (msb) set – this indicates that there are further
> bytes to come. The lower 7 bits of each byte are used to store the
> two's complement representation of the number in groups of 7 bits,
> least significant group first.

The following table illustrates varint-encoding.

| Number, `m` | `m` in binary, grouped into 7-bit chunks  | `varint(m)` bytes |
| ------      | -------------------                       | ------------      |
| 15          | `0001111`                                 | 15                |
| 300         | `0000010 0101100`                         | 172 2             |
| 1000000000  | `0000011 1011100 1101011 0010100 0000000` | 128 148 235 220 3 |

It is an error for a varint-encoded `m` in a `Repr` to be anything
other than the unique shortest encoding for that `m`. That is, a
varint-encoding of `m` *MUST NOT* end in `0` unless `m`=0.

### Records, Sequences, Sets and Dictionaries.

          «<L F_1...F_m>» = [0xB4] ++ «L» ++ «F_1» ++...++ «F_m» ++ [0x84]
            «[X_1...X_m]» = [0xB5] ++ «X_1» ++...++ «X_m» ++ [0x84]
           «#{E_1...E_m}» = [0xB6] ++ «E_1» ++...++ «E_m» ++ [0x84]
    «{K_1:V_1...K_m:V_m}» = [0xB7] ++ «K_1» ++ «V_1» ++...++ «K_m» ++ «V_m» ++ [0x84]

There is *no* ordering requirement on the `E_i` elements or
`K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any
order. However, the `E_i` and `K_i` *MUST* be pairwise distinct. In
addition, implementations *SHOULD* default to writing set elements and
dictionary key/value pairs in order sorted lexicographically by their
`Repr`s[^not-sorted-semantically], and *MAY* offer the option of
serializing in some other implementation-defined order.

  [^no-sorting-rationale]: In the BitTorrent encoding format,
    [bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding),
    dictionary key/value pairs must be sorted by key. This is a
    necessary step for ensuring serialization of `Value`s is
    canonical. We do not require that key/value pairs (or set
    elements) be in sorted order for serialized `Value`s; however, a
    [canonical form][canonical] for `Repr`s does exist where a sorted
    ordering is required.

  [^not-sorted-semantically]: It's important to note that the sort
    ordering for writing out set elements and dictionary key/value
    pairs is *not* the same as the sort ordering implied by the
    semantic ordering of those elements or keys. For example, the
    `Repr` of a negative number very far from zero will start with
    byte that is *greater* than the byte which starts the `Repr` of
    zero, making it sort lexicographically later by `Repr`, despite
    being semantically *less than* zero.

    **Rationale**. This is for ease-of-implementation reasons: not all
    languages can easily represent sorted sets or sorted dictionaries,
    but encoding and then sorting byte strings is much more likely to
    be within easy reach.

### SignedIntegers.

    «x» when x ∈ SignedInteger = [0xB0] ++ varint(m) ++ intbytes(x)  if ¬(-3≤x≤12) ∧ m>16
                                 ([0xA0] + m - 1) ++ intbytes(x)     if ¬(-3≤x≤12) ∧ m≤16
                                 ([0xA0] + x)                        if  (-3≤x≤-1)
                                 ([0x90] + x)                        if  ( 0≤x≤12)
                               where m =        |intbytes(x)|

Integers in the range [-3,12] are compactly represented with tags
between `0x90` and `0x9F` because they are so frequently used.
Integers up to 16 bytes long are represented with a single-byte tag
encoding the length of the integer. Larger integers are represented
with an explicit varint length. Every `SignedInteger` *MUST* be
represented with its shortest possible encoding.

The function `intbytes(x)` gives the big-endian two's-complement
binary representation of `x`, taking exactly as many whole bytes as
needed to unambiguously identify the value and its sign, and `m =
|intbytes(x)|`. The most-significant bit in the first byte in
`intbytes(x)` <!-- for `x`≠0 --> is the sign bit.[^zero-intbytes] For
example,

      «87112285931760246646623899502532662132736»
        = B0 12 01 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00
                00 00

      «-257» = A1 FE FF        «-3» = 9D          «128» = A1 00 80
      «-256» = A1 FF 00        «-2» = 9E          «255» = A1 00 FF
      «-255» = A1 FF 01        «-1» = 9F          «256» = A1 01 00
      «-254» = A1 FF 02         «0» = 90        «32767» = A1 7F FF
      «-129» = A1 FF 7F         «1» = 91        «32768» = A2 00 80 00
      «-128» = A0 80           «12» = 9C        «65535» = A2 00 FF FF
      «-127» = A0 81           «13» = A0 0D     «65536» = A2 01 00 00
        «-4» = A0 FC          «127» = A0 7F    «131072» = A2 02 00 00

  [^zero-intbytes]: The value 0 needs zero bytes to identify the
    value, so `intbytes(0)` is the empty byte string. Non-zero values
    need at least one byte.

### Strings, ByteStrings and Symbols.

Syntax for these three types varies only in the tag used. For `String`
and `Symbol`, the data following the tag is a UTF-8 encoding of the
`Value`'s code points, while for `ByteString` it is the raw data
contained within the `Value` unmodified.

    «S» = [0xB1] ++ varint(|utf8(S)|) ++ utf8(S)  if S ∈ String
          [0xB2] ++ varint(|S|) ++ S              if S ∈ ByteString
          [0xB3] ++ varint(|utf8(S)|) ++ utf8(S)  if S ∈ Symbol

### Booleans.

    «#f» = [0x80]
    «#t» = [0x81]

### Floats and Doubles.

    «F» when F ∈ Float  = [0x82] ++ binary32(F)
    «D» when D ∈ Double = [0x83] ++ binary64(D)

The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
8-byte IEEE 754 binary representations of `F` and `D`, respectively.

### Embeddeds.

The `Repr` of an `Embedded` is the `Repr` of a `Value` chosen to
represent the denoted object, prefixed with `[0x86]`.

    «#!V» = [0x86] ++ «V»

### Annotations.

To annotate a `Repr` `r` with some `Value` `v`, prepend `r` with
`[0x85] ++ «v»`. For example, the `Repr` corresponding to textual
syntax `@a@b[]`, i.e. an empty sequence annotated with two symbols,
`a` and `b`, is

    «@a @b []»
      = [0x85] ++ «a» ++ [0x85] ++ «b» ++ «[]»
      = [0x85, 0xB3, 0x01, 0x61, 0x85, 0xB3, 0x01, 0x62, 0xB5, 0x84]

## Examples

### Ordering.

The total ordering specified [above](#total-order) means that the following statements are true:

    "bzz" < "c" < "caa" < #!"a"
    #t < 3.0f < 3.0 < 3 < "3" < |3| < [] < #!#t

### Simple examples.

<!-- TODO: Give some examples of large and small Preserves, perhaps -->
<!-- translated from various JSON blobs floating around the internet. -->

| Value                       | Encoded byte sequence                                                           |
|-----------------------------|---------------------------------------------------------------------------------|
| `<capture <discard>>`       | B4 B3 07 'c' 'a' 'p' 't' 'u' 'r' 'e' B4 B3 07 'd' 'i' 's' 'c' 'a' 'r' 'd' 84 84 |
| `[1 2 3 4]`                 | B5 91 92 93 94 84                                                               |
| `[-2 -1 0 1]`               | B5 9E 9F 90 91 84                                                               |
| `"hello"` (format B)        | B1 05 'h' 'e' 'l' 'l' 'o'                                                       |
| `["a" b #"c" [] #{} #t #f]` | B5 B1 01 'a' B3 01 'b' B2 01 'c' B5 84 B6 84 81 80 84                           |
| `-257`                      | A1 FE FF                                                                        |
| `-1`                        | 9F                                                                              |
| `0`                         | 90                                                                              |
| `1`                         | 91                                                                              |
| `255`                       | A1 00 FF                                                                        |
| `1.0f`                      | 82 3F 80 00 00                                                                  |
| `1.0`                       | 83 3F F0 00 00 00 00 00 00                                                      |
| `-1.202e300`                | 83 FE 3C B7 B7 59 BF 04 26                                                      |

The next example uses a non-`Symbol` label for a record.[^extensibility2] The `Record`

    <[titled person 2 thing 1] 101 "Blackwell" <date 1821 2 3> "Dr">

encodes to

    B4                                ;; Record
      B5                                ;; Sequence
        B3 06 74 69 74 6C 65 64           ;; Symbol, "titled"
        B3 06 70 65 72 73 6F 6E           ;; Symbol, "person"
        92                                ;; SignedInteger, "2"
        B3 05 74 68 69 6E 67              ;; Symbol, "thing"
        91                                ;; SignedInteger, "1"
      84                                ;; End (sequence)
      A0 65                             ;; SignedInteger, "101"
      B1 09 42 6C 61 63 6B 77 65 6C 6C  ;; String, "Blackwell"
      B4                                ;; Record
        B3 04 64 61 74 65                 ;; Symbol, "date"
        A1 07 1D                          ;; SignedInteger, "1821"
        92                                ;; SignedInteger, "2"
        93                                ;; SignedInteger, "3"
      84                                ;; End (record)
      B1 02 44 72                       ;; String, "Dr"
    84                                ;; End (record)

  [^extensibility2]: It happens to line up with Racket's
    representation of a record label for an inheritance hierarchy
    where `titled` extends `person` extends `thing`:

        (struct date (year month day) #:prefab)
        (struct thing (id) #:prefab)
        (struct person thing (name date-of-birth) #:prefab)
        (struct titled person (title) #:prefab)

    For more detail on Racket's representations of record labels, see
    [the Racket documentation for `make-prefab-struct`](http://docs.racket-lang.org/reference/structutils.html#%28def._%28%28quote._~23~25kernel%29._make-prefab-struct%29%29).

---

### JSON examples.

The examples from
[RFC 8259](https://tools.ietf.org/html/rfc8259#section-13) read as
valid Preserves, though the JSON literals `true`, `false` and `null`
read as `Symbol`s. The first example:

    {
      "Image": {
          "Width":  800,
          "Height": 600,
          "Title":  "View from 15th Floor",
          "Thumbnail": {
              "Url":    "http://www.example.com/image/481989943",
              "Height": 125,
              "Width":  100
          },
          "Animated" : false,
          "IDs": [116, 943, 234, 38793]
        }
    }

encodes to binary as follows:

    B7
      B1 05 "Image"
      B7
        B1 03 "IDs"      B5
                           A0 74
                           A1 03 AF
                           A1 00 EA
                           A2 00 97 89
                         84
        B1 05 "Title"    B1 14 "View from 15th Floor"
        B1 05 "Width"    A1 03 20
        B1 06 "Height"   A1 02 58
        B1 08 "Animated" B3 05 "false"
        B1 09 "Thumbnail"
          B7
            B1 03 "Url"    B1 26 "http://www.example.com/image/481989943"
            B1 05 "Width"  A0 64
            B1 06 "Height" A0 7D
          84
      84
    84

and the second example:

    [
      {
         "precision": "zip",
         "Latitude":  37.7668,
         "Longitude": -122.3959,
         "Address":   "",
         "City":      "SAN FRANCISCO",
         "State":     "CA",
         "Zip":       "94107",
         "Country":   "US"
      },
      {
         "precision": "zip",
         "Latitude":  37.371991,
         "Longitude": -122.026020,
         "Address":   "",
         "City":      "SUNNYVALE",
         "State":     "CA",
         "Zip":       "94085",
         "Country":   "US"
      }
    ]

encodes to binary as follows:

    B5
      B7
        B1 03 "Zip"        B1 05 "94107"
        B1 04 "City"       B1 0D "SAN FRANCISCO"
        B1 05 "State"      B1 02 "CA"
        B1 07 "Address"    B1 00
        B1 07 "Country"    B1 02 "US"
        B1 08 "Latitude"   83 40 42 E2 26 80 9D 49 52
        B1 09 "Longitude"  83 C0 5E 99 56 6C F4 1F 21
        B1 09 "precision"  B1 03 "zip"
      84
      B7
        B1 03 "Zip"        B1 05 "94085"
        B1 04 "City"       B1 09 "SUNNYVALE"
        B1 05 "State"      B1 02 "CA"
        B1 07 "Address"    B1 00
        B1 07 "Country"    B1 02 "US"
        B1 08 "Latitude"   83 40 42 AF 9D 66 AD B4 03
        B1 09 "Longitude"  83 C0 5E 81 AA 4F CA 42 AF
        B1 09 "precision"  B1 03 "zip"
      84
    84

## Security Considerations

**Whitespace.** The textual format allows arbitrary whitespace in many
positions. Consider optional restrictions on the amount of consecutive
whitespace that may appear.

**Annotations.** Similarly, in modes where a `Value` is being read
while annotations are skipped, an endless sequence of annotations may
give an illusion of progress.

**Canonical form for cryptographic hashing and signing.** No canonical
textual encoding of a `Value` is specified. A
[canonical form][canonical] exists for binary encoded `Value`s, and
implementations *SHOULD* produce canonical binary encodings by
default; however, an implementation *MAY* permit two serializations of
the same `Value` to yield different binary `Repr`s.

## Acknowledgements

The treatment of commas as whitespace in the text syntax is inspired
by the same feature of [EDN](https://github.com/edn-format/edn).

The text syntax for `Boolean`s, `Symbol`s, and `ByteString`s is
directly inspired by [Racket](https://racket-lang.org/)'s lexical
syntax.

## Appendix. Autodetection of textual or binary syntax

Every tag byte in a binary Preserves `Document` falls within the range
[`0x80`, `0xBF`]. These bytes, interpreted as UTF-8, are *continuation
bytes*, and will never occur as the first byte of a UTF-8 encoded code
point. This means no binary-encoded document can be misinterpreted as
valid UTF-8.

Conversely, a UTF-8 document must start with a valid codepoint,
meaning in particular that it must not start with a byte in the range
[`0x80`, `0xBF`]. This means that no UTF-8 encoded textual-syntax
Preserves document can be misinterpreted as a binary-syntax document.

Examination of the top two bits of the first byte of a document gives
its syntax: if the top two bits are `10`, it should be interpreted as
a binary-syntax document; otherwise, it should be interpreted as text.

## Appendix. Table of tag values

     80 - False
     81 - True
     82 - Float
     83 - Double
     84 - End marker
     85 - Annotation
     86 - Embedded
    (8x)  RESERVED 87-8F

     9x - Small integers 0..12,-3..-1
     An - Small integers, (n+1) bytes long
     B0 - Small integers, variable length
     B1 - String
     B2 - ByteString
     B3 - Symbol

     B4 - Record
     B5 - Sequence
     B6 - Set
     B7 - Dictionary

## Appendix. Binary SignedInteger representation

Languages that provide fixed-width machine word types may find the
following table useful in encoding and decoding binary `SignedInteger`
values.

| Integer range                              | Bytes required | Encoding (hex)                               |
| ---                                        | ---            | ---                                          |
| -3 ≤ n ≤ 12                                | 1              | `3X`                                         |
| -2<sup>7</sup> ≤ n < 2<sup>7</sup> (i8)    | 2              | `A0` `XX`                                    |
| -2<sup>15</sup> ≤ n < 2<sup>15</sup> (i16) | 3              | `A1` `XX` `XX`                               |
| -2<sup>23</sup> ≤ n < 2<sup>23</sup> (i24) | 4              | `A2` `XX` `XX` `XX`                          |
| -2<sup>31</sup> ≤ n < 2<sup>31</sup> (i32) | 5              | `A3` `XX` `XX` `XX` `XX`                     |
| -2<sup>39</sup> ≤ n < 2<sup>39</sup> (i40) | 6              | `A4` `XX` `XX` `XX` `XX` `XX`                |
| -2<sup>47</sup> ≤ n < 2<sup>47</sup> (i48) | 7              | `A5` `XX` `XX` `XX` `XX` `XX` `XX`           |
| -2<sup>55</sup> ≤ n < 2<sup>55</sup> (i56) | 8              | `A6` `XX` `XX` `XX` `XX` `XX` `XX` `XX`      |
| -2<sup>63</sup> ≤ n < 2<sup>63</sup> (i64) | 9              | `A7` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |

<!-- Heading to visually offset the footnotes from the main document: -->
## Notes
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								---
-												Proper layouting

											
										
										
											2019-08-18 21:08:55 +00:00
+								no_site_title: true
 								title: "Preserves: an Expressive Data Language"
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								---
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								Tony Garnock-Jones <tonyg@leastfixedpoint.com>
-												The Great Renaming: Pointer -> Embedded

											
										
										
											2021-05-17 12:54:06 +00:00
+								May 2021. Version 0.6.0.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								  [sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt
 								  [spki]: http://world.std.com/~cme/html/spki.html
 								  [varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
-												LEB128

											
										
										
											2019-11-22 15:27:59 +00:00
+								  [LEB128]: https://en.wikipedia.org/wiki/LEB128
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								  [erlang-map]: http://erlang.org/doc/reference_manual/data_types.html#map
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								  [abnf]: https://tools.ietf.org/html/rfc7405
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								  [canonical]: canonical-binary.html
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								This document proposes a data model and serialization format called
 								*Preserves*.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Introduce pointers

											
										
										
											2021-01-29 11:03:28 +00:00
+								Preserves supports *records* with user-defined *labels*, embedded
 								*references*, and the usual suite of atomic and compound data types,
 								including *binary* data as a distinct type from text strings. Its
 								*annotations* allow separation of data from metadata such as
 								[comments](conventions.html#comments), trace information, and
-												Split out inessential text from the spec

											
										
										
											2019-08-18 16:51:26 +00:00
+								provenance information.
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
-												Introduce pointers

											
										
										
											2021-01-29 11:03:28 +00:00
+								Preserves departs from many other data languages in defining how to
 								*compare* two values. Comparison is based on the data model, not on
 								syntax or on data structures of any particular implementation
 								language.
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								## Starting with Semantics
 								Taking inspiration from functional programming, we start with a
 								definition of the *values* that we want to work with and give them
-												Delete misleading, incorrect, or unnecessary text

											
										
										
											2018-11-08 12:35:50 +00:00
+								meaning independent of their syntax.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Minor print layout tweaks, and minor content fixes

											
										
										
											2018-09-24 15:08:48 +00:00
+								Our `Value`s fall into two broad categories: *atomic* and *compound*
-												The Great Renaming: Pointer -> Embedded

											
										
										
											2021-05-17 12:54:06 +00:00
+								data. Every `Value` is finite and non-cyclic. Embedded values, called
 								`Embedded`s, are a third, special-case category.
-												Minor print layout tweaks, and minor content fixes

											
										
										
											2018-09-24 15:08:48 +00:00
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								                          Value = Atom
-												Introduce pointers

											
										
										
											2021-01-29 11:03:28 +00:00
+								                                | Compound
-												The Great Renaming: Pointer -> Embedded

											
										
										
											2021-05-17 12:54:06 +00:00
+								                                | Embedded
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Fixes

											
										
										
											2018-09-23 21:44:43 +00:00
+								                           Atom = Boolean
-												Introduce pointers

											
										
										
											2021-01-29 11:03:28 +00:00
+								                                | Float
 								                                | Double
 								                                | SignedInteger
 								                                | String
 								                                | ByteString
 								                                | Symbol
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								                       Compound = Record
-												Introduce pointers

											
										
										
											2021-01-29 11:03:28 +00:00
+								                                | Sequence
 								                                | Set
 								                                | Dictionary
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								**Total order.**<a name="total-order"></a> As we go, we will
 								incrementally specify a total order over `Value`s. Two values of the
 								same kind are compared using kind-specific rules. The ordering among
 								values of different kinds is essentially arbitrary, but having a total
 								order is convenient for many tasks, so we define it as
-												Remove pointless footnote remark

											
										
										
											2019-10-23 21:58:47 +00:00
+								follows:
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												The Great Renaming: Pointer -> Embedded

											
										
										
											2021-05-17 12:54:06 +00:00
+								            (Values)        Atom < Compound < Embedded
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								            (Compounds)     Record < Sequence < Set < Dictionary
-												Fixes

											
										
										
											2018-09-23 21:44:43 +00:00
+								            (Atoms)         Boolean < Float < Double < SignedInteger
 								                              < String < ByteString < Symbol
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								**Equivalence.**<a name="equivalence"></a> Two `Value`s are equal if
 								neither is less than the other according to the total order.
 								### Signed integers.
 								A `SignedInteger` is a signed integer of arbitrary width.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								`SignedInteger`s are compared as mathematical integers.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								### Unicode strings.
 								A `String` is a sequence of Unicode
-												Minor print layout tweaks, and minor content fixes

											
										
										
											2018-09-24 15:08:48 +00:00
+								[code-point](http://www.unicode.org/glossary/#code_point)s. `String`s
 								are compared lexicographically, code-point by
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								code-point.[^utf8-is-awesome]
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								  [^utf8-is-awesome]: Happily, the design of UTF-8 is such that this
 								    gives the same result as a lexicographic byte-by-byte comparison
 								    of the UTF-8 encoding of a string!
 								### Binary data.
-												Delete misleading, incorrect, or unnecessary text

											
										
										
											2018-11-08 12:35:50 +00:00
+								A `ByteString` is a sequence of octets. `ByteString`s are compared
 								lexicographically.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Minor print layout tweaks, and minor content fixes

											
										
										
											2018-09-24 15:08:48 +00:00
+								### Symbols.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								Programming languages like Lisp and Prolog frequently use string-like
 								values called *symbols*. Here, a `Symbol` is, like a `String`, a
-												Minor print layout tweaks, and minor content fixes

											
										
										
											2018-09-24 15:08:48 +00:00
+								sequence of Unicode code-points representing an identifier of some
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								kind. `Symbol`s are also compared lexicographically by code-point.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								### Booleans.
-												Delete misleading, incorrect, or unnecessary text

											
										
										
											2018-11-08 12:35:50 +00:00
+								There are two `Boolean`s, “false” and “true”. The “false” value is
 								less-than the “true” value.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								### IEEE floating-point values.
-												Delete misleading, incorrect, or unnecessary text

											
										
										
											2018-11-08 12:35:50 +00:00
+								`Float`s and `Double`s are single- and double-precision IEEE 754
 								floating-point values, respectively. `Float`s, `Double`s and
 								`SignedInteger`s are disjoint; by the rules [above](#total-order),
 								every `Float` is less than every `Double`, and every `SignedInteger`
 								is greater than both. Two `Float`s or two `Double`s are to be ordered
 								by the `totalOrder` predicate defined in section 5.10 of
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								[IEEE Std 754-2008](https://dx.doi.org/10.1109/IEEESTD.2008.4610935).
 								### Records.
-												Delete misleading, incorrect, or unnecessary text

											
										
										
											2018-11-08 12:35:50 +00:00
+								A `Record` is a *labelled* tuple of `Value`s, the record's *fields*. A
 								label can be any `Value`, but is usually a `Symbol`.[^extensibility]
 								[^iri-labels] `Record`s are compared lexicographically: first by
 								label, then by field sequence.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								  [^extensibility]: The [Racket](https://racket-lang.org/) programming
 								    language defines
-												Tweaks; python mapping

											
										
										
											2018-09-24 17:34:07 +00:00
+								    “[prefab](http://docs.racket-lang.org/guide/define-struct.html#(part._prefab-struct))”
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								    structure types, which map well to our `Record`s. Racket supports
 								    record extensibility by encoding record supertypes into record
 								    labels as specially-formatted lists.
 								  [^iri-labels]: It is occasionally (but seldom) necessary to
 								    interpret such `Symbol` labels as UTF-8 encoded IRIs. Where a
 								    label can be read as a relative IRI, it is notionally interpreted
 								    with respect to the IRI
 								    `urn:uuid:6bf094a6-20f1-4887-ada7-46834a9b5b34`; where a label can
 								    be read as an absolute IRI, it stands for that IRI; and otherwise,
 								    it cannot be read as an IRI at all, and so the label simply stands
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								    for itself—for its own `Value`.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								### Sequences.
-												Delete misleading, incorrect, or unnecessary text

											
										
										
											2018-11-08 12:35:50 +00:00
+								A `Sequence` is a sequence of `Value`s. `Sequence`s are compared
 								lexicographically.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								### Sets.
 								A `Set` is an unordered finite set of `Value`s. It contains no
 								duplicate values, following the [equivalence relation](#equivalence)
 								induced by the total order on `Value`s. Two `Set`s are compared by
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								sorting their elements ascending using the [total order](#total-order)
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								and comparing the resulting `Sequence`s.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								### Dictionaries.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								A `Dictionary` is an unordered finite collection of pairs of `Value`s.
-												Delete misleading, incorrect, or unnecessary text

											
										
										
											2018-11-08 12:35:50 +00:00
+								Each pair comprises a *key* and a *value*. Keys in a `Dictionary` are
 								pairwise distinct. Instances of `Dictionary` are compared by
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								lexicographic comparison of the sequences resulting from ordering each
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								`Dictionary`'s pairs in ascending order by key.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												The Great Renaming: Pointer -> Embedded

											
										
										
											2021-05-17 12:54:06 +00:00
+								### Embeddeds.
-												Introduce pointers

											
										
										
											2021-01-29 11:03:28 +00:00
-												The Great Renaming: Pointer -> Embedded

											
										
										
											2021-05-17 12:54:06 +00:00
+								An `Embedded` allows inclusion of *domain-specific*, potentially
 								*stateful* or *located* data into a `Value`.[^embedded-rationale]
 								`Embedded`s may be used to denote stateful objects, network services,
 								object capabilities, file descriptors, Unix processes, or other
 								possibly-stateful things. Because each `Embedded` is a domain-specific
 								datum, comparison of two `Embedded`s is done according to
-												Introduce pointers

											
										
										
											2021-01-29 11:03:28 +00:00
+								domain-specific rules.
-												The Great Renaming: Pointer -> Embedded

											
										
										
											2021-05-17 12:54:06 +00:00
+								  [^embedded-rationale]: **Rationale.** Why include `Embedded`s as a
-												Introduce pointers

											
										
										
											2021-01-29 11:03:28 +00:00
+								    special class, distinct from, say, a specially-labeled `Record`?
 								    First, a `Record` can only hold other `Value`s: in order to embed
 								    values such as live pointers to Java objects, some means of
 								    "escaping" from the `Value` data type must be provided. Second,
-												The Great Renaming: Pointer -> Embedded

											
										
										
											2021-05-17 12:54:06 +00:00
+								    `Embedded`s are meant to be able to denote stateful entities, for
-												Introduce pointers

											
										
										
											2021-01-29 11:03:28 +00:00
+								    which comparison by address is appropriate; however, we do not
 								    wish to place restrictions on the *nature* of these entities: if
-												The Great Renaming: Pointer -> Embedded

											
										
										
											2021-05-17 12:54:06 +00:00
+								    we had used `Record`s instead of distinct `Embedded`s, users would
-												Introduce pointers

											
										
										
											2021-01-29 11:03:28 +00:00
+								    have to invent an encoding of domain data into `Record`s that
 								    reflected domain ordering into `Value` ordering. This is often
 								    difficult and may not always be possible. Finally, because
-												The Great Renaming: Pointer -> Embedded

											
										
										
											2021-05-17 12:54:06 +00:00
+								    `Embedded`s are intended to be able to represent network and
 								    memory *locations*, they must be able to be rewritten at network
 								    and process boundaries. Having a distinct class allows generic
 								    `Embedded` rewriting without the quotation-related complications
 								    of encoding references as, say, `Record`s.
 								*Examples.* In a Java or Python implementation, an `Embedded` may
 								denote a reference to a Java or Python object; comparison would be
 								done via the language's own rules for equivalence and ordering. In a
 								Unix application, an `Embedded` may denote an open file descriptor or
 								a process ID. In an HTTP-based application, each `Embedded` might be a
-												Introduce pointers

											
										
										
											2021-01-29 11:03:28 +00:00
+								URL, compared according to
 								[RFC 6943](https://tools.ietf.org/html/rfc6943#section-3.3). When a
-												The Great Renaming: Pointer -> Embedded

											
										
										
											2021-05-17 12:54:06 +00:00
+								`Value` is serialized for storage or transfer, `Embedded`s will
 								usually be represented as ordinary `Value`s, in which case the
-												Introduce pointers

											
										
										
											2021-01-29 11:03:28 +00:00
+								ordinary rules for comparing `Value`s will apply.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								## Textual Syntax
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								Now we have discussed `Value`s and their meanings, we may turn to
 								techniques for *representing* `Value`s for communication or storage.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								In this section, we use [case-sensitive ABNF][abnf] to define a
 								textual syntax that is easy for people to read and
 								write.[^json-superset] Most of the examples in this document are
 								written using this syntax. In the following section, we will define an
 								equivalent compact machine-readable syntax.
 								  [^json-superset]: The grammar of the textual syntax is a superset of
 								    JSON, with the slightly unusual feature that `true`, `false`, and
 								    `null` are all read as `Symbol`s, and that `SignedInteger`s are
 								    never read as `Double`s.
-												Cosmetic.

											
										
										
											2019-07-03 23:35:56 +00:00
+								### Character set.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
 								[ABNF][abnf] allows easy definition of US-ASCII-based languages.
 								However, Preserves is a Unicode-based language. Therefore, we
 								reinterpret ABNF as a grammar for recognising sequences of Unicode
 								code points.
 								Textual syntax for a `Value` *SHOULD* be encoded using UTF-8 where
 								possible.
-												Cosmetic.

											
										
										
											2019-07-03 23:35:56 +00:00
+								### Whitespace.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
 								Whitespace is defined as any number of spaces, tabs, carriage returns,
-												Remove comments, in prep for annotations replacing them

											
										
										
											2019-07-01 20:31:49 +00:00
+								line feeds, or commas.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
-												Remove comments, in prep for annotations replacing them

											
										
										
											2019-07-01 20:31:49 +00:00
+								                ws = *(%x20 / %x09 / newline / ",")
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								           newline = CR / LF
-												Cosmetic.

											
										
										
											2019-07-03 23:35:56 +00:00
+								### Grammar.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
-												Delete misleading, incorrect, or unnecessary text

											
										
										
											2018-11-08 12:35:50 +00:00
+								Standalone documents may have trailing whitespace.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
 								          Document = Value ws
 								Any `Value` may be preceded by whitespace.
-												The Great Renaming: Pointer -> Embedded

											
										
										
											2021-05-17 12:54:06 +00:00
+								             Value = ws (Record / Collection / Atom / Embedded / Compact)
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								        Collection = Sequence / Dictionary / Set
 								              Atom = Boolean / Float / Double / SignedInteger /
 								                     String / ByteString / Symbol
-												Angle bracket S-exprs for Records!

											
										
										
											2019-08-11 22:54:57 +00:00
+								Each `Record` is an angle-bracket enclosed grouping of its
 								label-`Value` followed by its field-`Value`s.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
-												Angle bracket S-exprs for Records!

											
										
										
											2019-08-11 22:54:57 +00:00
+								            Record = "<" Value *Value ws ">"
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
 								`Sequence`s are enclosed in square brackets. `Dictionary` values are
 								curly-brace-enclosed colon-separated pairs of values. `Set`s are
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								written as values enclosed by the tokens `#{` and
-												Clarify no-duplicates in syntaxes.

											
										
										
											2019-08-18 12:56:13 +00:00
+								`}`.[^printing-collections] It is an error for a set to contain
 								duplicate elements or for a dictionary to contain duplicate keys.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
 								          Sequence = "[" *Value ws "]"
 								        Dictionary = "{" *(Value ws ":" Value) ws "}"
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								               Set = "#{" *Value ws "}"
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
+								  [^printing-collections]: **Implementation note.** When implementing
 								    printing of `Value`s using the textual syntax, consider supporting
 								    (a) optional pretty-printing with indentation, (b) optional
 								    JSON-compatible print mode for that subset of `Value` that is
 								    compatible with JSON, and (c) optional submodes for no commas,
 								    commas separating, and commas terminating elements or key/value
 								    pairs within a collection.
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								`Boolean`s are the simple literal strings `#t` and `#f` for true and
 								false, respectively.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								           Boolean = %s"#t" / %s"#f"
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
 								Numeric data follow the
 								[JSON grammar](https://tools.ietf.org/html/rfc8259#section-6), with
-												Fancy quotes

											
										
										
											2019-08-18 15:51:59 +00:00
+								the addition of a trailing “f” distinguishing `Float` from `Double`
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								values. `Float`s and `Double`s always have either a fractional part or
-												Handle a couple of TODOs

											
										
										
											2018-09-27 12:34:32 +00:00
+								an exponent part, where `SignedInteger`s never have
 								either.[^reading-and-writing-floats-accurately]
 								[^arbitrary-precision-signedinteger]
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
 								             Float = flt %i"f"
 								            Double = flt
 								     SignedInteger = int
 								          digit1-9 = %x31-39
 								               nat = %x30 / ( digit1-9 *DIGIT )
 								               int = ["-"] nat
 								              frac = "." 1*DIGIT
 								               exp = %i"e" ["-"/"+"] 1*DIGIT
 								               flt = int (frac exp / frac / exp)
-												Handle a couple of TODOs

											
										
										
											2018-09-27 12:34:32 +00:00
+								  [^reading-and-writing-floats-accurately]: **Implementation note.**
 								    Your language's standard library likely has a good routine for
 								    converting between decimal notation and IEEE 754 floating-point.
 								    However, if not, or if you are interested in the challenges of
 								    accurately reading and writing floating point numbers, see the
 								    excellent matched pair of 1990 papers by Clinger and Steele &
 								    White, and a recent follow-up by Jaffer:
 								    Clinger, William D. ‘How to Read Floating Point Numbers
 								    Accurately’. In Proc. PLDI. White Plains, New York, 1990.
 								    <https://doi.org/10.1145/93542.93557>.
 								    Steele, Guy L., Jr., and Jon L. White. ‘How to Print
 								    Floating-Point Numbers Accurately’. In Proc. PLDI. White Plains,
 								    New York, 1990. <https://doi.org/10.1145/93542.93559>.
 								    Jaffer, Aubrey. ‘Easy Accurate Reading and Writing of
 								    Floating-Point Numbers’. ArXiv:1310.8121 [Cs], 27 October 2013.
 								    <http://arxiv.org/abs/1310.8121>.
 								  [^arbitrary-precision-signedinteger]: **Implementation note.** Be
 								    aware when implementing reading and writing of `SignedInteger`s
 								    that the data model *requires* arbitrary-precision integers. Your
-												Remove placeholders from spec and implementations 1/5

Update spec and test suite.

											
										
										
											2020-05-28 21:20:02 +00:00
+								    implementation may (but, ideally, should not) truncate precision
 								    when reading or writing a `SignedInteger`; however, if it does so,
 								    it should (a) signal its client that truncation has occurred, and
 								    (b) make it clear to the client that comparing such truncated
 								    values for equality or ordering will not yield results that match
 								    the expected semantics of the data model.
-												Handle a couple of TODOs

											
										
										
											2018-09-27 12:34:32 +00:00
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								`String`s are,
 								[as in JSON](https://tools.ietf.org/html/rfc8259#section-7), possibly
 								escaped text surrounded by double quotes. The escaping rules are the
-												Handle a couple of TODOs

											
										
										
											2018-09-27 12:34:32 +00:00
+								same as for JSON.[^string-json-correspondence] [^escaping-surrogate-pairs]
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
 								            String = %x22 *char %x22
 								              char = unescaped / %x7C / escape (escaped / %x22 / %s"u" 4HEXDIG)
 								         unescaped = %x20-21 / %x23-5B / %x5D-7B / %x7D-10FFFF
 								            escape = %x5C              ; \
 								           escaped = ( %x5C /          ; \    reverse solidus U+005C
 								                       %x2F /          ; /    solidus         U+002F
 								                       %x62 /          ; b    backspace       U+0008
 								                       %x66 /          ; f    form feed       U+000C
 								                       %x6E /          ; n    line feed       U+000A
 								                       %x72 /          ; r    carriage return U+000D
 								                       %x74 )          ; t    tab             U+0009
 								  [^string-json-correspondence]: The grammar for `String` has the same
 								    effect as the
 								    [JSON](https://tools.ietf.org/html/rfc8259#section-7) grammar for
 								    `string`. Some auxiliary definitions (e.g. `escaped`) are lifted
 								    largely unmodified from the text of RFC 8259.
-												Handle a couple of TODOs

											
										
										
											2018-09-27 12:34:32 +00:00
+								  [^escaping-surrogate-pairs]: In particular, note JSON's rules around
 								    the use of surrogate pairs for code points not in the Basic
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								    Multilingual Plane. We encourage implementations to avoid using
 								    `\u` escapes when producing output, and instead to rely on the
 								    UTF-8 encoding of the entire document to handle non-ASCII
 								    codepoints correctly.
-												Handle a couple of TODOs

											
										
										
											2018-09-27 12:34:32 +00:00
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								A `ByteString` may be written in any of three different forms.
 								The first is similar to a `String`, but prepended with a hash sign
 								`#`. In addition, only Unicode code points overlapping with printable
 -bit ASCII are permitted unescaped inside such a `ByteString`; other
 								byte values must be escaped by prepending a two-digit hexadecimal
 								value with `\x`.
 								        ByteString = "#" %x22 *binchar %x22
 								           binchar = binunescaped / escape (escaped / %x22 / %s"x" 2HEXDIG)
 								      binunescaped = %x20-21 / %x23-5B / %x5D-7E
-												Typo

											
										
										
											2018-09-28 10:12:44 +00:00
+								The second is as a sequence of pairs of hexadecimal digits interleaved
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								with whitespace and surrounded by `#x"` and `"`.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								       ByteString =/ %s"#x" %x22 *(ws / 2HEXDIG) ws %x22
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
 								The third is as a sequence of
 								[Base64](https://tools.ietf.org/html/rfc4648) characters, interleaved
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								with whitespace and surrounded by `#[` and `]`. Plain and URL-safe
 								Base64 characters are allowed.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								       ByteString =/ "#[" *(ws / base64char) ws "]" /
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								        base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "="
-												Fancy quotes

											
										
										
											2019-08-18 15:51:59 +00:00
+								A `Symbol` may be written in a “bare” form[^cf-sexp-token] so long as
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								it conforms to certain restrictions on the characters appearing in the
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
+								symbol. Alternatively, it may be written in a quoted form. The quoted
 								form is much the same as the syntax for `String`s, including embedded
 								escape syntax, except using a bar or pipe character (`|`) instead of a
 								double quote mark.
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
 								            Symbol = symstart *symcont / "|" *symchar "|"
-												Avoid confusing dashes/numerics in symunicode at start of a symbol

											
										
										
											2019-08-18 15:51:46 +00:00
+								          symstart = ALPHA / sympunct / symustart
 								           symcont = ALPHA / sympunct / symustart / symucont / DIGIT / "-"
-												Prepare for annotations by disallowing @ in raw symbols

											
										
										
											2018-10-08 20:24:40 +00:00
+								          sympunct = "~" / "!" / "$" / "%" / "^" / "&" / "*" /
-												Angle bracket S-exprs for Records!

											
										
										
											2019-08-11 22:54:57 +00:00
+								                     "?" / "_" / "=" / "+" / "/" / "."
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								           symchar = unescaped / %x22 / escape (escaped / %x7C / %s"u" 4HEXDIG)
-												Avoid confusing dashes/numerics in symunicode at start of a symbol

											
										
										
											2019-08-18 15:51:46 +00:00
+								         symustart = <any code point greater than 127 whose Unicode
 								                      category is Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me,
 								                      Pc, Po, Sc, Sm, Sk, So, or Co>
 								          symucont = <any code point greater than 127 whose Unicode
 								                      category is Nd, Nl, No, or Pd>
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
 								  [^cf-sexp-token]: Compare with the [SPKI S-expression][sexp.txt]
-												Fancy quotes

											
										
										
											2019-08-18 15:51:59 +00:00
+								    definition of “token representation”, and with the
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
+								    [R6RS definition of identifiers](http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-7.html#node_sec_4.2.4).
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
-												The Great Renaming: Pointer -> Embedded

											
										
										
											2021-05-17 12:54:06 +00:00
+								An `Embedded` is written as a `Value` chosen to represent the denoted
-												Introduce pointers

											
										
										
											2021-01-29 11:03:28 +00:00
+								object, prefixed with `#!`.
-												The Great Renaming: Pointer -> Embedded

											
										
										
											2021-05-17 12:54:06 +00:00
+								           Embedded = "#!" Value
-												Introduce pointers

											
										
										
											2021-01-29 11:03:28 +00:00
-												Simplify, repair, and regularise embedded binary values in textual syntax

											
										
										
											2018-09-29 16:50:57 +00:00
+								Finally, any `Value` may be represented by escaping from the textual
 								syntax to the [compact binary syntax](#compact-binary-syntax) by
 								prefixing a `ByteString` containing the binary representation of the
-												Update Racket implementation

											
										
										
											2020-12-30 15:43:18 +00:00
+								`Value` with `#=`.[^rationale-switch-to-binary]
-												Note re annotations and compact values

											
										
										
											2019-08-20 19:32:58 +00:00
+								[^no-literal-binary-in-text] [^compact-value-annotations]
-												Simplify, repair, and regularise embedded binary values in textual syntax

											
										
										
											2018-09-29 16:50:57 +00:00
-												Update Racket implementation

											
										
										
											2020-12-30 15:43:18 +00:00
+								           Compact = "#=" ws ByteString
-												Simplify, repair, and regularise embedded binary values in textual syntax

											
										
										
											2018-09-29 16:50:57 +00:00
 								  [^rationale-switch-to-binary]: **Rationale.** The textual syntax
 								    cannot express every `Value`: specifically, it cannot express the
 								    several million floating-point NaNs, or the two floating-point
 								    Infinities. Since the compact binary format for `Value`s expresses
 								    each `Value` with precision, embedding binary `Value`s solves the
 								    problem.
 								  [^no-literal-binary-in-text]: Every text is ultimately physically
 								    stored as bytes; therefore, it might seem possible to escape to
 								    the raw binary form of compact binary encoding from within a
 								    pieces of textual syntax. However, while bytes must be involved in
 								    any *representation* of text, the text *itself* is logically a
 								    sequence of *code points* and is not *intrinsically* a binary
 								    structure at all. It would be incoherent to expect to be able to
 								    access the representation of the text from within the text itself.
-												Note re annotations and compact values

											
										
										
											2019-08-20 19:32:58 +00:00
+								  [^compact-value-annotations]: Any text-syntax annotations preceding
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								    the `#` are prepended to any binary-syntax annotations yielded by
 								    decoding the `ByteString`.
-												Note re annotations and compact values

											
										
										
											2019-08-20 19:32:58 +00:00
-												Initial draft text re annotations

											
										
										
											2019-07-03 23:33:37 +00:00
+								### Annotations.
-												More on annotations

											
										
										
											2019-07-11 01:52:04 +00:00
+								**Syntax.** When written down, a `Value` may have an associated
 								sequence of *annotations* carrying “out-of-band” contextual metadata
 								about the value. Each annotation is, in turn, a `Value`, and may
 								itself have annotations.
-												Initial draft text re annotations

											
										
										
											2019-07-03 23:33:37 +00:00
 								            Value =/ ws "@" Value Value
 								Each annotation is preceded by `@`; the underlying annotated value
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								follows its annotations. Here we extend only the syntactic nonterminal
-												Fancy quotes

											
										
										
											2019-08-18 15:51:59 +00:00
+								named “`Value`” without altering the semantic class of `Value`s.
-												Initial draft text re annotations

											
										
										
											2019-07-03 23:33:37 +00:00
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								**Comments.** Strings annotating a `Value` are conventionally
 								interpreted as comments associated with that value. Comments are
 								sufficiently common that special syntax exists for them.
 								            Value =/ ws
 								                     ";" *(%x00-09 / %x0B-0C / %x0E-%x10FFFF) newline
 								                     Value
 								When written this way, everything between the `;` and the newline is
 								included in the string annotating the `Value`.
-												More on annotations

											
										
										
											2019-07-11 01:52:04 +00:00
+								**Equivalence.** Annotations appear within syntax denoting a `Value`;
 								however, the annotations are not part of the denoted value. They are
 								only part of the syntax. Annotations do not play a part in
 								equivalences and orderings of `Value`s.
-												Initial draft text re annotations

											
										
										
											2019-07-03 23:33:37 +00:00
 								Reflective tools such as debuggers, user interfaces, and message
 								routers and relays---tools which process `Value`s generically---may
 								use annotated inputs to tailor their operation, or may insert
 								annotations in their outputs. By contrast, in ordinary programs, as a
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								rule of thumb, the presence, absence or content of an annotation
 								should not change the control flow or output of the program.
 								Annotations are data *describing* `Value`s, and are not in the domain
 								of any specific application of `Value`s. That is, an annotation will
 								almost never cause a non-reflective program to do anything observably
 								different.
-												Initial draft text re annotations

											
										
										
											2019-07-03 23:33:37 +00:00
-												WIP from the early hours of this morning, adding textual syntax

											
										
										
											2018-09-27 10:42:55 +00:00
+								## Compact Binary Syntax
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								A `Repr` is a binary-syntax encoding, or representation, of a `Value`.
 								For a value `v`, we write `«v»` for the `Repr` of v.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Cosmetic.

											
										
										
											2019-07-03 23:35:56 +00:00
+								### Type and Length representation.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								Each `Repr` starts with a tag byte, describing the kind of information
 								represented. Depending on the tag, a length indicator, further encoded
 								information, and/or an ending tag may follow.
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								    tag                          (simple atomic data and small integers)
 								    tag ++ binarydata            (most integers)
 								    tag ++ length ++ binarydata  (large integers, strings, symbols, and binary)
 								    tag ++ repr ++ ... ++ endtag (compound data)
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								The unique end tag is byte value `0x84`.
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								If present after a tag, the length of a following piece of binary data
 								is formatted as a [base 128 varint][varint].[^see-also-leb128] We
 								write `varint(m)` for the varint-encoding of `m`. Quoting the
-												LEB128

											
										
										
											2019-11-22 15:27:59 +00:00
+								[Google Protocol Buffers][varint] definition,
 								  [^see-also-leb128]: Also known as [LEB128][] encoding, for unsigned
 								    integers. Varints and LEB128-encoded integers differ only for
 								    signed integers, which are not used in Preserves.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								> Each byte in a varint, except the last byte, has the most
 								> significant bit (msb) set – this indicates that there are further
 								> bytes to come. The lower 7 bits of each byte are used to store the
 								> two's complement representation of the number in groups of 7 bits,
 								> least significant group first.
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								The following table illustrates varint-encoding.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Introduce pointers

											
										
										
											2021-01-29 11:03:28 +00:00
+								| Number, `m` | `m` in binary, grouped into 7-bit chunks  | `varint(m)` bytes |
 								| ------      | -------------------                       | ------------      |
 								| 15          | `0001111`                                 | 15                |
 								| 300         | `0000010 0101100`                         | 172 2             |
 								| 1000000000  | `0000011 1011100 1101011 0010100 0000000` | 128 148 235 220 3 |
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Varints must be as short as possible to be canonical

											
										
										
											2019-10-08 12:43:58 +00:00
+								It is an error for a varint-encoded `m` in a `Repr` to be anything
 								other than the unique shortest encoding for that `m`. That is, a
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								varint-encoding of `m` *MUST NOT* end in `0` unless `m`=0.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								### Records, Sequences, Sets and Dictionaries.
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								          «<L F_1...F_m>» = [0xB4] ++ «L» ++ «F_1» ++...++ «F_m» ++ [0x84]
 								            «[X_1...X_m]» = [0xB5] ++ «X_1» ++...++ «X_m» ++ [0x84]
 								           «#{E_1...E_m}» = [0xB6] ++ «E_1» ++...++ «E_m» ++ [0x84]
 								    «{K_1:V_1...K_m:V_m}» = [0xB7] ++ «K_1» ++ «V_1» ++...++ «K_m» ++ «V_m» ++ [0x84]
-												Many improvements

											
										
										
											2018-09-23 17:14:58 +00:00
-												Clarify no-duplicates in syntaxes.

											
										
										
											2019-08-18 12:56:13 +00:00
+								There is *no* ordering requirement on the `E_i` elements or
-												Many improvements

											
										
										
											2018-09-23 17:14:58 +00:00
+								`K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								order. However, the `E_i` and `K_i` *MUST* be pairwise distinct. In
 								addition, implementations *SHOULD* default to writing set elements and
 								dictionary key/value pairs in order sorted lexicographically by their
 								`Repr`s[^not-sorted-semantically], and *MAY* offer the option of
 								serializing in some other implementation-defined order.
-												Many improvements

											
										
										
											2018-09-23 17:14:58 +00:00
 								  [^no-sorting-rationale]: In the BitTorrent encoding format,
 								    [bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding),
 								    dictionary key/value pairs must be sorted by key. This is a
 								    necessary step for ensuring serialization of `Value`s is
 								    canonical. We do not require that key/value pairs (or set
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								    elements) be in sorted order for serialized `Value`s; however, a
 								    [canonical form][canonical] for `Repr`s does exist where a sorted
 								    ordering is required.
 								  [^not-sorted-semantically]: It's important to note that the sort
 								    ordering for writing out set elements and dictionary key/value
 								    pairs is *not* the same as the sort ordering implied by the
 								    semantic ordering of those elements or keys. For example, the
 								    `Repr` of a negative number very far from zero will start with
 								    byte that is *greater* than the byte which starts the `Repr` of
 								    zero, making it sort lexicographically later by `Repr`, despite
 								    being semantically *less than* zero.
 								    **Rationale**. This is for ease-of-implementation reasons: not all
 								    languages can easily represent sorted sets or sorted dictionaries,
 								    but encoding and then sorting byte strings is much more likely to
 								    be within easy reach.
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
-												Cosmetic.

											
										
										
											2019-07-03 23:35:56 +00:00
+								### SignedIntegers.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								    «x» when x ∈ SignedInteger = [0xB0] ++ varint(m) ++ intbytes(x)  if ¬(-3≤x≤12) ∧ m>16
 								                                 ([0xA0] + m - 1) ++ intbytes(x)     if ¬(-3≤x≤12) ∧ m≤16
 								                                 ([0xA0] + x)                        if  (-3≤x≤-1)
 								                                 ([0x90] + x)                        if  ( 0≤x≤12)
 								                               where m =        |intbytes(x)|
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								Integers in the range [-3,12] are compactly represented with tags
 								between `0x90` and `0x9F` because they are so frequently used.
 								Integers up to 16 bytes long are represented with a single-byte tag
 								encoding the length of the integer. Larger integers are represented
 								with an explicit varint length. Every `SignedInteger` *MUST* be
 								represented with its shortest possible encoding.
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								The function `intbytes(x)` gives the big-endian two's-complement
 								binary representation of `x`, taking exactly as many whole bytes as
 								needed to unambiguously identify the value and its sign, and `m =
-												Literal small integers

											
										
										
											2018-09-24 13:09:26 +00:00
+								|intbytes(x)|`. The most-significant bit in the first byte in
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								`intbytes(x)` <!-- for `x`≠0 --> is the sign bit.[^zero-intbytes] For
 								example,
 								      «87112285931760246646623899502532662132736»
 								        = B0 12 01 00 00 00 00 00 00 00
 00 00 00 00 00 00 00
 00
 								      «-257» = A1 FE FF        «-3» = 9D          «128» = A1 00 80
 								      «-256» = A1 FF 00        «-2» = 9E          «255» = A1 00 FF
 								      «-255» = A1 FF 01        «-1» = 9F          «256» = A1 01 00
 								      «-254» = A1 FF 02         «0» = 90        «32767» = A1 7F FF
 								      «-129» = A1 FF 7F         «1» = 91        «32768» = A2 00 80 00
 								      «-128» = A0 80           «12» = 9C        «65535» = A2 00 FF FF
 								      «-127» = A0 81           «13» = A0 0D     «65536» = A2 01 00 00
 								        «-4» = A0 FC          «127» = A0 7F    «131072» = A2 02 00 00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
-												Literal small integers

											
										
										
											2018-09-24 13:09:26 +00:00
+								  [^zero-intbytes]: The value 0 needs zero bytes to identify the
 								    value, so `intbytes(0)` is the empty byte string. Non-zero values
 								    need at least one byte.
-												Many improvements

											
										
										
											2018-09-23 17:14:58 +00:00
-												Cosmetic.

											
										
										
											2019-07-03 23:35:56 +00:00
+								### Strings, ByteStrings and Symbols.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								Syntax for these three types varies only in the tag used. For `String`
 								and `Symbol`, the data following the tag is a UTF-8 encoding of the
 								`Value`'s code points, while for `ByteString` it is the raw data
 								contained within the `Value` unmodified.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								    «S» = [0xB1] ++ varint(|utf8(S)|) ++ utf8(S)  if S ∈ String
 								          [0xB2] ++ varint(|S|) ++ S              if S ∈ ByteString
 								          [0xB3] ++ varint(|utf8(S)|) ++ utf8(S)  if S ∈ Symbol
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								### Booleans.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								    «#f» = [0x80]
 								    «#t» = [0x81]
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								### Floats and Doubles.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								    «F» when F ∈ Float  = [0x82] ++ binary32(F)
 								    «D» when D ∈ Double = [0x83] ++ binary64(D)
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
 								The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
 -byte IEEE 754 binary representations of `F` and `D`, respectively.
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												The Great Renaming: Pointer -> Embedded

											
										
										
											2021-05-17 12:54:06 +00:00
+								### Embeddeds.
-												Introduce pointers

											
										
										
											2021-01-29 11:03:28 +00:00
-												The Great Renaming: Pointer -> Embedded

											
										
										
											2021-05-17 12:54:06 +00:00
+								The `Repr` of an `Embedded` is the `Repr` of a `Value` chosen to
-												Introduce pointers

											
										
										
											2021-01-29 11:03:28 +00:00
+								represent the denoted object, prefixed with `[0x86]`.
 								    «#!V» = [0x86] ++ «V»
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
+								### Annotations.
 								To annotate a `Repr` `r` with some `Value` `v`, prepend `r` with
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								`[0x85] ++ «v»`. For example, the `Repr` corresponding to textual
 								syntax `@a@b[]`, i.e. an empty sequence annotated with two symbols,
 								`a` and `b`, is
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								    «@a @b []»
 								      = [0x85] ++ «a» ++ [0x85] ++ «b» ++ «[]»
 								      = [0x85, 0xB3, 0x01, 0x61, 0x85, 0xB3, 0x01, 0x62, 0xB5, 0x84]
-												Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

											
										
										
											2019-07-14 02:20:22 +00:00
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								## Examples
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								### Ordering.
 								The total ordering specified [above](#total-order) means that the following statements are true:
-												Introduce pointers

											
										
										
											2021-01-29 11:03:28 +00:00
+								    "bzz" < "c" < "caa" < #!"a"
 								    #t < 3.0f < 3.0 < 3 < "3" < |3| < [] < #!#t
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
-												Cosmetic.

											
										
										
											2019-07-03 23:35:56 +00:00
+								### Simple examples.
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
+								<!-- TODO: Give some examples of large and small Preserves, perhaps -->
 								<!-- translated from various JSON blobs floating around the internet. -->
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								| Value                       | Encoded byte sequence                                                           |
 								|-----------------------------|---------------------------------------------------------------------------------|
 								| `<capture <discard>>`       | B4 B3 07 'c' 'a' 'p' 't' 'u' 'r' 'e' B4 B3 07 'd' 'i' 's' 'c' 'a' 'r' 'd' 84 84 |
 								| `[1 2 3 4]`                 | B5 91 92 93 94 84                                                               |
 								| `[-2 -1 0 1]`               | B5 9E 9F 90 91 84                                                               |
 								| `"hello"` (format B)        | B1 05 'h' 'e' 'l' 'l' 'o'                                                       |
 								| `["a" b #"c" [] #{} #t #f]` | B5 B1 01 'a' B3 01 'b' B2 01 'c' B5 84 B6 84 81 80 84                           |
 								| `-257`                      | A1 FE FF                                                                        |
 								| `-1`                        | 9F                                                                              |
 								| `0`                         | 90                                                                              |
 								| `1`                         | 91                                                                              |
 								| `255`                       | A1 00 FF                                                                        |
 								| `1.0f`                      | 82 3F 80 00 00                                                                  |
 								| `1.0`                       | 83 3F F0 00 00 00 00 00 00                                                      |
 								| `-1.202e300`                | 83 FE 3C B7 B7 59 BF 04 26                                                      |
-												Progress

											
										
										
											2018-09-23 21:35:00 +00:00
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
+								The next example uses a non-`Symbol` label for a record.[^extensibility2] The `Record`
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
-												Angle bracket S-exprs for Records!

											
										
										
											2019-08-11 22:54:57 +00:00
+								    <[titled person 2 thing 1] 101 "Blackwell" <date 1821 2 3> "Dr">
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								encodes to
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								    B4                                ;; Record
 								      B5                                ;; Sequence
 								        B3 06 74 69 74 6C 65 64           ;; Symbol, "titled"
 								        B3 06 70 65 72 73 6F 6E           ;; Symbol, "person"
 ;; SignedInteger, "2"
 								        B3 05 74 68 69 6E 67              ;; Symbol, "thing"
 ;; SignedInteger, "1"
 ;; End (sequence)
 								      A0 65                             ;; SignedInteger, "101"
 								      B1 09 42 6C 61 63 6B 77 65 6C 6C  ;; String, "Blackwell"
 								      B4                                ;; Record
 								        B3 04 64 61 74 65                 ;; Symbol, "date"
 								        A1 07 1D                          ;; SignedInteger, "1821"
 ;; SignedInteger, "2"
 ;; SignedInteger, "3"
 ;; End (record)
 								      B1 02 44 72                       ;; String, "Dr"
 ;; End (record)
-												preserve.md based on codec.md which I'm about to check in

											
										
										
											2018-09-23 13:37:20 +00:00
 								  [^extensibility2]: It happens to line up with Racket's
 								    representation of a record label for an inheritance hierarchy
 								    where `titled` extends `person` extends `thing`:
 								        (struct date (year month day) #:prefab)
 								        (struct thing (id) #:prefab)
 								        (struct person thing (name date-of-birth) #:prefab)
 								        (struct titled person (title) #:prefab)
-												Link to Racket docs for prefab struct labels

											
										
										
											2018-09-25 09:08:22 +00:00
+								    For more detail on Racket's representations of record labels, see
 								    [the Racket documentation for `make-prefab-struct`](http://docs.racket-lang.org/reference/structutils.html#%28def._%28%28quote._~23~25kernel%29._make-prefab-struct%29%29).
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
+								---
-												Cosmetic.

											
										
										
											2019-07-03 23:35:56 +00:00
+								### JSON examples.
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
 								The examples from
 								[RFC 8259](https://tools.ietf.org/html/rfc8259#section-13) read as
 								valid Preserves, though the JSON literals `true`, `false` and `null`
 								read as `Symbol`s. The first example:
 								    {
 								      "Image": {
 								          "Width":  800,
 								          "Height": 600,
 								          "Title":  "View from 15th Floor",
 								          "Thumbnail": {
 								              "Url":    "http://www.example.com/image/481989943",
 								              "Height": 125,
 								              "Width":  100
 								          },
 								          "Animated" : false,
 								          "IDs": [116, 943, 234, 38793]
 								        }
 								    }
 								encodes to binary as follows:
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								    B7
 								      B1 05 "Image"
 								      B7
-												Update Racket implementation

											
										
										
											2020-12-30 15:43:18 +00:00
+								        B1 03 "IDs"      B5
 								                           A0 74
 								                           A1 03 AF
 								                           A1 00 EA
 								                           A2 00 97 89
 
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								        B1 05 "Title"    B1 14 "View from 15th Floor"
 								        B1 05 "Width"    A1 03 20
 								        B1 06 "Height"   A1 02 58
 								        B1 08 "Animated" B3 05 "false"
 								        B1 09 "Thumbnail"
 								          B7
 								            B1 03 "Url"    B1 26 "http://www.example.com/image/481989943"
 								            B1 05 "Width"  A0 64
 								            B1 06 "Height" A0 7D
 
 
 
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
 								and the second example:
 								    [
 								      {
 								         "precision": "zip",
 								         "Latitude":  37.7668,
 								         "Longitude": -122.3959,
 								         "Address":   "",
 								         "City":      "SAN FRANCISCO",
 								         "State":     "CA",
 								         "Zip":       "94107",
 								         "Country":   "US"
 								      },
 								      {
 								         "precision": "zip",
 								         "Latitude":  37.371991,
 								         "Longitude": -122.026020,
 								         "Address":   "",
 								         "City":      "SUNNYVALE",
 								         "State":     "CA",
 								         "Zip":       "94085",
 								         "Country":   "US"
 								      }
 								    ]
 								encodes to binary as follows:
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								    B5
 								      B7
 								        B1 03 "Zip"        B1 05 "94107"
 								        B1 04 "City"       B1 0D "SAN FRANCISCO"
 								        B1 05 "State"      B1 02 "CA"
 								        B1 07 "Address"    B1 00
 								        B1 07 "Country"    B1 02 "US"
 								        B1 08 "Latitude"   83 40 42 E2 26 80 9D 49 52
 								        B1 09 "Longitude"  83 C0 5E 99 56 6C F4 1F 21
 								        B1 09 "precision"  B1 03 "zip"
 
 								      B7
 								        B1 03 "Zip"        B1 05 "94085"
 								        B1 04 "City"       B1 09 "SUNNYVALE"
 								        B1 05 "State"      B1 02 "CA"
 								        B1 07 "Address"    B1 00
 								        B1 07 "Country"    B1 02 "US"
 								        B1 08 "Latitude"   83 40 42 AF 9D 66 AD B4 03
 								        B1 09 "Longitude"  83 C0 5E 81 AA 4F CA 42 AF
 								        B1 09 "precision"  B1 03 "zip"
 
 
-												More TODOs in the text; initial textual reader in Racket

											
										
										
											2018-09-27 18:25:28 +00:00
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
+								## Security Considerations
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								**Whitespace.** The textual format allows arbitrary whitespace in many
 								positions. Consider optional restrictions on the amount of consecutive
 								whitespace that may appear.
 								**Annotations.** Similarly, in modes where a `Value` is being read
 								while annotations are skipped, an endless sequence of annotations may
 								give an illusion of progress.
 								**Canonical form for cryptographic hashing and signing.** No canonical
 								textual encoding of a `Value` is specified. A
 								[canonical form][canonical] exists for binary encoded `Value`s, and
 								implementations *SHOULD* produce canonical binary encodings by
 								default; however, an implementation *MAY* permit two serializations of
 								the same `Value` to yield different binary `Repr`s.
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
-												Acknowledge influences

											
										
										
											2019-08-18 21:42:23 +00:00
+								## Acknowledgements
 								The treatment of commas as whitespace in the text syntax is inspired
 								by the same feature of [EDN](https://github.com/edn-format/edn).
-												Acknowledge Racket influence

											
										
										
											2019-08-19 20:14:46 +00:00
+								The text syntax for `Boolean`s, `Symbol`s, and `ByteString`s is
 								directly inspired by [Racket](https://racket-lang.org/)'s lexical
 								syntax.
-												Autodetectability of binary vs text; documented test case schema a little

											
										
										
											2020-05-13 10:55:55 +00:00
+								## Appendix. Autodetection of textual or binary syntax
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								Every tag byte in a binary Preserves `Document` falls within the range
 								[`0x80`, `0xBF`]. These bytes, interpreted as UTF-8, are *continuation
 								bytes*, and will never occur as the first byte of a UTF-8 encoded code
 								point. This means no binary-encoded document can be misinterpreted as
 								valid UTF-8.
 								Conversely, a UTF-8 document must start with a valid codepoint,
 								meaning in particular that it must not start with a byte in the range
 								[`0x80`, `0xBF`]. This means that no UTF-8 encoded textual-syntax
 								Preserves document can be misinterpreted as a binary-syntax document.
 								Examination of the top two bits of the first byte of a document gives
 								its syntax: if the top two bits are `10`, it should be interpreted as
 								a binary-syntax document; otherwise, it should be interpreted as text.
 								## Appendix. Table of tag values
 - False
 - True
 - Float
 - Double
 - End marker
 - Annotation
-												The Great Renaming: Pointer -> Embedded

											
										
										
											2021-05-17 12:54:06 +00:00
+- Embedded
-												Introduce pointers

											
										
										
											2021-01-29 11:03:28 +00:00
+								    (8x)  RESERVED 87-8F
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
 x - Small integers 0..12,-3..-1
 								     An - Small integers, (n+1) bytes long
 								     B0 - Small integers, variable length
 								     B1 - String
 								     B2 - ByteString
 								     B3 - Symbol
 								     B4 - Record
 								     B5 - Sequence
 								     B6 - Set
 								     B7 - Dictionary
-												Trim and improve

											
										
										
											2018-09-24 11:59:22 +00:00
-												Integer ranges

											
										
										
											2020-05-22 12:36:33 +00:00
+								## Appendix. Binary SignedInteger representation
 								Languages that provide fixed-width machine word types may find the
 								following table useful in encoding and decoding binary `SignedInteger`
 								values.
-												MUCH simpler binary format, inspired by Syrup; alterations to text format

											
										
										
											2020-12-28 22:25:02 +00:00
+								| Integer range                              | Bytes required | Encoding (hex)                               |
 								| ---                                        | ---            | ---                                          |
 								| -3 ≤ n ≤ 12                                | 1              | `3X`                                         |
 								| -2<sup>7</sup> ≤ n < 2<sup>7</sup> (i8)    | 2              | `A0` `XX`                                    |
 								| -2<sup>15</sup> ≤ n < 2<sup>15</sup> (i16) | 3              | `A1` `XX` `XX`                               |
 								| -2<sup>23</sup> ≤ n < 2<sup>23</sup> (i24) | 4              | `A2` `XX` `XX` `XX`                          |
 								| -2<sup>31</sup> ≤ n < 2<sup>31</sup> (i32) | 5              | `A3` `XX` `XX` `XX` `XX`                     |
 								| -2<sup>39</sup> ≤ n < 2<sup>39</sup> (i40) | 6              | `A4` `XX` `XX` `XX` `XX` `XX`                |
 								| -2<sup>47</sup> ≤ n < 2<sup>47</sup> (i48) | 7              | `A5` `XX` `XX` `XX` `XX` `XX` `XX`           |
 								| -2<sup>55</sup> ≤ n < 2<sup>55</sup> (i56) | 8              | `A6` `XX` `XX` `XX` `XX` `XX` `XX` `XX`      |
 								| -2<sup>63</sup> ≤ n < 2<sup>63</sup> (i64) | 9              | `A7` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
-												Integer ranges

											
										
										
											2020-05-22 12:36:33 +00:00
-												Restore removed "Notes" heading

											
										
										
											2019-07-14 18:09:19 +00:00
+								<!-- Heading to visually offset the footnotes from the main document: -->
 								## Notes