1462 lines
60 KiB
Markdown
1462 lines
60 KiB
Markdown
---
|
||
---
|
||
<title>Preserves: an Expressive Data Language</title>
|
||
<link rel="stylesheet" href="preserves.css">
|
||
|
||
# Preserves: an Expressive Data Language
|
||
|
||
Tony Garnock-Jones <tonyg@leastfixedpoint.com>
|
||
June 2019. Version 0.0.5.
|
||
|
||
[sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt
|
||
[spki]: http://world.std.com/~cme/html/spki.html
|
||
[varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
|
||
[erlang-map]: http://erlang.org/doc/reference_manual/data_types.html#map
|
||
[abnf]: https://tools.ietf.org/html/rfc7405
|
||
|
||
This document proposes a data model and serialization format called
|
||
*Preserves*.
|
||
|
||
Preserves supports *records* with user-defined *labels*. This relieves
|
||
the confusion caused by encoding records as dictionaries, seen in most
|
||
data languages in use on the web. It also allows Preserves to easily
|
||
represent the *labelled sums of products* as seen in many functional
|
||
programming languages.
|
||
|
||
Preserves also supports the usual suite of atomic and compound data
|
||
types, in particular including *binary* data as a distinct type from
|
||
text strings. Its *annotations* allow separation of data from metadata
|
||
such as comments, trace information, and provenance information.
|
||
|
||
Finally, Preserves defines precisely how to *compare* two values.
|
||
Comparison is based on the data model, not on syntax or on data
|
||
structures of any particular implementation language.
|
||
|
||
## Starting with Semantics
|
||
|
||
Taking inspiration from functional programming, we start with a
|
||
definition of the *values* that we want to work with and give them
|
||
meaning independent of their syntax.
|
||
|
||
Our `Value`s fall into two broad categories: *atomic* and *compound*
|
||
data.
|
||
|
||
Value = Atom
|
||
| Compound
|
||
|
||
Atom = Boolean
|
||
| Float
|
||
| Double
|
||
| SignedInteger
|
||
| String
|
||
| ByteString
|
||
| Symbol
|
||
|
||
Compound = Record
|
||
| Sequence
|
||
| Set
|
||
| Dictionary
|
||
|
||
**Total order.**<a name="total-order"></a> As we go, we will
|
||
incrementally specify a total order over `Value`s. Two values of the
|
||
same kind are compared using kind-specific rules. The ordering among
|
||
values of different kinds is essentially arbitrary, but having a total
|
||
order is convenient for many tasks, so we define it as
|
||
follows:[^ordering-by-syntax]
|
||
|
||
(Values) Atom < Compound
|
||
|
||
(Compounds) Record < Sequence < Set < Dictionary
|
||
|
||
(Atoms) Boolean < Float < Double < SignedInteger
|
||
< String < ByteString < Symbol
|
||
|
||
[^ordering-by-syntax]: The observant reader may note that the
|
||
ordering here is the same as that implied by the tagging scheme
|
||
used in the concrete binary syntax for `Value`s.
|
||
|
||
**Equivalence.**<a name="equivalence"></a> Two `Value`s are equal if
|
||
neither is less than the other according to the total order.
|
||
|
||
### Signed integers.
|
||
|
||
A `SignedInteger` is a signed integer of arbitrary width.
|
||
`SignedInteger`s are compared as mathematical integers.
|
||
|
||
### Unicode strings.
|
||
|
||
A `String` is a sequence of Unicode
|
||
[code-point](http://www.unicode.org/glossary/#code_point)s. `String`s
|
||
are compared lexicographically, code-point by
|
||
code-point.[^utf8-is-awesome]
|
||
|
||
[^utf8-is-awesome]: Happily, the design of UTF-8 is such that this
|
||
gives the same result as a lexicographic byte-by-byte comparison
|
||
of the UTF-8 encoding of a string!
|
||
|
||
### Binary data.
|
||
|
||
A `ByteString` is a sequence of octets. `ByteString`s are compared
|
||
lexicographically.
|
||
|
||
### Symbols.
|
||
|
||
Programming languages like Lisp and Prolog frequently use string-like
|
||
values called *symbols*. Here, a `Symbol` is, like a `String`, a
|
||
sequence of Unicode code-points representing an identifier of some
|
||
kind. `Symbol`s are also compared lexicographically by code-point.
|
||
|
||
### Booleans.
|
||
|
||
There are two `Boolean`s, “false” and “true”. The “false” value is
|
||
less-than the “true” value.
|
||
|
||
### IEEE floating-point values.
|
||
|
||
`Float`s and `Double`s are single- and double-precision IEEE 754
|
||
floating-point values, respectively. `Float`s, `Double`s and
|
||
`SignedInteger`s are disjoint; by the rules [above](#total-order),
|
||
every `Float` is less than every `Double`, and every `SignedInteger`
|
||
is greater than both. Two `Float`s or two `Double`s are to be ordered
|
||
by the `totalOrder` predicate defined in section 5.10 of
|
||
[IEEE Std 754-2008](https://dx.doi.org/10.1109/IEEESTD.2008.4610935).
|
||
|
||
### Records.
|
||
|
||
A `Record` is a *labelled* tuple of `Value`s, the record's *fields*. A
|
||
label can be any `Value`, but is usually a `Symbol`.[^extensibility]
|
||
[^iri-labels] `Record`s are compared lexicographically: first by
|
||
label, then by field sequence.
|
||
|
||
[^extensibility]: The [Racket](https://racket-lang.org/) programming
|
||
language defines
|
||
“[prefab](http://docs.racket-lang.org/guide/define-struct.html#(part._prefab-struct))”
|
||
structure types, which map well to our `Record`s. Racket supports
|
||
record extensibility by encoding record supertypes into record
|
||
labels as specially-formatted lists.
|
||
|
||
[^iri-labels]: It is occasionally (but seldom) necessary to
|
||
interpret such `Symbol` labels as UTF-8 encoded IRIs. Where a
|
||
label can be read as a relative IRI, it is notionally interpreted
|
||
with respect to the IRI
|
||
`urn:uuid:6bf094a6-20f1-4887-ada7-46834a9b5b34`; where a label can
|
||
be read as an absolute IRI, it stands for that IRI; and otherwise,
|
||
it cannot be read as an IRI at all, and so the label simply stands
|
||
for itself—for its own `Value`.
|
||
|
||
### Sequences.
|
||
|
||
A `Sequence` is a sequence of `Value`s. `Sequence`s are compared
|
||
lexicographically.
|
||
|
||
### Sets.
|
||
|
||
A `Set` is an unordered finite set of `Value`s. It contains no
|
||
duplicate values, following the [equivalence relation](#equivalence)
|
||
induced by the total order on `Value`s. Two `Set`s are compared by
|
||
sorting their elements ascending using the [total order](#total-order)
|
||
and comparing the resulting `Sequence`s.
|
||
|
||
### Dictionaries.
|
||
|
||
A `Dictionary` is an unordered finite collection of pairs of `Value`s.
|
||
Each pair comprises a *key* and a *value*. Keys in a `Dictionary` are
|
||
pairwise distinct. Instances of `Dictionary` are compared by
|
||
lexicographic comparison of the sequences resulting from ordering each
|
||
`Dictionary`'s pairs in ascending order by key.
|
||
|
||
## Textual Syntax
|
||
|
||
Now we have discussed `Value`s and their meanings, we may turn to
|
||
techniques for *representing* `Value`s for communication or storage.
|
||
|
||
In this section, we use [case-sensitive ABNF][abnf] to define a
|
||
textual syntax that is easy for people to read and
|
||
write.[^json-superset] Most of the examples in this document are
|
||
written using this syntax. In the following section, we will define an
|
||
equivalent compact machine-readable syntax.
|
||
|
||
[^json-superset]: The grammar of the textual syntax is a superset of
|
||
JSON, with the slightly unusual feature that `true`, `false`, and
|
||
`null` are all read as `Symbol`s, and that `SignedInteger`s are
|
||
never read as `Double`s.
|
||
|
||
### Character set.
|
||
|
||
[ABNF][abnf] allows easy definition of US-ASCII-based languages.
|
||
However, Preserves is a Unicode-based language. Therefore, we
|
||
reinterpret ABNF as a grammar for recognising sequences of Unicode
|
||
code points.
|
||
|
||
Textual syntax for a `Value` *SHOULD* be encoded using UTF-8 where
|
||
possible.
|
||
|
||
### Whitespace.
|
||
|
||
Whitespace is defined as any number of spaces, tabs, carriage returns,
|
||
line feeds, or commas.
|
||
|
||
ws = *(%x20 / %x09 / newline / ",")
|
||
newline = CR / LF
|
||
|
||
### Grammar.
|
||
|
||
Standalone documents may have trailing whitespace.
|
||
|
||
Document = Value ws
|
||
|
||
Any `Value` may be preceded by whitespace.
|
||
|
||
Value = ws (Record / Collection / Atom / Compact)
|
||
Collection = Sequence / Dictionary / Set
|
||
Atom = Boolean / Float / Double / SignedInteger /
|
||
String / ByteString / Symbol
|
||
|
||
Each `Record` is its label-`Value` followed by a parenthesised
|
||
grouping of its field-`Value`s. Whitespace is not permitted between
|
||
the label and the open-parenthesis.
|
||
|
||
Record = Value "(" *Value ws ")"
|
||
|
||
`Sequence`s are enclosed in square brackets. `Dictionary` values are
|
||
curly-brace-enclosed colon-separated pairs of values. `Set`s are
|
||
written either as one or more values enclosed in curly braces, or zero
|
||
or more values enclosed by the tokens `#set{` and
|
||
`}`.[^printing-collections]
|
||
|
||
Sequence = "[" *Value ws "]"
|
||
Dictionary = "{" *(Value ws ":" Value) ws "}"
|
||
Set = %s"#set{" *Value ws "}" / "{" 1*Value ws "}"
|
||
|
||
[^printing-collections]: **Implementation note.** When implementing
|
||
printing of `Value`s using the textual syntax, consider supporting
|
||
(a) optional pretty-printing with indentation, (b) optional
|
||
JSON-compatible print mode for that subset of `Value` that is
|
||
compatible with JSON, and (c) optional submodes for no commas,
|
||
commas separating, and commas terminating elements or key/value
|
||
pairs within a collection.
|
||
|
||
The special cases of records with a single field, which is in turn a
|
||
sequence or dictionary, may be written omitting the parentheses.
|
||
|
||
Record =/ Value Sequence
|
||
Record =/ Value Dictionary
|
||
|
||
`Boolean`s are the simple literal strings `#true` and `#false`.
|
||
|
||
Boolean = %s"#true" / %s"#false"
|
||
|
||
Numeric data follow the
|
||
[JSON grammar](https://tools.ietf.org/html/rfc8259#section-6), with
|
||
the addition of a trailing "f" distinguishing `Float` from `Double`
|
||
values. `Float`s and `Double`s always have either a fractional part or
|
||
an exponent part, where `SignedInteger`s never have
|
||
either.[^reading-and-writing-floats-accurately]
|
||
[^arbitrary-precision-signedinteger]
|
||
|
||
Float = flt %i"f"
|
||
Double = flt
|
||
SignedInteger = int
|
||
|
||
digit1-9 = %x31-39
|
||
nat = %x30 / ( digit1-9 *DIGIT )
|
||
int = ["-"] nat
|
||
frac = "." 1*DIGIT
|
||
exp = %i"e" ["-"/"+"] 1*DIGIT
|
||
flt = int (frac exp / frac / exp)
|
||
|
||
[^reading-and-writing-floats-accurately]: **Implementation note.**
|
||
Your language's standard library likely has a good routine for
|
||
converting between decimal notation and IEEE 754 floating-point.
|
||
However, if not, or if you are interested in the challenges of
|
||
accurately reading and writing floating point numbers, see the
|
||
excellent matched pair of 1990 papers by Clinger and Steele &
|
||
White, and a recent follow-up by Jaffer:
|
||
|
||
Clinger, William D. ‘How to Read Floating Point Numbers
|
||
Accurately’. In Proc. PLDI. White Plains, New York, 1990.
|
||
<https://doi.org/10.1145/93542.93557>.
|
||
|
||
Steele, Guy L., Jr., and Jon L. White. ‘How to Print
|
||
Floating-Point Numbers Accurately’. In Proc. PLDI. White Plains,
|
||
New York, 1990. <https://doi.org/10.1145/93542.93559>.
|
||
|
||
Jaffer, Aubrey. ‘Easy Accurate Reading and Writing of
|
||
Floating-Point Numbers’. ArXiv:1310.8121 [Cs], 27 October 2013.
|
||
<http://arxiv.org/abs/1310.8121>.
|
||
|
||
[^arbitrary-precision-signedinteger]: **Implementation note.** Be
|
||
aware when implementing reading and writing of `SignedInteger`s
|
||
that the data model *requires* arbitrary-precision integers. Your
|
||
I/O routines must not truncate precision either when reading or
|
||
writing a `SignedInteger`.
|
||
|
||
`String`s are,
|
||
[as in JSON](https://tools.ietf.org/html/rfc8259#section-7), possibly
|
||
escaped text surrounded by double quotes. The escaping rules are the
|
||
same as for JSON.[^string-json-correspondence] [^escaping-surrogate-pairs]
|
||
|
||
String = %x22 *char %x22
|
||
char = unescaped / %x7C / escape (escaped / %x22 / %s"u" 4HEXDIG)
|
||
unescaped = %x20-21 / %x23-5B / %x5D-7B / %x7D-10FFFF
|
||
escape = %x5C ; \
|
||
escaped = ( %x5C / ; \ reverse solidus U+005C
|
||
%x2F / ; / solidus U+002F
|
||
%x62 / ; b backspace U+0008
|
||
%x66 / ; f form feed U+000C
|
||
%x6E / ; n line feed U+000A
|
||
%x72 / ; r carriage return U+000D
|
||
%x74 ) ; t tab U+0009
|
||
|
||
[^string-json-correspondence]: The grammar for `String` has the same
|
||
effect as the
|
||
[JSON](https://tools.ietf.org/html/rfc8259#section-7) grammar for
|
||
`string`. Some auxiliary definitions (e.g. `escaped`) are lifted
|
||
largely unmodified from the text of RFC 8259.
|
||
|
||
[^escaping-surrogate-pairs]: In particular, note JSON's rules around
|
||
the use of surrogate pairs for code points not in the Basic
|
||
Multilingual Plane. We encourage implementations to avoid escaping
|
||
such characters when producing output, and instead to rely on the
|
||
UTF-8 encoding of the entire document to handle them correctly.
|
||
|
||
A `ByteString` may be written in any of three different forms.
|
||
|
||
The first is similar to a `String`, but prepended with a hash sign
|
||
`#`. In addition, only Unicode code points overlapping with printable
|
||
7-bit ASCII are permitted unescaped inside such a `ByteString`; other
|
||
byte values must be escaped by prepending a two-digit hexadecimal
|
||
value with `\x`.
|
||
|
||
ByteString = "#" %x22 *binchar %x22
|
||
binchar = binunescaped / escape (escaped / %x22 / %s"x" 2HEXDIG)
|
||
binunescaped = %x20-21 / %x23-5B / %x5D-7E
|
||
|
||
The second is as a sequence of pairs of hexadecimal digits interleaved
|
||
with whitespace and surrounded by `#hex{` and `}`.
|
||
|
||
ByteString =/ %s"#hex{" *(ws / 2HEXDIG) ws "}"
|
||
|
||
The third is as a sequence of
|
||
[Base64](https://tools.ietf.org/html/rfc4648) characters, interleaved
|
||
with whitespace and surrounded by `#base64{` and `}`. Plain and
|
||
URL-safe Base64 characters are allowed.
|
||
|
||
ByteString =/ %s"#base64{" *(ws / base64char) ws "}" /
|
||
base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "="
|
||
|
||
A `Symbol` may be written in a "bare" form[^cf-sexp-token] so long as
|
||
it conforms to certain restrictions on the characters appearing in the
|
||
symbol. Alternatively, it may be written in a quoted form. The quoted
|
||
form is much the same as the syntax for `String`s, including embedded
|
||
escape syntax, except using a bar or pipe character (`|`) instead of a
|
||
double quote mark.
|
||
|
||
Symbol = symstart *symcont / "|" *symchar "|"
|
||
symstart = ALPHA / sympunct / symunicode
|
||
symcont = ALPHA / sympunct / symunicode / DIGIT / "-"
|
||
sympunct = "~" / "!" / "$" / "%" / "^" / "&" / "*" /
|
||
"?" / "_" / "=" / "+" / "<" / ">" / "/" / "."
|
||
symchar = unescaped / %x22 / escape (escaped / %x7C / %s"u" 4HEXDIG)
|
||
symunicode = <any code point greater than 127 whose Unicode
|
||
category is Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me, Nd,
|
||
Nl, No, Pd, Pc, Po, Sc, Sm, Sk, So, or Co>
|
||
|
||
[^cf-sexp-token]: Compare with the [SPKI S-expression][sexp.txt]
|
||
definition of "token representation", and with the
|
||
[R6RS definition of identifiers](http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-7.html#node_sec_4.2.4).
|
||
|
||
Finally, any `Value` may be represented by escaping from the textual
|
||
syntax to the [compact binary syntax](#compact-binary-syntax) by
|
||
prefixing a `ByteString` containing the binary representation of the
|
||
`Value` with `#value`.[^rationale-switch-to-binary] [^no-literal-binary-in-text]
|
||
|
||
Compact = %s"#value" ws ByteString
|
||
|
||
[^rationale-switch-to-binary]: **Rationale.** The textual syntax
|
||
cannot express every `Value`: specifically, it cannot express the
|
||
several million floating-point NaNs, or the two floating-point
|
||
Infinities. Since the compact binary format for `Value`s expresses
|
||
each `Value` with precision, embedding binary `Value`s solves the
|
||
problem.
|
||
|
||
[^no-literal-binary-in-text]: Every text is ultimately physically
|
||
stored as bytes; therefore, it might seem possible to escape to
|
||
the raw binary form of compact binary encoding from within a
|
||
pieces of textual syntax. However, while bytes must be involved in
|
||
any *representation* of text, the text *itself* is logically a
|
||
sequence of *code points* and is not *intrinsically* a binary
|
||
structure at all. It would be incoherent to expect to be able to
|
||
access the representation of the text from within the text itself.
|
||
|
||
### Annotations.
|
||
|
||
**Syntax.** When written down, a `Value` may have an associated
|
||
sequence of *annotations* carrying “out-of-band” contextual metadata
|
||
about the value. Each annotation is, in turn, a `Value`, and may
|
||
itself have annotations.
|
||
|
||
Value =/ ws "@" Value Value
|
||
|
||
Each annotation is preceded by `@`; the underlying annotated value
|
||
follows its annotations. Here we extend only the syntactic nonterminal
|
||
named "`Value`" without altering the semantic class of `Value`s.
|
||
|
||
**Equivalence.** Annotations appear within syntax denoting a `Value`;
|
||
however, the annotations are not part of the denoted value. They are
|
||
only part of the syntax. Annotations do not play a part in
|
||
equivalences and orderings of `Value`s.
|
||
|
||
Reflective tools such as debuggers, user interfaces, and message
|
||
routers and relays---tools which process `Value`s generically---may
|
||
use annotated inputs to tailor their operation, or may insert
|
||
annotations in their outputs. By contrast, in ordinary programs, as a
|
||
rule of thumb, the presence, absence or content of an annotation
|
||
should not change the control flow or output of the program.
|
||
Annotations are data *describing* `Value`s, and are not in the domain
|
||
of any specific application of `Value`s. That is, an annotation will
|
||
almost never cause a non-reflective program to do anything observably
|
||
different.
|
||
|
||
## Compact Binary Syntax
|
||
|
||
A `Repr` is a binary-syntax encoding, or representation, of either
|
||
|
||
- a `Value`,
|
||
- a "placeholder" for a `Value`, or
|
||
- an annotation on a `Repr`.
|
||
|
||
Each `Repr` comprises one or more bytes describing the kind of
|
||
represented information and the length of the representation, followed
|
||
by the encoded details.
|
||
|
||
For a value `v`, we write `[[v]]` for the `Repr` of v.
|
||
|
||
### Type and Length representation.
|
||
|
||
Each `Repr` takes one of three possible forms:
|
||
|
||
- (A) type-specific form, used for simple values such as `Boolean`s
|
||
or `Float`s, for placeholders, and for introducing annotations.
|
||
|
||
- (B) a variable-length form with length specified up-front, used for
|
||
compound and variable-length atomic data structures when their
|
||
sizes are known at the time serialization begins.
|
||
|
||
- (C) a variable-length streaming form with unknown or unpredictable
|
||
length, used in cases when serialization begins before the number
|
||
of elements or bytes in the corresponding `Value` is known.
|
||
|
||
Applications may choose between formats B and C depending on their
|
||
needs at serialization time.
|
||
|
||
#### The lead byte.
|
||
|
||
Every `Repr` starts with a *lead byte*, constructed by
|
||
`leadbyte(t,n,m)`, where `t`,`n`∈{0,1,2,3} and 0≤`m`<16:
|
||
|
||
leadbyte(t,n,m) = [t*64 + n*16 + m]
|
||
|
||
The arguments `t`, `n` and `m` describe the rest of the
|
||
representation.[^some-encodings-unused]
|
||
|
||
[^some-encodings-unused]: Some encodings are unused. All such
|
||
encodings are reserved for future versions of this specification.
|
||
|
||
| `t` | `n` | `m` | Meaning |
|
||
| --- | --- | --- | ------- |
|
||
| 0 | 0 | 0–3 | (format A) An `Atom` with fixed-length binary representation |
|
||
| 0 | 0 | 4 | (format C) Stream end |
|
||
| 0 | 0 | 5 | (format A) Annotation |
|
||
| 0 | 1 | | (format A) Placeholder for an application-specific `Value` |
|
||
| 0 | 2 | | (format C) Stream start |
|
||
| 0 | 3 | | (format A) Certain small `SignedInteger`s |
|
||
| 1 | | | (format B) An `Atom` with variable-length binary representation |
|
||
| 2 | | | (format B) A `Compound` with variable-length representation |
|
||
|
||
#### Encoding data of type-specific length (format A).
|
||
|
||
Each type of data defines its own rules for this format.
|
||
|
||
#### Encoding data of known length (format B).
|
||
|
||
Format B is used where the length `l` of the `Value` to be encoded is
|
||
known when serialization begins. Format B `Repr`s use `m` in
|
||
`leadbyte` to encode `l`. The length counts *bytes* for atomic
|
||
`Value`s, but counts *contained values* for compound `Value`s.
|
||
|
||
- A length `l` between 0 and 14 is represented using `leadbyte` with
|
||
`m=l`.
|
||
- A length of 15 or greater is represented by `m=15` and additional
|
||
bytes describing the length following the lead byte.
|
||
|
||
The function `header(t,n,m)` yields an appropriate sequence of bytes
|
||
describing a `Repr`'s type and length when `t`, `n` and `m` are
|
||
appropriate non-negative integers:
|
||
|
||
header(t,n,m) = leadbyte(t,n,m) when m < 15
|
||
or leadbyte(t,n,15) ++ varint(m) otherwise
|
||
|
||
The additional length bytes are formatted as
|
||
[base 128 varints][varint]. We write `varint(m)` for the
|
||
varint-encoding of `m`. Quoting the [Google Protocol Buffers][varint]
|
||
definition,
|
||
|
||
> Each byte in a varint, except the last byte, has the most
|
||
> significant bit (msb) set – this indicates that there are further
|
||
> bytes to come. The lower 7 bits of each byte are used to store the
|
||
> two's complement representation of the number in groups of 7 bits,
|
||
> least significant group first.
|
||
|
||
The following table illustrates varint-encoding.
|
||
|
||
| Number, `m` | `m` in binary, grouped into 7-bit chunks | `varint(m)` bytes |
|
||
| ------ | ------------------- | ------------ |
|
||
| 15 | `0001111` | 15 |
|
||
| 300 | `0000010 0101100` | 172 2 |
|
||
| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 128 148 235 220 3 |
|
||
|
||
#### Streaming data of unknown length (format C).
|
||
|
||
A `Repr` where the length of the `Value` to be encoded is variable and
|
||
not known at the time serialization of the `Value` starts is encoded
|
||
by a single Stream Start (“open”) byte, followed by zero or more
|
||
*chunks*, followed by a matching Stream End (“close”) byte:
|
||
|
||
open(t,n) = leadbyte(0,2, t*4 + n) = [0x20 + t*4 + n]
|
||
close() = leadbyte(0,0, 4) = [0x04]
|
||
|
||
For a format C `Repr` of an atomic `Value`, each chunk is to be a
|
||
format B `Repr` of a `ByteString`, no matter the type of the overall
|
||
`Value`. Annotations are not allowed on these individual chunks.
|
||
|
||
For a format C `Repr` of a compound `Value`, each chunk is to be a
|
||
single `Repr`, which may itself be annotated.
|
||
|
||
Each chunk within a format C `Repr` *MUST* have non-zero length.
|
||
Software that decodes `Repr`s *MUST* reject `Repr`s that include
|
||
zero-length chunks.
|
||
|
||
### Records.
|
||
|
||
Format B (known length):
|
||
|
||
[[ L(F_1...F_m) ]] = header(2,0,m+1) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]]
|
||
|
||
For `m` fields, `m+1` is supplied to `header`, to account for the
|
||
encoding of the record label.
|
||
|
||
Format C (streaming):
|
||
|
||
[[ L(F_1...F_m) ]] = open(2,0) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] ++ close()
|
||
|
||
Applications *SHOULD* prefer the known-length format for encoding
|
||
`Record`s.
|
||
|
||
### Placeholders.
|
||
|
||
Applications may define an interpretation for numbered *placeholders*
|
||
in the binary syntax, mapping each *placeholder number* `n` to a
|
||
specific `Value`. For example, a placeholder number may be assigned
|
||
for a frequently-used `Record` label.
|
||
|
||
A `Value` `v` for which placeholder number `n` has been assigned may
|
||
be tersely encoded as
|
||
|
||
[[v]] = header(0,1,n) when n is a placeholder number for v
|
||
|
||
**Examples.** For example, a protocol may choose to assign placeholder
|
||
number 4 to the symbol `void`, making
|
||
|
||
[[void]] = header(0,1,4) = [0x14]
|
||
[[void()]] = header(2,0,1) ++ [[void]] = [0x81, 0x14]
|
||
|
||
or it may map symbol `person` to placeholder number 102, making
|
||
|
||
[[person]] = header(0,1,102) = [0x1F, 0x66]
|
||
|
||
and so
|
||
|
||
[[person("Dr", "Elizabeth", "Blackwell")]]
|
||
= header(2,0,4) ++ [[person]] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]
|
||
= [0x84, 0x1F, 0x66] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]
|
||
|
||
for format B, or
|
||
|
||
open(2,0) ++ [[person]] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ close()
|
||
= [0x28, 0x1F, 0x66] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ [0x04]
|
||
|
||
for format C.
|
||
|
||
### Sequences, Sets and Dictionaries.
|
||
|
||
Format B (known length):
|
||
|
||
[[ [X_1...X_m] ]] = header(2,1,m) ++ [[X_1]] ++...++ [[X_m]]
|
||
[[ #set{X_1...X_m} ]] = header(2,2,m) ++ [[X_1]] ++...++ [[X_m]]
|
||
[[ {K_1:V_1...K_m:V_m} ]] = header(2,3,m*2) ++ [[K_1]] ++ [[V_1]] ++...
|
||
++ [[K_m]] ++ [[V_m]]
|
||
|
||
Note that `m*2` is given to `header` for a `Dictionary`, since there
|
||
are two `Value`s in each key-value pair.
|
||
|
||
Format C (streaming):
|
||
|
||
[[ [X_1...X_m] ]] = open(2,1) ++ [[X_1]] ++...++ [[X_m]] ++ close()
|
||
[[ #set{X_1...X_m} ]] = open(2,2) ++ [[X_1]] ++...++ [[X_m]] ++ close()
|
||
[[ {K_1:V_1...K_m:V_m} ]] = open(2,3) ++ [[K_1]] ++ [[V_1]] ++...
|
||
++ [[K_m]] ++ [[V_m]] ++ close()
|
||
|
||
Applications may use whichever format suits their needs on a
|
||
case-by-case basis.
|
||
|
||
There is *no* ordering requirement on the `X_i` elements or
|
||
`K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any
|
||
order.
|
||
|
||
[^no-sorting-rationale]: In the BitTorrent encoding format,
|
||
[bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding),
|
||
dictionary key/value pairs must be sorted by key. This is a
|
||
necessary step for ensuring serialization of `Value`s is
|
||
canonical. We do not require that key/value pairs (or set
|
||
elements) be in sorted order for serialized `Value`s, because (a)
|
||
where canonicalization is used for cryptographic signatures, it is
|
||
more reliable to simply retain the exact binary form of the signed
|
||
document than to depend on canonical de- and re-serialization, and
|
||
(b) sorting keys or elements makes no sense in streaming
|
||
serialization formats.
|
||
|
||
However, a quality implementation may wish to offer the programmer
|
||
the option of serializing with set elements and dictionary keys in
|
||
sorted order.
|
||
|
||
### SignedIntegers.
|
||
|
||
Format B/A (known length/fixed-size):
|
||
|
||
[[ x ]] when x ∈ SignedInteger = header(1,0,m) ++ intbytes(x) if x<-3 ∨ 13≤x
|
||
header(0,3,x+16) if -3≤x<0
|
||
header(0,3,x) if 0≤x<13
|
||
|
||
Integers in the range [-3,12] are compactly represented using format A
|
||
because they are so frequently used. Other integers are represented
|
||
using format B.
|
||
|
||
Format C *MUST NOT* be used for `SignedInteger`s.
|
||
|
||
The function `intbytes(x)` gives the big-endian two's-complement
|
||
binary representation of `x`, taking exactly as many whole bytes as
|
||
needed to unambiguously identify the value and its sign, and `m =
|
||
|intbytes(x)|`. The most-significant bit in the first byte in
|
||
`intbytes(x)` <!-- for `x`≠0 --> is the sign bit.[^zero-intbytes]
|
||
|
||
[^zero-intbytes]: The value 0 needs zero bytes to identify the
|
||
value, so `intbytes(0)` is the empty byte string. Non-zero values
|
||
need at least one byte.
|
||
|
||
For example,
|
||
|
||
[[ -257 ]] = 42 FE FF [[ -3 ]] = 3D [[ 128 ]] = 42 00 80
|
||
[[ -256 ]] = 42 FF 00 [[ -2 ]] = 3E [[ 255 ]] = 42 00 FF
|
||
[[ -255 ]] = 42 FF 01 [[ -1 ]] = 3F [[ 256 ]] = 42 01 00
|
||
[[ -254 ]] = 42 FF 02 [[ 0 ]] = 30 [[ 32767 ]] = 42 7F FF
|
||
[[ -129 ]] = 42 FF 7F [[ 1 ]] = 31 [[ 32768 ]] = 43 00 80 00
|
||
[[ -128 ]] = 41 80 [[ 12 ]] = 3C [[ 65535 ]] = 43 00 FF FF
|
||
[[ -127 ]] = 41 81 [[ 13 ]] = 41 0D [[ 65536 ]] = 43 01 00 00
|
||
[[ -4 ]] = 41 FC [[ 127 ]] = 41 7F [[ 131072 ]] = 43 02 00 00
|
||
|
||
### Strings, ByteStrings and Symbols.
|
||
|
||
Syntax for these three types varies only in the value of `n` supplied
|
||
to `header` and `open`. In each case, the payload following the header
|
||
is a binary sequence; for `String` and `Symbol`, it is a UTF-8
|
||
encoding of the `Value`'s code points, while for `ByteString` it is
|
||
the raw data contained within the `Value` unmodified.
|
||
|
||
Format B (known length):
|
||
|
||
[[ S ]] = header(1,n,m) ++ encode(S)
|
||
where m = |encode(S)|
|
||
and (n,encode(S)) = (1,utf8(S)) if S ∈ String
|
||
(2,S) if S ∈ ByteString
|
||
(3,utf8(S)) if S ∈ Symbol
|
||
|
||
To stream a `String`, `ByteString` or `Symbol`, emit `open(1,n)` and
|
||
then a sequence of zero or more format B chunks, followed by
|
||
`close()`. Every chunk must be a `ByteString`, and no chunk may be
|
||
annotated.
|
||
|
||
While the overall content of a streamed `String` or `Symbol` must be
|
||
valid UTF-8, individual chunks do not have to conform to UTF-8.
|
||
|
||
### Fixed-length Atoms.
|
||
|
||
Fixed-length atoms all use format A, and do not have a length
|
||
representation. They repurpose the bits that format B `Repr`s use to
|
||
specify lengths. Applications *MUST NOT* use format C with `open(0,n)`
|
||
for any `n`.
|
||
|
||
#### Booleans.
|
||
|
||
[[ #false ]] = header(0,0,0) = [0x00]
|
||
[[ #true ]] = header(0,0,1) = [0x01]
|
||
|
||
#### Floats and Doubles.
|
||
|
||
[[ F ]] when F ∈ Float = header(0,0,2) ++ binary32(F)
|
||
[[ D ]] when D ∈ Double = header(0,0,3) ++ binary64(D)
|
||
|
||
The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
|
||
8-byte IEEE 754 binary representations of `F` and `D`, respectively.
|
||
|
||
### Annotations.
|
||
|
||
To annotate a `Repr` `r` with some `Value` `v`, prepend `r` with
|
||
`[0x05] ++ [[v]]`.
|
||
|
||
For example, the `Repr` corresponding to textual syntax `@a @b []`,
|
||
i.e. an empty sequence annotated with two symbols, `a` and `b`, is
|
||
|
||
[[ @a @b [] ]]
|
||
= [0x05] ++ [[a]] ++ [0x05] ++ [[b]] ++ [[ [] ]]
|
||
= [0x05, 0x71, 0x61, 0x05, 0x71, 0x62, 0x90]
|
||
|
||
## Examples
|
||
|
||
### Simple examples.
|
||
|
||
<!-- TODO: Give some examples of large and small Preserves, perhaps -->
|
||
<!-- translated from various JSON blobs floating around the internet. -->
|
||
|
||
For the following examples, imagine an application that maps
|
||
placeholder number 0 to symbol `discard`, 1 to `capture`, and 2 to
|
||
`observe`.
|
||
|
||
| Value | Encoded byte sequence |
|
||
|---------------------------------------------------|-------------------------------------------------------------------------------------|
|
||
| `capture(discard())` | 82 11 81 10 |
|
||
| `observe(speak(discard(), capture(discard())))` | 82 12 83 75 's' 'p' 'e' 'a' 'k' 81 10 82 11 81 11 |
|
||
| `[1 2 3 4]` (format B) | 94 31 32 33 34 |
|
||
| `[1 2 3 4]` (format C) | 29 31 32 33 34 04 |
|
||
| `[-2 -1 0 1]` | 94 3E 3F 30 31 |
|
||
| `"hello"` (format B) | 55 'h' 'e' 'l' 'l' 'o' |
|
||
| `"hello"` (format C, 2 chunks) | 25 62 'h' 'e' 63 'l' 'l' 'o' 35 |
|
||
| `"hello"` (format C, 5 chunks) | 25 61 'h' 61 'e' 61 'l' 61 'l' 61 'o' 35 |
|
||
| `["hello" there #"world" [] #set{} #true #false]` | 97 55 'h' 'e' 'l' 'l' 'o' 75 't' 'h' 'e' 'r' 'e' 65 'w' 'o' 'r' 'l' 'd' 90 A0 01 00 |
|
||
| `-257` | 42 FE FF |
|
||
| `-1` | 3F |
|
||
| `0` | 30 |
|
||
| `1` | 31 |
|
||
| `255` | 42 00 FF |
|
||
| `1.0f` | 02 3F 80 00 00 |
|
||
| `1.0` | 03 3F F0 00 00 00 00 00 00 |
|
||
| `-1.202e300` | 03 FE 3C B7 B7 59 BF 04 26 |
|
||
|
||
The next example uses a non-`Symbol` label for a record.[^extensibility2] The `Record`
|
||
|
||
[titled person 2 thing 1](101, "Blackwell", date(1821 2 3), "Dr")
|
||
|
||
encodes to
|
||
|
||
85 ;; Record, generic, 4+1
|
||
95 ;; Sequence, 5
|
||
76 74 69 74 6C 65 64 ;; Symbol, "titled"
|
||
76 70 65 72 73 6F 6E ;; Symbol, "person"
|
||
32 ;; SignedInteger, "2"
|
||
75 74 68 69 6E 67 ;; Symbol, "thing"
|
||
31 ;; SignedInteger, "1"
|
||
41 65 ;; SignedInteger, "101"
|
||
59 42 6C 61 63 6B 77 65 6C 6C ;; String, "Blackwell"
|
||
84 ;; Record, generic, 3+1
|
||
74 64 61 74 65 ;; Symbol, "date"
|
||
42 07 1D ;; SignedInteger, "1821"
|
||
32 ;; SignedInteger, "2"
|
||
33 ;; SignedInteger, "3"
|
||
52 44 72 ;; String, "Dr"
|
||
|
||
[^extensibility2]: It happens to line up with Racket's
|
||
representation of a record label for an inheritance hierarchy
|
||
where `titled` extends `person` extends `thing`:
|
||
|
||
(struct date (year month day) #:prefab)
|
||
(struct thing (id) #:prefab)
|
||
(struct person thing (name date-of-birth) #:prefab)
|
||
(struct titled person (title) #:prefab)
|
||
|
||
For more detail on Racket's representations of record labels, see
|
||
[the Racket documentation for `make-prefab-struct`](http://docs.racket-lang.org/reference/structutils.html#%28def._%28%28quote._~23~25kernel%29._make-prefab-struct%29%29).
|
||
|
||
---
|
||
|
||
### JSON examples.
|
||
|
||
The examples from
|
||
[RFC 8259](https://tools.ietf.org/html/rfc8259#section-13) read as
|
||
valid Preserves, though the JSON literals `true`, `false` and `null`
|
||
read as `Symbol`s. The first example:
|
||
|
||
{
|
||
"Image": {
|
||
"Width": 800,
|
||
"Height": 600,
|
||
"Title": "View from 15th Floor",
|
||
"Thumbnail": {
|
||
"Url": "http://www.example.com/image/481989943",
|
||
"Height": 125,
|
||
"Width": 100
|
||
},
|
||
"Animated" : false,
|
||
"IDs": [116, 943, 234, 38793]
|
||
}
|
||
}
|
||
|
||
encodes to binary as follows:
|
||
|
||
B2
|
||
55 "Image"
|
||
BC
|
||
55 "Width" 42 03 20
|
||
55 "Title" 5F 14 "View from 15th Floor"
|
||
58 "Animated" 75 "false"
|
||
56 "Height" 42 02 58
|
||
59 "Thumbnail"
|
||
B6
|
||
55 "Width" 41 64
|
||
53 "Url" 5F 26 "http://www.example.com/image/481989943"
|
||
56 "Height" 41 7D
|
||
53 "IDs" 94
|
||
41 74
|
||
42 03 AF
|
||
42 00 EA
|
||
43 00 97 89
|
||
|
||
and the second example:
|
||
|
||
[
|
||
{
|
||
"precision": "zip",
|
||
"Latitude": 37.7668,
|
||
"Longitude": -122.3959,
|
||
"Address": "",
|
||
"City": "SAN FRANCISCO",
|
||
"State": "CA",
|
||
"Zip": "94107",
|
||
"Country": "US"
|
||
},
|
||
{
|
||
"precision": "zip",
|
||
"Latitude": 37.371991,
|
||
"Longitude": -122.026020,
|
||
"Address": "",
|
||
"City": "SUNNYVALE",
|
||
"State": "CA",
|
||
"Zip": "94085",
|
||
"Country": "US"
|
||
}
|
||
]
|
||
|
||
encodes to binary as follows:
|
||
|
||
92
|
||
BF 10
|
||
59 "precision" 53 "zip"
|
||
58 "Latitude" 03 40 42 E2 26 80 9D 49 52
|
||
59 "Longitude" 03 C0 5E 99 56 6C F4 1F 21
|
||
57 "Address" 50
|
||
54 "City" 5D "SAN FRANCISCO"
|
||
55 "State" 52 "CA"
|
||
53 "Zip" 55 "94107"
|
||
57 "Country" 52 "US"
|
||
BF 10
|
||
59 "precision" 53 "zip"
|
||
58 "Latitude" 03 40 42 AF 9D 66 AD B4 03
|
||
59 "Longitude" 03 C0 5E 81 AA 4F CA 42 AF
|
||
57 "Address" 50
|
||
54 "City" 59 "SUNNYVALE"
|
||
55 "State" 52 "CA"
|
||
53 "Zip" 55 "94085"
|
||
57 "Country" 52 "US"
|
||
|
||
## Conventions for Common Data Types
|
||
|
||
The `Value` data type is essentially an S-Expression, able to
|
||
represent semi-structured data over `ByteString`, `String`,
|
||
`SignedInteger` atoms and so on.[^why-not-spki-sexps]
|
||
|
||
[^why-not-spki-sexps]: Rivest's S-Expressions are in many ways
|
||
similar to Preserves. However, while they include binary data and
|
||
sequences, and an obvious equivalence for them exists, they lack
|
||
numbers *per se* as well as any kind of unordered structure such
|
||
as sets or maps. In addition, while "display hints" allow
|
||
labelling of binary data with an intended interpretation, they
|
||
cannot be attached to any other kind of structure, and the "hint"
|
||
itself can only be a binary blob.
|
||
|
||
However, users need a wide variety of data types for representing
|
||
domain-specific values such as various kinds of encoded and normalized
|
||
text, calendrical values, machine words, and so on.
|
||
|
||
Appropriately-labelled `Record`s denote these domain-specific data
|
||
types.[^why-dictionaries]
|
||
|
||
[^why-dictionaries]: Given `Record`'s existence, it may seem odd
|
||
that `Dictionary`, `Set`, `Float`, etc. are given special
|
||
treatment. Preserves aims to offer a useful basic equivalence
|
||
predicate to programmers, and so if a data type demands a special
|
||
equivalence predicate, as `Dictionary`, `Set` and `Float` all do,
|
||
then the type should be included in the base language. Otherwise,
|
||
it can be represented as a `Record` and treated separately.
|
||
`Boolean`, `String` and `Symbol` are seeming exceptions. The first
|
||
two merit inclusion because of their cultural importance, while
|
||
`Symbol`s are included to allow their use as `Record` labels.
|
||
Primitive `Symbol` support avoids a bootstrapping issue.
|
||
|
||
All of these conventions are optional. They form a layer atop the core
|
||
`Value` structure. Non-domain-specific tools do not in general need to
|
||
treat them specially.
|
||
|
||
**Validity.** Many of the labels we will describe in this section come
|
||
with side-conditions on the contents of labelled `Record`s. It is
|
||
possible to construct an instance of `Value` that violates these
|
||
side-conditions without ceasing to be a `Value` or becoming
|
||
unrepresentable. However, we say that such a `Value` is *invalid*
|
||
because it fails to honour the necessary side-conditions.
|
||
Implementations *SHOULD* allow two modes of working: one which
|
||
treats all `Value`s identically, without regard for side-conditions,
|
||
and one which enforces validity (i.e. side-conditions) when reading,
|
||
writing, or constructing `Value`s.
|
||
|
||
### IOLists.
|
||
|
||
Inspired by Erlang's notions of
|
||
[`iolist()` and `iodata()`](http://erlang.org/doc/reference_manual/typespec.html),
|
||
an `IOList` is any tree constructed from `ByteString`s and
|
||
`Sequence`s. Formally, an `IOList` is either a `ByteString` or a
|
||
`Sequence` of `IOList`s.
|
||
|
||
`IOList`s can be useful for
|
||
[vectored I/O](https://en.wikipedia.org/wiki/Vectored_I/O).
|
||
Additionally, the flexibility of `IOList` trees allows annotation of
|
||
interior portions of a tree.
|
||
|
||
### Comments.
|
||
|
||
`String` values used as annotations are conventionally interpreted as
|
||
comments.
|
||
|
||
@"I am a comment for the Dictionary"
|
||
{
|
||
@"I am a comment for the key"
|
||
key: @"I am a comment for the value"
|
||
value
|
||
}
|
||
|
||
@"I am a comment for this entire IOList"
|
||
[
|
||
#hex{00010203}
|
||
@"I am a comment for the middle half of the IOList"
|
||
@"A second comment for the same portion of the IOList"
|
||
[
|
||
@"I am a comment for the following ByteString"
|
||
#hex{04050607}
|
||
#hex{08090A0B}
|
||
]
|
||
#hex{0C0D0E0F}
|
||
]
|
||
|
||
### MIME-type tagged binary data.
|
||
|
||
Many internet protocols use
|
||
[media types](https://tools.ietf.org/html/rfc6838) (a.k.a MIME types)
|
||
to indicate the format of some associated binary data. For this
|
||
purpose, we define `MIMEData` to be a record labelled `mime` with two
|
||
fields, the first being a `Symbol`, the media type, and the second
|
||
being a `ByteString`, the binary data.
|
||
|
||
While each media type may define its own rules for comparing
|
||
documents, we define ordering among `MIMEData` *representations* of
|
||
such media types following the general rules for ordering of
|
||
`Record`s.
|
||
|
||
**Examples.**
|
||
|
||
| Value | Encoded hexadecimal byte sequence |
|
||
|--------------------------------------------|-------------------------------------------------------------------------------------------------------------------|
|
||
| `mime(application/octet-stream #"abcde")` | 83 74 6D 69 6D 65 7F 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D 65 61 62 63 64 65 |
|
||
| `mime(text/plain #"ABC")` | 83 74 6D 69 6D 65 7A 74 65 78 74 2F 70 6C 61 69 6E 63 41 42 43 |
|
||
| `mime(application/xml #"<xhtml/>")` | 83 74 6D 69 6D 65 7F 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 68 3C 78 68 74 6D 6C 2F 3E |
|
||
| `mime(text/csv #"123,234,345")` | 83 74 6D 69 6D 65 78 74 65 78 74 2F 63 73 76 6B 31 32 33 2C 32 33 34 2C 33 34 35 |
|
||
|
||
Applications making heavy use of `mime` records may choose to use a
|
||
placeholder number for the symbol `mime` as well as the symbols for
|
||
individual media types. For example, if placeholder number 1 were
|
||
chosen for `mime`, and placeholder number 7 for `text/plain`, the
|
||
second example above, `mime(text/plain #"ABC")`, would be encoded as
|
||
`83 11 17 63 41 42 43`.
|
||
|
||
### Unicode normalization forms.
|
||
|
||
Unicode defines multiple
|
||
[normalization forms](http://unicode.org/reports/tr15/) for text.
|
||
While no particular normalization form is required for `String`s,
|
||
users may need to unambiguously signal or require a particular
|
||
normalization form. A `NormalizedString` is a `Record` labelled with
|
||
`unicode-normalization` and having two fields, the first of which is a
|
||
`Symbol` specifying the normalization form used (e.g. `nfc`, `nfd`,
|
||
`nfkc`, `nfkd`), and the second of which is a `String` whose
|
||
underlying code point representation *MUST* be normalized according to
|
||
the named normalization form.
|
||
|
||
### IRIs (URIs, URLs, URNs, etc.).
|
||
|
||
An `IRI` is a `Record` labelled with `iri` and having one field, a
|
||
`String` which is the IRI itself and which *MUST* be a valid absolute
|
||
or relative IRI.
|
||
|
||
### Machine words.
|
||
|
||
The definition of `SignedInteger` captures all integers. However, in
|
||
certain circumstances it can be valuable to assert that a number
|
||
inhabits a particular range, such as a fixed-width machine word.
|
||
|
||
A family of labels `i`*n* and `u`*n* for *n* ∈ {8,16,32,64} denote
|
||
*n*-bit-wide signed and unsigned range restrictions, respectively.
|
||
Records with these labels *MUST* have one field, a `SignedInteger`,
|
||
which *MUST* fall within the appropriate range. That is, to be valid,
|
||
- in `i8(`*x*`)`, -128 <= *x* <= 127.
|
||
- in `u8(`*x*`)`, 0 <= *x* <= 255.
|
||
- in `i16(`*x*`)`, -32768 <= *x* <= 32767.
|
||
- etc.
|
||
|
||
### Anonymous Tuples and Unit.
|
||
|
||
A `Tuple` is a `Record` with label `tuple` and zero or more fields,
|
||
denoting an anonymous tuple of values.
|
||
|
||
The 0-ary tuple, `tuple()`, denotes the empty tuple, sometimes called
|
||
"unit" or "void" (but *not* e.g. JavaScript's "undefined" value).
|
||
|
||
### Null and Undefined.
|
||
|
||
Tony Hoare's
|
||
"[billion-dollar mistake](https://en.wikipedia.org/wiki/Tony_Hoare#Apologies_and_retractions)"
|
||
can be represented with the 0-ary `Record` `null()`. An "undefined"
|
||
value can be represented as `undefined()`.
|
||
|
||
### Dates and Times.
|
||
|
||
Dates, times, moments, and timestamps can be represented with a
|
||
`Record` with label `rfc3339` having a single field, a `String`, which
|
||
*MUST* conform to one of the `full-date`, `partial-time`, `full-time`,
|
||
or `date-time` productions of
|
||
[section 5.6 of RFC 3339](https://tools.ietf.org/html/rfc3339#section-5.6).
|
||
|
||
## Security Considerations
|
||
|
||
**Empty chunks.** Chunks of zero length are prohibited in streamed
|
||
(format C) `Repr`s. However, a malicious or broken encoder may include
|
||
them nonetheless. This opens up a possibility for denial-of-service:
|
||
an attacker may begin streaming a `String`, for example, sending an
|
||
endless sequence of zero length chunks, appearing to make progress but
|
||
not actually doing so. Implementations *MUST* reject zero length
|
||
chunks when decoding, and *MUST NOT* produce them when encoding.
|
||
|
||
**Whitespace.** Similarly, the textual format for `Value`s allows
|
||
arbitrary whitespace in many positions. In streaming transfer
|
||
situations, consider optional restrictions on the amount of
|
||
consecutive whitespace that may appear in a serialized `Value`.
|
||
|
||
**Annotations.** Also similarly, in modes where a `Value` is being
|
||
read while annotations are skipped, an endless sequence of annotations
|
||
may give an illusion of progress.
|
||
|
||
**Canonical form for cryptographic hashing and signing.** As
|
||
specified, neither the textual nor the compact binary encoding rules
|
||
for `Value`s force canonical serializations. Two serializations of the
|
||
same `Value` may yield different binary `Repr`s.
|
||
|
||
## Appendix. Table of lead byte values
|
||
|
||
00 - False
|
||
01 - True
|
||
02 - Float
|
||
03 - Double
|
||
04 - End stream
|
||
05 - Annotation
|
||
(0x) RESERVED 06-0F
|
||
1x - Placeholder
|
||
2x - Start Stream
|
||
3x - Small integers 0..12,-3..-1
|
||
|
||
4x - SignedInteger
|
||
5x - String
|
||
6x - ByteString
|
||
7x - Symbol
|
||
|
||
8x - Record
|
||
9x - Sequence
|
||
Ax - Set
|
||
Bx - Dictionary
|
||
|
||
(Cx) RESERVED C0-CF
|
||
(Dx) RESERVED D0-DF
|
||
(Ex) RESERVED E0-EF
|
||
(Fx) RESERVED F0-FF
|
||
|
||
## Appendix. Bit fields within lead byte values
|
||
|
||
tt nn mmmm contents
|
||
---------- ---------
|
||
|
||
00 00 0000 False
|
||
00 00 0001 True
|
||
00 00 0010 Float, 32 bits big-endian binary
|
||
00 00 0011 Double, 64 bits big-endian binary
|
||
00 00 0100 End Stream (to match a previous Start Stream)
|
||
00 00 0101 Annotation; two more Reprs follow
|
||
|
||
00 01 mmmm Placeholder; m is the placeholder number
|
||
|
||
00 10 ttnn Start Stream <tt,nn>
|
||
When tt = 00 --> error
|
||
01 --> each chunk is a ByteString
|
||
10 --> each chunk is a single encoded Value
|
||
11 --> error (RESERVED)
|
||
|
||
00 11 xxxx Small integers 0..12,-3..-1
|
||
|
||
01 00 mmmm SignedInteger, big-endian binary
|
||
01 01 mmmm String, UTF-8 binary
|
||
01 10 mmmm ByteString
|
||
01 11 mmmm Symbol, UTF-8 binary
|
||
|
||
10 00 mmmm Record
|
||
10 01 mmmm Sequence
|
||
10 10 mmmm Set
|
||
10 11 mmmm Dictionary
|
||
|
||
11 nn mmmm error, RESERVED
|
||
|
||
Where `mmmm` appears, interpret it as an unsigned 4-bit number `m`. If
|
||
`m`<15, let `l`=`m`. Otherwise, `m`=15; let `l` be the result of
|
||
decoding the varint that follows.
|
||
|
||
Then, if `ttnn`=`0001`, `l` is the placeholder number; otherwise, `l`
|
||
is the length of the body that follows, counted in bytes for `tt`=`01`
|
||
and in `Repr`s for `tt`=`10`.
|
||
|
||
<!-- Not yet ready
|
||
|
||
## Appendix. Representing Values in Programming Languages
|
||
|
||
We have given a definition of `Value` and its semantics, and proposed
|
||
a concrete syntax for communicating and storing `Value`s. We now turn
|
||
to **suggested** representations of `Value`s as *programming-language
|
||
values* for various programming languages.
|
||
|
||
When designing a language mapping, an important consideration is
|
||
roundtripping: serialization after deserialization, and vice versa,
|
||
should both be identities.
|
||
|
||
Also, the presence or absence of annotations on a `Value` should not
|
||
affect comparisons of that `Value` to others in any way.
|
||
|
||
### JavaScript.
|
||
|
||
- `Boolean` ↔ `Boolean`
|
||
- `Float` and `Double` ↔ numbers
|
||
- `SignedInteger` ↔ numbers or `BigInt` (see [here](https://developers.google.com/web/updates/2018/05/bigint) and [here](https://github.com/tc39/proposal-bigint))
|
||
- `String` ↔ strings
|
||
- `ByteString` ↔ `Uint8Array`
|
||
- `Symbol` ↔ `Symbol.for(...)`
|
||
- `Record` ↔ `{ "_label": theLabel, "_fields": [field0, ..., fieldN] }`, plus convenience accessors
|
||
- `(undefined)` ↔ the undefined value
|
||
- `(rfc3339 F)` ↔ `Date`, if `F` matches the `date-time` RFC 3339 production
|
||
- `Sequence` ↔ `Array`
|
||
- `Set` ↔ `{ "_set": M }` where `M` is a `Map` from the elements of the set to `true`
|
||
- `Dictionary` ↔ a `Map`
|
||
|
||
### Scheme/Racket.
|
||
|
||
- `Boolean` ↔ booleans
|
||
- `Float` and `Double` ↔ inexact numbers (Racket: single- and double-precision floats)
|
||
- `SignedInteger` ↔ exact numbers
|
||
- `String` ↔ strings
|
||
- `ByteString` ↔ byte vector (Racket: "Bytes")
|
||
- `Symbol` ↔ symbols
|
||
- `Record` ↔ structures (Racket: prefab struct)
|
||
- `Sequence` ↔ lists
|
||
- `Set` ↔ Racket: sets
|
||
- `Dictionary` ↔ Racket: hash-table
|
||
|
||
### Java.
|
||
|
||
- `Boolean` ↔ `Boolean`
|
||
- `Float` and `Double` ↔ `Float` and `Double`
|
||
- `SignedInteger` ↔ `Integer`, `Long`, `BigInteger`
|
||
- `String` ↔ `String`
|
||
- `ByteString` ↔ `byte[]`
|
||
- `Symbol` ↔ a simple data class wrapping a `String`
|
||
- `Record` ↔ in a simple implementation, a generic `Record` class; else perhaps a bean mapping?
|
||
- `(mime T B)` ↔ an implementation of `javax.activation.DataSource`?
|
||
- `Sequence` ↔ an implementation of `java.util.List`
|
||
- `Set` ↔ an implementation of `java.util.Set`
|
||
- `Dictionary` ↔ an implementation of `java.util.Map`
|
||
|
||
### Erlang.
|
||
|
||
- `Boolean` ↔ `true` and `false`
|
||
- `Float` and `Double` ↔ floats (unsure how Erlang deals with single-precision)
|
||
- `SignedInteger` ↔ integers
|
||
- `String` ↔ pair of `utf8` and a binary
|
||
- `ByteString` ↔ a binary
|
||
- `Symbol` ↔ pair of `atom` and a binary
|
||
- `Record` ↔ triple of `obj`, label, and field list
|
||
- `Sequence` ↔ a list
|
||
- `Set` ↔ a `sets` set
|
||
- `Dictionary` ↔ a [map][erlang-map] (new in Erlang/OTP R17)
|
||
|
||
This is a somewhat unsatisfactory mapping because: (a) Erlang doesn't
|
||
garbage-collect its atoms, meaning that (a.1) representing `Symbol`s
|
||
as atoms could lead to denial-of-service and (a.2) representing
|
||
`Symbol`-labelled `Record`s as Erlang records must be rejected for the
|
||
same reason; (b) even if it did, Erlang's boolean values are atoms,
|
||
which would then clash with the `Symbol`s `true` and `false`; and (c)
|
||
Erlang has no distinct string type, making for a trilemma where
|
||
`String`s are in danger of clashing with `ByteString`s, `Sequence`s,
|
||
or `Record`s.
|
||
|
||
### Python.
|
||
|
||
- `Boolean` ↔ `True` and `False`
|
||
- `Float` ↔ a `Float` wrapper-class for a double-precision value
|
||
- `Double` ↔ float
|
||
- `SignedInteger` ↔ int and long
|
||
- `String` ↔ `unicode`
|
||
- `ByteString` ↔ `bytes`
|
||
- `Symbol` ↔ a simple data class wrapping a `unicode`
|
||
- `Record` ↔ something like `namedtuple`, but that doesn't care about class identity?
|
||
- `Sequence` ↔ `tuple` (but accept `list` during encoding)
|
||
- `Set` ↔ `frozenset` (but accept `set` during encoding)
|
||
- `Dictionary` ↔ a hashable (immutable) dictionary-like thing (but accept `dict` during encoding)
|
||
|
||
### Squeak Smalltalk.
|
||
|
||
- `Boolean` ↔ `true` and `false`
|
||
- `Float` ↔ perhaps a subclass of `Float`?
|
||
- `Double` ↔ `Float`
|
||
- `SignedInteger` ↔ `Integer`
|
||
- `String` ↔ `WideString`
|
||
- `ByteString` ↔ `ByteArray`
|
||
- `Symbol` ↔ `WideSymbol`
|
||
- `Record` ↔ a simple data class
|
||
- `Sequence` ↔ `ArrayedCollection` (usually `OrderedCollection`)
|
||
- `Set` ↔ `Set`
|
||
- `Dictionary` ↔ `Dictionary`
|
||
|
||
-->
|
||
|
||
## Appendix. Why not Just Use JSON?
|
||
|
||
<!-- JSON lacks semantics: JSON syntax doesn't denote anything -->
|
||
|
||
JSON offers *syntax* for numbers, strings, booleans, null, arrays and
|
||
string-keyed maps. However, it suffers from two major problems. First,
|
||
it offers no *semantics* for the syntax: it is left to each
|
||
implementation to determine how to treat each JSON term. This causes
|
||
[interoperability](http://seriot.ch/parsing_json.php) and even
|
||
[security](http://web.archive.org/web/20180906202559/http://docs.couchdb.org/en/stable/cve/2017-12635.html)
|
||
issues. Second, JSON's lack of support for type tags leads to awkward
|
||
and incompatible *encodings* of type information in terms of the fixed
|
||
suite of constructors on offer.
|
||
|
||
There are other minor problems with JSON having to do with its syntax.
|
||
Examples include its relative verbosity and its lack of support for
|
||
binary data.
|
||
|
||
### JSON syntax doesn't *mean* anything
|
||
|
||
When are two JSON values the same? When are they different?
|
||
<!-- When is one JSON value "less than" another? -->
|
||
|
||
The specifications are largely silent on these questions. Different
|
||
JSON implementations give different answers.
|
||
|
||
Specifically, JSON does not:
|
||
|
||
- assign any meaning to numbers,[^meaning-ieee-double]
|
||
- determine how strings are to be compared,[^string-key-comparison]
|
||
- determine whether object key ordering is significant,[^json-member-ordering] or
|
||
- determine whether duplicate object keys are permitted, what it
|
||
would mean if they were, or how to determine a duplicate in the
|
||
first place.[^json-key-uniqueness]
|
||
|
||
In short, JSON syntax doesn't *denote* anything.[^xml-infoset] [^other-formats]
|
||
|
||
[^meaning-ieee-double]:
|
||
[Section 6 of RFC 8259](https://tools.ietf.org/html/rfc8259#section-6)
|
||
does go so far as to indicate “good interoperability can be
|
||
achieved” by imagining that parsers are able reliably to
|
||
understand the syntax of numbers as denoting an IEEE 754
|
||
double-precision floating-point value.
|
||
|
||
[^string-key-comparison]:
|
||
[Section 8.3 of RFC 8259](https://tools.ietf.org/html/rfc8259#section-8.3)
|
||
suggests that *if* an implementation compares strings used as
|
||
object keys “code unit by code unit”, then it will interoperate
|
||
with *other such implementations*, but neither requires this
|
||
behaviour nor discusses comparisons of strings used in other
|
||
contexts.
|
||
|
||
[^json-member-ordering]:
|
||
[Section 4 of RFC 8259](https://tools.ietf.org/html/rfc8259#section-4)
|
||
remarks that “[implementations] differ as to whether or not they
|
||
make the ordering of object members visible to calling software.”
|
||
|
||
[^json-key-uniqueness]:
|
||
[Section 4 of RFC 8259](https://tools.ietf.org/html/rfc8259#section-4)
|
||
is the only place in the specification that mentions the issue. It
|
||
explicitly sanctions implementations supporting duplicate keys,
|
||
noting only that “when the names within an object are not unique,
|
||
the behavior of software that receives such an object is
|
||
unpredictable.” Implementations are free to choose any behaviour
|
||
at all in this situation, including signalling an error, or
|
||
discarding all but one of a set of duplicates.
|
||
|
||
[^xml-infoset]: The XML world has the concept of
|
||
[XML infoset](https://www.w3.org/TR/xml-infoset/). Loosely
|
||
speaking, XML infoset is the *denotation* of an XML document; the
|
||
*meaning* of the document.
|
||
|
||
[^other-formats]: Most other recent data languages are like JSON in
|
||
specifying only a syntax with no associated semantics. While some
|
||
do make a sketch of a semantics, the result is often
|
||
underspecified (e.g. in terms of how strings are to be compared),
|
||
overly machine-oriented (e.g. treating 32-bit integers as
|
||
fundamentally distinct from 64-bit integers and from
|
||
floating-point numbers), overly fine (e.g. giving visibility to
|
||
the order in which map entries are written), or all three.
|
||
|
||
Some examples:
|
||
|
||
- are the JSON values `1`, `1.0`, and `1e0` the same or different?
|
||
- are the JSON values `1.0` and `1.0000000000000001` the same or different?
|
||
- are the JSON strings `"päron"` (UTF-8 `70c3a4726f6e`) and `"päron"`
|
||
(UTF-8 `7061cc88726f6e`) the same or different?
|
||
- are the JSON objects `{"a":1, "b":2}` and `{"b":2, "a":1}` the same
|
||
or different?
|
||
- which, if any, of `{"a":1, "a":2}`, `{"a":1}` and `{"a":2}` are the
|
||
same? Are all three legal?
|
||
- are `{"päron":1}` and `{"päron":1}` the same or different?
|
||
|
||
### JSON can multiply nicely, but it can't add very well
|
||
|
||
JSON includes a fixed set of types: numbers, strings, booleans, null,
|
||
arrays and string-keyed maps. Domain-specific data must be *encoded*
|
||
into these types. For example, dates and email addresses are often
|
||
represented as strings with an implicit internal structure.
|
||
|
||
There is no convention for *labelling* a value as belonging to a
|
||
particular category. Instead, JSON-encoded data are often labelled in
|
||
an ad-hoc way. Multiple incompatible approaches exist. For example, a
|
||
"money" structure containing a `currency` field and an `amount` may be
|
||
represented in any number of ways:
|
||
|
||
{ "_type": "money", "currency": "EUR", "amount": 10 }
|
||
{ "type": "money", "value": { "currency": "EUR", "amount": 10 } }
|
||
[ "money", { "currency": "EUR", "amount": 10 } ]
|
||
{ "@money": { "currency": "EUR", "amount": 10 } }
|
||
|
||
This causes particular problems when JSON is used to represent *sum*
|
||
or *union* types, such as "either a value or an error, but not both".
|
||
Again, multiple incompatible approaches exist.
|
||
|
||
For example, imagine an API for depositing money in an account. The
|
||
response might be either a "success" response indicating the new
|
||
balance, or one of a set of possible errors.
|
||
|
||
Sometimes, a *pair* of values is used, with `null` marking the option
|
||
not taken.[^interesting-failure-mode]
|
||
|
||
{ "ok": { "balance": 210 }, "error": null }
|
||
{ "ok": null, "error": "Unauthorized" }
|
||
|
||
[^interesting-failure-mode]: What is the meaning of a document where
|
||
both `ok` and `error` are non-null? What might happen when a
|
||
program is presented with such a document?
|
||
|
||
The branch not chosen is sometimes present, sometimes omitted as if it
|
||
were an optional field:
|
||
|
||
{ "ok": { "balance": 210 } }
|
||
{ "error": "Unauthorized" }
|
||
|
||
Sometimes, an array of a label and a value is used:
|
||
|
||
[ "ok", { "balance": 210 } ]
|
||
[ "error", "Unauthorized" ]
|
||
|
||
Sometimes, the shape of the data is sufficient to distinguish among
|
||
the alternatives, and the label is left implicit:
|
||
|
||
{ "balance": 210 }
|
||
"Unauthorized"
|
||
|
||
JSON itself does not offer any guidance for which of these options to
|
||
choose. In many real cases on the web, poor choices have led to
|
||
encodings that are irrecoverably ambiguous.
|
||
|
||
# Open questions
|
||
|
||
Q. Should "symbols" instead be URIs? Relative, usually; relative to
|
||
what? Some domain-specific base URI?
|
||
|
||
Q. Literal small integers: are they pulling their weight? They're not
|
||
absolutely necessary.
|
||
|
||
Q. Should we go for trying to make the data ordering line up with the
|
||
encoding ordering? We'd have to only use streaming forms, and avoid
|
||
the small integer encoding, and not store record arities, and sort
|
||
sets and dictionaries, and mask floats and doubles (perhaps
|
||
[like this](https://stackoverflow.com/questions/43299299/sorting-floating-point-values-using-their-byte-representation)),
|
||
and perhaps pick a specific `NaN`, and I don't know what to do about
|
||
SignedIntegers. Perhaps make them more like float formats, with the
|
||
byte count acting as a kind of exponent underneath the sign bit.
|
||
|
||
- Perhaps define separate additional canonicalization restrictions?
|
||
Doesn't help the ordering, but does help the equivalence.
|
||
|
||
- Canonicalization and early-bailout-equivalence-checking are in
|
||
tension with support for streaming values.
|
||
|
||
Q. The postfix fields in the textual syntax come unannounced: "oh, and
|
||
another thing, what you just read is a label, and here are some
|
||
fields." This is a problem for interactive reading of textual syntax,
|
||
because after a complete term, it needs to see the next character to
|
||
tell whether it is an open-parenthesis or not! For this reason, I've
|
||
disallowed whitespace between a label `Value` and the open-parenthesis
|
||
of the fields. Is this reasonable??
|
||
|
||
Q. To remain compatible with JSON, portions of the text syntax have to
|
||
remain case-insensitive (`%i"..."`). However, non-JSON extensions do
|
||
not. There's only one (?) at the moment, the `%i"f"` in `Float`;
|
||
should it be changed to case-sensitive?
|
||
|
||
Q. Should `IOList`s be wrapped in an identifying unary record constructor?
|
||
|
||
TODO: Examples of the ordering. `"bzz" < "c" < "caa"`; `#true < 3 < "3" < |3|`
|
||
|
||
TODO: Probably should add a canonicalized subset. Consider adding
|
||
explicit "I promise this is canonical" marker, like a BOM, which
|
||
identifies a binary value as (first) binary and (second, optionally)
|
||
as canonical. UTF-8 disallows byte `0xFF` from appearing anywhere in a
|
||
text; this might be a good candidate for a marker sequence.
|
||
((Actually, perhaps `0x10` would be good! It corresponds to DLE, "data
|
||
link escape"; it is not a printable ASCII character, and is disallowed
|
||
in the textual Preserves grammar; and it is also mnemonic for "version
|
||
0", since it is the Preserves binary encoding of the small integer
|
||
zero.))
|
||
|
||
<!-- Heading to visually offset the footnotes from the main document: -->
|
||
## Notes
|