2018-09-23 13:37:20 +00:00
|
|
|
|
---
|
|
|
|
|
---
|
|
|
|
|
<style>
|
|
|
|
|
body { padding-top: 2rem; font-family: palatino, "Palatino Linotype", "Palatino LT STD", "URW Palladio L", serif; max-width: 40em; margin: auto; font-size: 120%; }
|
|
|
|
|
h1, h2, h3, h4, h5, h6 { margin-left: -1rem; color: #4f81bd; }
|
|
|
|
|
h2 { border-bottom: solid #4f81bd 1px; }
|
|
|
|
|
pre, code { background-color: #eee; }
|
|
|
|
|
pre { padding: 0.33rem; }
|
|
|
|
|
</style>
|
|
|
|
|
|
|
|
|
|
# Preserves: Semantic Serialization of Node-labelled Data
|
|
|
|
|
|
|
|
|
|
_________
|
|
|
|
|
<_________> Tony Garnock-Jones <tonyg@leastfixedpoint.com>
|
|
|
|
|
| FRμIT | September 2018
|
|
|
|
|
|Preserves| Version 0.0.2
|
|
|
|
|
\_________/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
[sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt
|
|
|
|
|
[spki]: http://world.std.com/~cme/html/spki.html
|
|
|
|
|
[varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
|
|
|
|
|
[erlang-map]: http://erlang.org/doc/reference_manual/data_types.html#map
|
|
|
|
|
|
|
|
|
|
## Introduction
|
|
|
|
|
|
|
|
|
|
Most data serialization formats used on the web represent
|
|
|
|
|
*edge-labelled* semi-structured data.
|
|
|
|
|
|
|
|
|
|
This document proposes a data model and serialization format that
|
|
|
|
|
takes a *node-labelled* approach.
|
|
|
|
|
|
|
|
|
|
This makes it both extensible and much more like S-expressions, making
|
|
|
|
|
it easily able to represent the *labelled sums of products* as seen in
|
|
|
|
|
Rust, Haskell, OCaml, and other functional programming languages.
|
|
|
|
|
|
|
|
|
|
## Starting with Semantics
|
|
|
|
|
|
|
|
|
|
Taking inspiration from functional programming, we start with a
|
|
|
|
|
definition of the *values* that we want to work with and give them
|
|
|
|
|
meaning independent of their syntax. We will treat syntax separately,
|
|
|
|
|
later in this document.
|
|
|
|
|
|
|
|
|
|
Value = Atom
|
|
|
|
|
| Compound
|
|
|
|
|
|
|
|
|
|
Atom = SignedInteger
|
|
|
|
|
| String
|
|
|
|
|
| ByteString
|
|
|
|
|
| Symbol
|
|
|
|
|
| Boolean
|
|
|
|
|
| Float
|
|
|
|
|
| Double
|
|
|
|
|
|
|
|
|
|
Compound = Record
|
|
|
|
|
| Sequence
|
|
|
|
|
| Set
|
|
|
|
|
| Dictionary
|
|
|
|
|
|
|
|
|
|
Our `Value`s fall into two broad categories: *atomic* and *compound*
|
|
|
|
|
data.[^zephyr-asdl]
|
|
|
|
|
|
|
|
|
|
[^zephyr-asdl]: This design was loosely inspired by S-expressions,
|
|
|
|
|
as seen in Lisp, Scheme, [SPKI/SDSI][sexp.txt], and many others,
|
|
|
|
|
and by the ML type system, as seen in languages such as SML,
|
|
|
|
|
OCaml, Haskell, Rust, and many others. It is also related to
|
|
|
|
|
Zephyr ASDL (h/t
|
|
|
|
|
[Darius Bacon](https://twitter.com/abecedarius/status/993545767884226561)),
|
|
|
|
|
which doesn't offer much in the way of atoms, but offers
|
|
|
|
|
general-purpose labelled sums and products. See D. C. Wang, A. W.
|
|
|
|
|
Appel, J. L. Korn, and C. S. Serra, “The Zephyr Abstract Syntax
|
|
|
|
|
Description Language,” in USENIX Conference on Domain-Specific
|
|
|
|
|
Languages, 1997, pp. 213–228.
|
|
|
|
|
[PDF available.](https://www.usenix.org/legacy/publications/library/proceedings/dsl97/full_papers/wang/wang.pdf)
|
|
|
|
|
|
|
|
|
|
**Total order.**<a name="total-order"></a> As we go, we will
|
|
|
|
|
incrementally specify a total order over `Value`s. Two values of the
|
|
|
|
|
same kind are compared using kind-specific rules. The ordering among
|
|
|
|
|
values of different kinds is essentially arbitrary, but having a total
|
|
|
|
|
order is convenient for many tasks, so we define it as
|
|
|
|
|
follows:[^ordering-by-syntax]
|
|
|
|
|
|
|
|
|
|
(Values) Compound < Atom
|
|
|
|
|
|
|
|
|
|
(Compounds) Record < Sequence < Set < Dictionary
|
|
|
|
|
|
|
|
|
|
(Atoms) SignedInteger < String < ByteString < Symbol
|
2018-09-23 17:14:58 +00:00
|
|
|
|
< Boolean < Float < Double
|
2018-09-23 13:37:20 +00:00
|
|
|
|
|
|
|
|
|
[^ordering-by-syntax]: The observant reader may note that the
|
|
|
|
|
ordering here is the same as that implied by the tagging scheme
|
|
|
|
|
used in the concrete binary syntax for `Value`s.
|
|
|
|
|
|
|
|
|
|
**Equivalence.**<a name="equivalence"></a> Two `Value`s are equal if
|
|
|
|
|
neither is less than the other according to the total order.
|
|
|
|
|
|
|
|
|
|
<!-- We should avoid unnecessary restrictions such as machine-oriented -->
|
|
|
|
|
<!-- fixed-width integer or floating-point values where possible. -->
|
|
|
|
|
|
|
|
|
|
### Signed integers.
|
|
|
|
|
|
|
|
|
|
A `SignedInteger` is a signed integer of arbitrary width.
|
|
|
|
|
`SignedInteger`s are compared as mathematical integers. We will write
|
|
|
|
|
examples of `SignedInteger`s using standard mathematical notation.
|
|
|
|
|
|
|
|
|
|
**Examples.** 10; -6; 0.
|
|
|
|
|
|
|
|
|
|
**Non-examples.** NaN (the clue is in the name!); ∞ (not finite); 0.2
|
|
|
|
|
(not an integer); 1/7 (likewise); 2+*i*3 (likewise); √2 (likewise).
|
|
|
|
|
|
|
|
|
|
### Unicode strings.
|
|
|
|
|
|
|
|
|
|
A `String` is a sequence of Unicode
|
|
|
|
|
[code-point](http://www.unicode.org/glossary/#code_point)s. Two
|
|
|
|
|
`String`s are compared lexicographically, code-point by
|
|
|
|
|
code-point.[^utf8-is-awesome] We will write examples of `String`s text
|
|
|
|
|
surrounded by double-quotes “`"`” using a monospace font.
|
|
|
|
|
|
|
|
|
|
[^utf8-is-awesome]: Happily, the design of UTF-8 is such that this
|
|
|
|
|
gives the same result as a lexicographic byte-by-byte comparison
|
|
|
|
|
of the UTF-8 encoding of a string!
|
|
|
|
|
|
|
|
|
|
**Examples.** `"Hello world"`, an eleven-code-point string; `"z水𝄞"`,
|
|
|
|
|
the string containing the three Unicode code-points `z` (0x7A), `水`
|
|
|
|
|
(0x6C34) and `𝄞` (0x1D11E); `""`, the empty string.
|
|
|
|
|
|
|
|
|
|
**Normalization forms.** Unicode defines multiple
|
|
|
|
|
[normalization forms](http://unicode.org/reports/tr15/) for text. No
|
|
|
|
|
particular normalization form is required for `String`s;
|
|
|
|
|
[see below](#normalization-forms).
|
|
|
|
|
|
|
|
|
|
### Binary data.
|
|
|
|
|
|
|
|
|
|
A `ByteString` is an ordered sequence of zero or more integers in the
|
|
|
|
|
inclusive range [0..255]. `ByteString`s are compared
|
|
|
|
|
lexicographically, byte by byte. We will only write examples of
|
|
|
|
|
`ByteString`s that contain bytes mapping to printable ASCII
|
|
|
|
|
characters, using “`#"`” as an opening quote mark and “`"`” as a
|
|
|
|
|
closing quote mark.
|
|
|
|
|
|
|
|
|
|
**Examples.** The `ByteString` containing the integers 65, 66 and 67
|
|
|
|
|
(corresponding to ASCII characters `A`, `B` and `C`) is written as
|
|
|
|
|
`#"ABC"`. The empty `ByteString` is written as `#""`. **N.B.** Despite
|
|
|
|
|
appearances, these are *binary* data.
|
|
|
|
|
|
|
|
|
|
### Symbols or identifiers.
|
|
|
|
|
|
|
|
|
|
Programming languages like Lisp and Prolog frequently use string-like
|
|
|
|
|
values called *symbols*. Here, a `Symbol` is, like a `String`, a
|
|
|
|
|
sequence of Unicode code-points, intended to represent an identifier
|
|
|
|
|
of some kind. `Symbol`s are also compared lexicographically by
|
|
|
|
|
code-point. We will write examples including only non-empty sequences
|
|
|
|
|
of non-whitespace characters, using a monospace font without quotation
|
|
|
|
|
marks.
|
|
|
|
|
|
|
|
|
|
**Examples.** `hello-world`; `utf8-string`; `exact-integer?`.
|
|
|
|
|
|
|
|
|
|
### Booleans.
|
|
|
|
|
|
|
|
|
|
There are exactly two `Boolean` values, “false” and “true”. The
|
|
|
|
|
“false” value compares less-than the “true” value. We write `#f` for
|
|
|
|
|
“false”, and `#t` for “true”.
|
|
|
|
|
|
|
|
|
|
**Examples.** `#f`; `#t`.
|
|
|
|
|
|
|
|
|
|
### IEEE floating-point values.
|
|
|
|
|
|
|
|
|
|
A `Float` is a single-precision IEEE 754 floating-point value; a
|
|
|
|
|
`Double` is a double-precision IEEE 754 floating-point value.
|
|
|
|
|
`Float`s, `Double`s and `SignedInteger`s are considered disjoint, and
|
|
|
|
|
so by the rules [above](#total-order), every `Float` is less than
|
|
|
|
|
every `Double`, and every `SignedInteger` is less than both. Two
|
|
|
|
|
`Float`s or two `Double`s are to be ordered by the `totalOrder`
|
|
|
|
|
predicate defined in section 5.10 of
|
|
|
|
|
[IEEE Std 754-2008](https://dx.doi.org/10.1109/IEEESTD.2008.4610935).
|
|
|
|
|
We write examples using standard mathematical notation, avoiding NaN
|
|
|
|
|
and infinities, using a suffix `f` or `d` to indicate `Float` or
|
|
|
|
|
`Double`, respectively.
|
|
|
|
|
|
|
|
|
|
**Examples.** 10f; -6d; 0f; 0.5d; -1.202e300d.
|
|
|
|
|
|
|
|
|
|
**Non-examples.** 10, -6, and 0, because writing them this way
|
|
|
|
|
indicates `SignedInteger`s, not `Float`s or `Double`s.
|
|
|
|
|
|
|
|
|
|
### Records.
|
|
|
|
|
|
|
|
|
|
A `Record` is a *labelled* tuple of zero or more `Value`s, called the
|
|
|
|
|
record's *fields*. A record's label is, itself, a `Value`, though it
|
|
|
|
|
will usually be a `Symbol`.[^extensibility] [^iri-labels] `Record`s
|
|
|
|
|
are compared lexicographically as if they were just tuples; that is,
|
|
|
|
|
first by their labels, and then by the remainder of their fields. We
|
|
|
|
|
will only write examples of `Record`s having labels that are `Symbol`s
|
|
|
|
|
entirely composed of ASCII characters. Such `Record`s will be written
|
|
|
|
|
as a parenthesised, space-separated sequence of their label followed
|
|
|
|
|
by their fields.
|
|
|
|
|
|
|
|
|
|
[^extensibility]: The [Racket](https://racket-lang.org/) programming
|
|
|
|
|
language defines
|
|
|
|
|
[“prefab”](http://docs.racket-lang.org/guide/define-struct.html#(part._prefab-struct))
|
|
|
|
|
structure types, which map well to our `Record`s. Racket supports
|
|
|
|
|
record extensibility by encoding record supertypes into record
|
|
|
|
|
labels as specially-formatted lists.
|
|
|
|
|
|
|
|
|
|
[^iri-labels]: It is occasionally (but seldom) necessary to
|
|
|
|
|
interpret such `Symbol` labels as UTF-8 encoded IRIs. Where a
|
|
|
|
|
label can be read as a relative IRI, it is notionally interpreted
|
|
|
|
|
with respect to the IRI
|
|
|
|
|
`urn:uuid:6bf094a6-20f1-4887-ada7-46834a9b5b34`; where a label can
|
|
|
|
|
be read as an absolute IRI, it stands for that IRI; and otherwise,
|
|
|
|
|
it cannot be read as an IRI at all, and so the label simply stands
|
|
|
|
|
for itself - for its own `Value`.
|
|
|
|
|
|
|
|
|
|
**Examples.** The `Record` with label `foo` and fields 1, 2 and 3 is
|
|
|
|
|
written `(foo 1 2 3)`; the `Record` with label `void` and no fields is
|
|
|
|
|
written `(void)`.
|
|
|
|
|
|
|
|
|
|
### Sequences.
|
|
|
|
|
|
|
|
|
|
A `Sequence` is a general-purpose, variable-length ordered sequence of
|
|
|
|
|
zero or more `Value`s. `Sequence`s are compared lexicographically,
|
|
|
|
|
appealing to the ordering on `Value`s for comparisons at each position
|
|
|
|
|
in the `Sequence`s. We write examples space-separated, surrounded with
|
|
|
|
|
square brackets.
|
|
|
|
|
|
|
|
|
|
**Examples.** `[]`, the empty sequence; `[1 2 3]`, the sequence of
|
|
|
|
|
`SignedInteger`s 1, 2 and 3.
|
|
|
|
|
|
|
|
|
|
### Sets.
|
|
|
|
|
|
|
|
|
|
A `Set` is an unordered finite set of `Value`s. It contains no
|
|
|
|
|
duplicate values, following the [equivalence relation](#equivalence)
|
|
|
|
|
induced by the total order on `Value`s. Two `Set`s are compared by
|
|
|
|
|
sorting their elements using the [total order](#total-order) and
|
|
|
|
|
comparing the resulting sequences as `Sequence`s. We write examples
|
|
|
|
|
space-separated, surrounded with curly braces, prefixed by `#set`.
|
|
|
|
|
|
|
|
|
|
**Examples.** `#set{}`, the empty set; `#set{#set{}}`, the set
|
|
|
|
|
containing only the empty set; `#set{4 "hello" (void) 9.0f}`, the set
|
|
|
|
|
containing 4, the string `"hello"`, the record with label `void` and
|
|
|
|
|
no fields, and the `Float` denoting the number 9.0; `#set{1 1.0f}`,
|
|
|
|
|
the set containing a `SignedInteger` and a `Float`, both denoting the
|
2018-09-23 17:14:58 +00:00
|
|
|
|
number 1; `#set{(mime application/xml #"<x/>") (mime
|
|
|
|
|
application/xml #"<x />")}`, a set containing two different
|
|
|
|
|
type-labelled byte arrays.[^mime-xml-difference]
|
2018-09-23 13:37:20 +00:00
|
|
|
|
|
2018-09-23 17:14:58 +00:00
|
|
|
|
[^mime-xml-difference]: The two XML documents `<x/>` and `<x />`
|
|
|
|
|
differ by bytewise comparison, and thus yield different record
|
2018-09-23 13:37:20 +00:00
|
|
|
|
values, even though under the semantics of XML they denote
|
|
|
|
|
identical XML infoset.
|
|
|
|
|
|
|
|
|
|
**Non-examples.** `#set{1 1 1}`, because it contains multiple
|
|
|
|
|
equivalent `Value`s.
|
|
|
|
|
|
|
|
|
|
### Dictionaries, hash-tables or maps.
|
|
|
|
|
|
|
|
|
|
A `Dictionary` is an unordered finite collection of zero or more pairs
|
|
|
|
|
of `Value`s. Each pair comprises a *key* and a *value*. Keys in a
|
|
|
|
|
`Dictionary` must be pairwise distinct. Instances of `Dictionary` are
|
|
|
|
|
compared by lexicographic comparison of the sequences resulting from
|
|
|
|
|
ordering each `Dictionary`'s pairs in ascending order by key. Examples
|
|
|
|
|
are written as a `#dict`-prefixed, curly-brace-surrounded sequence of
|
|
|
|
|
space-separated key-value pairs, each written with a colon between the
|
|
|
|
|
key and value.
|
|
|
|
|
|
|
|
|
|
**Examples.** `#dict{}`, the empty dictionary; `#dict{a:1}`, the
|
|
|
|
|
dictionary mapping the `Symbol` `a` to the `SignedInteger` 1;
|
|
|
|
|
`#dict{1:a}`, mapping 1 to `a`; `#dict{"hi":0 hi:0 there:[]}`, having
|
|
|
|
|
a `String` and two `Symbol` keys, and `SignedInteger` and `Sequence`
|
|
|
|
|
values.
|
|
|
|
|
|
|
|
|
|
**Non-examples.** `#dict{a:1 b:2 a:3}`, because it contains duplicate
|
|
|
|
|
keys; `#dict{[]:[] []:99}`, for the same reason.
|
|
|
|
|
|
|
|
|
|
## Syntax
|
|
|
|
|
|
|
|
|
|
Now we have discussed `Value`s and their meanings, we may turn to
|
|
|
|
|
techniques for *representing* `Value`s for communication or storage.
|
|
|
|
|
|
|
|
|
|
The syntax we have used for the examples so far is inadequate in many
|
|
|
|
|
ways, not least of which is that it cannot represent every `Value`.
|
|
|
|
|
|
|
|
|
|
Separation of the meaning of a piece of syntax from the syntax itself
|
|
|
|
|
opens the door to domain-specific syntaxes, all equivalent and
|
|
|
|
|
interconvertible.[^asn1] With a robust semantic foundation,
|
|
|
|
|
connections to other data languages can also be made.
|
|
|
|
|
|
|
|
|
|
[^asn1]: Those who remember
|
|
|
|
|
[ASN.1](https://www.itu.int/en/ITU-T/asn1/Pages/introduction.aspx)
|
|
|
|
|
will recall BER, DER, PER, CER, XER and so on, each appropriate to
|
|
|
|
|
a different setting. Similarly,
|
|
|
|
|
[Rivest's S-Expression design][sexp.txt] offers a human-friendly
|
|
|
|
|
syntax, a syntax robust to network-induced message corruption, and
|
|
|
|
|
an unambiguous, simple and easily-parsed machine-friendly syntax
|
|
|
|
|
for the same underlying values.
|
|
|
|
|
|
|
|
|
|
### Binary syntax
|
|
|
|
|
|
|
|
|
|
For now, we limit our attention to an easily-parsed, easily-produced
|
|
|
|
|
machine-readable syntax.
|
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
A `Repr` is an encoding, or representation, of a specific `Value`.
|
|
|
|
|
Each `Repr` comprises one or more bytes describing first the kind of
|
|
|
|
|
represented `Value` and the length of the representation, and then the
|
|
|
|
|
encoded details of the `Value` itself.
|
2018-09-23 13:37:20 +00:00
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
For a value `v`, we write `[[v]]` for the `Repr` of v.
|
2018-09-23 13:37:20 +00:00
|
|
|
|
|
|
|
|
|
The following figure summarises the definitions below:
|
|
|
|
|
|
|
|
|
|
tt nn mmmm varint(m) contents
|
|
|
|
|
-------------------------------
|
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
00 00 0000 False
|
|
|
|
|
00 00 0001 True
|
|
|
|
|
00 00 0010 Float, 32 bits big-endian binary
|
|
|
|
|
00 00 0011 Double, 64 bits big-endian binary
|
|
|
|
|
00 00 x1xx RESERVED
|
|
|
|
|
00 00 1xxx RESERVED
|
|
|
|
|
00 01 xxxx RESERVED
|
|
|
|
|
00 10 ttnn Start Stream <tt,nn>
|
|
|
|
|
When tt = 00 --> error
|
|
|
|
|
01 --> each chunk is a <tt,nn> piece
|
|
|
|
|
1x --> each chunk is a single encoded Value
|
|
|
|
|
00 11 ttnn End Stream <tt,nn> (must match preceding Start Stream)
|
|
|
|
|
|
|
|
|
|
01 00 mmmm ... SignedInteger, big-endian binary
|
|
|
|
|
01 01 mmmm ... String, UTF-8 binary
|
|
|
|
|
01 10 mmmm ... Bytes
|
|
|
|
|
01 11 mmmm ... Symbol, UTF-8 binary
|
|
|
|
|
|
|
|
|
|
10 00 mmmm ... application-specific Record
|
|
|
|
|
10 01 mmmm ... application-specific Record
|
|
|
|
|
10 10 mmmm ... application-specific Record
|
|
|
|
|
10 11 mmmm ... Record
|
|
|
|
|
|
|
|
|
|
11 00 mmmm ... Sequence
|
|
|
|
|
11 01 mmmm ... Set
|
|
|
|
|
11 10 mmmm ... Dictionary
|
|
|
|
|
11 11 xxxx RESERVED
|
2018-09-23 13:37:20 +00:00
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
If mmmm = 1111, varint(m) is present; otherwise, m is the length
|
2018-09-23 13:37:20 +00:00
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
#### Type and Length representation
|
2018-09-23 13:37:20 +00:00
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
Each `Repr` takes one of three possible forms:
|
2018-09-23 13:37:20 +00:00
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
- (A) a fixed-length form, used for simple values such as `Boolean`s
|
|
|
|
|
or `Float`s.
|
2018-09-23 13:37:20 +00:00
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
- (B) a variable-length form with length specified up-front, used for
|
|
|
|
|
almost all `Record`s as well as for most `Sequence`s and `String`s,
|
|
|
|
|
when their sizes are known at the time serialization begins.
|
2018-09-23 13:37:20 +00:00
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
- (C) a variable-length streaming form with unknown or unpredictable
|
|
|
|
|
length, used only seldom for `Record`s, since the number of fields
|
|
|
|
|
in a `Record` is usually statically known, but sometimes used for
|
|
|
|
|
`Sequence`s, `String`s etc., such as in cases when serialization
|
|
|
|
|
begins before the number of elements or bytes in the corresponding
|
|
|
|
|
`Value` is known.
|
2018-09-23 13:37:20 +00:00
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
Applications may choose between formats (B) and (C) depending on their
|
|
|
|
|
needs at serialization time.
|
|
|
|
|
|
|
|
|
|
Every `Repr`, however, starts with a *lead byte* describing the
|
|
|
|
|
remainder of the representation.
|
2018-09-23 13:37:20 +00:00
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
##### The lead byte
|
|
|
|
|
|
|
|
|
|
The lead byte is constructed by a function `leadbyte`:
|
2018-09-23 13:37:20 +00:00
|
|
|
|
|
|
|
|
|
leadbyte(t,n,m) = [t*64 + n*16 + m]
|
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
Both `t` and `n` are two-bit unsigned numbers; `m` is a four-bit
|
|
|
|
|
unsigned number.
|
|
|
|
|
|
2018-09-23 13:37:20 +00:00
|
|
|
|
The lead byte describes the rest of the representation as
|
|
|
|
|
follows:[^some-encodings-unused]
|
|
|
|
|
|
|
|
|
|
[^some-encodings-unused]: Some encodings are unused. All such
|
|
|
|
|
encodings are reserved for future versions of this specification.
|
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
- `leadbyte(0,0,-)` (format A) represents an Atom with fixed-length binary representation.
|
|
|
|
|
- `leadbyte(0,1,-)` (format A) is RESERVED.
|
|
|
|
|
- `leadbyte(0,2,-)` (format C) is a Stream Start byte.
|
|
|
|
|
- `leadbyte(0,3,-)` (format C) is a Stream End byte.
|
|
|
|
|
- `leadbyte(1,-,-)` (format B) represents an Atom with variable-length binary representation.
|
|
|
|
|
- `leadbyte(2,-,-)` (format B) represents a Record.
|
|
|
|
|
- `leadbyte(3,-,-)` (format B) represents a Sequence, Set or Dictionary.
|
|
|
|
|
|
|
|
|
|
##### Encoding data of fixed length (format A)
|
|
|
|
|
|
|
|
|
|
Each specific type of data defines its own rules for this format.
|
|
|
|
|
|
|
|
|
|
##### Encoding data of known length (format B)
|
|
|
|
|
|
|
|
|
|
A `Repr` where the length of the `Value` to be encoded is variable but
|
|
|
|
|
known uses the value of `m` in `leadbyte` to encode its length. The
|
|
|
|
|
length counts *bytes* for atomic `Value`s, but counts *contained
|
|
|
|
|
values* for compound `Value`s.
|
2018-09-23 13:37:20 +00:00
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
- A length `l` between 0 and 14 is represented using `leadbyte` with
|
|
|
|
|
`m=l`.
|
|
|
|
|
- A length of 15 or greater is represented by `m=15` and additional
|
|
|
|
|
bytes describing the length following the lead byte.
|
2018-09-23 13:37:20 +00:00
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
The function `header(t,n,m)` yields an appropriate sequence of bytes
|
|
|
|
|
describing a `Repr`'s type and length when `t`, `n` and `m` are
|
|
|
|
|
appropriate non-negative integers:
|
|
|
|
|
|
|
|
|
|
header(t,n,m) = leadbyte(t,n,m) when m < 15
|
|
|
|
|
or leadbyte(t,n,15) ++ varint(m) otherwise
|
|
|
|
|
|
|
|
|
|
The additional length bytes are formatted as
|
|
|
|
|
[base 128 varints][varint]. We write `varint(m)` for the
|
|
|
|
|
varint-encoding of `m`. Quoting the [Google Protocol Buffers][varint]
|
|
|
|
|
definition,
|
2018-09-23 13:37:20 +00:00
|
|
|
|
|
|
|
|
|
> Each byte in a varint, except the last byte, has the most
|
|
|
|
|
> significant bit (msb) set – this indicates that there are further
|
|
|
|
|
> bytes to come. The lower 7 bits of each byte are used to store the
|
|
|
|
|
> two's complement representation of the number in groups of 7 bits,
|
|
|
|
|
> least significant group first.
|
|
|
|
|
|
|
|
|
|
**Examples.**
|
|
|
|
|
|
|
|
|
|
- The varint representation of 15 is just the byte 15.
|
|
|
|
|
- 300 (binary, grouped into 7-bit chunks, `10 0101100`) varint-encodes to the two bytes 172 and 2.
|
|
|
|
|
- 1000000000 (binary `11 1011100 1101011 0010100 0000000`) varint-encodes to bytes 128, 148, 235, 220, and 3.
|
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
##### Streaming data of unknown length (format C)
|
|
|
|
|
|
|
|
|
|
A `Repr` where the length of the `Value` to be encoded is variable and
|
|
|
|
|
not known at the time serialization of the `Value` starts is encoded
|
|
|
|
|
by a single Stream Start byte, followed by zero or more *chunks*,
|
|
|
|
|
followed by a matching Stream End byte:
|
|
|
|
|
|
|
|
|
|
startbyte(t,n) = leadbyte(0,2, t*4 + n)
|
|
|
|
|
endbyte(t,n) = leadbyte(0,3, t*4 + n)
|
|
|
|
|
|
|
|
|
|
For a `Repr` of a `Value` containing binary data, each chunk is to be
|
|
|
|
|
a format B `Repr` of the same type as the overall `Repr`.
|
|
|
|
|
|
|
|
|
|
For a `Repr` of a `Value` containing other `Value`s, each chunk is to
|
|
|
|
|
be a single `Repr`.
|
2018-09-23 13:37:20 +00:00
|
|
|
|
|
|
|
|
|
#### Records
|
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
Format B (known length):
|
|
|
|
|
|
|
|
|
|
[[ (L F_1 ... F_m) ]] = header(2,3,m+1) ++ [[L]] ++ [[F_1]] ++ ... ++ [[F_m]]
|
2018-09-23 13:37:20 +00:00
|
|
|
|
|
|
|
|
|
For `m` fields, `m+1` is supplied to `header`, to account for the
|
|
|
|
|
encoding of the record label.
|
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
Format C (streaming):
|
|
|
|
|
|
|
|
|
|
[[ (L F_1 ... F_m) ]]
|
|
|
|
|
= startbyte(2,3) ++ [[L]] ++ [[F_1]] ++ ... ++ [[F_m]] ++ endbyte(2,3)
|
|
|
|
|
|
|
|
|
|
Applications *SHOULD* prefer the known-length format for encoding
|
|
|
|
|
`Record`s.
|
|
|
|
|
|
2018-09-23 13:37:20 +00:00
|
|
|
|
##### Application-specific short form for labels
|
|
|
|
|
|
|
|
|
|
Any given protocol using Preserves may additionally define an
|
|
|
|
|
interpretation for `n ∈ {0,1,2}`, mapping each *short form label
|
|
|
|
|
number* `n` to a specific record label. When encoding `m` fields with
|
2018-09-23 21:35:00 +00:00
|
|
|
|
short form label number `n`, format B becomes
|
|
|
|
|
|
|
|
|
|
header(2,n,m) ++ [[F_1]] ++ ... ++ [[F_m]]
|
|
|
|
|
|
|
|
|
|
and format C becomes
|
|
|
|
|
|
|
|
|
|
startbyte(2,n) ++ [[F_1]] ++ ... ++ [[F_m]] ++ endbyte(2,n)
|
2018-09-23 13:37:20 +00:00
|
|
|
|
|
|
|
|
|
**Examples.** For example, a protocol may choose to map records
|
|
|
|
|
labelled `void` to `n=0`, making
|
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
[[(void)]] = header(2,0,0) = [0x80]
|
2018-09-23 13:37:20 +00:00
|
|
|
|
|
|
|
|
|
or it may map records labelled `person` to short form label number 1,
|
|
|
|
|
making
|
|
|
|
|
|
|
|
|
|
[[(person "Dr" "Elizabeth" "Blackwell")]]
|
2018-09-23 21:35:00 +00:00
|
|
|
|
= header(2,1,3) ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]
|
|
|
|
|
= [0x93] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]
|
|
|
|
|
|
|
|
|
|
for format B, or
|
|
|
|
|
|
|
|
|
|
= startbyte(2,1) ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ endbyte(2,1)
|
|
|
|
|
= [0x29] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ [0x39]
|
|
|
|
|
|
|
|
|
|
for format C.
|
2018-09-23 13:37:20 +00:00
|
|
|
|
|
|
|
|
|
#### Sequences, Sets and Dictionaries
|
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
Format B (known length):
|
|
|
|
|
|
|
|
|
|
[[ [X_1 ... X_m] ]] = header(3,0,m) ++ [[X_1]] ++ ... ++ [[X_m]]
|
|
|
|
|
|
|
|
|
|
[[ #set{X_1 ... X_m} ]] = header(3,1,m) ++ [[X_1]] ++ ... ++ [[X_m]]
|
|
|
|
|
|
|
|
|
|
[[ #dict{K_1:V_1 ... K_m:V_m} ]]
|
|
|
|
|
= header(3,2,m) ++ [[K_1]] ++ [[V_1]] ++ ... ++ [[K_m]] ++ [[V_m]]
|
|
|
|
|
|
|
|
|
|
Format C (streaming):
|
2018-09-23 13:37:20 +00:00
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
[[ [X_1 ... X_m] ]] = startbyte(3,0) ++ [[X_1]] ++ ... ++ [[X_m]] ++ endbyte(3,0)
|
|
|
|
|
|
|
|
|
|
[[ #set{X_1 ... X_m} ]] = startbyte(3,1) ++ [[X_1]] ++ ... ++ [[X_m]] ++ endbyte(3,1)
|
2018-09-23 13:37:20 +00:00
|
|
|
|
|
|
|
|
|
[[ #dict{K_1:V_1 ... K_m:V_m} ]]
|
2018-09-23 21:35:00 +00:00
|
|
|
|
= startbyte(3,2) ++ [[K_1]] ++ [[V_1]] ++ ... ++ [[K_m]] ++ [[V_m]] ++ endbyte(3,2)
|
|
|
|
|
|
|
|
|
|
Applications may use whichever format suits their needs on a
|
|
|
|
|
case-by-case basis.
|
2018-09-23 17:14:58 +00:00
|
|
|
|
|
|
|
|
|
There is *no* ordering requirement on the `X_i` elements or
|
|
|
|
|
`K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any
|
|
|
|
|
order.
|
|
|
|
|
|
|
|
|
|
[^no-sorting-rationale]: In the BitTorrent encoding format,
|
|
|
|
|
[bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding),
|
|
|
|
|
dictionary key/value pairs must be sorted by key. This is a
|
|
|
|
|
necessary step for ensuring serialization of `Value`s is
|
|
|
|
|
canonical. We do not require that key/value pairs (or set
|
|
|
|
|
elements) be in sorted order for serialized `Value`s, because (a)
|
|
|
|
|
where canonicalization is used for cryptographic signatures, it is
|
|
|
|
|
more reliable to simply retain the exact binary form of the signed
|
|
|
|
|
document than to depend on canonical de- and re-serialization, and
|
|
|
|
|
(b) sorting keys or elements makes no sense in streaming
|
|
|
|
|
serialization formats.
|
2018-09-23 13:37:20 +00:00
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
Note that `header(3,3,m)` and `startbyte(3,3)`/`endbyte(3,3)` is unused and reserved.
|
2018-09-23 13:37:20 +00:00
|
|
|
|
|
|
|
|
|
#### Variable-length Atoms
|
|
|
|
|
|
|
|
|
|
##### SignedInteger
|
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
Format B (known length):
|
|
|
|
|
|
|
|
|
|
[[ x ]] when x ∈ SignedInteger = header(1,0,m) ++ intbytes(x)
|
2018-09-23 13:37:20 +00:00
|
|
|
|
where m = |intbytes(x)|
|
|
|
|
|
and intbytes(x) = a big-endian two's-complement representation
|
|
|
|
|
of the signed integer x, taking exactly as
|
|
|
|
|
many whole bytes as needed to unambiguously
|
|
|
|
|
identify the value
|
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
Format C *MUST NOT* be used for `SignedInteger`s.
|
|
|
|
|
|
2018-09-23 17:14:58 +00:00
|
|
|
|
The value 0 needs zero bytes to identify the value, so `intbytes(0)`
|
|
|
|
|
is the empty byte string. Non-zero values need at least one byte; the
|
|
|
|
|
most-significant bit in the first byte in `intbytes(x)` for `x≠0` is
|
|
|
|
|
the sign bit.
|
|
|
|
|
|
2018-09-23 13:37:20 +00:00
|
|
|
|
For example,
|
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
[[ -257 ]] = [0x42, 0xFE, 0xFF]
|
|
|
|
|
[[ -256 ]] = [0x42, 0xFF, 0x00]
|
|
|
|
|
[[ -255 ]] = [0x42, 0xFF, 0x01]
|
|
|
|
|
[[ -254 ]] = [0x42, 0xFF, 0x02]
|
|
|
|
|
[[ -129 ]] = [0x42, 0xFF, 0x7F]
|
|
|
|
|
[[ -128 ]] = [0x41, 0x80]
|
|
|
|
|
[[ -127 ]] = [0x41, 0x81]
|
|
|
|
|
[[ -2 ]] = [0x41, 0xFE]
|
|
|
|
|
[[ -1 ]] = [0x41, 0xFF]
|
|
|
|
|
[[ 0 ]] = [0x40]
|
|
|
|
|
[[ 1 ]] = [0x41, 0x01]
|
|
|
|
|
[[ 127 ]] = [0x41, 0x7F]
|
|
|
|
|
[[ 128 ]] = [0x42, 0x00, 0x80]
|
|
|
|
|
[[ 255 ]] = [0x42, 0x00, 0xFF]
|
|
|
|
|
[[ 256 ]] = [0x42, 0x01, 0x00]
|
|
|
|
|
[[ 32767 ]] = [0x42, 0x7F, 0xFF]
|
|
|
|
|
[[ 32768 ]] = [0x43, 0x00, 0x80, 0x00]
|
|
|
|
|
[[ 65535 ]] = [0x43, 0x00, 0xFF, 0xFF]
|
|
|
|
|
[[ 65536 ]] = [0x43, 0x01, 0x00, 0x00]
|
|
|
|
|
[[ 131072 ]] = [0x43, 0x02, 0x00, 0x00]
|
2018-09-23 13:37:20 +00:00
|
|
|
|
|
|
|
|
|
##### String
|
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
Format B (known length):
|
|
|
|
|
|
|
|
|
|
[[ S ]] when S ∈ String = header(1,1,m) ++ utf8(S)
|
2018-09-23 13:37:20 +00:00
|
|
|
|
where m = |utf8(x)|
|
|
|
|
|
and utf8(x) = the UTF-8 encoding of S
|
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
To stream a `String`, emit `startbyte(1,1)` and then a sequence of
|
|
|
|
|
zero or more format B `String` chunks, followed by `endbyte(1,1)`.
|
|
|
|
|
|
|
|
|
|
While the overall content of a streamed `String` must be valid UTF-8,
|
|
|
|
|
individual chunks do not have to conform to UTF-8.
|
|
|
|
|
|
2018-09-23 13:37:20 +00:00
|
|
|
|
##### ByteString
|
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
Format B (known length):
|
|
|
|
|
|
|
|
|
|
[[ B ]] when B ∈ ByteString = header(1,2,m) ++ B
|
2018-09-23 13:37:20 +00:00
|
|
|
|
where m = |B|
|
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
To stream a `ByteString`, emit `startbyte(1,2)` and then a sequence of
|
|
|
|
|
zero or more format B `ByteString` chunks, followed by `endbyte(1,2)`.
|
|
|
|
|
|
2018-09-23 13:37:20 +00:00
|
|
|
|
##### Symbol
|
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
Format B (known length):
|
|
|
|
|
|
|
|
|
|
[[ S ]] when S ∈ Symbol = header(1,3,m) ++ utf8(S)
|
2018-09-23 13:37:20 +00:00
|
|
|
|
where m = |utf8(x)|
|
|
|
|
|
and utf8(x) = the UTF-8 encoding of S
|
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
To stream a `Symbol`, emit `startbyte(1,3)` and then a sequence of
|
|
|
|
|
zero or more format B `Symbol` chunks, followed by `endbyte(1,3)`.
|
|
|
|
|
|
2018-09-23 13:37:20 +00:00
|
|
|
|
#### Fixed-length Atoms
|
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
Fixed-length atoms all use format A, and do not have a length
|
|
|
|
|
representation. They repurpose the bits that format B `Repr`s use to
|
|
|
|
|
specify lengths. Applications *MUST NOT* use format C with
|
|
|
|
|
`startbyte(0,n)` or `endbyte(0,n)` for any `n`.
|
|
|
|
|
|
2018-09-23 13:37:20 +00:00
|
|
|
|
##### Booleans
|
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
[[ #f ]] = header(0,0,0) = [0x00]
|
|
|
|
|
[[ #t ]] = header(0,0,1) = [0x01]
|
2018-09-23 13:37:20 +00:00
|
|
|
|
|
|
|
|
|
##### Floats and Doubles
|
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
[[ F ]] when F ∈ Float = header(0,0,2) ++ binary32(F)
|
|
|
|
|
[[ D ]] when D ∈ Double = header(0,0,3) ++ binary64(D)
|
2018-09-23 13:37:20 +00:00
|
|
|
|
where binary32(F) and binary64(D) are big-endian 4- and 8-byte
|
|
|
|
|
IEEE 754 binary representations
|
|
|
|
|
|
|
|
|
|
## Examples
|
|
|
|
|
|
|
|
|
|
<!-- TODO: Give some examples of large and small Preserves, perhaps -->
|
|
|
|
|
<!-- translated from various JSON blobs floating around the internet. -->
|
|
|
|
|
|
|
|
|
|
For the following examples, imagine an application that maps `Record`
|
|
|
|
|
short form label number 0 to label `discard`, 1 to `capture`, and 2 to
|
|
|
|
|
`observe`.
|
|
|
|
|
|
|
|
|
|
| Value | Encoded hexadecimal byte sequence |
|
|
|
|
|
|--------------------------------------------------------------------|----------------------------------------------------|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
| `(capture (discard))` | 91 80 |
|
|
|
|
|
| `(observe (speak (discard) (capture (discard))))` | A1 B3 75 73 70 65 61 6B 80 91 80 |
|
|
|
|
|
| `[1 2 3 4]` (format B) | C4 41 01 41 02 41 03 41 04 |
|
|
|
|
|
| `[1 2 3 4]` (format C) | 2C 41 01 41 02 41 03 41 04 3C |
|
|
|
|
|
| `[-2 -1 0 1]` | C4 41 FE 41 FF 40 41 01 |
|
|
|
|
|
| `"hello"` (format B) | 55 68 65 6C 6C 6F |
|
|
|
|
|
| `"hello"` (format C, 2 chunks) | 25 52 68 65 53 6C 6C 6F 35 |
|
|
|
|
|
| `"hello"` (format C, 5 chunks) | 25 52 68 65 52 6C 6C 50 50 51 6F 35 |
|
|
|
|
|
| `["hello" there #"world" [] #set{} #t #f]` | C7 55 68 65 6C 6C 6F 75 74 68 65 72 65 C0 D0 01 00 |
|
|
|
|
|
| `-257` | 42 FE FF |
|
|
|
|
|
| `-1` | 41 FF |
|
|
|
|
|
| `0` | 40 |
|
|
|
|
|
| `1` | 41 01 |
|
|
|
|
|
| `255` | 42 00 FF |
|
|
|
|
|
| `1f` | 02 3F 80 00 00 |
|
|
|
|
|
| `1d` | 03 3F F0 00 00 00 00 00 00 |
|
|
|
|
|
| `-1.202e300d` | 03 FE 3C B7 B7 59 BF 04 26 |
|
|
|
|
|
|
|
|
|
|
Finally, a larger example, using a non-`Symbol` label for a record.[^extensibility2] The `Record`
|
2018-09-23 13:37:20 +00:00
|
|
|
|
|
|
|
|
|
([titled person 2 thing 1]
|
|
|
|
|
101
|
|
|
|
|
"Blackwell"
|
|
|
|
|
(date 1821 2 3)
|
|
|
|
|
"Dr")
|
|
|
|
|
|
|
|
|
|
encodes to
|
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
B5 ;; Record, generic, 4+1
|
|
|
|
|
C5 ;; Sequence, 5
|
|
|
|
|
76 74 69 74 6C 65 64 ;; Symbol, "titled"
|
|
|
|
|
76 70 65 72 73 6F 6E ;; Symbol, "person"
|
|
|
|
|
41 02 ;; SignedInteger, "2"
|
|
|
|
|
75 74 68 69 6E 67 ;; Symbol, "thing"
|
|
|
|
|
41 01 ;; SignedInteger, "1"
|
|
|
|
|
41 65 ;; SignedInteger, "101"
|
|
|
|
|
59 42 6C 61 63 6B 77 65 6C 6C ;; String, "Blackwell"
|
|
|
|
|
B4 ;; Record, generic, 3+1
|
|
|
|
|
74 64 61 74 65 ;; Symbol, "date"
|
|
|
|
|
42 07 1D ;; SignedInteger, "1821"
|
|
|
|
|
41 02 ;; SignedInteger, "2"
|
|
|
|
|
41 03 ;; SignedInteger, "3"
|
|
|
|
|
52 44 72 ;; String, "Dr"
|
2018-09-23 13:37:20 +00:00
|
|
|
|
|
|
|
|
|
[^extensibility2]: It happens to line up with Racket's
|
|
|
|
|
representation of a record label for an inheritance hierarchy
|
|
|
|
|
where `titled` extends `person` extends `thing`:
|
|
|
|
|
|
|
|
|
|
(struct date (year month day) #:prefab)
|
|
|
|
|
(struct thing (id) #:prefab)
|
|
|
|
|
(struct person thing (name date-of-birth) #:prefab)
|
|
|
|
|
(struct titled person (title) #:prefab)
|
|
|
|
|
|
|
|
|
|
## Conventions for Common Data Types
|
|
|
|
|
|
|
|
|
|
The `Value` data type is essentially an S-Expression, able to
|
|
|
|
|
represent semi-structured data over `ByteString`, `String`,
|
|
|
|
|
`SignedInteger` atoms and so on.
|
|
|
|
|
|
|
|
|
|
However, users need a wide variety of data types for representing
|
|
|
|
|
domain-specific values such as various kinds of encoded and normalized
|
|
|
|
|
text, calendrical values, machine words, and so on.
|
|
|
|
|
|
|
|
|
|
We use appropriately-labelled `Record`s to denote these
|
|
|
|
|
domain-specific data types.
|
|
|
|
|
|
|
|
|
|
All of these conventions are optional. They form a layer atop the core
|
|
|
|
|
`Value` structure. Non-domain-specific tools do not in general need to
|
|
|
|
|
treat them specially.
|
|
|
|
|
|
|
|
|
|
**Validity.** Many of the labels we will describe in this section come
|
|
|
|
|
with side-conditions on the contents of labelled `Record`s. It is
|
|
|
|
|
possible to construct an instance of `Value` that violates these
|
|
|
|
|
side-conditions without ceasing to be a `Value` or becoming
|
|
|
|
|
unrepresentable. However, we say that such a `Value` is *invalid*
|
|
|
|
|
because it fails to honour the necessary side-conditions.
|
|
|
|
|
Implementations *SHOULD* allow two modes of working: one which
|
|
|
|
|
treats all `Value`s identically, without regard for side-conditions,
|
|
|
|
|
and one which enforces validity (i.e. side-conditions) when reading,
|
|
|
|
|
writing, or constructing `Value`s.
|
|
|
|
|
|
2018-09-23 17:14:58 +00:00
|
|
|
|
### MIME-type tagged binary data
|
|
|
|
|
|
|
|
|
|
Many internet protocols use
|
|
|
|
|
[media types](https://tools.ietf.org/html/rfc6838) (a.k.a MIME types)
|
|
|
|
|
to indicate the format of some associated binary data. For this
|
|
|
|
|
purpose, we define `MIMEData` to be a record labelled `mime` with two
|
|
|
|
|
fields, the first being a `Symbol`, the media type, and the second
|
|
|
|
|
being a `ByteString`, the binary data.
|
|
|
|
|
|
|
|
|
|
While each media type may define its own rules for comparing
|
|
|
|
|
documents, we define ordering among `MIMEData` *representations* of
|
|
|
|
|
such media types lexicographically over the (`Symbol`, `ByteString`)
|
|
|
|
|
pair.
|
|
|
|
|
|
|
|
|
|
**Examples.**
|
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
| `(mime application/octet-stream #"abcde")` | B3 74 6D 69 6D 65 7F 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D 65 61 62 63 64 65 |
|
|
|
|
|
| `(mime text/plain #"ABC")` | B3 74 6D 69 6D 65 7A 74 65 78 74 2F 70 6C 61 69 6E 63 41 42 43 |
|
|
|
|
|
| `(mime application/xml #"<xhtml/>")` | B3 74 6D 69 6D 65 7F 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 68 3C 78 68 74 6D 6C 2F 3E |
|
|
|
|
|
| `(mime text/csv #"123,234,345")` | B3 74 6D 69 6D 65 78 74 65 78 74 2F 63 73 76 6B 31 32 33 2C 32 33 34 2C 33 34 35 |
|
2018-09-23 17:14:58 +00:00
|
|
|
|
|
|
|
|
|
Applications making heavy use of `mime` records may choose to use a
|
|
|
|
|
short form label number for the record type. For example, if short
|
|
|
|
|
form label number 1 were chosen, the second example above, `(mime
|
2018-09-23 21:35:00 +00:00
|
|
|
|
text/plain "ABC")`, would be encoded with "92" in place of "B3 74 6D
|
2018-09-23 17:14:58 +00:00
|
|
|
|
69 6D 65".
|
|
|
|
|
|
2018-09-23 13:37:20 +00:00
|
|
|
|
### Text
|
|
|
|
|
|
|
|
|
|
#### Normalization forms
|
|
|
|
|
|
|
|
|
|
In order for users to unambiguously signal or require a particular
|
|
|
|
|
[normalization form](http://unicode.org/reports/tr15/), we define a
|
|
|
|
|
`NormalizedString`, which is a `Record` labelled with
|
|
|
|
|
`unicode-normalization` and having two fields, the first of which is a
|
|
|
|
|
`Symbol` specifying the normalization form used (e.g. `nfc`, `nfd`,
|
|
|
|
|
`nfkc`, `nfkd`), and the second of which is a `String` whose
|
|
|
|
|
underlying code point representation *MUST* be normalized according to
|
|
|
|
|
the named normalization form.
|
|
|
|
|
|
|
|
|
|
#### IRIs (URIs, URLs, URNs, etc.)
|
|
|
|
|
|
|
|
|
|
An `IRI` is a `Record` labelled with `iri` and having one field, a
|
|
|
|
|
`String` which is the IRI itself and which *MUST* be a valid absolute
|
|
|
|
|
or relative IRI.
|
|
|
|
|
|
|
|
|
|
### Machine words
|
|
|
|
|
|
|
|
|
|
The definition of `SignedInteger` captures all integers. However, in
|
|
|
|
|
certain circumstances it can be valuable to assert that a number
|
|
|
|
|
inhabits a particular range, such as a fixed-width machine word.
|
|
|
|
|
|
|
|
|
|
A family of labels `i`*n* and `u`*n* for *n* ∈ {16,32,64} denote
|
|
|
|
|
*n*-bit-wide signed and unsigned range restrictions, respectively.
|
|
|
|
|
Records with these labels *MUST* have one field, a `SignedInteger`,
|
|
|
|
|
which *MUST* fall within the appropriate range. That is, to be valid,
|
|
|
|
|
- in `(i16 `*x*`)`, -32768 <= *x* <= 32767.
|
|
|
|
|
- in `(u16 `*x*`)`, 0 <= *x* <= 65535.
|
|
|
|
|
- in `(i32 `*x*`)`, -2147483648 <= *x* <= 2147483647.
|
|
|
|
|
- etc.
|
|
|
|
|
|
|
|
|
|
### Anonymous Tuples and Unit
|
|
|
|
|
|
|
|
|
|
A `Tuple` is a `Record` with label `tuple` and zero or more fields,
|
|
|
|
|
denoting an anonymous tuple of values.
|
|
|
|
|
|
|
|
|
|
The 0-ary tuple, `(tuple)`, denotes the empty tuple, sometimes called
|
|
|
|
|
"unit" or "void" (but *not* e.g. JavaScript's "undefined" value).
|
|
|
|
|
|
|
|
|
|
### Null and Undefined
|
|
|
|
|
|
|
|
|
|
Tony Hoare's
|
|
|
|
|
"[billion-dollar mistake](https://en.wikipedia.org/wiki/Tony_Hoare#Apologies_and_retractions)"
|
|
|
|
|
can be represented with the 0-ary `Record` `(null)`. An "undefined"
|
|
|
|
|
value can be represented as `(undefined)`.
|
|
|
|
|
|
|
|
|
|
### Dates and Times
|
|
|
|
|
|
|
|
|
|
Dates, times, moments, and timestamps can be represented with a
|
|
|
|
|
`Record` with label `rfc3339` having a single field, a `String`, which
|
|
|
|
|
*MUST* conform to one of the `full-date`, `partial-time`, `full-time`,
|
|
|
|
|
or `date-time` productions of
|
|
|
|
|
[section 5.6 of RFC 3339](https://tools.ietf.org/html/rfc3339#section-5.6).
|
|
|
|
|
|
|
|
|
|
## Representing Values in Programming Languages
|
|
|
|
|
|
|
|
|
|
We have given a definition of `Value` and its semantics, and proposed
|
|
|
|
|
a concrete syntax for communicating and storing `Value`s. We now turn
|
|
|
|
|
to **suggested** representations of `Value`s as *programming-language
|
|
|
|
|
values* for various programming languages.
|
|
|
|
|
|
|
|
|
|
When designing a language mapping, an important consideration is
|
|
|
|
|
roundtripping: serialization after deserialization, and vice versa,
|
|
|
|
|
should both be identities.
|
|
|
|
|
|
|
|
|
|
### JavaScript
|
|
|
|
|
|
|
|
|
|
- `SignedInteger` ↔ numbers or `BigInt` [[1](https://developers.google.com/web/updates/2018/05/bigint), [2](https://github.com/tc39/proposal-bigint)]
|
|
|
|
|
- `String` ↔ strings
|
|
|
|
|
- `ByteString` ↔ `Uint8Array`
|
|
|
|
|
- `Symbol` ↔ `Symbol.for(...)`
|
|
|
|
|
- `Boolean` ↔ `Boolean`
|
|
|
|
|
- `Float` and `Double` ↔ numbers,
|
|
|
|
|
- `Record` ↔ `{ "_label": theLabel, "_fields": [field0, ..., fieldN] }`, plus convenience accessors
|
|
|
|
|
- `(undefined)` ↔ the undefined value
|
|
|
|
|
- `(rfc3339 F)` ↔ `Date`, if `F` matches the `date-time` RFC 3339 production
|
|
|
|
|
- `Sequence` ↔ `Array`
|
|
|
|
|
- `Set` ↔ `{ "_set": M }` where `M` is a `Map` from the elements of the set to `true`
|
|
|
|
|
- `Dictionary` ↔ a `Map`
|
|
|
|
|
|
|
|
|
|
### Scheme/Racket
|
|
|
|
|
|
|
|
|
|
- `SignedInteger` ↔ exact numbers
|
|
|
|
|
- `String` ↔ strings
|
|
|
|
|
- `ByteString` ↔ byte vector (Racket: "Bytes")
|
|
|
|
|
- `Symbol` ↔ symbols
|
|
|
|
|
- `Boolean` ↔ booleans
|
|
|
|
|
- `Float` and `Double` ↔ inexact numbers (Racket: single- and double-precision floats)
|
|
|
|
|
- `Record` ↔ structures (Racket: prefab struct)
|
|
|
|
|
- `Sequence` ↔ lists
|
|
|
|
|
- `Set` ↔ Racket: sets
|
|
|
|
|
- `Dictionary` ↔ Racket: hash-table
|
|
|
|
|
|
|
|
|
|
### Java
|
|
|
|
|
|
|
|
|
|
- `SignedInteger` ↔ `Integer`, `Long`, `BigInteger`
|
|
|
|
|
- `String` ↔ `String`
|
|
|
|
|
- `ByteString` ↔ `byte[]`
|
|
|
|
|
- `Symbol` ↔ a simple data class wrapping a `String`
|
|
|
|
|
- `Boolean` ↔ `Boolean`
|
|
|
|
|
- `Float` and `Double` ↔ `Float` and `Double`
|
|
|
|
|
- `Record` ↔ in a simple implementation, a generic `Record` class; else perhaps a bean mapping?
|
|
|
|
|
- `Sequence` ↔ an implementation of `java.util.List`
|
|
|
|
|
- `Set` ↔ an implementation of `java.util.Set`
|
|
|
|
|
- `Dictionary` ↔ an implementation of `java.util.Map`
|
|
|
|
|
|
|
|
|
|
### Erlang
|
|
|
|
|
|
|
|
|
|
- `SignedInteger` ↔ integers
|
|
|
|
|
- `String` ↔ tuple of `utf8` and a binary
|
|
|
|
|
- `ByteString` ↔ a binary
|
|
|
|
|
- `Symbol` ↔ the underlying string converted to an Erlang atom, if
|
|
|
|
|
some kind of an "unsafe" mode is set on the decoder (because Erlang
|
|
|
|
|
atoms are not GC'd); otherwise perhaps a tuple of `symbol` and a
|
|
|
|
|
binary of the utf-8
|
|
|
|
|
- `Boolean` ↔ `true` and `false`
|
|
|
|
|
- `Float` and `Double` ↔ floats (unsure how Erlang deals with single-precision)
|
|
|
|
|
- `Record` ↔ a tuple with the label in the first position, and the fields in subsequent positions
|
|
|
|
|
- `Sequence` ↔ a list
|
|
|
|
|
- `Set` ↔ a `sets` set (is this unambiguous? Maybe a [map][erlang-map] from elements to `true`?)
|
|
|
|
|
- `Dictionary` ↔ a [map][erlang-map] (new in Erlang/OTP R17)
|
|
|
|
|
|
|
|
|
|
## Appendix. Table of lead byte values
|
|
|
|
|
|
2018-09-23 21:35:00 +00:00
|
|
|
|
00 - False
|
|
|
|
|
01 - True
|
|
|
|
|
02 - Float
|
|
|
|
|
03 - Double
|
|
|
|
|
(0x) RESERVED 04-0F
|
|
|
|
|
(1x) RESERVED 10-1F
|
|
|
|
|
2x - Start Stream
|
|
|
|
|
3x - End Stream
|
|
|
|
|
|
|
|
|
|
4x - SignedInteger
|
|
|
|
|
5x - String
|
|
|
|
|
6x - Bytes
|
|
|
|
|
7x - Symbol
|
|
|
|
|
|
|
|
|
|
8x - short form Record label index 0
|
|
|
|
|
9x - short form Record label index 1
|
|
|
|
|
Ax - short form Record label index 2
|
|
|
|
|
Bx - Record
|
|
|
|
|
|
|
|
|
|
Cx - Sequence
|
|
|
|
|
Dx - Set
|
|
|
|
|
Ex - Dictionary
|
|
|
|
|
(Fx) RESERVED F0-FF
|
2018-09-23 13:37:20 +00:00
|
|
|
|
|
|
|
|
|
## Appendix. Why not Just Use JSON?
|
|
|
|
|
|
|
|
|
|
<!-- JSON lacks semantics: JSON syntax doesn't denote anything -->
|
|
|
|
|
|
|
|
|
|
JSON offers *syntax* for numbers, strings, booleans, null, arrays and
|
|
|
|
|
string-keyed maps. However, it suffers from two major problems. First,
|
|
|
|
|
it offers no *semantics* for the syntax: it is left to each
|
|
|
|
|
implementation to determine how to treat each JSON term. This causes
|
|
|
|
|
[interoperability](http://seriot.ch/parsing_json.php) and even
|
|
|
|
|
[security](http://web.archive.org/web/20180906202559/http://docs.couchdb.org/en/stable/cve/2017-12635.html)
|
|
|
|
|
issues. Second, JSON's lack of support for type tags leads to awkward
|
|
|
|
|
and incompatible *encodings* of type information in terms of the fixed
|
|
|
|
|
suite of constructors on offer.
|
|
|
|
|
|
|
|
|
|
There are other minor problems with JSON having to do with its syntax.
|
|
|
|
|
Examples include its relative verbosity and its lack of support for
|
|
|
|
|
binary data.
|
|
|
|
|
|
|
|
|
|
### JSON syntax doesn't *mean* anything
|
|
|
|
|
|
|
|
|
|
When are two JSON values the same? When are they different?
|
|
|
|
|
<!-- When is one JSON value "less than" another? -->
|
|
|
|
|
|
|
|
|
|
The specifications are largely silent on these questions. Different
|
|
|
|
|
JSON implementations give different answers.
|
|
|
|
|
|
|
|
|
|
Specifically, JSON does not:
|
|
|
|
|
|
|
|
|
|
- assign any meaning to numbers,[^meaning-ieee-double]
|
|
|
|
|
- determine how strings are to be compared,[^string-key-comparison]
|
|
|
|
|
- determine whether object key ordering is significant,[^json-member-ordering] or
|
|
|
|
|
- determine whether duplicate object keys are permitted, what it
|
|
|
|
|
would mean if they were, or how to determine a duplicate in the
|
|
|
|
|
first place.[^json-key-uniqueness]
|
|
|
|
|
|
|
|
|
|
In short, JSON syntax doesn't *denote* anything.[^xml-infoset] [^other-formats]
|
|
|
|
|
|
|
|
|
|
[^meaning-ieee-double]:
|
|
|
|
|
[Section 6 of RFC 7159](https://tools.ietf.org/html/rfc7159#section-6)
|
|
|
|
|
does go so far as to indicate “good interoperability can be
|
|
|
|
|
achieved” by imagining that parsers are able reliably to
|
|
|
|
|
understand the syntax of numbers as denoting an IEEE 754
|
|
|
|
|
double-precision floating-point value.
|
|
|
|
|
|
|
|
|
|
[^string-key-comparison]:
|
|
|
|
|
[Section 8.3 of RFC 7159](https://tools.ietf.org/html/rfc7159#section-8.3)
|
|
|
|
|
suggests that *if* an implementation compares strings used as
|
|
|
|
|
object keys “code unit by code unit”, then it will interoperate
|
|
|
|
|
with *other such implementations*, but neither requires this
|
|
|
|
|
behaviour nor discusses comparisons of strings used in other
|
|
|
|
|
contexts.
|
|
|
|
|
|
|
|
|
|
[^json-member-ordering]:
|
|
|
|
|
[Section 4 of RFC 7159](https://tools.ietf.org/html/rfc7159#section-4)
|
|
|
|
|
remarks that “[implementations] differ as to whether or not they
|
|
|
|
|
make the ordering of object members visible to calling software.”
|
|
|
|
|
|
|
|
|
|
[^json-key-uniqueness]:
|
|
|
|
|
[Section 4 of RFC 7159](https://tools.ietf.org/html/rfc7159#section-4)
|
|
|
|
|
is the only place in the specification that mentions the issue. It
|
|
|
|
|
explicitly sanctions implementations supporting duplicate keys,
|
|
|
|
|
noting only that “when the names within an object are not unique,
|
|
|
|
|
the behavior of software that receives such an object is
|
|
|
|
|
unpredictable.” Implementations are free to choose any behaviour
|
|
|
|
|
at all in this situation, including signalling an error, or
|
|
|
|
|
discarding all but one of a set of duplicates.
|
|
|
|
|
|
|
|
|
|
[^xml-infoset]: The XML world has the concept of
|
|
|
|
|
[XML infoset](https://www.w3.org/TR/xml-infoset/). Loosely
|
|
|
|
|
speaking, XML infoset is the *denotation* of an XML document; the
|
|
|
|
|
*meaning* of the document.
|
|
|
|
|
|
|
|
|
|
[^other-formats]: Most other recent data languages are like JSON in
|
|
|
|
|
specifying only a syntax with no associated semantics. While some
|
|
|
|
|
do make a sketch of a semantics, the result is often
|
|
|
|
|
underspecified (e.g. in terms of how strings are to be compared),
|
|
|
|
|
overly machine-oriented (e.g. treating 32-bit integers as
|
|
|
|
|
fundamentally distinct from 64-bit integers and from
|
|
|
|
|
floating-point numbers), overly fine (e.g. giving visibility to
|
|
|
|
|
the order in which map entries are written), or all three.
|
|
|
|
|
|
|
|
|
|
Some examples:
|
|
|
|
|
|
|
|
|
|
- are the JSON values `1`, `1.0`, and `1e0` the same or different?
|
|
|
|
|
- are the JSON values `1.0` and `1.0000000000000001` the same or different?
|
|
|
|
|
- are the JSON strings `"päron"` (UTF-8 `70c3a4726f6e`) and `"päron"`
|
|
|
|
|
(UTF-8 `7061cc88726f6e`) the same or different?
|
|
|
|
|
- are the JSON objects `{"a":1, "b":2}` and `{"b":2, "a":1}` the same
|
|
|
|
|
or different?
|
|
|
|
|
- which, if any, of `{"a":1, "a":2}`, `{"a":1}` and `{"a":2}` are the
|
|
|
|
|
same? Are all three legal?
|
|
|
|
|
- are `{"päron":1}` and `{"päron":1}` the same or different?
|
|
|
|
|
|
|
|
|
|
### JSON can multiply nicely, but it can't add very well
|
|
|
|
|
|
|
|
|
|
JSON includes a fixed set of types: numbers, strings, booleans, null,
|
|
|
|
|
arrays and string-keyed maps. Domain-specific data must be *encoded*
|
|
|
|
|
into these types. For example, dates and email addresses are often
|
|
|
|
|
represented as strings with an implicit internal structure.
|
|
|
|
|
|
|
|
|
|
There is no convention for *labelling* a value as belonging to a
|
|
|
|
|
particular category. This makes it difficult to extract, say, all
|
|
|
|
|
email addresses, or all URLs, from an arbitrary JSON document.
|
|
|
|
|
|
|
|
|
|
Instead, JSON-encoded data are often labelled in an ad-hoc way.
|
|
|
|
|
Multiple incompatible approaches exist. For example, a "money"
|
|
|
|
|
structure containing a `currency` field and an `amount` may be
|
|
|
|
|
represented in any number of ways:
|
|
|
|
|
|
|
|
|
|
{ "_type": "money", "currency": "EUR", "amount": 10 }
|
|
|
|
|
{ "type": "money", "value": { "currency": "EUR", "amount": 10 } }
|
|
|
|
|
[ "money", { "currency": "EUR", "amount": 10 } ]
|
|
|
|
|
{ "@money": { "currency": "EUR", "amount": 10 } }
|
|
|
|
|
|
|
|
|
|
This causes particular problems when JSON is used to represent *sum*
|
|
|
|
|
or *union* types, such as "either a value or an error, but not both".
|
|
|
|
|
Again, multiple incompatible approaches exist.
|
|
|
|
|
|
|
|
|
|
For example, imagine an API for depositing money in an account. The
|
|
|
|
|
response might be either a "success" response indicating the new
|
|
|
|
|
balance, or one of a set of possible errors.
|
|
|
|
|
|
|
|
|
|
Sometimes, a *pair* of values is used, with `null` marking the option
|
|
|
|
|
not taken.[^interesting-failure-mode]
|
|
|
|
|
|
|
|
|
|
{ "ok": { "balance": 210 }, "error": null }
|
|
|
|
|
{ "ok": null, "error": "Unauthorized" }
|
|
|
|
|
|
|
|
|
|
[^interesting-failure-mode]: What is the meaning of a document where
|
|
|
|
|
both `ok` and `error` are non-null? What might happen when a
|
|
|
|
|
program is presented with such a document?
|
|
|
|
|
|
|
|
|
|
The branch not chosen is sometimes present, sometimes omitted as if it
|
|
|
|
|
were an optional field:
|
|
|
|
|
|
|
|
|
|
{ "ok": { "balance": 210 } }
|
|
|
|
|
{ "error": "Unauthorized" }
|
|
|
|
|
|
|
|
|
|
Sometimes, an array of a label and a value is used:
|
|
|
|
|
|
|
|
|
|
[ "ok", { "balance": 210 } ]
|
|
|
|
|
[ "error", "Unauthorized" ]
|
|
|
|
|
|
|
|
|
|
Sometimes, the shape of the data is sufficient to distinguish among
|
|
|
|
|
the alternatives, and the label is left implicit:
|
|
|
|
|
|
|
|
|
|
{ "balance": 210 }
|
|
|
|
|
"Unauthorized"
|
|
|
|
|
|
|
|
|
|
JSON itself does not offer any guidance for which of these options to
|
|
|
|
|
choose. In many real cases on the web, poor choices have led to
|
|
|
|
|
encodings that are irrecoverably ambiguous.
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
# Open questions
|
|
|
|
|
|
|
|
|
|
Q. Should "symbols" instead be URIs? Relative, usually; relative to
|
|
|
|
|
what? Some domain-specific base URI?
|
|
|
|
|
|
|
|
|
|
Q. What about general rationals, subsuming integers and IEEE floats
|
|
|
|
|
(except NaN and the Infinities)?
|
|
|
|
|
|
|
|
|
|
Q. Should I map to SPKI SEXP or is that nonsense / for later?[^why-not-spki-sexps]
|
|
|
|
|
|
|
|
|
|
[^why-not-spki-sexps]: Why not just use Rivest's S-Expressions as
|
|
|
|
|
they are? While they include binary data and sequences, and an
|
|
|
|
|
obvious equivalence for them exists, they lack numbers *per se* as
|
|
|
|
|
well as any kind of unordered structure such as sets or maps. In
|
|
|
|
|
addition, while "display hints" allow labelling of binary data
|
|
|
|
|
with an intended interpretation, they cannot be attached to any
|
|
|
|
|
other kind of structure, and the "hint" itself can only be a
|
|
|
|
|
binary blob.
|
|
|
|
|
|
|
|
|
|
Q. Should `Symbol` be a special syntax for a `Record` with a `Symbol`
|
|
|
|
|
label (recursive!?) and a single `String` field?
|
|
|
|
|
|
|
|
|
|
Q. Should `String` be a special syntax for `(utf8 Bytes)`? Again,
|
|
|
|
|
recursiveness problems...?
|
|
|
|
|
|
|
|
|
|
Q. Should `Dictionary` be a special syntax for etc etc.? `Set`?
|
|
|
|
|
`Float`? `Double`?
|
|
|
|
|
|
|
|
|
|
--> Rule of thumb: if there's a special equivalence predicate for it,
|
|
|
|
|
it needs to be built-in syntax. Otherwise it can be a regular
|
|
|
|
|
record. (So: `Boolean` might not make the cut for special
|
|
|
|
|
treatment?? Likewise `String`...? Ugh those are psychologically
|
|
|
|
|
important perhaps)
|
|
|
|
|
|
|
|
|
|
Q. Are the language mappings reasonable? How about one for Python?
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
2018-09-23 17:14:58 +00:00
|
|
|
|
Literal small integers: could be nice? Not absolutely necessary.
|
2018-09-23 13:37:20 +00:00
|
|
|
|
|
|
|
|
|
---
|