Trim and improve
This commit is contained in:
parent
b2eb53e664
commit
b4d4092b90
|
@ -1,14 +1,16 @@
|
|||
---
|
||||
---
|
||||
<title>Preserves: an Expressive Data Language</title>
|
||||
<style>
|
||||
body { font-family: palatino, "Palatino Linotype", "Palatino LT STD", "URW Palladio L", "TeX Gyre Pagella", serif; }
|
||||
@media screen {
|
||||
body { padding-top: 2rem; max-width: 40em; margin: auto; font-size: 120%; }
|
||||
}
|
||||
@media print {
|
||||
body { margin-left: 2rem; margin-right: 2rem; }
|
||||
h2 { page-break-before: always }
|
||||
h2:first-of-type { page-break-before: auto; }
|
||||
@page { margin: 1.5cm; }
|
||||
body { margin-left: 2rem; margin-right 2rem; }
|
||||
h1, h2 { page-break-before: always }
|
||||
h1:first-of-type, h2:first-of-type { page-break-before: auto; }
|
||||
}
|
||||
h1, h2, h3, h4, h5, h6 { margin-left: -1rem; color: #4f81bd; }
|
||||
h2 { border-bottom: solid #4f81bd 1px; }
|
||||
|
@ -17,29 +19,45 @@ code { font-size: 75%; }
|
|||
pre { padding: 0.33rem; }
|
||||
</style>
|
||||
|
||||
# Preserves: Semantic Serialization of Node-labelled Data
|
||||
# Preserves: an Expressive Data Language
|
||||
|
||||
_________
|
||||
<_________> Tony Garnock-Jones <tonyg@leastfixedpoint.com>
|
||||
| FRμIT | September 2018
|
||||
|Preserves| Version 0.0.2
|
||||
\_________/
|
||||
|
||||
Tony Garnock-Jones <tonyg@leastfixedpoint.com>
|
||||
September 2018. Version 0.0.2.
|
||||
|
||||
[sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt
|
||||
[spki]: http://world.std.com/~cme/html/spki.html
|
||||
[varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
|
||||
[erlang-map]: http://erlang.org/doc/reference_manual/data_types.html#map
|
||||
|
||||
Most data serialization formats used on the web represent
|
||||
*edge-labelled* semi-structured data.
|
||||
This document proposes a data model and serialization format called
|
||||
*Preserves*.
|
||||
|
||||
This document proposes a data model and serialization format that
|
||||
takes a *node-labelled* approach.
|
||||
Preserves supports *records* with user-defined *labels*. This makes it
|
||||
more expressive[^macro-expressiveness] than most data languages in use
|
||||
on the web and allows it to easily represent the *labelled sums of
|
||||
products* as seen in many functional programming languages.
|
||||
|
||||
This makes it both extensible and much more like S-expressions, making
|
||||
it easily able to represent the *labelled sums of products* as seen in
|
||||
Rust, Haskell, OCaml, and other functional programming languages.
|
||||
Preserves also supports the usual suite of atomic and compound data
|
||||
types, in particular including *binary* data as a distinct type from
|
||||
text strings.
|
||||
|
||||
Finally, Preserves defines precisely how to compare two values with
|
||||
each other in terms of the data model, not in terms of syntax or of
|
||||
the data structures of any particular implementation language.
|
||||
|
||||
[^macro-expressiveness]: By "expressive" I mean *macro-expressive*
|
||||
in the sense of Felleisen's 1991 paper, "On the Expressive Power
|
||||
of Programming Languages".
|
||||
|
||||
Roughly speaking, there's no way in a JSON document to introduce a
|
||||
new kind of information (such as binary data, or a date-stamp, or
|
||||
a "person" object) in an *unambiguous way* without *global
|
||||
agreement* from every potential consumer of the document. With an
|
||||
extensible labelled record type, there is.
|
||||
|
||||
Felleisen, Matthias. “On the Expressive Power of Programming
|
||||
Languages.” Science of Computer Programming 17, no. 1--3 (1991):
|
||||
35–75.
|
||||
|
||||
## Starting with Semantics
|
||||
|
||||
|
@ -65,20 +83,12 @@ later in this document.
|
|||
| Dictionary
|
||||
|
||||
Our `Value`s fall into two broad categories: *atomic* and *compound*
|
||||
data.[^zephyr-asdl]
|
||||
data.[^inspiration]
|
||||
|
||||
[^zephyr-asdl]: This design was loosely inspired by S-expressions,
|
||||
[^inspiration]: This design was loosely inspired by S-expressions,
|
||||
as seen in Lisp, Scheme, [SPKI/SDSI][sexp.txt], and many others,
|
||||
and by the ML type system, as seen in languages such as SML,
|
||||
OCaml, Haskell, Rust, and many others. It is also related to
|
||||
Zephyr ASDL (h/t
|
||||
[Darius Bacon](https://twitter.com/abecedarius/status/993545767884226561)),
|
||||
which doesn't offer much in the way of atoms, but offers
|
||||
general-purpose labelled sums and products. See D. C. Wang, A. W.
|
||||
Appel, J. L. Korn, and C. S. Serra, “The Zephyr Abstract Syntax
|
||||
Description Language,” in USENIX Conference on Domain-Specific
|
||||
Languages, 1997, pp. 213–228.
|
||||
[PDF available.](https://www.usenix.org/legacy/publications/library/proceedings/dsl97/full_papers/wang/wang.pdf)
|
||||
as well as by the ML type system, as seen in languages such as
|
||||
SML, OCaml, Haskell, Rust, and many others.
|
||||
|
||||
**Total order.**<a name="total-order"></a> As we go, we will
|
||||
incrementally specify a total order over `Value`s. Two values of the
|
||||
|
@ -101,9 +111,6 @@ follows:[^ordering-by-syntax]
|
|||
**Equivalence.**<a name="equivalence"></a> Two `Value`s are equal if
|
||||
neither is less than the other according to the total order.
|
||||
|
||||
<!-- We should avoid unnecessary restrictions such as machine-oriented -->
|
||||
<!-- fixed-width integer or floating-point values where possible. -->
|
||||
|
||||
### Signed integers.
|
||||
|
||||
A `SignedInteger` is a signed integer of arbitrary width.
|
||||
|
@ -120,8 +127,8 @@ examples of `SignedInteger`s using standard mathematical notation.
|
|||
A `String` is a sequence of Unicode
|
||||
[code-point](http://www.unicode.org/glossary/#code_point)s. Two
|
||||
`String`s are compared lexicographically, code-point by
|
||||
code-point.[^utf8-is-awesome] We will write examples of `String`s text
|
||||
surrounded by double-quotes “`"`” using a monospace font.
|
||||
code-point.[^utf8-is-awesome] We will write examples of `String`s as
|
||||
text surrounded by double-quotes “`"`” using a monospace font.
|
||||
|
||||
[^utf8-is-awesome]: Happily, the design of UTF-8 is such that this
|
||||
gives the same result as a lexicographic byte-by-byte comparison
|
||||
|
@ -176,7 +183,7 @@ A `Float` is a single-precision IEEE 754 floating-point value; a
|
|||
`Double` is a double-precision IEEE 754 floating-point value.
|
||||
`Float`s, `Double`s and `SignedInteger`s are considered disjoint, and
|
||||
so by the rules [above](#total-order), every `Float` is less than
|
||||
every `Double`, and every `SignedInteger` is less than both. Two
|
||||
every `Double`, and every `SignedInteger` is greater than both. Two
|
||||
`Float`s or two `Double`s are to be ordered by the `totalOrder`
|
||||
predicate defined in section 5.10 of
|
||||
[IEEE Std 754-2008](https://dx.doi.org/10.1109/IEEESTD.2008.4610935).
|
||||
|
@ -196,10 +203,8 @@ record's *fields*. A record's label is, itself, a `Value`, though it
|
|||
will usually be a `Symbol`.[^extensibility] [^iri-labels] `Record`s
|
||||
are compared lexicographically as if they were just tuples; that is,
|
||||
first by their labels, and then by the remainder of their fields. We
|
||||
will only write examples of `Record`s having labels that are `Symbol`s
|
||||
entirely composed of ASCII characters. Such `Record`s will be written
|
||||
as a parenthesised, space-separated sequence of their label followed
|
||||
by their fields.
|
||||
will write examples of `Record`s as a parenthesised, space-separated
|
||||
sequence of their label `Value` followed by their field `Value`s.
|
||||
|
||||
[^extensibility]: The [Racket](https://racket-lang.org/) programming
|
||||
language defines
|
||||
|
@ -215,19 +220,19 @@ by their fields.
|
|||
`urn:uuid:6bf094a6-20f1-4887-ada7-46834a9b5b34`; where a label can
|
||||
be read as an absolute IRI, it stands for that IRI; and otherwise,
|
||||
it cannot be read as an IRI at all, and so the label simply stands
|
||||
for itself - for its own `Value`.
|
||||
for itself—for its own `Value`.
|
||||
|
||||
**Examples.** The `Record` with label `foo` and fields 1, 2 and 3 is
|
||||
written `(foo 1 2 3)`; the `Record` with label `void` and no fields is
|
||||
written `(void)`.
|
||||
|
||||
**Non-examples.** `()`, because it lacks a label.
|
||||
|
||||
### Sequences.
|
||||
|
||||
A `Sequence` is a general-purpose, variable-length ordered sequence of
|
||||
zero or more `Value`s. `Sequence`s are compared lexicographically,
|
||||
appealing to the ordering on `Value`s for comparisons at each position
|
||||
in the `Sequence`s. We write examples space-separated, surrounded with
|
||||
square brackets.
|
||||
zero or more `Value`s. `Sequence`s are compared lexicographically. We
|
||||
write examples space-separated, surrounded with square brackets.
|
||||
|
||||
**Examples.** `[]`, the empty sequence; `[1 2 3]`, the sequence of
|
||||
`SignedInteger`s 1, 2 and 3.
|
||||
|
@ -237,18 +242,18 @@ square brackets.
|
|||
A `Set` is an unordered finite set of `Value`s. It contains no
|
||||
duplicate values, following the [equivalence relation](#equivalence)
|
||||
induced by the total order on `Value`s. Two `Set`s are compared by
|
||||
sorting their elements using the [total order](#total-order) and
|
||||
comparing the resulting sequences as `Sequence`s. We write examples
|
||||
sorting their elements ascending using the [total order](#total-order)
|
||||
and comparing the resulting `Sequence`s. We write examples
|
||||
space-separated, surrounded with curly braces, prefixed by `#set`.
|
||||
|
||||
**Examples.** `#set{}`, the empty set; `#set{#set{}}`, the set
|
||||
containing only the empty set; `#set{4 "hello" (void) 9.0f}`, the set
|
||||
containing 4, the string `"hello"`, the record with label `void` and
|
||||
no fields, and the `Float` denoting the number 9.0; `#set{1 1.0f}`,
|
||||
the set containing a `SignedInteger` and a `Float`, both denoting the
|
||||
number 1; `#set{(mime application/xml #"<x/>") (mime
|
||||
application/xml #"<x />")}`, a set containing two different
|
||||
type-labelled byte arrays.[^mime-xml-difference]
|
||||
the set containing a `SignedInteger` and a `Float`; `#set{(mime
|
||||
application/xml #"<x/>") (mime application/xml #"<x />")}`, a set
|
||||
containing two different type-labelled byte
|
||||
arrays.[^mime-xml-difference]
|
||||
|
||||
[^mime-xml-difference]: The two XML documents `<x/>` and `<x />`
|
||||
differ by bytewise comparison, and thus yield different record
|
||||
|
@ -258,50 +263,31 @@ type-labelled byte arrays.[^mime-xml-difference]
|
|||
**Non-examples.** `#set{1 1 1}`, because it contains multiple
|
||||
equivalent `Value`s.
|
||||
|
||||
### Dictionaries, hash-tables or maps.
|
||||
### Dictionaries.
|
||||
|
||||
A `Dictionary` is an unordered finite collection of zero or more pairs
|
||||
of `Value`s. Each pair comprises a *key* and a *value*. Keys in a
|
||||
`Dictionary` must be pairwise distinct. Instances of `Dictionary` are
|
||||
compared by lexicographic comparison of the sequences resulting from
|
||||
ordering each `Dictionary`'s pairs in ascending order by key. Examples
|
||||
are written as a `#dict`-prefixed, curly-brace-surrounded sequence of
|
||||
A `Dictionary` is an unordered finite collection of pairs of `Value`s.
|
||||
Each pair comprises a *key* and a *value*. Keys in a `Dictionary` must
|
||||
be pairwise distinct. Instances of `Dictionary` are compared by
|
||||
lexicographic comparison of the sequences resulting from ordering each
|
||||
`Dictionary`'s pairs in ascending order by key. Examples are written
|
||||
as a `#dict`-prefixed, curly-brace-surrounded sequence of
|
||||
space-separated key-value pairs, each written with a colon between the
|
||||
key and value.
|
||||
|
||||
**Examples.** `#dict{}`, the empty dictionary; `#dict{a:1}`, the
|
||||
dictionary mapping the `Symbol` `a` to the `SignedInteger` 1;
|
||||
`#dict{1:a}`, mapping 1 to `a`; `#dict{"hi":0 hi:0 there:[]}`, having
|
||||
a `String` and two `Symbol` keys, and `SignedInteger` and `Sequence`
|
||||
values.
|
||||
`#dict{[1 2 3]:a}`, mapping `[1 2 3]` to `a`; `#dict{"hi":0 hi:0
|
||||
there:[]}`, having a `String` and two `Symbol` keys, and
|
||||
`SignedInteger` and `Sequence` values.
|
||||
|
||||
**Non-examples.** `#dict{a:1 b:2 a:3}`, because it contains duplicate
|
||||
keys; `#dict{[]:[] []:99}`, for the same reason.
|
||||
keys; `#dict{[7 8]:[] [7 8]:99}`, for the same reason.
|
||||
|
||||
## Syntax
|
||||
|
||||
Now we have discussed `Value`s and their meanings, we may turn to
|
||||
techniques for *representing* `Value`s for communication or storage.
|
||||
|
||||
The syntax we have used for the examples so far is inadequate in many
|
||||
ways, not least of which is that it cannot represent every `Value`.
|
||||
|
||||
Separation of the meaning of a piece of syntax from the syntax itself
|
||||
opens the door to domain-specific syntaxes, all equivalent and
|
||||
interconvertible.[^asn1] With a robust semantic foundation,
|
||||
connections to other data languages can also be made.
|
||||
|
||||
[^asn1]: Those who remember
|
||||
[ASN.1](https://www.itu.int/en/ITU-T/asn1/Pages/introduction.aspx)
|
||||
will recall BER, DER, PER, CER, XER and so on, each appropriate to
|
||||
a different setting. Similarly,
|
||||
[Rivest's S-Expression design][sexp.txt] offers a human-friendly
|
||||
syntax, a syntax robust to network-induced message corruption, and
|
||||
an unambiguous, simple and easily-parsed machine-friendly syntax
|
||||
for the same underlying values.
|
||||
|
||||
### Binary syntax
|
||||
|
||||
For now, we limit our attention to an easily-parsed, easily-produced
|
||||
machine-readable syntax.
|
||||
|
||||
|
@ -312,42 +298,7 @@ encoded details of the `Value` itself.
|
|||
|
||||
For a value `v`, we write `[[v]]` for the `Repr` of v.
|
||||
|
||||
The following figure summarises the definitions below:
|
||||
|
||||
tt nn mmmm varint(m) contents
|
||||
-------------------------------
|
||||
|
||||
00 00 0000 False
|
||||
00 00 0001 True
|
||||
00 00 0010 Float, 32 bits big-endian binary
|
||||
00 00 0011 Double, 64 bits big-endian binary
|
||||
00 00 x1xx RESERVED
|
||||
00 00 1xxx RESERVED
|
||||
00 01 xxxx RESERVED
|
||||
00 10 ttnn Start Stream <tt,nn>
|
||||
When tt = 00 --> error
|
||||
01 --> each chunk is a <tt,nn> piece
|
||||
1x --> each chunk is a single encoded Value
|
||||
00 11 ttnn End Stream <tt,nn> (must match preceding Start Stream)
|
||||
|
||||
01 00 mmmm ... SignedInteger, big-endian binary
|
||||
01 01 mmmm ... String, UTF-8 binary
|
||||
01 10 mmmm ... ByteString
|
||||
01 11 mmmm ... Symbol, UTF-8 binary
|
||||
|
||||
10 00 mmmm ... application-specific Record
|
||||
10 01 mmmm ... application-specific Record
|
||||
10 10 mmmm ... application-specific Record
|
||||
10 11 mmmm ... Record
|
||||
|
||||
11 00 mmmm ... Sequence
|
||||
11 01 mmmm ... Set
|
||||
11 10 mmmm ... Dictionary
|
||||
11 11 xxxx RESERVED
|
||||
|
||||
If mmmm = 1111, varint(m) is present; otherwise, m is the length
|
||||
|
||||
#### Type and Length representation
|
||||
### Type and Length representation
|
||||
|
||||
Each `Repr` takes one of three possible forms:
|
||||
|
||||
|
@ -365,13 +316,13 @@ Each `Repr` takes one of three possible forms:
|
|||
begins before the number of elements or bytes in the corresponding
|
||||
`Value` is known.
|
||||
|
||||
Applications may choose between formats (B) and (C) depending on their
|
||||
Applications may choose between formats B and C depending on their
|
||||
needs at serialization time.
|
||||
|
||||
Every `Repr`, however, starts with a *lead byte* describing the
|
||||
remainder of the representation.
|
||||
Every `Repr` starts with a *lead byte* describing the remainder of the
|
||||
representation.
|
||||
|
||||
##### The lead byte
|
||||
#### The lead byte
|
||||
|
||||
The lead byte is constructed by a function `leadbyte`:
|
||||
|
||||
|
@ -387,18 +338,18 @@ follows:[^some-encodings-unused]
|
|||
encodings are reserved for future versions of this specification.
|
||||
|
||||
- `leadbyte(0,0,-)` (format A) represents an Atom with fixed-length binary representation.
|
||||
- `leadbyte(0,1,-)` (format A) is RESERVED.
|
||||
- `leadbyte(0,1,-)` (format A) is reserved.
|
||||
- `leadbyte(0,2,-)` (format C) is a Stream Start byte.
|
||||
- `leadbyte(0,3,-)` (format C) is a Stream End byte.
|
||||
- `leadbyte(1,-,-)` (format B) represents an Atom with variable-length binary representation.
|
||||
- `leadbyte(2,-,-)` (format B) represents a Record.
|
||||
- `leadbyte(3,-,-)` (format B) represents a Sequence, Set or Dictionary.
|
||||
|
||||
##### Encoding data of fixed length (format A)
|
||||
#### Encoding data of fixed length (format A)
|
||||
|
||||
Each specific type of data defines its own rules for this format.
|
||||
|
||||
##### Encoding data of known length (format B)
|
||||
#### Encoding data of known length (format B)
|
||||
|
||||
A `Repr` where the length of the `Value` to be encoded is variable but
|
||||
known uses the value of `m` in `leadbyte` to encode its length. The
|
||||
|
@ -434,15 +385,15 @@ definition,
|
|||
- 300 (binary, grouped into 7-bit chunks, `10 0101100`) varint-encodes to the two bytes 172 and 2.
|
||||
- 1000000000 (binary `11 1011100 1101011 0010100 0000000`) varint-encodes to bytes 128, 148, 235, 220, and 3.
|
||||
|
||||
##### Streaming data of unknown length (format C)
|
||||
#### Streaming data of unknown length (format C)
|
||||
|
||||
A `Repr` where the length of the `Value` to be encoded is variable and
|
||||
not known at the time serialization of the `Value` starts is encoded
|
||||
by a single Stream Start byte, followed by zero or more *chunks*,
|
||||
followed by a matching Stream End byte:
|
||||
by a single Stream Start (“open”) byte, followed by zero or more
|
||||
*chunks*, followed by a matching Stream End (“close”) byte:
|
||||
|
||||
startbyte(t,n) = leadbyte(0,2, t*4 + n)
|
||||
endbyte(t,n) = leadbyte(0,3, t*4 + n)
|
||||
open(t,n) = leadbyte(0,2, t*4 + n)
|
||||
close(t,n) = leadbyte(0,3, t*4 + n)
|
||||
|
||||
For a `Repr` of a `Value` containing binary data, each chunk is to be
|
||||
a format B `Repr` of the same type as the overall `Repr`.
|
||||
|
@ -450,7 +401,7 @@ a format B `Repr` of the same type as the overall `Repr`.
|
|||
For a `Repr` of a `Value` containing other `Value`s, each chunk is to
|
||||
be a single `Repr`.
|
||||
|
||||
#### Records
|
||||
### Records
|
||||
|
||||
Format B (known length):
|
||||
|
||||
|
@ -461,13 +412,12 @@ encoding of the record label.
|
|||
|
||||
Format C (streaming):
|
||||
|
||||
[[ (L F_1 ... F_m) ]]
|
||||
= startbyte(2,3) ++ [[L]] ++ [[F_1]] ++ ... ++ [[F_m]] ++ endbyte(2,3)
|
||||
[[ (L F_1...F_m) ]] = open(2,3) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] ++ close(2,3)
|
||||
|
||||
Applications *SHOULD* prefer the known-length format for encoding
|
||||
`Record`s.
|
||||
|
||||
##### Application-specific short form for labels
|
||||
#### Application-specific short form for labels
|
||||
|
||||
Any given protocol using Preserves may additionally define an
|
||||
interpretation for `n ∈ {0,1,2}`, mapping each *short form label
|
||||
|
@ -478,7 +428,7 @@ short form label number `n`, format B becomes
|
|||
|
||||
and format C becomes
|
||||
|
||||
startbyte(2,n) ++ [[F_1]] ++ ... ++ [[F_m]] ++ endbyte(2,n)
|
||||
open(2,n) ++ [[F_1]] ++...++ [[F_m]] ++ close(2,n)
|
||||
|
||||
**Examples.** For example, a protocol may choose to map records
|
||||
labelled `void` to `n=0`, making
|
||||
|
@ -494,30 +444,29 @@ making
|
|||
|
||||
for format B, or
|
||||
|
||||
= startbyte(2,1) ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ endbyte(2,1)
|
||||
= open(2,1) ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ close(2,1)
|
||||
= [0x29] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ [0x39]
|
||||
|
||||
for format C.
|
||||
|
||||
#### Sequences, Sets and Dictionaries
|
||||
### Sequences, Sets and Dictionaries
|
||||
|
||||
Format B (known length):
|
||||
|
||||
[[ [X_1...X_m] ]] = header(3,0,m) ++ [[X_1]] ++...++ [[X_m]]
|
||||
|
||||
[[ #set{X_1...X_m} ]] = header(3,1,m) ++ [[X_1]] ++...++ [[X_m]]
|
||||
[[ #dict{K_1:V_1...K_m:V_m} ]] = header(3,2,m*2) ++ [[K_1]] ++ [[V_1]] ++...
|
||||
++ [[K_m]] ++ [[V_m]]
|
||||
|
||||
[[ #dict{K_1:V_1 ... K_m:V_m} ]]
|
||||
= header(3,2,m) ++ [[K_1]] ++ [[V_1]] ++ ... ++ [[K_m]] ++ [[V_m]]
|
||||
Note that `m*2` is given to `header` for a `Dictionary`, since there
|
||||
are two `Value`s in each key-value pair.
|
||||
|
||||
Format C (streaming):
|
||||
|
||||
[[ [X_1 ... X_m] ]] = startbyte(3,0) ++ [[X_1]] ++ ... ++ [[X_m]] ++ endbyte(3,0)
|
||||
|
||||
[[ #set{X_1 ... X_m} ]] = startbyte(3,1) ++ [[X_1]] ++ ... ++ [[X_m]] ++ endbyte(3,1)
|
||||
|
||||
[[ #dict{K_1:V_1 ... K_m:V_m} ]]
|
||||
= startbyte(3,2) ++ [[K_1]] ++ [[V_1]] ++ ... ++ [[K_m]] ++ [[V_m]] ++ endbyte(3,2)
|
||||
[[ [X_1...X_m] ]] = open(3,0) ++ [[X_1]] ++...++ [[X_m]] ++ close(3,0)
|
||||
[[ #set{X_1...X_m} ]] = open(3,1) ++ [[X_1]] ++...++ [[X_m]] ++ close(3,1)
|
||||
[[ #dict{K_1:V_1...K_m:V_m} ]] = open(3,2) ++ [[K_1]] ++ [[V_1]] ++...
|
||||
++ [[K_m]] ++ [[V_m]] ++ close(3,2)
|
||||
|
||||
Applications may use whichever format suits their needs on a
|
||||
case-by-case basis.
|
||||
|
@ -538,26 +487,30 @@ order.
|
|||
(b) sorting keys or elements makes no sense in streaming
|
||||
serialization formats.
|
||||
|
||||
Note that `header(3,3,m)` and `startbyte(3,3)`/`endbyte(3,3)` is unused and reserved.
|
||||
However, a quality implementation may wish to offer the programmer
|
||||
the option of serializing with set elements and dictionary keys in
|
||||
sorted order.
|
||||
|
||||
#### Variable-length Atoms
|
||||
Note that `header(3,3,m)` and `open(3,3)`/`close(3,3)` is unused and reserved.
|
||||
|
||||
##### SignedInteger
|
||||
### Variable-length Atoms
|
||||
|
||||
#### SignedInteger
|
||||
|
||||
Format B (known length):
|
||||
|
||||
[[ x ]] when x ∈ SignedInteger = header(1,0,m) ++ intbytes(x)
|
||||
where m = |intbytes(x)|
|
||||
and intbytes(x) = a big-endian two's-complement representation
|
||||
of the signed integer x, taking exactly as
|
||||
many whole bytes as needed to unambiguously
|
||||
identify the value
|
||||
|
||||
Format C *MUST NOT* be used for `SignedInteger`s.
|
||||
|
||||
The function `intbytes(x)` gives the big-endian two's-complement
|
||||
binary representation of `x`, taking exactly as many whole bytes as
|
||||
needed to unambiguously identify the value and its sign, and `m =
|
||||
|intbytes(x)|`.
|
||||
|
||||
The value 0 needs zero bytes to identify the value, so `intbytes(0)`
|
||||
is the empty byte string. Non-zero values need at least one byte; the
|
||||
most-significant bit in the first byte in `intbytes(x)` for `x≠0` is
|
||||
most-significant bit in the first byte in `intbytes(x)` for `x`≠0 is
|
||||
the sign bit.
|
||||
|
||||
For example,
|
||||
|
@ -583,59 +536,49 @@ For example,
|
|||
[[ 65536 ]] = [0x43, 0x01, 0x00, 0x00]
|
||||
[[ 131072 ]] = [0x43, 0x02, 0x00, 0x00]
|
||||
|
||||
##### String
|
||||
#### String, ByteString and Symbol
|
||||
|
||||
Syntax for these three types varies only in the value of `n` supplied
|
||||
to `header`, `open`, and `close`. In each case, the payload following
|
||||
the header is a binary sequence; for `String` and `Symbol`, it is a
|
||||
UTF-8 encoding of the `Value`'s code points, while for `ByteString` it
|
||||
is the raw data contained within the `Value` unmodified.
|
||||
|
||||
Format B (known length):
|
||||
|
||||
[[ S ]] when S ∈ String = header(1,1,m) ++ utf8(S)
|
||||
where m = |utf8(x)|
|
||||
and utf8(x) = the UTF-8 encoding of S
|
||||
[[ S ]] = header(1,n,m) ++ encode(S)
|
||||
where m = |encode(S)|
|
||||
and (n,encode(S)) = (1,utf8(S)) if S ∈ String
|
||||
(2,S) if S ∈ ByteString
|
||||
(3,utf8(S)) if S ∈ Symbol
|
||||
|
||||
To stream a `String`, emit `startbyte(1,1)` and then a sequence of
|
||||
zero or more format B `String` chunks, followed by `endbyte(1,1)`.
|
||||
To stream a `String`, `ByteString` or `Symbol`, emit `open(1,n)` and
|
||||
then a sequence of zero or more format B chunks, followed by
|
||||
`close(1,n)`. For a `String`, every chunk must be a `String`;
|
||||
likewise, for `ByteString` and `Symbol`.
|
||||
|
||||
While the overall content of a streamed `String` must be valid UTF-8,
|
||||
individual chunks do not have to conform to UTF-8.
|
||||
While the overall content of a streamed `String` or `Symbol` must be
|
||||
valid UTF-8, individual chunks do not have to conform to UTF-8.
|
||||
|
||||
##### ByteString
|
||||
|
||||
Format B (known length):
|
||||
|
||||
[[ B ]] when B ∈ ByteString = header(1,2,m) ++ B
|
||||
where m = |B|
|
||||
|
||||
To stream a `ByteString`, emit `startbyte(1,2)` and then a sequence of
|
||||
zero or more format B `ByteString` chunks, followed by `endbyte(1,2)`.
|
||||
|
||||
##### Symbol
|
||||
|
||||
Format B (known length):
|
||||
|
||||
[[ S ]] when S ∈ Symbol = header(1,3,m) ++ utf8(S)
|
||||
where m = |utf8(x)|
|
||||
and utf8(x) = the UTF-8 encoding of S
|
||||
|
||||
To stream a `Symbol`, emit `startbyte(1,3)` and then a sequence of
|
||||
zero or more format B `Symbol` chunks, followed by `endbyte(1,3)`.
|
||||
|
||||
#### Fixed-length Atoms
|
||||
### Fixed-length Atoms
|
||||
|
||||
Fixed-length atoms all use format A, and do not have a length
|
||||
representation. They repurpose the bits that format B `Repr`s use to
|
||||
specify lengths. Applications *MUST NOT* use format C with
|
||||
`startbyte(0,n)` or `endbyte(0,n)` for any `n`.
|
||||
`open(0,n)` or `close(0,n)` for any `n`.
|
||||
|
||||
##### Booleans
|
||||
#### Booleans
|
||||
|
||||
[[ #f ]] = header(0,0,0) = [0x00]
|
||||
[[ #t ]] = header(0,0,1) = [0x01]
|
||||
|
||||
##### Floats and Doubles
|
||||
#### Floats and Doubles
|
||||
|
||||
[[ F ]] when F ∈ Float = header(0,0,2) ++ binary32(F)
|
||||
[[ D ]] when D ∈ Double = header(0,0,3) ++ binary64(D)
|
||||
where binary32(F) and binary64(D) are big-endian 4- and 8-byte
|
||||
IEEE 754 binary representations
|
||||
|
||||
The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
|
||||
8-byte IEEE 754 binary representations of `F` and `D`, respectively.
|
||||
|
||||
## Examples
|
||||
|
||||
|
@ -705,14 +648,33 @@ encodes to
|
|||
|
||||
The `Value` data type is essentially an S-Expression, able to
|
||||
represent semi-structured data over `ByteString`, `String`,
|
||||
`SignedInteger` atoms and so on.
|
||||
`SignedInteger` atoms and so on.[^why-not-spki-sexps]
|
||||
|
||||
[^why-not-spki-sexps]: Rivest's S-Expressions are in many ways
|
||||
similar to Preserves. However, while they include binary data and
|
||||
sequences, and an obvious equivalence for them exists, they lack
|
||||
numbers *per se* as well as any kind of unordered structure such
|
||||
as sets or maps. In addition, while "display hints" allow
|
||||
labelling of binary data with an intended interpretation, they
|
||||
cannot be attached to any other kind of structure, and the "hint"
|
||||
itself can only be a binary blob.
|
||||
|
||||
However, users need a wide variety of data types for representing
|
||||
domain-specific values such as various kinds of encoded and normalized
|
||||
text, calendrical values, machine words, and so on.
|
||||
|
||||
We use appropriately-labelled `Record`s to denote these
|
||||
domain-specific data types.
|
||||
Appropriately-labelled `Record`s denote these domain-specific data
|
||||
types.[^why-dictionaries]
|
||||
|
||||
[^why-dictionaries]: Given `Record`'s existence, it may seem odd
|
||||
that `Dictionary`, `Set`, `Float`, etc. are given special
|
||||
treatment. Preserves aims to offer a useful basic equivalence
|
||||
predicate to programmers, and so if a data type demands a special
|
||||
equivalence predicate, as `Dictionary`, `Set` and `Float` all do,
|
||||
then the type should be included in the base language. Otherwise,
|
||||
it can be represented as a `Record` and treated separately. Both
|
||||
`Boolean` and `String` are seeming exceptions: they merit
|
||||
inclusion because of their cultural importance.
|
||||
|
||||
All of these conventions are optional. They form a layer atop the core
|
||||
`Value` structure. Non-domain-specific tools do not in general need to
|
||||
|
@ -740,11 +702,13 @@ being a `ByteString`, the binary data.
|
|||
|
||||
While each media type may define its own rules for comparing
|
||||
documents, we define ordering among `MIMEData` *representations* of
|
||||
such media types lexicographically over the (`Symbol`, `ByteString`)
|
||||
pair.
|
||||
such media types following the general rules for ordering of
|
||||
`Record`s.
|
||||
|
||||
**Examples.**
|
||||
|
||||
| Value | Encoded hexadecimal byte sequence |
|
||||
|--------------------------------------------|-------------------------------------------------------------------------------------------------------------------|
|
||||
| `(mime application/octet-stream #"abcde")` | B3 74 6D 69 6D 65 7F 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D 65 61 62 63 64 65 |
|
||||
| `(mime text/plain #"ABC")` | B3 74 6D 69 6D 65 7A 74 65 78 74 2F 70 6C 61 69 6E 63 41 42 43 |
|
||||
| `(mime application/xml #"<xhtml/>")` | B3 74 6D 69 6D 65 7F 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 68 3C 78 68 74 6D 6C 2F 3E |
|
||||
|
@ -813,7 +777,84 @@ Dates, times, moments, and timestamps can be represented with a
|
|||
or `date-time` productions of
|
||||
[section 5.6 of RFC 3339](https://tools.ietf.org/html/rfc3339#section-5.6).
|
||||
|
||||
## Representing Values in Programming Languages
|
||||
## Security Considerations
|
||||
|
||||
**Empty chunks.** Streamed (format C) `String`s, `ByteString`s and
|
||||
`Symbol`s may include chunks of zero length. This opens up a
|
||||
possibility for denial-of-service: an attacker may begin streaming a
|
||||
string, sending an endless sequence of zero length chunks, appearing
|
||||
to make progress but not actually doing so. Implementations may place
|
||||
optional reasonable restrictions on the number of consecutive empty
|
||||
chunks that may appear in a stream, and may even supply an optional
|
||||
mode that rejects empty chunks entirely.
|
||||
|
||||
**Canonical form for cryptographic hashing and signing.** As
|
||||
specified, the encoding rules for `Value`s do not force canonical
|
||||
serializations for `Set` or `Dictionary` values. Two serializations of
|
||||
the same `Value` may yield different binary `Repr`s.
|
||||
|
||||
## Appendix. Table of lead byte values
|
||||
|
||||
00 - False
|
||||
01 - True
|
||||
02 - Float
|
||||
03 - Double
|
||||
(0x) RESERVED 04-0F
|
||||
(1x) RESERVED 10-1F
|
||||
2x - Start Stream
|
||||
3x - End Stream
|
||||
|
||||
4x - SignedInteger
|
||||
5x - String
|
||||
6x - ByteString
|
||||
7x - Symbol
|
||||
|
||||
8x - short form Record label index 0
|
||||
9x - short form Record label index 1
|
||||
Ax - short form Record label index 2
|
||||
Bx - Record
|
||||
|
||||
Cx - Sequence
|
||||
Dx - Set
|
||||
Ex - Dictionary
|
||||
(Fx) RESERVED F0-FF
|
||||
|
||||
## Appendix. Bit fields within lead byte values
|
||||
|
||||
tt nn mmmm contents
|
||||
---------- ---------
|
||||
|
||||
00 00 0000 False
|
||||
00 00 0001 True
|
||||
00 00 0010 Float, 32 bits big-endian binary
|
||||
00 00 0011 Double, 64 bits big-endian binary
|
||||
|
||||
00 10 ttnn Start Stream <tt,nn>
|
||||
When tt = 00 --> error
|
||||
01 --> each chunk is a <tt,nn> piece
|
||||
1x --> each chunk is a single encoded Value
|
||||
00 11 ttnn End Stream <tt,nn> (must match preceding Start Stream)
|
||||
|
||||
01 00 mmmm SignedInteger, big-endian binary
|
||||
01 01 mmmm String, UTF-8 binary
|
||||
01 10 mmmm ByteString
|
||||
01 11 mmmm Symbol, UTF-8 binary
|
||||
|
||||
10 00 mmmm application-specific Record
|
||||
10 01 mmmm application-specific Record
|
||||
10 10 mmmm application-specific Record
|
||||
10 11 mmmm Record
|
||||
|
||||
11 00 mmmm Sequence
|
||||
11 01 mmmm Set
|
||||
11 10 mmmm Dictionary
|
||||
|
||||
If mmmm = 1111, a varint(m) follows, giving the length, before
|
||||
the body; otherwise, m is the length of the body to follow.
|
||||
|
||||
|
||||
|
||||
## Appendix. Representing Values in Programming Languages
|
||||
|
||||
We have given a definition of `Value` and its semantics, and proposed
|
||||
a concrete syntax for communicating and storing `Value`s. We now turn
|
||||
|
@ -881,32 +922,6 @@ should both be identities.
|
|||
- `Set` ↔ a `sets` set (is this unambiguous? Maybe a [map][erlang-map] from elements to `true`?)
|
||||
- `Dictionary` ↔ a [map][erlang-map] (new in Erlang/OTP R17)
|
||||
|
||||
## Appendix. Table of lead byte values
|
||||
|
||||
00 - False
|
||||
01 - True
|
||||
02 - Float
|
||||
03 - Double
|
||||
(0x) RESERVED 04-0F
|
||||
(1x) RESERVED 10-1F
|
||||
2x - Start Stream
|
||||
3x - End Stream
|
||||
|
||||
4x - SignedInteger
|
||||
5x - String
|
||||
6x - ByteString
|
||||
7x - Symbol
|
||||
|
||||
8x - short form Record label index 0
|
||||
9x - short form Record label index 1
|
||||
Ax - short form Record label index 2
|
||||
Bx - Record
|
||||
|
||||
Cx - Sequence
|
||||
Dx - Set
|
||||
Ex - Dictionary
|
||||
(Fx) RESERVED F0-FF
|
||||
|
||||
## Appendix. Why not Just Use JSON?
|
||||
|
||||
<!-- JSON lacks semantics: JSON syntax doesn't denote anything -->
|
||||
|
@ -1060,47 +1075,13 @@ JSON itself does not offer any guidance for which of these options to
|
|||
choose. In many real cases on the web, poor choices have led to
|
||||
encodings that are irrecoverably ambiguous.
|
||||
|
||||
---
|
||||
---
|
||||
|
||||
# Open questions
|
||||
|
||||
Q. Should "symbols" instead be URIs? Relative, usually; relative to
|
||||
what? Some domain-specific base URI?
|
||||
|
||||
Q. What about general rationals, subsuming integers and IEEE floats
|
||||
(except NaN and the Infinities)?
|
||||
|
||||
Q. Should I map to SPKI SEXP or is that nonsense / for later?[^why-not-spki-sexps]
|
||||
|
||||
[^why-not-spki-sexps]: Why not just use Rivest's S-Expressions as
|
||||
they are? While they include binary data and sequences, and an
|
||||
obvious equivalence for them exists, they lack numbers *per se* as
|
||||
well as any kind of unordered structure such as sets or maps. In
|
||||
addition, while "display hints" allow labelling of binary data
|
||||
with an intended interpretation, they cannot be attached to any
|
||||
other kind of structure, and the "hint" itself can only be a
|
||||
binary blob.
|
||||
|
||||
Q. Should `Symbol` be a special syntax for a `Record` with a `Symbol`
|
||||
label (recursive!?) and a single `String` field?
|
||||
|
||||
Q. Should `String` be a special syntax for `(utf8 ByteString)`? Again,
|
||||
recursiveness problems...?
|
||||
|
||||
Q. Should `Dictionary` be a special syntax for etc etc.? `Set`?
|
||||
`Float`? `Double`?
|
||||
|
||||
--> Rule of thumb: if there's a special equivalence predicate for it,
|
||||
it needs to be built-in syntax. Otherwise it can be a regular
|
||||
record. (So: `Boolean` might not make the cut for special
|
||||
treatment?? Likewise `String`...? Ugh those are psychologically
|
||||
important perhaps)
|
||||
|
||||
Q. Are the language mappings reasonable? How about one for Python?
|
||||
|
||||
---
|
||||
Q. Literal small integers: could be nice? Not absolutely necessary.
|
||||
|
||||
Literal small integers: could be nice? Not absolutely necessary.
|
||||
|
||||
---
|
||||
## Notes
|
||||
|
|
Loading…
Reference in New Issue