Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks

This commit is contained in:
Tony Garnock-Jones 2019-07-13 22:20:22 -04:00
parent d349e89ea4
commit e2b859e55d
2 changed files with 224 additions and 172 deletions

View File

@ -1,5 +1,6 @@
body {
font-family: palatino, "Palatino Linotype", "Palatino LT STD", "URW Palladio L", "TeX Gyre Pagella", serif;
font-family: palatino, "Palatino Linotype", "Palatino LT STD", "URW Palladio L", "TeX Gyre Pagella", serif;
box-sizing: border-box;
}
@media screen {
body { padding-top: 2rem; max-width: 40em; margin: auto; font-size: 120%; }
@ -59,3 +60,20 @@ h2#notes:before {
}
.footnotes > ol { padding: 0; font-size: 90%; }
table {
border-collapse: collapse;
width: 100%;
}
thead tr {
border-bottom: solid black 1px;
}
th {
font-weight: normal;
text-align: left;
padding-right: 0.5rem;
padding-bottom: 0.3rem;
}
td {
padding-right: 0.5rem;
}

View File

@ -6,7 +6,7 @@
# Preserves: an Expressive Data Language
Tony Garnock-Jones <tonyg@leastfixedpoint.com>
November 2018. Version 0.0.4.
June 2019. Version 0.0.5.
[sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt
[spki]: http://world.std.com/~cme/html/spki.html
@ -72,9 +72,8 @@ follows:[^ordering-by-syntax]
< String < ByteString < Symbol
[^ordering-by-syntax]: The observant reader may note that the
ordering here is (almost) the same as that implied by the tagging
scheme used in the concrete binary syntax for `Value`s. (The
exception is the syntax for small integers near zero.)
ordering here is the same as that implied by the tagging scheme
used in the concrete binary syntax for `Value`s.
**Equivalence.**<a name="equivalence"></a> Two `Value`s are equal if
neither is less than the other according to the total order.
@ -400,7 +399,8 @@ itself have annotations.
Value =/ ws "@" Value Value
Each annotation is preceded by `@`; the underlying annotated value
follows its annotations.
follows its annotations. Here we extend only the syntactic nonterminal
named "`Value`" without altering the semantic class of `Value`s.
**Equivalence.** Annotations appear within syntax denoting a `Value`;
however, the annotations are not part of the denoted value. They are
@ -411,19 +411,24 @@ Reflective tools such as debuggers, user interfaces, and message
routers and relays---tools which process `Value`s generically---may
use annotated inputs to tailor their operation, or may insert
annotations in their outputs. By contrast, in ordinary programs, as a
rule of thumb, the presence, absence or specific value of an
annotation should not change the control flow or output of the
program. Annotations are data *describing* `Value`s, and are not in
the domain of any specific application of `Value`s. That is, an
annotation will almost never cause a non-reflective program to do
anything observably different.
rule of thumb, the presence, absence or content of an annotation
should not change the control flow or output of the program.
Annotations are data *describing* `Value`s, and are not in the domain
of any specific application of `Value`s. That is, an annotation will
almost never cause a non-reflective program to do anything observably
different.
## Compact Binary Syntax
A `Repr` is an encoding, or representation, of a specific `Value`.
Each `Repr` comprises one or more bytes describing first the kind of
represented `Value` and the length of the representation, and then the
encoded details of the `Value` itself.
A `Repr` is a binary-syntax encoding, or representation, of either
- a `Value`,
- a "placeholder" for a `Value`, or
- an annotation on a `Repr`.
Each `Repr` comprises one or more bytes describing the kind of
represented information and the length of the representation, followed
by the encoded details.
For a value `v`, we write `[[v]]` for the `Repr` of v.
@ -431,19 +436,16 @@ For a value `v`, we write `[[v]]` for the `Repr` of v.
Each `Repr` takes one of three possible forms:
- (A) a fixed-length form, used for simple values such as `Boolean`s
or `Float`s.
- (A) type-specific form, used for simple values such as `Boolean`s
or `Float`s, for placeholders, and for introducing annotations.
- (B) a variable-length form with length specified up-front, used for
almost all `Record`s as well as for most `Sequence`s and `String`s,
when their sizes are known at the time serialization begins.
compound and variable-length atomic data structures when their
sizes are known at the time serialization begins.
- (C) a variable-length streaming form with unknown or unpredictable
length, used only seldom for `Record`s, since the number of fields
in a `Record` is usually statically known, but sometimes used for
`Sequence`s, `String`s etc., such as in cases when serialization
begins before the number of elements or bytes in the corresponding
`Value` is known.
length, used in cases when serialization begins before the number
of elements or bytes in the corresponding `Value` is known.
Applications may choose between formats B and C depending on their
needs at serialization time.
@ -455,30 +457,33 @@ Every `Repr` starts with a *lead byte*, constructed by
leadbyte(t,n,m) = [t*64 + n*16 + m]
The arguments `t` and `n` describe the rest of the
representation:[^some-encodings-unused]
The arguments `t`, `n` and `m` describe the rest of the
representation.[^some-encodings-unused]
[^some-encodings-unused]: Some encodings are unused. All such
encodings are reserved for future versions of this specification.
- `t`=0, `n`=0 (format A) represents an `Atom` with fixed-length binary representation.
- `t`=0, `n`=1 (format A) represents certain small `SignedInteger`s.
- `t`=0, `n`=2 (format C) is a Stream Start byte.
- `t`=0, `n`=3 (format C) is a Stream End byte.
- `t`=1 (format B) represents an `Atom` with variable-length binary representation.
- `t`=2 (format B) represents a `Record`.
- `t`=3 (format B) represents a `Sequence`, `Set` or `Dictionary`.
| `t` | `n` | `m` | Meaning |
| --- | --- | --- | ------- |
| 0 | 0 | 03 | (format A) An `Atom` with fixed-length binary representation |
| 0 | 0 | 4 | (format C) Stream end |
| 0 | 0 | 5 | (format A) Annotation |
| 0 | 1 | | (format A) Placeholder for an application-specific `Value` |
| 0 | 2 | | (format C) Stream start |
| 0 | 3 | | (format A) Certain small `SignedInteger`s |
| 1 | | | (format B) An `Atom` with variable-length binary representation |
| 2 | | | (format B) A `Compound` with variable-length representation |
#### Encoding data of fixed length (format A).
#### Encoding data of type-specific length (format A).
Each specific type of data defines its own rules for this format.
Each type of data defines its own rules for this format.
#### Encoding data of known length (format B).
A `Repr` where the length of the `Value` to be encoded is variable but
known uses the value of `m` in `leadbyte` to encode its length. The
length counts *bytes* for atomic `Value`s, but counts *contained
values* for compound `Value`s.
Format B is used where the length `l` of the `Value` to be encoded is
known when serialization begins. Format B `Repr`s use `m` in
`leadbyte` to encode `l`. The length counts *bytes* for atomic
`Value`s, but counts *contained values* for compound `Value`s.
- A length `l` between 0 and 14 is represented using `leadbyte` with
`m=l`.
@ -503,11 +508,13 @@ definition,
> two's complement representation of the number in groups of 7 bits,
> least significant group first.
**Examples.**
The following table illustrates varint-encoding.
- The varint representation of 15 is just the byte 15.
- 300 (binary, grouped into 7-bit chunks, `10 0101100`) varint-encodes to the two bytes 172 and 2.
- 1000000000 (binary `11 1011100 1101011 0010100 0000000`) varint-encodes to bytes 128, 148, 235, 220, and 3.
| Number, `m` | `m` in binary, grouped into 7-bit chunks | `varint(m)` bytes |
| ------ | ------------------- | ------------ |
| 15 | `0001111` | 15 |
| 300 | `0000010 0101100` | 172 2 |
| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 128 148 235 220 3 |
#### Streaming data of unknown length (format C).
@ -516,61 +523,69 @@ not known at the time serialization of the `Value` starts is encoded
by a single Stream Start (“open”) byte, followed by zero or more
*chunks*, followed by a matching Stream End (“close”) byte:
open(t,n) = leadbyte(0,2, t*4 + n)
close(t,n) = leadbyte(0,3, t*4 + n)
open(t,n) = leadbyte(0,2, t*4 + n) = [0x20 + t*4 + n]
close() = leadbyte(0,0, 4) = [0x04]
For a `Repr` of a `Value` containing binary data, each chunk is to be
a format B `Repr` of a `ByteString`, no matter the type of the overall
`Repr`.
For a format C `Repr` of an atomic `Value`, each chunk is to be a
format B `Repr` of a `ByteString`, no matter the type of the overall
`Value`. Annotations are not allowed on these individual chunks.
For a `Repr` of a `Value` containing other `Value`s, each chunk is to
be a single `Repr`.
For a format C `Repr` of a compound `Value`, each chunk is to be a
single `Repr`, which may itself be annotated.
Each chunk within a format C `Repr` *MUST* have non-zero length.
Software that decodes `Repr`s *MUST* reject `Repr`s that include
zero-length chunks.
### Records.
Format B (known length):
[[ L(F_1...F_m) ]] = header(2,3,m+1) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]]
[[ L(F_1...F_m) ]] = header(2,0,m+1) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]]
For `m` fields, `m+1` is supplied to `header`, to account for the
encoding of the record label.
Format C (streaming):
[[ L(F_1...F_m) ]] = open(2,3) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] ++ close(2,3)
[[ L(F_1...F_m) ]] = open(2,0) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] ++ close()
Applications *SHOULD* prefer the known-length format for encoding
`Record`s.
#### Application-specific short form for labels.
### Placeholders.
Any given protocol using Preserves may additionally define an
interpretation for `n`∈{0,1,2}, mapping each *short form label
number* `n` to a specific record label. When encoding `m` fields with
short form label number `n`, format B becomes
Any given protocol using Preserves may define an interpretation for
numbered *placeholders* in the binary syntax, mapping each
*placeholder number* `n` to a specific `Value`. For example, a
placeholder number may be assigned for a frequently-used `Record`
label.
header(2,n,m) ++ [[F_1]] ++...++ [[F_m]]
A `Value` `v` for which placeholder number `n` has been assigned may
be tersely encoded as
and format C becomes
[[v]] = header(0,1,n) when n is a placeholder number for v
open(2,n) ++ [[F_1]] ++...++ [[F_m]] ++ close(2,n)
**Examples.** For example, a protocol may choose to assign placeholder
number 4 to the symbol `void`, making
**Examples.** For example, a protocol may choose to map records
labelled `void` to `n=0`, making
[[void]] = header(0,1,4) = [0x14]
[[void()]] = header(2,0,1) ++ [[void]] = [0x81, 0x14]
[[void()]] = header(2,0,0) = [0x80]
or it may map symbol `person` to placeholder number 102, making
or it may map records labelled `person` to short form label number 1,
making
[[person]] = header(0,1,102) = [0x1F, 0x66]
and so
[[person("Dr", "Elizabeth", "Blackwell")]]
= header(2,1,3) ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]
= [0x93] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]
= header(2,0,4) ++ [[person]] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]
= [0x84, 0x1F, 0x66] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]
for format B, or
= open(2,1) ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ close(2,1)
= [0x29] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ [0x39]
open(2,0) ++ [[person]] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ close()
= [0x28, 0x1F, 0x66] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ [0x04]
for format C.
@ -578,9 +593,9 @@ for format C.
Format B (known length):
[[ [X_1...X_m] ]] = header(3,0,m) ++ [[X_1]] ++...++ [[X_m]]
[[ #set{X_1...X_m} ]] = header(3,1,m) ++ [[X_1]] ++...++ [[X_m]]
[[ {K_1:V_1...K_m:V_m} ]] = header(3,2,m*2) ++ [[K_1]] ++ [[V_1]] ++...
[[ [X_1...X_m] ]] = header(2,1,m) ++ [[X_1]] ++...++ [[X_m]]
[[ #set{X_1...X_m} ]] = header(2,2,m) ++ [[X_1]] ++...++ [[X_m]]
[[ {K_1:V_1...K_m:V_m} ]] = header(2,3,m*2) ++ [[K_1]] ++ [[V_1]] ++...
++ [[K_m]] ++ [[V_m]]
Note that `m*2` is given to `header` for a `Dictionary`, since there
@ -588,10 +603,10 @@ are two `Value`s in each key-value pair.
Format C (streaming):
[[ [X_1...X_m] ]] = open(3,0) ++ [[X_1]] ++...++ [[X_m]] ++ close(3,0)
[[ #set{X_1...X_m} ]] = open(3,1) ++ [[X_1]] ++...++ [[X_m]] ++ close(3,1)
[[ {K_1:V_1...K_m:V_m} ]] = open(3,2) ++ [[K_1]] ++ [[V_1]] ++...
++ [[K_m]] ++ [[V_m]] ++ close(3,2)
[[ [X_1...X_m] ]] = open(2,1) ++ [[X_1]] ++...++ [[X_m]] ++ close()
[[ #set{X_1...X_m} ]] = open(2,2) ++ [[X_1]] ++...++ [[X_m]] ++ close()
[[ {K_1:V_1...K_m:V_m} ]] = open(2,3) ++ [[K_1]] ++ [[V_1]] ++...
++ [[K_m]] ++ [[V_m]] ++ close()
Applications may use whichever format suits their needs on a
case-by-case basis.
@ -616,15 +631,13 @@ order.
the option of serializing with set elements and dictionary keys in
sorted order.
Note that `header(3,3,m)` and `open(3,3)`/`close(3,3)` are unused and reserved.
### SignedIntegers.
Format B/A (known length/fixed-size):
[[ x ]] when x ∈ SignedInteger = header(1,0,m) ++ intbytes(x) if x<-3 13x
header(0,1,x+16) if -3≤x<0
header(0,1,x) if 0≤x<13
header(0,3,x+16) if -3≤x<0
header(0,3,x) if 0≤x<13
Integers in the range [-3,12] are compactly represented using format A
because they are so frequently used. Other integers are represented
@ -644,22 +657,22 @@ needed to unambiguously identify the value and its sign, and `m =
For example,
[[ -257 ]] = 42 FE FF [[ -3 ]] = 1D [[ 128 ]] = 42 00 80
[[ -256 ]] = 42 FF 00 [[ -2 ]] = 1E [[ 255 ]] = 42 00 FF
[[ -255 ]] = 42 FF 01 [[ -1 ]] = 1F [[ 256 ]] = 42 01 00
[[ -254 ]] = 42 FF 02 [[ 0 ]] = 10 [[ 32767 ]] = 42 7F FF
[[ -129 ]] = 42 FF 7F [[ 1 ]] = 11 [[ 32768 ]] = 43 00 80 00
[[ -128 ]] = 41 80 [[ 12 ]] = 1C [[ 65535 ]] = 43 00 FF FF
[[ -257 ]] = 42 FE FF [[ -3 ]] = 3D [[ 128 ]] = 42 00 80
[[ -256 ]] = 42 FF 00 [[ -2 ]] = 3E [[ 255 ]] = 42 00 FF
[[ -255 ]] = 42 FF 01 [[ -1 ]] = 3F [[ 256 ]] = 42 01 00
[[ -254 ]] = 42 FF 02 [[ 0 ]] = 30 [[ 32767 ]] = 42 7F FF
[[ -129 ]] = 42 FF 7F [[ 1 ]] = 31 [[ 32768 ]] = 43 00 80 00
[[ -128 ]] = 41 80 [[ 12 ]] = 3C [[ 65535 ]] = 43 00 FF FF
[[ -127 ]] = 41 81 [[ 13 ]] = 41 0D [[ 65536 ]] = 43 01 00 00
[[ -4 ]] = 41 FC [[ 127 ]] = 41 7F [[ 131072 ]] = 43 02 00 00
### Strings, ByteStrings and Symbols.
Syntax for these three types varies only in the value of `n` supplied
to `header`, `open`, and `close`. In each case, the payload following
the header is a binary sequence; for `String` and `Symbol`, it is a
UTF-8 encoding of the `Value`'s code points, while for `ByteString` it
is the raw data contained within the `Value` unmodified.
to `header` and `open`. In each case, the payload following the header
is a binary sequence; for `String` and `Symbol`, it is a UTF-8
encoding of the `Value`'s code points, while for `ByteString` it is
the raw data contained within the `Value` unmodified.
Format B (known length):
@ -671,7 +684,8 @@ Format B (known length):
To stream a `String`, `ByteString` or `Symbol`, emit `open(1,n)` and
then a sequence of zero or more format B chunks, followed by
`close(1,n)`. Every chunk must be a `ByteString`.
`close()`. Every chunk must be a `ByteString`, and no chunk may be
annotated.
While the overall content of a streamed `String` or `Symbol` must be
valid UTF-8, individual chunks do not have to conform to UTF-8.
@ -680,8 +694,8 @@ valid UTF-8, individual chunks do not have to conform to UTF-8.
Fixed-length atoms all use format A, and do not have a length
representation. They repurpose the bits that format B `Repr`s use to
specify lengths. Applications *MUST NOT* use format C with
`open(0,n)` or `close(0,n)` for any `n`.
specify lengths. Applications *MUST NOT* use format C with `open(0,n)`
for any `n`.
#### Booleans.
@ -696,6 +710,18 @@ specify lengths. Applications *MUST NOT* use format C with
The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
8-byte IEEE 754 binary representations of `F` and `D`, respectively.
### Annotations.
To annotate a `Repr` `r` with some `Value` `v`, prepend `r` with
`[0x05] ++ [[v]]`.
For example, the `Repr` corresponding to textual syntax `@a@b[]`, i.e.
an empty sequence annotated with two symbols, `a` and `b`, is
[[ @a @b [] ]]
= [0x05] ++ [[a]] ++ [0x05] ++ [[b]] ++ [[ [] ]]
= [0x05, 0x71, 0x61, 0x05, 0x71, 0x62, 0x90]
## Examples
### Simple examples.
@ -703,25 +729,25 @@ The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
<!-- TODO: Give some examples of large and small Preserves, perhaps -->
<!-- translated from various JSON blobs floating around the internet. -->
For the following examples, imagine an application that maps `Record`
short form label number 0 to label `discard`, 1 to `capture`, and 2 to
For the following examples, imagine an application that maps
placeholder number 0 to symbol `discard`, 1 to `capture`, and 2 to
`observe`.
| Value | Encoded hexadecimal byte sequence |
|---------------------------------------------------|----------------------------------------------------------------------|
| `capture(discard())` | 91 80 |
| `observe(speak(discard(), capture(discard())))` | A1 B3 75 73 70 65 61 6B 80 91 80 |
| `[1 2 3 4]` (format B) | C4 11 12 13 14 |
| `[1 2 3 4]` (format C) | 2C 11 12 13 14 3C |
| `[-2 -1 0 1]` | C4 1E 1F 10 11 |
| `capture(discard())` | 82 11 81 10 |
| `observe(speak(discard(), capture(discard())))` | 82 12 83 75 73 70 65 61 6B 81 10 82 11 81 11 |
| `[1 2 3 4]` (format B) | 94 31 32 33 34 |
| `[1 2 3 4]` (format C) | 29 31 32 33 34 04 |
| `[-2 -1 0 1]` | 94 3E 3F 30 31 |
| `"hello"` (format B) | 55 68 65 6C 6C 6F |
| `"hello"` (format C, 2 chunks) | 25 62 68 65 63 6C 6C 6F 35 |
| `"hello"` (format C, 5 chunks) | 25 62 68 65 62 6C 6C 60 60 61 6F 35 |
| `["hello" there #"world" [] #set{} #true #false]` | C7 55 68 65 6C 6C 6F 75 74 68 65 72 65 65 77 6F 72 6C 64 C0 D0 01 00 |
| `"hello"` (format C, 5 chunks) | 25 61 68 61 65 61 6C 61 6C 61 6F 35 |
| `["hello" there #"world" [] #set{} #true #false]` | 97 55 68 65 6C 6C 6F 75 74 68 65 72 65 65 77 6F 72 6C 64 90 A0 01 00 |
| `-257` | 42 FE FF |
| `-1` | 1F |
| `0` | 10 |
| `1` | 11 |
| `-1` | 3F |
| `0` | 30 |
| `1` | 31 |
| `255` | 42 00 FF |
| `1.0f` | 02 3F 80 00 00 |
| `1.0` | 03 3F F0 00 00 00 00 00 00 |
@ -733,20 +759,20 @@ The next example uses a non-`Symbol` label for a record.[^extensibility2] The `R
encodes to
B5 ;; Record, generic, 4+1
C5 ;; Sequence, 5
85 ;; Record, generic, 4+1
95 ;; Sequence, 5
76 74 69 74 6C 65 64 ;; Symbol, "titled"
76 70 65 72 73 6F 6E ;; Symbol, "person"
12 ;; SignedInteger, "2"
32 ;; SignedInteger, "2"
75 74 68 69 6E 67 ;; Symbol, "thing"
11 ;; SignedInteger, "1"
31 ;; SignedInteger, "1"
41 65 ;; SignedInteger, "101"
59 42 6C 61 63 6B 77 65 6C 6C ;; String, "Blackwell"
B4 ;; Record, generic, 3+1
84 ;; Record, generic, 3+1
74 64 61 74 65 ;; Symbol, "date"
42 07 1D ;; SignedInteger, "1821"
12 ;; SignedInteger, "2"
13 ;; SignedInteger, "3"
32 ;; SignedInteger, "2"
33 ;; SignedInteger, "3"
52 44 72 ;; String, "Dr"
[^extensibility2]: It happens to line up with Racket's
@ -787,19 +813,19 @@ read as `Symbol`s. The first example:
encodes to binary as follows:
E2
B2
55 "Image"
EC
BC
55 "Width" 42 03 20
55 "Title" 5F 14 "View from 15th Floor"
58 "Animated" 75 "false"
56 "Height" 42 02 58
59 "Thumbnail"
E6
B6
55 "Width" 41 64
53 "Url" 5F 26 "http://www.example.com/image/481989943"
56 "Height" 41 7D
53 "IDs" C4
53 "IDs" 94
41 74
42 03 AF
42 00 EA
@ -832,8 +858,8 @@ and the second example:
encodes to binary as follows:
C2
EF 10
92
BF 10
59 "precision" 53 "zip"
58 "Latitude" 03 40 42 E2 26 80 9D 49 52
59 "Longitude" 03 C0 5E 99 56 6C F4 1F 21
@ -842,7 +868,7 @@ encodes to binary as follows:
55 "State" 52 "CA"
53 "Zip" 55 "94107"
57 "Country" 52 "US"
EF 10
BF 10
59 "precision" 53 "zip"
58 "Latitude" 03 40 42 AF 9D 66 AD B4 03
59 "Longitude" 03 C0 5E 81 AA 4F CA 42 AF
@ -957,16 +983,17 @@ such media types following the general rules for ordering of
| Value | Encoded hexadecimal byte sequence |
|--------------------------------------------|-------------------------------------------------------------------------------------------------------------------|
| `mime(application/octet-stream #"abcde")` | B3 74 6D 69 6D 65 7F 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D 65 61 62 63 64 65 |
| `mime(text/plain #"ABC")` | B3 74 6D 69 6D 65 7A 74 65 78 74 2F 70 6C 61 69 6E 63 41 42 43 |
| `mime(application/xml #"<xhtml/>")` | B3 74 6D 69 6D 65 7F 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 68 3C 78 68 74 6D 6C 2F 3E |
| `mime(text/csv #"123,234,345")` | B3 74 6D 69 6D 65 78 74 65 78 74 2F 63 73 76 6B 31 32 33 2C 32 33 34 2C 33 34 35 |
| `mime(application/octet-stream #"abcde")` | 83 74 6D 69 6D 65 7F 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D 65 61 62 63 64 65 |
| `mime(text/plain #"ABC")` | 83 74 6D 69 6D 65 7A 74 65 78 74 2F 70 6C 61 69 6E 63 41 42 43 |
| `mime(application/xml #"<xhtml/>")` | 83 74 6D 69 6D 65 7F 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 68 3C 78 68 74 6D 6C 2F 3E |
| `mime(text/csv #"123,234,345")` | 83 74 6D 69 6D 65 78 74 65 78 74 2F 63 73 76 6B 31 32 33 2C 32 33 34 2C 33 34 35 |
Applications making heavy use of `mime` records may choose to use a
short form label number for the record type. For example, if short
form label number 1 were chosen, the second example above,
`mime(text/plain "ABC")`, would be encoded with "92" in place of "B3
74 6D 69 6D 65".
placeholder number for the symbol `mime` as well as the symbols for
individual media types. For example, if placeholder number 1 were
chosen for `mime`, and placeholder number 7 for `text/plain`, the
second example above, `mime(text/plain #"ABC")`, would be encoded as
`83 11 17 63 41 42 43`.
### Unicode normalization forms.
@ -1027,20 +1054,23 @@ or `date-time` productions of
## Security Considerations
**Empty chunks.** Streamed (format C) `String`s, `ByteString`s and
`Symbol`s may include chunks of zero length. This opens up a
possibility for denial-of-service: an attacker may begin streaming a
string, sending an endless sequence of zero length chunks, appearing
to make progress but not actually doing so. Implementations may place
optional reasonable restrictions on the number of consecutive empty
chunks that may appear in a stream, and may even supply an optional
mode that rejects empty chunks entirely.
**Empty chunks.** Chunks of zero length are prohibited in streamed
(format C) `Repr`s. However, a malicious or broken encoder may include
them nonetheless. This opens up a possibility for denial-of-service:
an attacker may begin streaming a `String`, for example, sending an
endless sequence of zero length chunks, appearing to make progress but
not actually doing so. Implementations *MUST* reject zero length
chunks when decoding, and *MUST NOT* produce them when encoding.
**Whitespace.** Similarly, the textual format for `Value`s allows
arbitrary whitespace in many positions. In streaming transfer
situations, consider optional restrictions on the amount of
consecutive whitespace that may appear in a serialized `Value`.
**Annotations.** Also similarly, in modes where a `Value` is being
read while annotations are skipped, an endless sequence of annotations
may give an illusion of progress.
**Canonical form for cryptographic hashing and signing.** As
specified, neither the textual nor the compact binary encoding rules
for `Value`s force canonical serializations. Two serializations of the
@ -1052,24 +1082,26 @@ same `Value` may yield different binary `Repr`s.
01 - True
02 - Float
03 - Double
(0x) RESERVED 04-0F
1x - Small integers 0..12,-3..-1
04 - End stream
05 - Annotation
(0x) RESERVED 06-0F
1x - Placeholder
2x - Start Stream
3x - End Stream
3x - Small integers 0..12,-3..-1
4x - SignedInteger
5x - String
6x - ByteString
7x - Symbol
8x - short form Record label index 0
9x - short form Record label index 1
Ax - short form Record label index 2
Bx - Record
8x - Record
9x - Sequence
Ax - Set
Bx - Dictionary
Cx - Sequence
Dx - Set
Ex - Dictionary
(Cx) RESERVED C0-CF
(Dx) RESERVED D0-DF
(Ex) RESERVED E0-EF
(Fx) RESERVED F0-FF
## Appendix. Bit fields within lead byte values
@ -1081,31 +1113,40 @@ same `Value` may yield different binary `Repr`s.
00 00 0001 True
00 00 0010 Float, 32 bits big-endian binary
00 00 0011 Double, 64 bits big-endian binary
00 00 0100 End Stream (to match a previous Start Stream)
00 00 0101 Annotation; two more Reprs follow
00 01 xxxx Small integers 0..12,-3..-1
00 01 mmmm Placeholder; m is the placeholder number
00 10 ttnn Start Stream <tt,nn>
When tt = 00 --> error
01 --> each chunk is a ByteString
1x --> each chunk is a single encoded Value
00 11 ttnn End Stream <tt,nn> (must match preceding Start Stream)
10 --> each chunk is a single encoded Value
11 --> error (RESERVED)
00 11 xxxx Small integers 0..12,-3..-1
01 00 mmmm SignedInteger, big-endian binary
01 01 mmmm String, UTF-8 binary
01 10 mmmm ByteString
01 11 mmmm Symbol, UTF-8 binary
10 00 mmmm application-specific Record
10 01 mmmm application-specific Record
10 10 mmmm application-specific Record
10 11 mmmm Record
10 00 mmmm Record
10 01 mmmm Sequence
10 10 mmmm Set
10 11 mmmm Dictionary
11 00 mmmm Sequence
11 01 mmmm Set
11 10 mmmm Dictionary
11 nn mmmm error, RESERVED
If mmmm = 1111, a varint(m) follows, giving the length, before
the body; otherwise, m is the length of the body to follow.
Where `mmmm` appears, interpret it as an unsigned 4-bit number `m`. If
`m`<15, let `l`=`m`. Otherwise, `m`=15; let `l` be the result of
decoding the varint that follows.
Then, if `ttnn`=`0001`, `l` is the placeholder number; otherwise, `l`
is the length of the body that follows, counted in bytes for `tt`=`01`
and in `Repr`s for `tt`=`10`.
<!-- Not yet ready
## Appendix. Representing Values in Programming Languages
@ -1118,6 +1159,9 @@ When designing a language mapping, an important consideration is
roundtripping: serialization after deserialization, and vice versa,
should both be identities.
Also, the presence or absence of annotations on a `Value` should not
affect comparisons of that `Value` to others in any way.
### JavaScript.
- `Boolean``Boolean`
@ -1211,6 +1255,8 @@ or `Record`s.
- `Set``Set`
- `Dictionary``Dictionary`
-->
## Appendix. Why not Just Use JSON?
<!-- JSON lacks semantics: JSON syntax doesn't denote anything -->
@ -1367,18 +1413,14 @@ Q. Should "symbols" instead be URIs? Relative, usually; relative to
what? Some domain-specific base URI?
Q. Literal small integers: are they pulling their weight? They're not
absolutely necessary. They mess up the connection between
value-type-ordering and repr-tag-ordering! (The connection between
*value* ordering and *repr* ordering is already irretrievably messed
up: length prefixes blow lexicographic ordering away, sign bits are
the wrong way around, floats are sign-magnitude, etc etc.)
absolutely necessary.
Q. Should we go for trying to make the data ordering line up with the
encoding ordering? We'd have to only use streaming forms, and avoid
the small integer encoding, and not store record arities, and sort
sets and dictionaries, and mask floats and doubles (perhaps
[like this](https://stackoverflow.com/questions/43299299/sorting-floating-point-values-using-their-byte-representation)),
and pick a specific `NaN`, and I don't know what to do about
and perhaps pick a specific `NaN`, and I don't know what to do about
SignedIntegers. Perhaps make them more like float formats, with the
byte count acting as a kind of exponent underneath the sign bit.
@ -1413,11 +1455,3 @@ link escape"; it is not a printable ASCII character, and is disallowed
in the textual Preserves grammar; and it is also mnemonic for "version
0", since it is the Preserves binary encoding of the small integer
zero.))
IN PROGRESS: Remove the special short syntax for application-specific record
label usage? Then perhaps 8x, 9x, Ax and Bx would work for Record,
Sequence, Set and Dictionary, leaving Cx, Dx, Ex and Fx entirely free.
TODO: Forbid empty chunks.
## Notes