MUCH simpler binary format, inspired by Syrup; alterations to text format

This commit is contained in:
Tony Garnock-Jones 2020-12-28 23:25:02 +01:00
parent ccf4f97ed8
commit 5d719c2c6f
6 changed files with 399 additions and 596 deletions

2
NOTICE
View File

@ -1,2 +1,2 @@
Preserves: an Expressive Data Language
Copyright 2018-2019 Tony Garnock-Jones
Copyright 2018-2020 Tony Garnock-Jones

View File

@ -38,7 +38,7 @@ For that, see the [Preserves specification](preserves.html).
If you're familiar with JSON, Preserves looks fairly similar:
``` javascript
```
{"name": "Missy Rose",
"species": "Felis Catus",
"age": 13,
@ -49,35 +49,35 @@ Preserves also has something we can use for debugging/development
information called "annotations"; they aren't actually read in as data
but we can use them for comments.
(They can also be used for other development tools and are not
restricted to strings; more on this later, but for now interpret them
as comments.)
restricted to strings; more on this later, but for now, we will stick
to the special comment annotation syntax.)
``` javascript
@"I'm an annotation... basically a comment. Ignore me!"
"I'm data! Don't ignore me!"
```
;I'm an annotation... basically a comment. Ignore me!
"I'm data! Don't ignore me!"
```
Preserves supports some data types you're probably already familiar
with from JSON, and which look fairly similar in the textual format:
``` javascript
@"booleans"
#true
#false
@"various kinds of numbers:"
```
;booleans
#t
#f
;various kinds of numbers:
42
123556789012345678901234567890
-10
13.5
@"strings"
;strings
"I'm feeling stringy!"
@"sequences (lists)"
;sequences (lists)
["cat", "dog", "mouse", "goldfish"]
@"dictionaries (hashmaps)"
;dictionaries (hashmaps)
{"cat": "meow",
"dog": "woof",
"goldfish": "glub glub",
@ -90,16 +90,16 @@ with from JSON, and which look fairly similar in the textual format:
## Going beyond JSON
We can observe a few differences from JSON already; it's possible to
express numbers of arbitrary length in Preserves, and booleans look a little
*reliably* express integers of arbitrary length in Preserves, and booleans look a little
bit different.
A few more interesting differences:
``` javascript
@"Preserves treats commas as whitespace, so these are the same"
```
;Preserves treats commas as whitespace, so these are the same
["cat", "dog", "mouse", "goldfish"]
["cat" "dog" "mouse" "goldfish"]
@"We can use anything as keys in dictionaries, not just strings"
;We can use anything as keys in dictionaries, not just strings
{1: "the loneliest number",
["why", "was", 6, "afraid", "of", 7]: "because 7 8 9",
{"dictionaries": "as keys???"}: "well, why not?"}
@ -107,17 +107,17 @@ A few more interesting differences:
Preserves technically provides a few types of numbers:
``` javascript
@"Signed Integers"
```
;Signed Integers
42
-42
5907212309572059846509324862304968273468909473609826340
-5907212309572059846509324862304968273468909473609826340
@"Floats (Single-precision IEEE floats) (notice the trailing f)"
;Floats (Single-precision IEEE floats) (notice the trailing f)
3.1415927f
@"Doubles (Double-precision IEEE floats)"
;Doubles (Double-precision IEEE floats)
3.141592653589793
```
@ -129,33 +129,33 @@ Often they're meant to be used for something that has symbolic importance
to the program, but not textual importance (other than to guide the
programmer… not unlike variable names).
``` javascript
@"A symbol (NOT a string!)"
```
;A symbol (NOT a string!)
JustASymbol
@"You can do mixedCase or CamelCase too of course, pick your poison"
@"(but be consistent, for the sake of your collaborators!"
;You can do mixedCase or CamelCase too of course, pick your poison
;(but be consistent, for the sake of your collaborators!)
iAmASymbol
i-am-a-symbol
@"A list of symbols"
;A list of symbols
[GET, PUT, POST, DELETE]
@"A symbol with spaces in it"
;A symbol with spaces in it
|this is just one symbol believe it or not|
```
We can also add binary data, aka ByteStrings:
``` javascript
@"Some binary data, base64 encoded"
#base64{cGljdHVyZSBvZiBhIGNhdA==}
@"Some other binary data, hexadecimal encoded"
#hex{616263}
@"Same binary data as above, base64 encoded"
#base64{YWJj}
```
;Some binary data, base64 encoded
#[cGljdHVyZSBvZiBhIGNhdA==]
;Some other binary data, hexadecimal encoded
#x"616263"
;Same binary data as above, base64 encoded
#[YWJj]
```
What's neat about this is that we don't have to "pay the cost" of
@ -165,48 +165,41 @@ the length of the binary data is the length of the binary data.
Conveniently, Preserves also includes Sets, which are collections of
unique elements where ordering of items is unimportant.
``` javascript
#set{flour, salt, water}
```
#{flour, salt, water}
```
<a id="orgefafe56"></a>
## Total ordering and canonicalization
## Canonicalization
This is a good time to mention that even though from a semantic
perspective sets and dictionaries do not carry information about the
ordering of their elements (and Preserves doesn't care what order we
enter them in for our hand-written-as-text Preserves documents),
Preserves has a well-defined "total ordering".
[Preserves provides support for canonical ordering](canonical-binary.html)
when serializing.
Based on this total ordering, Preserves provides support for canonical
ordering when serializing; in this mode, Preserves will always write
out the elements in the same order, every time.
When combined with binary serialization, this is Preserves' "canonical
form".
This is important and useful for many contexts, but especially for
cryptographic signatures and hashing.
In canonicalizing output mode, Preserves will always write out a given
value using exactly the same bytes, every time. This is important and
useful for many contexts, but especially for cryptographic signatures
and hashing.
``` javascript
@"This hand-typed Preserves document..."
```
;This hand-typed Preserves document...
{monkey: {"noise": "ooh-ooh",
"eats": #set{"bananas", "berries"}}
"eats": #{"bananas", "berries"}}
cat: {"noise": "meow",
"eats": #set{"kibble", "cat treats", "tinned meat"}}}
@"Will always, always be written out in this order when canonicalized:"
{cat: {"eats": #set{"cat treats", "kibble", "tinned meat"},
"eats": #{"kibble", "cat treats", "tinned meat"}}}
;Will always, always be written out in this order (except in
;binary, of course) when canonicalized:
{cat: {"eats": #{"cat treats", "kibble", "tinned meat"},
"noise": "meow"}
monkey: {"eats": #set{"bananas", "berries"},
monkey: {"eats": #{"bananas", "berries"},
"noise": "ooh-ooh"}}
```
Clever implementations can get canonicalized output for free by
carefully ordering set elements and dictionary entries at construction
time, but even in simple implementations, canonical serialization is
almost as cheap as normal serialization.
<a id="org0366627"></a>
## Defining our own types using Records
@ -216,7 +209,7 @@ sense, it's a meta-type.
`Record` objects have a label and a series of arguments (or "fields").
For example, we can make a `Date` record:
``` javascript
```
<Date 2019 8 15>
```
@ -228,7 +221,7 @@ We could instead just decide to encode our date data in a string,
like "2019-08-15".
A document using such a date structure might look like so:
``` javascript
```
{"name": "Gregor Samsa",
"description": "humanoid trapped in an insect body",
"born": "1915-10-04"}
@ -243,13 +236,13 @@ know the date exactly.
This causes a problem.
Now we might have two kinds of entries:
``` javascript
@"Exact date known"
```
;Exact date known
{"name": "Gregor Samsa",
"description": "humanoid trapped in an insect body",
"born": "1915-10-04"}
@"Not sure about exact date..."
;Not sure about exact date...
{"name": "Gregor Samsa",
"description": "humanoid trapped in an insect body",
"born": "Sometime in October 1915? Or was that when he became an insect?"}
@ -261,13 +254,13 @@ like a date", but doing this kind of thing is prone to errors and weird
edge cases.
No, it's better to be able to have a separate type:
``` javascript
@"Exact date known"
```
;Exact date known
{"name": "Gregor Samsa",
"description": "humanoid trapped in an insect body",
"born": <Date 1915 10 04>}
@"Not sure about exact date..."
;Not sure about exact date...
{"name": "Gregor Samsa",
"description": "humanoid trapped in an insect body",
"born": <Unknown "Sometime in October 1915? Or was that when he became an insect?">}
@ -285,7 +278,7 @@ the meaning the label signifies for it to be of use.
Still, there are plenty of interesting labels we can define.
Here is one for an "iri", a hyperlink:
``` javascript
```
<iri "https://dustycloud.org/blog/">
```
@ -294,11 +287,11 @@ Records are usually symbols but aren't necessarily so.
They can also be strings or numbers or even dictionaries.
And very interestingly, they can also be other records:
``` javascript
<<iri "https://www.w3.org/ns/activitystreams#Note">
{"to": [<iri "https://chatty.example/ben/">],
"attributedTo": <iri "https://social.example/alyssa/">,
"content": "Say, did you finish reading that book I lent you?"}>
```
< <iri "https://www.w3.org/ns/activitystreams#Note">
{"to": [<iri "https://chatty.example/ben/">],
"attributedTo": <iri "https://social.example/alyssa/">,
"content": "Say, did you finish reading that book I lent you?"} >
```
Do you see it? This Record's label is&#x2026; an `iri` Record!
@ -327,16 +320,18 @@ Annotations are not strictly a necessary feature, but they are useful
in some circumstances.
We have previously shown them used as comments:
``` javascript
@"I'm a comment!"
```
;I'm a comment!
"I am not a comment, I am data!"
```
Annotations annotate the values they precede.
It is possible to have multiple annotations on a value.
The `;`-based comment syntax is syntactic sugar for the general
`@`-prefixed string annotation syntax.
``` javascript
@"I am annotating this number"
```
;I am annotating this number
@"And so am I!"
42
```
@ -349,7 +344,7 @@ Many implementations will, in the same mode, also supply line number
and column information attached to each read value.
So what's the point of them then?
If annotations were just for comments, there would be indeed hardly
If annotations were just for comments, there would be indeed hardly any
point at all&#x2026; it would be simpler to just provide a comment syntax.
However, annotations can be used for more than just comments.
@ -360,13 +355,17 @@ For instance, here's a reply from an HTTP API service running in
"debug" mode annotated with the time it took to produce the reply and
the internal name of the server that produced the response:
``` javascript
```
@<ResponseTime <Milliseconds 64.4>>
@<BackendServer "humpty-dumpty.example.com">
<Success
<Employees [
<Employee "Alyssa P. Hacker" #set{<Role Programmer>, <Role Manager>}, <Date 2018, 1, 24>>
<Employee "Ben Bitdiddle" #set{<Role Programmer>}, <Date 2019, 2, 13>> ]>>
<Employee "Alyssa P. Hacker"
#{<Role Programmer>, <Role Manager>}
<Date 2018, 1, 24>>
<Employee "Ben Bitdiddle"
#{<Role Programmer>}
<Date 2019, 2, 13>> ]>>
```
The annotations aren't related to the data requested, which is all

View File

@ -20,22 +20,17 @@ are equal.
This document specifies canonical form for the Preserves compact
binary syntax.
**General rules.**
Streaming formats ("format C") *MUST NOT* be used.
**Annotations.**
Annotations *MUST NOT* be present.
Whenever there is a choice between fixed-length ("format A") or
variable-length ("format B") formats, the fixed-length format *MUST* be
used.
**Sets.**
The elements of a `Set` *MUST* be serialized sorted in ascending order
following the total order relation defined in the
[Preserves specification][spec].
by comparing their canonical encoded binary representations.
**Dictionaries.**
The key-value pairs in a `Dictionary` *MUST* be serialized sorted in
ascending order by key, following the total order relation defined in
the [Preserves specification][spec].[^no-need-for-by-value]
ascending order by comparing the canonical encoded binary
representations of their keys.[^no-need-for-by-value]
[^no-need-for-by-value]: There is no need to order by (key, value)
pair, since a `Dictionary` has no duplicate keys.
@ -43,7 +38,9 @@ the [Preserves specification][spec].[^no-need-for-by-value]
**Other kinds of `Value`.**
There are no special canonicalization restrictions on
`SignedInteger`s, `String`s, `ByteString`s, `Symbol`s, `Boolean`s,
`Float`s, `Double`s, `Record`s, or `Sequence`s.
`Float`s, `Double`s, `Record`s, or `Sequence`s. The constraints given
for these `Value`s in the [specification][spec] suffice to ensure
canonicity.
<!-- Heading to visually offset the footnotes from the main document: -->
## Notes

View File

@ -65,28 +65,29 @@ interior portions of a tree.
## Comments.
`String` values used as annotations are conventionally interpreted as
comments.
comments. Special syntax exists for such string annotations, though
the usual `@`-prefixed annotation notation can also be used.
@"I am a comment for the Dictionary"
;I am a comment for the Dictionary
{
@"I am a comment for the key"
key: @"I am a comment for the value"
;I am a comment for the key
key: ;I am a comment for the value
value
}
@"I am a comment for this entire IOList"
;I am a comment for this entire IOList
[
#hex{00010203}
@"I am a comment for the middle half of the IOList"
@"A second comment for the same portion of the IOList"
@ @"I am the first and only comment for the following comment"
#x"00010203"
;I am a comment for the middle half of the IOList
;A second comment for the same portion of the IOList
@ ;I am the first and only comment for the following comment
"A third (itself commented!) comment for the same part of the IOList"
[
@"I am a comment for the following ByteString"
#hex{04050607}
#hex{08090A0B}
;"I am a comment for the following ByteString"
#x"04050607"
#x"08090A0B"
]
#hex{0C0D0E0F}
#x"0C0D0E0F"
]
## MIME-type tagged binary data.
@ -105,12 +106,17 @@ such media types following the general rules for ordering of
**Examples.**
| Value | Encoded hexadecimal byte sequence |
|--------------------------------------------|-------------------------------------------------------------------------------------------------------------------|
| `<mime application/octet-stream #"abcde">` | 83 74 6D 69 6D 65 7F 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D 65 61 62 63 64 65 |
| `<mime text/plain #"ABC">` | 83 74 6D 69 6D 65 7A 74 65 78 74 2F 70 6C 61 69 6E 63 41 42 43 |
| `<mime application/xml #"<xhtml/>">` | 83 74 6D 69 6D 65 7F 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 68 3C 78 68 74 6D 6C 2F 3E |
| `<mime text/csv #"123,234,345">` | 83 74 6D 69 6D 65 78 74 65 78 74 2F 63 73 76 6B 31 32 33 2C 32 33 34 2C 33 34 35 |
«<mime application/octet-stream #"abcde">»
= B4 B3 04 "mime" B3 18 "application/octet-stream" B2 05 "abcde"
«<mime text/plain #"ABC">»
= B4 B3 04 "mime" B3 0A "text/plain" B2 03 "ABC" 84
«<mime application/xml #"<xhtml/>">»
= B4 B3 04 "mime" B3 0F "application/xml" B2 08 "<xhtml/>" 84
«<mime text/csv #"123,234,345">»
= B4 B3 04 "mime" B3 08 "text/csv" B2 0B "123,234,345" 84
## Unicode normalization forms.

View File

@ -4,7 +4,7 @@ title: "Preserves: an Expressive Data Language"
---
Tony Garnock-Jones <tonyg@leastfixedpoint.com>
May 2020. Version 0.0.8.
Jan 2021. Version 0.4.0.
[sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt
[spki]: http://world.std.com/~cme/html/spki.html
@ -12,6 +12,7 @@ May 2020. Version 0.0.8.
[LEB128]: https://en.wikipedia.org/wiki/LEB128
[erlang-map]: http://erlang.org/doc/reference_manual/data_types.html#map
[abnf]: https://tools.ietf.org/html/rfc7405
[canonical]: canonical-binary.html
This document proposes a data model and serialization format called
*Preserves*.
@ -42,20 +43,20 @@ Our `Value`s fall into two broad categories: *atomic* and *compound*
data. Every `Value` is finite and non-cyclic.
Value = Atom
| Compound
| Compound
Atom = Boolean
| Float
| Double
| SignedInteger
| String
| ByteString
| Symbol
| Float
| Double
| SignedInteger
| String
| ByteString
| Symbol
Compound = Record
| Sequence
| Set
| Dictionary
| Sequence
| Set
| Dictionary
**Total order.**<a name="total-order"></a> As we go, we will
incrementally specify a total order over `Value`s. Two values of the
@ -215,14 +216,13 @@ label-`Value` followed by its field-`Value`s.
`Sequence`s are enclosed in square brackets. `Dictionary` values are
curly-brace-enclosed colon-separated pairs of values. `Set`s are
written either as one or more values enclosed in curly braces, or zero
or more values enclosed by the tokens `#set{` and
written as values enclosed by the tokens `#{` and
`}`.[^printing-collections] It is an error for a set to contain
duplicate elements or for a dictionary to contain duplicate keys.
Sequence = "[" *Value ws "]"
Dictionary = "{" *(Value ws ":" Value) ws "}"
Set = %s"#set{" *Value ws "}" / "{" 1*Value ws "}"
Set = "#{" *Value ws "}"
[^printing-collections]: **Implementation note.** When implementing
printing of `Value`s using the textual syntax, consider supporting
@ -232,9 +232,10 @@ duplicate elements or for a dictionary to contain duplicate keys.
commas separating, and commas terminating elements or key/value
pairs within a collection.
`Boolean`s are the simple literal strings `#true` and `#false`.
`Boolean`s are the simple literal strings `#t` and `#f` for true and
false, respectively.
Boolean = %s"#true" / %s"#false"
Boolean = %s"#t" / %s"#f"
Numeric data follow the
[JSON grammar](https://tools.ietf.org/html/rfc8259#section-6), with
@ -310,9 +311,10 @@ same as for JSON.[^string-json-correspondence] [^escaping-surrogate-pairs]
[^escaping-surrogate-pairs]: In particular, note JSON's rules around
the use of surrogate pairs for code points not in the Basic
Multilingual Plane. We encourage implementations to avoid escaping
such characters when producing output, and instead to rely on the
UTF-8 encoding of the entire document to handle them correctly.
Multilingual Plane. We encourage implementations to avoid using
`\u` escapes when producing output, and instead to rely on the
UTF-8 encoding of the entire document to handle non-ASCII
codepoints correctly.
A `ByteString` may be written in any of three different forms.
@ -327,16 +329,16 @@ value with `\x`.
binunescaped = %x20-21 / %x23-5B / %x5D-7E
The second is as a sequence of pairs of hexadecimal digits interleaved
with whitespace and surrounded by `#hex{` and `}`.
with whitespace and surrounded by `#x"` and `"`.
ByteString =/ %s"#hex{" *(ws / 2HEXDIG) ws "}"
ByteString =/ %s"#x" %x22 *(ws / 2HEXDIG) ws %x22
The third is as a sequence of
[Base64](https://tools.ietf.org/html/rfc4648) characters, interleaved
with whitespace and surrounded by `#base64{` and `}`. Plain and
URL-safe Base64 characters are allowed.
with whitespace and surrounded by `#[` and `]`. Plain and URL-safe
Base64 characters are allowed.
ByteString =/ %s"#base64{" *(ws / base64char) ws "}" /
ByteString =/ "#[" *(ws / base64char) ws "]" /
base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "="
A `Symbol` may be written in a “bare” form[^cf-sexp-token] so long as
@ -365,10 +367,10 @@ double quote mark.
Finally, any `Value` may be represented by escaping from the textual
syntax to the [compact binary syntax](#compact-binary-syntax) by
prefixing a `ByteString` containing the binary representation of the
`Value` with `#value`.[^rationale-switch-to-binary]
`Value` with `#`.[^rationale-switch-to-binary]
[^no-literal-binary-in-text] [^compact-value-annotations]
Compact = %s"#value" ws ByteString
Compact = "#" ws ByteString
[^rationale-switch-to-binary]: **Rationale.** The textual syntax
cannot express every `Value`: specifically, it cannot express the
@ -387,8 +389,8 @@ prefixing a `ByteString` containing the binary representation of the
access the representation of the text from within the text itself.
[^compact-value-annotations]: Any text-syntax annotations preceding
the `#value` are prepended to any binary-syntax annotations
yielded by decoding the `ByteString`.
the `#` are prepended to any binary-syntax annotations yielded by
decoding the `ByteString`.
### Annotations.
@ -403,6 +405,17 @@ Each annotation is preceded by `@`; the underlying annotated value
follows its annotations. Here we extend only the syntactic nonterminal
named “`Value`” without altering the semantic class of `Value`s.
**Comments.** Strings annotating a `Value` are conventionally
interpreted as comments associated with that value. Comments are
sufficiently common that special syntax exists for them.
Value =/ ws
";" *(%x00-09 / %x0B-0C / %x0E-%x10FFFF) newline
Value
When written this way, everything between the `;` and the newline is
included in the string annotating the `Value`.
**Equivalence.** Annotations appear within syntax denoting a `Value`;
however, the annotations are not part of the denoted value. They are
only part of the syntax. Annotations do not play a part in
@ -421,86 +434,25 @@ different.
## Compact Binary Syntax
A `Repr` is a binary-syntax encoding, or representation, of either a
`Value` or an annotation on a `Repr`.
Each `Repr` comprises one or more bytes describing the kind of
represented information and the length of the representation, followed
by the encoded details.
For a value `v`, we write `[[v]]` for the `Repr` of v.
A `Repr` is a binary-syntax encoding, or representation, of a `Value`.
For a value `v`, we write `«v»` for the `Repr` of v.
### Type and Length representation.
Each `Repr` takes one of three possible forms:
Each `Repr` starts with a tag byte, describing the kind of information
represented. Depending on the tag, a length indicator, further encoded
information, and/or an ending tag may follow.
- (A) type-specific form, used for simple values such as `Boolean`s
or `Float`s as well as for introducing annotations.
tag (simple atomic data and small integers)
tag ++ binarydata (most integers)
tag ++ length ++ binarydata (large integers, strings, symbols, and binary)
tag ++ repr ++ ... ++ endtag (compound data)
- (B) a variable-length form with length specified up-front, used for
compound and variable-length atomic data structures when their
sizes are known at the time serialization begins.
The unique end tag is byte value `0x84`.
- (C) a variable-length streaming form with unknown or unpredictable
length, used in cases when serialization begins before the number
of elements or bytes in the corresponding `Value` is known.
Applications may choose between formats B and C depending on their
needs at serialization time.
#### The lead byte.
Every `Repr` starts with a *lead byte*, constructed by
`leadbyte(t,n,m)`, where `t`,`n`∈{0,1,2,3} and 0≤`m`<16:
leadbyte(t,n,m) = [t*64 + n*16 + m]
The arguments `t`, `n` and `m` describe the rest of the
representation.[^some-encodings-unused]
[^some-encodings-unused]: Some encodings are unused. All such
encodings are reserved for future versions of this specification.
| `t` | `n` | `m` | Meaning |
| --- | --- | --- | ------- |
| 0 | 0 | 03 | (format A) An `Atom` with fixed-length binary representation |
| 0 | 0 | 4 | (format C) Stream end |
| 0 | 0 | 5 | (format A) Annotation |
| 0 | 2 | | (format C) Stream start |
| 0 | 3 | | (format A) Certain small `SignedInteger`s |
| 1 | | | (format B) An `Atom` with variable-length binary representation |
| 2 | | | (format B) A `Compound` with variable-length representation |
| 3 | 3 | 15 | (format A) 0xFF byte; no-op |
#### Encoding data of type-specific length (format A).
Each type of data defines its own rules for this format.
Of particular note is lead byte `0xFF`, which is a no-op byte acting
as a kind of pseudo-whitespace in a binary-syntax encoding.
#### Encoding data of known length (format B).
Format B is used where the length `l` of the `Value` to be encoded is
known when serialization begins. Format B `Repr`s use `m` in
`leadbyte` to encode `l`. The length counts *bytes* for atomic
`Value`s, but counts *contained values* for compound `Value`s.
- A length `l` between 0 and 14 is represented using `leadbyte` with
`m=l`.
- A length of 15 or greater is represented by `m=15` and additional
bytes describing the length following the lead byte.
The function `header(t,n,m)` yields an appropriate sequence of bytes
describing a `Repr`'s type and length when `t`, `n` and `m` are
appropriate non-negative integers:
header(t,n,m) = leadbyte(t,n,m) when m < 15
or leadbyte(t,n,15) ++ varint(m) otherwise
The additional length bytes are formatted as
[base 128 varints][varint].[^see-also-leb128] We write `varint(m)` for
the varint-encoding of `m`. Quoting the
If present after a tag, the length of a following piece of binary data
is formatted as a [base 128 varint][varint].[^see-also-leb128] We
write `varint(m)` for the varint-encoding of `m`. Quoting the
[Google Protocol Buffers][varint] definition,
[^see-also-leb128]: Also known as [LEB128][] encoding, for unsigned
@ -515,174 +467,114 @@ the varint-encoding of `m`. Quoting the
The following table illustrates varint-encoding.
| Number, `m` | `m` in binary, grouped into 7-bit chunks | `varint(m)` bytes |
| ------ | ------------------- | ------------ |
| 15 | `0001111` | 15 |
| 300 | `0000010 0101100` | 172 2 |
| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 128 148 235 220 3 |
| Number, `m` | `m` in binary, grouped into 7-bit chunks | `varint(m)` bytes |
| ------ | ------------------- | ------------ |
| 15 | `0001111` | 15 |
| 300 | `0000010 0101100` | 172 2 |
| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 128 148 235 220 3 |
It is an error for a varint-encoded `m` in a `Repr` to be anything
other than the unique shortest encoding for that `m`. That is, a
varint-encoding of `m` *MUST NOT* end in `0` unless `m`=0. However,
the `varint(m)` encoding of a length *MUST NOT* be used when `m`<15,
meaning that a `Repr` *MUST NOT* contain any varint-encoding with
final byte `0`.
varint-encoding of `m` *MUST NOT* end in `0` unless `m`=0.
#### Streaming data of unknown length (format C).
### Records, Sequences, Sets and Dictionaries.
A `Repr` where the length of the `Value` to be encoded is variable and
not known at the time serialization of the `Value` starts is encoded
by a single Stream Start (“open”) byte, followed by zero or more
*chunks*, followed by a matching Stream End (“close”) byte:
open(t,n) = leadbyte(0,2, t*4 + n) = [0x20 + t*4 + n]
close() = leadbyte(0,0, 4) = [0x04]
For a format C `Repr` of an atomic `Value`, each chunk is to be a
format B `Repr` of a `ByteString`, no matter the type of the overall
`Value`. Annotations are not allowed on these individual chunks.
For a format C `Repr` of a compound `Value`, each chunk is to be a
single `Repr`, which may itself be annotated.
Each chunk within a format C `Repr` *MUST* have non-zero length.
Software that decodes `Repr`s *MUST* reject `Repr`s that include
zero-length chunks.
### Records.
Format B (known length):
[[ <L F_1...F_m> ]] = header(2,0,m+1) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]]
For `m` fields, `m+1` is supplied to `header`, to account for the
encoding of the record label.
Format C (streaming):
[[ <L F_1...F_m> ]] = open(2,0) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] ++ close()
Applications *SHOULD* prefer the known-length format for encoding
`Record`s.
### Sequences, Sets and Dictionaries.
Format B (known length):
[[ [X_1...X_m] ]] = header(2,1,m) ++ [[X_1]] ++...++ [[X_m]]
[[ #set{X_1...X_m} ]] = header(2,2,m) ++ [[X_1]] ++...++ [[X_m]]
[[ {K_1:V_1...K_m:V_m} ]] = header(2,3,m*2) ++ [[K_1]] ++ [[V_1]] ++...
++ [[K_m]] ++ [[V_m]]
Note that `m*2` is given to `header` for a `Dictionary`, since there
are two `Value`s in each key-value pair.
Format C (streaming):
[[ [X_1...X_m] ]] = open(2,1) ++ [[X_1]] ++...++ [[X_m]] ++ close()
[[ #set{E_1...E_m} ]] = open(2,2) ++ [[E_1]] ++...++ [[E_m]] ++ close()
[[ {K_1:V_1...K_m:V_m} ]] = open(2,3) ++ [[K_1]] ++ [[V_1]] ++...
++ [[K_m]] ++ [[V_m]] ++ close()
Applications may use whichever format suits their needs on a
case-by-case basis.
«<L F_1...F_m>» = [0xB4] ++ «L» ++ «F_1» ++...++ «F_m» ++ [0x84]
«[X_1...X_m]» = [0xB5] ++ «X_1» ++...++ «X_m» ++ [0x84]
«#{E_1...E_m}» = [0xB6] ++ «E_1» ++...++ «E_m» ++ [0x84]
«{K_1:V_1...K_m:V_m}» = [0xB7] ++ «K_1» ++ «V_1» ++...++ «K_m» ++ «V_m» ++ [0x84]
There is *no* ordering requirement on the `E_i` elements or
`K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any
order. However, the `E_i` and `K_i` *MUST* be pairwise distinct.
order. However, the `E_i` and `K_i` *MUST* be pairwise distinct. In
addition, implementations *SHOULD* default to writing set elements and
dictionary key/value pairs in order sorted lexicographically by their
`Repr`s[^not-sorted-semantically], and *MAY* offer the option of
serializing in some other implementation-defined order.
[^no-sorting-rationale]: In the BitTorrent encoding format,
[bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding),
dictionary key/value pairs must be sorted by key. This is a
necessary step for ensuring serialization of `Value`s is
canonical. We do not require that key/value pairs (or set
elements) be in sorted order for serialized `Value`s, because (a)
where canonicalization is used for cryptographic signatures, it is
more reliable to simply retain the exact binary form of the signed
document than to depend on canonical de- and re-serialization, and
(b) sorting keys or elements makes no sense in streaming
serialization formats.
elements) be in sorted order for serialized `Value`s; however, a
[canonical form][canonical] for `Repr`s does exist where a sorted
ordering is required.
However, a quality implementation may wish to offer the programmer
the option of serializing with set elements and dictionary keys in
sorted order.
[^not-sorted-semantically]: It's important to note that the sort
ordering for writing out set elements and dictionary key/value
pairs is *not* the same as the sort ordering implied by the
semantic ordering of those elements or keys. For example, the
`Repr` of a negative number very far from zero will start with
byte that is *greater* than the byte which starts the `Repr` of
zero, making it sort lexicographically later by `Repr`, despite
being semantically *less than* zero.
**Rationale**. This is for ease-of-implementation reasons: not all
languages can easily represent sorted sets or sorted dictionaries,
but encoding and then sorting byte strings is much more likely to
be within easy reach.
### SignedIntegers.
Format B/A (known length/fixed-size):
«x» when x ∈ SignedInteger = [0xB0] ++ varint(m) ++ intbytes(x) if ¬(-3≤x≤12) ∧ m>16
([0xA0] + m - 1) ++ intbytes(x) if ¬(-3≤x≤12) ∧ m≤16
([0xA0] + x) if (-3≤x≤-1)
([0x90] + x) if ( 0≤x≤12)
where m = |intbytes(x)|
[[ x ]] when x ∈ SignedInteger = header(1,0,m) ++ intbytes(x) if x<-3 13x
header(0,3,x+16) if -3≤x<0
header(0,3,x) if 0≤x<13
Integers in the range [-3,12] are compactly represented using format A
because they are so frequently used. Other integers are represented
using format B.
Format C *MUST NOT* be used for `SignedInteger`s. Format A *MUST* be
used for integers in the range -3 to 12, inclusive.
Integers in the range [-3,12] are compactly represented with tags
between `0x90` and `0x9F` because they are so frequently used.
Integers up to 16 bytes long are represented with a single-byte tag
encoding the length of the integer. Larger integers are represented
with an explicit varint length. Every `SignedInteger` *MUST* be
represented with its shortest possible encoding.
The function `intbytes(x)` gives the big-endian two's-complement
binary representation of `x`, taking exactly as many whole bytes as
needed to unambiguously identify the value and its sign, and `m =
|intbytes(x)|`. The most-significant bit in the first byte in
`intbytes(x)` <!-- for `x`≠0 --> is the sign bit.[^zero-intbytes]
`intbytes(x)` <!-- for `x`≠0 --> is the sign bit.[^zero-intbytes] For
example,
«87112285931760246646623899502532662132736»
= B0 12 01 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
00 00
«-257» = A1 FE FF «-3» = 9D «128» = A1 00 80
«-256» = A1 FF 00 «-2» = 9E «255» = A1 00 FF
«-255» = A1 FF 01 «-1» = 9F «256» = A1 01 00
«-254» = A1 FF 02 «0» = 90 «32767» = A1 7F FF
«-129» = A1 FF 7F «1» = 91 «32768» = A2 00 80 00
«-128» = A0 80 «12» = 9C «65535» = A2 00 FF FF
«-127» = A0 81 «13» = A0 0D «65536» = A2 01 00 00
«-4» = A0 FC «127» = A0 7F «131072» = A2 02 00 00
[^zero-intbytes]: The value 0 needs zero bytes to identify the
value, so `intbytes(0)` is the empty byte string. Non-zero values
need at least one byte.
For example,
[[ -257 ]] = 42 FE FF [[ -3 ]] = 3D [[ 128 ]] = 42 00 80
[[ -256 ]] = 42 FF 00 [[ -2 ]] = 3E [[ 255 ]] = 42 00 FF
[[ -255 ]] = 42 FF 01 [[ -1 ]] = 3F [[ 256 ]] = 42 01 00
[[ -254 ]] = 42 FF 02 [[ 0 ]] = 30 [[ 32767 ]] = 42 7F FF
[[ -129 ]] = 42 FF 7F [[ 1 ]] = 31 [[ 32768 ]] = 43 00 80 00
[[ -128 ]] = 41 80 [[ 12 ]] = 3C [[ 65535 ]] = 43 00 FF FF
[[ -127 ]] = 41 81 [[ 13 ]] = 41 0D [[ 65536 ]] = 43 01 00 00
[[ -4 ]] = 41 FC [[ 127 ]] = 41 7F [[ 131072 ]] = 43 02 00 00
### Strings, ByteStrings and Symbols.
Syntax for these three types varies only in the value of `n` supplied
to `header` and `open`. In each case, the payload following the header
is a binary sequence; for `String` and `Symbol`, it is a UTF-8
encoding of the `Value`'s code points, while for `ByteString` it is
the raw data contained within the `Value` unmodified.
Syntax for these three types varies only in the tag used. For `String`
and `Symbol`, the data following the tag is a UTF-8 encoding of the
`Value`'s code points, while for `ByteString` it is the raw data
contained within the `Value` unmodified.
Format B (known length):
«S» = [0xB1] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ String
[0xB2] ++ varint(|S|) ++ S if S ∈ ByteString
[0xB3] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ Symbol
[[ S ]] = header(1,n,m) ++ encode(S)
where m = |encode(S)|
and (n,encode(S)) = (1,utf8(S)) if S ∈ String
(2,S) if S ∈ ByteString
(3,utf8(S)) if S ∈ Symbol
### Booleans.
To stream a `String`, `ByteString` or `Symbol`, emit `open(1,n)` and
then a sequence of zero or more format B chunks, followed by
`close()`. Every chunk must be a `ByteString`, and no chunk may be
annotated.
«#f» = [0x80]
«#t» = [0x81]
While the overall content of a streamed `String` or `Symbol` must be
valid UTF-8, individual chunks do not have to conform to UTF-8.
### Floats and Doubles.
### Fixed-length Atoms.
Fixed-length atoms all use format A, and do not have a length
representation. They repurpose the bits that format B `Repr`s use to
specify lengths. Applications *MUST NOT* use format C with `open(0,n)`
for any `n`.
#### Booleans.
[[ #false ]] = header(0,0,0) = [0x00]
[[ #true ]] = header(0,0,1) = [0x01]
#### Floats and Doubles.
[[ F ]] when F ∈ Float = header(0,0,2) ++ binary32(F)
[[ D ]] when D ∈ Double = header(0,0,3) ++ binary64(D)
«F» when F ∈ Float = [0x82] ++ binary32(F)
«D» when D ∈ Double = [0x83] ++ binary64(D)
The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
8-byte IEEE 754 binary representations of `F` and `D`, respectively.
@ -690,40 +582,43 @@ The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
### Annotations.
To annotate a `Repr` `r` with some `Value` `v`, prepend `r` with
`[0x05] ++ [[v]]`.
`[0x85] ++ «v»`. For example, the `Repr` corresponding to textual
syntax `@a@b[]`, i.e. an empty sequence annotated with two symbols,
`a` and `b`, is
For example, the `Repr` corresponding to textual syntax `@a@b[]`,
i.e. an empty sequence annotated with two symbols, `a` and `b`, is
[[ @a @b [] ]]
= [0x05] ++ [[a]] ++ [0x05] ++ [[b]] ++ [[ [] ]]
= [0x05, 0x71, 0x61, 0x05, 0x71, 0x62, 0x90]
«@a @b []»
= [0x85] ++ «a» ++ [0x85] ++ «b» ++ «[]»
= [0x85, 0xB3, 0x01, 0x61, 0x85, 0xB3, 0x01, 0x62, 0xB5, 0x84]
## Examples
### Ordering.
The total ordering specified [above](#total-order) means that the following statements are true:
"bzz" < "c" < "caa"
#t < 3.0f < 3.0 < 3 < "3" < |3| < []
### Simple examples.
<!-- TODO: Give some examples of large and small Preserves, perhaps -->
<!-- translated from various JSON blobs floating around the internet. -->
| Value | Encoded byte sequence |
|---------------------------------------------------|-------------------------------------------------------------------------------------|
| `<capture <discard>>` | 82 77 'c' 'a' 'p' 't' 'u' 'r' 'e' 81 77 'd' 'i' 's' 'c' 'a' 'r' 'd' |
| `[1 2 3 4]` (format B) | 94 31 32 33 34 |
| `[1 2 3 4]` (format C) | 29 31 32 33 34 04 |
| `[-2 -1 0 1]` | 94 3E 3F 30 31 |
| `"hello"` (format B) | 55 'h' 'e' 'l' 'l' 'o' |
| `"hello"` (format C, 2 chunks) | 25 62 'h' 'e' 63 'l' 'l' 'o' 35 |
| `"hello"` (format C, 5 chunks) | 25 61 'h' 61 'e' 61 'l' 61 'l' 61 'o' 35 |
| `["hello" there #"world" [] #set{} #true #false]` | 97 55 'h' 'e' 'l' 'l' 'o' 75 't' 'h' 'e' 'r' 'e' 65 'w' 'o' 'r' 'l' 'd' 90 A0 01 00 |
| `-257` | 42 FE FF |
| `-1` | 3F |
| `0` | 30 |
| `1` | 31 |
| `255` | 42 00 FF |
| `1.0f` | 02 3F 80 00 00 |
| `1.0` | 03 3F F0 00 00 00 00 00 00 |
| `-1.202e300` | 03 FE 3C B7 B7 59 BF 04 26 |
| Value | Encoded byte sequence |
|-----------------------------|---------------------------------------------------------------------------------|
| `<capture <discard>>` | B4 B3 07 'c' 'a' 'p' 't' 'u' 'r' 'e' B4 B3 07 'd' 'i' 's' 'c' 'a' 'r' 'd' 84 84 |
| `[1 2 3 4]` | B5 91 92 93 94 84 |
| `[-2 -1 0 1]` | B5 9E 9F 90 91 84 |
| `"hello"` (format B) | B1 05 'h' 'e' 'l' 'l' 'o' |
| `["a" b #"c" [] #{} #t #f]` | B5 B1 01 'a' B3 01 'b' B2 01 'c' B5 84 B6 84 81 80 84 |
| `-257` | A1 FE FF |
| `-1` | 9F |
| `0` | 90 |
| `1` | 91 |
| `255` | A1 00 FF |
| `1.0f` | 82 3F 80 00 00 |
| `1.0` | 83 3F F0 00 00 00 00 00 00 |
| `-1.202e300` | 83 FE 3C B7 B7 59 BF 04 26 |
The next example uses a non-`Symbol` label for a record.[^extensibility2] The `Record`
@ -731,21 +626,24 @@ The next example uses a non-`Symbol` label for a record.[^extensibility2] The `R
encodes to
85 ;; Record, generic, 4+1
95 ;; Sequence, 5
76 74 69 74 6C 65 64 ;; Symbol, "titled"
76 70 65 72 73 6F 6E ;; Symbol, "person"
32 ;; SignedInteger, "2"
75 74 68 69 6E 67 ;; Symbol, "thing"
31 ;; SignedInteger, "1"
41 65 ;; SignedInteger, "101"
59 42 6C 61 63 6B 77 65 6C 6C ;; String, "Blackwell"
84 ;; Record, generic, 3+1
74 64 61 74 65 ;; Symbol, "date"
42 07 1D ;; SignedInteger, "1821"
32 ;; SignedInteger, "2"
33 ;; SignedInteger, "3"
52 44 72 ;; String, "Dr"
B4 ;; Record
B5 ;; Sequence
B3 06 74 69 74 6C 65 64 ;; Symbol, "titled"
B3 06 70 65 72 73 6F 6E ;; Symbol, "person"
92 ;; SignedInteger, "2"
B3 05 74 68 69 6E 67 ;; Symbol, "thing"
91 ;; SignedInteger, "1"
84 ;; End (sequence)
A0 65 ;; SignedInteger, "101"
B1 09 42 6C 61 63 6B 77 65 6C 6C ;; String, "Blackwell"
B4 ;; Record
B3 04 64 61 74 65 ;; Symbol, "date"
A1 07 1D ;; SignedInteger, "1821"
92 ;; SignedInteger, "2"
93 ;; SignedInteger, "3"
84 ;; End (record)
B1 02 44 72 ;; String, "Dr"
84 ;; End (record)
[^extensibility2]: It happens to line up with Racket's
representation of a record label for an inheritance hierarchy
@ -785,23 +683,27 @@ read as `Symbol`s. The first example:
encodes to binary as follows:
B2
55 "Image"
BC
55 "Width" 42 03 20
55 "Title" 5F 14 "View from 15th Floor"
58 "Animated" 75 "false"
56 "Height" 42 02 58
59 "Thumbnail"
B6
55 "Width" 41 64
53 "Url" 5F 26 "http://www.example.com/image/481989943"
56 "Height" 41 7D
53 "IDs" 94
41 74
42 03 AF
42 00 EA
43 00 97 89
B7
B1 05 "Image"
B7
B1 05 "Title" B1 14 "View from 15th Floor"
B1 05 "Width" A1 03 20
B1 06 "Height" A1 02 58
B1 08 "Animated" B3 05 "false"
B1 09 "Thumbnail"
B7
B1 03 "Url" B1 26 "http://www.example.com/image/481989943"
B1 03 "IDs" B5
A0 74
A1 03 AF
A1 00 EA
A2 00 97 89
84
B1 05 "Width" A0 64
B1 06 "Height" A0 7D
84
84
84
and the second example:
@ -830,55 +732,51 @@ and the second example:
encodes to binary as follows:
92
BF 10
59 "precision" 53 "zip"
58 "Latitude" 03 40 42 E2 26 80 9D 49 52
59 "Longitude" 03 C0 5E 99 56 6C F4 1F 21
57 "Address" 50
54 "City" 5D "SAN FRANCISCO"
55 "State" 52 "CA"
53 "Zip" 55 "94107"
57 "Country" 52 "US"
BF 10
59 "precision" 53 "zip"
58 "Latitude" 03 40 42 AF 9D 66 AD B4 03
59 "Longitude" 03 C0 5E 81 AA 4F CA 42 AF
57 "Address" 50
54 "City" 59 "SUNNYVALE"
55 "State" 52 "CA"
53 "Zip" 55 "94085"
57 "Country" 52 "US"
B5
B7
B1 03 "Zip" B1 05 "94107"
B1 04 "City" B1 0D "SAN FRANCISCO"
B1 05 "State" B1 02 "CA"
B1 07 "Address" B1 00
B1 07 "Country" B1 02 "US"
B1 08 "Latitude" 83 40 42 E2 26 80 9D 49 52
B1 09 "Longitude" 83 C0 5E 99 56 6C F4 1F 21
B1 09 "precision" B1 03 "zip"
84
B7
B1 03 "Zip" B1 05 "94085"
B1 04 "City" B1 09 "SUNNYVALE"
B1 05 "State" B1 02 "CA"
B1 07 "Address" B1 00
B1 07 "Country" B1 02 "US"
B1 08 "Latitude" 83 40 42 AF 9D 66 AD B4 03
B1 09 "Longitude" 83 C0 5E 81 AA 4F CA 42 AF
B1 09 "precision" B1 03 "zip"
84
84
## Security Considerations
**Empty chunks.** Chunks of zero length are prohibited in streamed
(format C) `Repr`s. However, a malicious or broken encoder may include
them nonetheless. This opens up a possibility for denial-of-service:
an attacker may begin streaming a `String`, for example, sending an
endless sequence of zero length chunks, appearing to make progress but
not actually doing so. Implementations *MUST* reject zero length
chunks when decoding, and *MUST NOT* produce them when encoding.
**Whitespace.** The textual format allows arbitrary whitespace in many
positions. Consider optional restrictions on the amount of consecutive
whitespace that may appear.
**Whitespace and no-ops.** Similarly, the binary format allows `0xFF`
no-ops and the textual format allows arbitrary whitespace in many
positions. In streaming transfer situations, consider optional
restrictions on the amount of consecutive whitespace or the number of
consecutive no-ops that may appear.
**Annotations.** Similarly, in modes where a `Value` is being read
while annotations are skipped, an endless sequence of annotations may
give an illusion of progress.
**Annotations.** Also similarly, in modes where a `Value` is being
read while annotations are skipped, an endless sequence of annotations
may give an illusion of progress.
**Canonical form for cryptographic hashing and signing.** As
specified, neither the textual nor the compact binary encoding rules
for `Value`s force canonical serializations. Two serializations of the
same `Value` may yield different binary `Repr`s.
**Canonical form for cryptographic hashing and signing.** No canonical
textual encoding of a `Value` is specified. A
[canonical form][canonical] exists for binary encoded `Value`s, and
implementations *SHOULD* produce canonical binary encodings by
default; however, an implementation *MAY* permit two serializations of
the same `Value` to yield different binary `Repr`s.
## Acknowledgements
The use of low-order bits of each lead byte for the length of short
values is inspired by a similar feature of [CBOR](http://cbor.io/).
The use of the low-order bits in certain SignedInteger tags for the
length of the following data is inspired by a similar feature of
[CBOR](http://cbor.io/).
The treatment of commas as whitespace in the text syntax is inspired
by the same feature of [EDN](https://github.com/edn-format/edn).
@ -889,126 +787,42 @@ syntax.
## Appendix. Autodetection of textual or binary syntax
Whitespace characters `0x09` (ASCII HT (tab)), `0x0A` (LF), `0x0D`
(CR), `0x20` (space) and `0x2C` (comma) are ignored at the start of a
textual-syntax Preserves `Document`, and their UTF-8 encodings are
reserved lead byte values in binary-syntax Preserves.
Every tag byte in a binary Preserves `Document` falls within the range
[`0x80`, `0xBF`]. These bytes, interpreted as UTF-8, are *continuation
bytes*, and will never occur as the first byte of a UTF-8 encoded code
point. This means no binary-encoded document can be misinterpreted as
valid UTF-8.
The byte `0xFF`, signifying a no-op in binary-syntax Preserves, has no
meaning in either 7-bit ASCII or UTF-8, and therefore cannot appear in
a valid textual-syntax Preserves `Document`.
Conversely, a UTF-8 document must start with a valid codepoint,
meaning in particular that it must not start with a byte in the range
[`0x80`, `0xBF`]. This means that no UTF-8 encoded textual-syntax
Preserves document can be misinterpreted as a binary-syntax document.
If applications prefix their textual-syntax documents with e.g. a
space or newline character, and their binary-syntax documents with a
`0xFF` byte, consumers of these documents may reliably autodetect the
syntax being used. In a network protocol supporting this kind of
autodetection, clients may transmit LF or `0xFF` to select text or
binary syntax, respectively.
Examination of the top two bits of the first byte of a document gives
its syntax: if the top two bits are `10`, it should be interpreted as
a binary-syntax document; otherwise, it should be interpreted as text.
Furthermore, if an application consistently uses `Record`s for its
top-level messages,[^records-and-nonatoms] eschewing `Atom`s in
particular, then autodetection of the encoding used for a given input
can be done as follows:
## Appendix. Table of tag values
| First byte of encoded input | Encoding | Other conclusions |
| --- | --- | --- |
| `0x80`--`0x8F` | binary | `Record` (format B) |
| `0x28` | binary | `Record` (format C) |
| `0x05` | binary | annotated value (presumably a `Record`) |
| `0xFF` | binary | no-op; value will follow |
| --- | --- | --- |
| `0x7B` ("<") | text | `Record` |
| `0x40` ("@") | text | annotated value (presumably a `Record`) |
| `0x09`, `0x0A`, `0x0D`, `0x20` or `0x2C` | text | whitespace; value will follow |
80 - False
81 - True
82 - Float
83 - Double
84 - End marker
85 - Annotation
(8x) RESERVED 86-8F
[^records-and-nonatoms]: Similar reasoning can be used to permit
unambiguous detection of encoding when `Collection`s are allowed
as top-level messages as well as `Record`s.
9x - Small integers 0..12,-3..-1
An - Small integers, (n+1) bytes long
B0 - Small integers, variable length
B1 - String
B2 - ByteString
B3 - Symbol
## Appendix. Table of lead byte values
00 - False
01 - True
02 - Float
03 - Double
04 - End stream
05 - Annotation
(0x) RESERVED 06-0F (NB. 09, 0A, 0D specially reserved)
(1x) RESERVED
2x - Start Stream (NB. 20, 2C specially reserved)
3x - Small integers 0..12,-3..-1
4x - SignedInteger
5x - String
6x - ByteString
7x - Symbol
8x - Record
9x - Sequence
Ax - Set
Bx - Dictionary
(Cx) RESERVED C0-CF
(Dx) RESERVED D0-DF
(Ex) RESERVED E0-EF
(Fx) RESERVED F0-FE
FF No-op
## Appendix. Bit fields within lead byte values
tt nn mmmm contents
---------- ---------
00 00 0000 False
00 00 0001 True
00 00 0010 Float, 32 bits big-endian binary
00 00 0011 Double, 64 bits big-endian binary
00 00 0100 End Stream (to match a previous Start Stream)
00 00 0101 Annotation; two more Reprs follow
00 00 1001 (ASCII HT (tab)) \
00 00 1010 (ASCII LF) |- Reserved: may be used to indicate
00 00 1101 (ASCII CR) / use of text encoding
00 01 xxxx error, RESERVED
00 10 ttnn Start Stream <tt,nn>
When tt = 00 --> error
When nn = 00 --> (ASCII space)
Reserved: may be used to indicate
use of text encoding
otherwise --> error
01 --> each chunk is a ByteString
10 --> each chunk is a single encoded Value
11 --> error (RESERVED)
When nn = 00 --> (ASCII comma)
Reserved: may be used to indicate
use of text encoding
otherwise --> error
00 11 xxxx Small integers 0..12,-3..-1
01 00 mmmm SignedInteger, big-endian binary
01 01 mmmm String, UTF-8 binary
01 10 mmmm ByteString
01 11 mmmm Symbol, UTF-8 binary
10 00 mmmm Record
10 01 mmmm Sequence
10 10 mmmm Set
10 11 mmmm Dictionary
11 00 xxxx error, RESERVED
11 01 xxxx error, RESERVED
11 10 xxxx error, RESERVED
11 11 1111 no-op; unambiguous indication of binary Preserves format
Where `mmmm` appears, interpret it as an unsigned 4-bit number `m`. If
`m`<15, let `l`=`m`. Otherwise, `m`=15; let `l` be the result of
decoding the varint that follows.
Then, `l` is the length of the body that follows, counted in bytes for
`tt`=`01` and in `Repr`s for `tt`=`10`.
B4 - Record
B5 - Sequence
B6 - Set
B7 - Dictionary
## Appendix. Binary SignedInteger representation
@ -1016,17 +830,17 @@ Languages that provide fixed-width machine word types may find the
following table useful in encoding and decoding binary `SignedInteger`
values.
| Integer range | Bytes required | Encoding (hex) |
| --- | --- | --- |
| -3 ≤ n < 13 (numbers -3..12 encoded specially) | 1 | `3X` |
| -2<sup>7</sup> ≤ n < 2<sup>7</sup> (i8) | 2 | `41` `XX` |
| -2<sup>15</sup> ≤ n < 2<sup>15</sup> (i16) | 3 | `42` `XX` `XX` |
| -2<sup>23</sup> ≤ n < 2<sup>23</sup> (i24) | 4 | `43` `XX` `XX` `XX` |
| -2<sup>31</sup> ≤ n < 2<sup>31</sup> (i32) | 5 | `44` `XX` `XX` `XX` `XX` |
| -2<sup>39</sup> ≤ n < 2<sup>39</sup> (i40) | 6 | `45` `XX` `XX` `XX` `XX` `XX` |
| -2<sup>47</sup> ≤ n < 2<sup>47</sup> (i48) | 7 | `46` `XX` `XX` `XX` `XX` `XX` `XX` |
| -2<sup>55</sup> ≤ n < 2<sup>55</sup> (i56) | 8 | `47` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
| -2<sup>63</sup> ≤ n < 2<sup>63</sup> (i64) | 9 | `48` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
| Integer range | Bytes required | Encoding (hex) |
| --- | --- | --- |
| -3 ≤ n ≤ 12 | 1 | `3X` |
| -2<sup>7</sup> ≤ n < 2<sup>7</sup> (i8) | 2 | `A0` `XX` |
| -2<sup>15</sup> ≤ n < 2<sup>15</sup> (i16) | 3 | `A1` `XX` `XX` |
| -2<sup>23</sup> ≤ n < 2<sup>23</sup> (i24) | 4 | `A2` `XX` `XX` `XX` |
| -2<sup>31</sup> ≤ n < 2<sup>31</sup> (i32) | 5 | `A3` `XX` `XX` `XX` `XX` |
| -2<sup>39</sup> ≤ n < 2<sup>39</sup> (i40) | 6 | `A4` `XX` `XX` `XX` `XX` `XX` |
| -2<sup>47</sup> ≤ n < 2<sup>47</sup> (i48) | 7 | `A5` `XX` `XX` `XX` `XX` `XX` `XX` |
| -2<sup>55</sup> ≤ n < 2<sup>55</sup> (i56) | 8 | `A6` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
| -2<sup>63</sup> ≤ n < 2<sup>63</sup> (i64) | 9 | `A7` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
<!-- Heading to visually offset the footnotes from the main document: -->
## Notes

View File

@ -29,16 +29,3 @@ not. There's only one (?) at the moment, the `%i"f"` in `Float`;
should it be changed to case-sensitive?
Q. Should `IOList`s be wrapped in an identifying unary record constructor?
TODO: Examples of the ordering. `"bzz" < "c" < "caa"`; `#true < 3 < "3" < |3|`
TODO: Probably should add a canonicalized subset. Consider adding
explicit "I promise this is canonical" marker, like a BOM, which
identifies a binary value as (first) binary and (second, optionally)
as canonical. UTF-8 disallows byte `0xFF` from appearing anywhere in a
text; this might be a good candidate for a marker sequence.
((Actually, perhaps `0x10` would be good! It corresponds to DLE, "data
link escape"; it is not a printable ASCII character, and is disallowed
in the textual Preserves grammar; and it is also mnemonic for "version
0", since it is the Preserves binary encoding of the small integer
zero.))