--- no_site_title: true title: "Preserves: an Expressive Data Language" --- Tony Garnock-Jones August 2019. Version 0.0.6. [sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt [spki]: http://world.std.com/~cme/html/spki.html [varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints [erlang-map]: http://erlang.org/doc/reference_manual/data_types.html#map [abnf]: https://tools.ietf.org/html/rfc7405 This document proposes a data model and serialization format called *Preserves*. Preserves supports *records* with user-defined *labels*. This relieves the confusion caused by encoding records as dictionaries, seen in most data languages in use on the web. It also allows Preserves to easily represent the *labelled sums of products* as seen in many functional programming languages. Preserves also supports the usual suite of atomic and compound data types, in particular including *binary* data as a distinct type from text strings. Its *annotations* allow separation of data from metadata such as [comments](conventions.html#comments), trace information, and provenance information. Finally, Preserves defines precisely how to *compare* two values. Comparison is based on the data model, not on syntax or on data structures of any particular implementation language. ## Starting with Semantics Taking inspiration from functional programming, we start with a definition of the *values* that we want to work with and give them meaning independent of their syntax. Our `Value`s fall into two broad categories: *atomic* and *compound* data. Value = Atom | Compound Atom = Boolean | Float | Double | SignedInteger | String | ByteString | Symbol Compound = Record | Sequence | Set | Dictionary **Total order.** As we go, we will incrementally specify a total order over `Value`s. Two values of the same kind are compared using kind-specific rules. The ordering among values of different kinds is essentially arbitrary, but having a total order is convenient for many tasks, so we define it as follows:[^ordering-by-syntax] (Values) Atom < Compound (Compounds) Record < Sequence < Set < Dictionary (Atoms) Boolean < Float < Double < SignedInteger < String < ByteString < Symbol [^ordering-by-syntax]: The observant reader may note that the ordering here is the same as that implied by the tagging scheme used in the concrete binary syntax for `Value`s. **Equivalence.** Two `Value`s are equal if neither is less than the other according to the total order. ### Signed integers. A `SignedInteger` is a signed integer of arbitrary width. `SignedInteger`s are compared as mathematical integers. ### Unicode strings. A `String` is a sequence of Unicode [code-point](http://www.unicode.org/glossary/#code_point)s. `String`s are compared lexicographically, code-point by code-point.[^utf8-is-awesome] [^utf8-is-awesome]: Happily, the design of UTF-8 is such that this gives the same result as a lexicographic byte-by-byte comparison of the UTF-8 encoding of a string! ### Binary data. A `ByteString` is a sequence of octets. `ByteString`s are compared lexicographically. ### Symbols. Programming languages like Lisp and Prolog frequently use string-like values called *symbols*. Here, a `Symbol` is, like a `String`, a sequence of Unicode code-points representing an identifier of some kind. `Symbol`s are also compared lexicographically by code-point. ### Booleans. There are two `Boolean`s, “false” and “true”. The “false” value is less-than the “true” value. ### IEEE floating-point values. `Float`s and `Double`s are single- and double-precision IEEE 754 floating-point values, respectively. `Float`s, `Double`s and `SignedInteger`s are disjoint; by the rules [above](#total-order), every `Float` is less than every `Double`, and every `SignedInteger` is greater than both. Two `Float`s or two `Double`s are to be ordered by the `totalOrder` predicate defined in section 5.10 of [IEEE Std 754-2008](https://dx.doi.org/10.1109/IEEESTD.2008.4610935). ### Records. A `Record` is a *labelled* tuple of `Value`s, the record's *fields*. A label can be any `Value`, but is usually a `Symbol`.[^extensibility] [^iri-labels] `Record`s are compared lexicographically: first by label, then by field sequence. [^extensibility]: The [Racket](https://racket-lang.org/) programming language defines “[prefab](http://docs.racket-lang.org/guide/define-struct.html#(part._prefab-struct))” structure types, which map well to our `Record`s. Racket supports record extensibility by encoding record supertypes into record labels as specially-formatted lists. [^iri-labels]: It is occasionally (but seldom) necessary to interpret such `Symbol` labels as UTF-8 encoded IRIs. Where a label can be read as a relative IRI, it is notionally interpreted with respect to the IRI `urn:uuid:6bf094a6-20f1-4887-ada7-46834a9b5b34`; where a label can be read as an absolute IRI, it stands for that IRI; and otherwise, it cannot be read as an IRI at all, and so the label simply stands for itself—for its own `Value`. ### Sequences. A `Sequence` is a sequence of `Value`s. `Sequence`s are compared lexicographically. ### Sets. A `Set` is an unordered finite set of `Value`s. It contains no duplicate values, following the [equivalence relation](#equivalence) induced by the total order on `Value`s. Two `Set`s are compared by sorting their elements ascending using the [total order](#total-order) and comparing the resulting `Sequence`s. ### Dictionaries. A `Dictionary` is an unordered finite collection of pairs of `Value`s. Each pair comprises a *key* and a *value*. Keys in a `Dictionary` are pairwise distinct. Instances of `Dictionary` are compared by lexicographic comparison of the sequences resulting from ordering each `Dictionary`'s pairs in ascending order by key. ## Textual Syntax Now we have discussed `Value`s and their meanings, we may turn to techniques for *representing* `Value`s for communication or storage. In this section, we use [case-sensitive ABNF][abnf] to define a textual syntax that is easy for people to read and write.[^json-superset] Most of the examples in this document are written using this syntax. In the following section, we will define an equivalent compact machine-readable syntax. [^json-superset]: The grammar of the textual syntax is a superset of JSON, with the slightly unusual feature that `true`, `false`, and `null` are all read as `Symbol`s, and that `SignedInteger`s are never read as `Double`s. ### Character set. [ABNF][abnf] allows easy definition of US-ASCII-based languages. However, Preserves is a Unicode-based language. Therefore, we reinterpret ABNF as a grammar for recognising sequences of Unicode code points. Textual syntax for a `Value` *SHOULD* be encoded using UTF-8 where possible. ### Whitespace. Whitespace is defined as any number of spaces, tabs, carriage returns, line feeds, or commas. ws = *(%x20 / %x09 / newline / ",") newline = CR / LF ### Grammar. Standalone documents may have trailing whitespace. Document = Value ws Any `Value` may be preceded by whitespace. Value = ws (Record / Collection / Atom / Compact) Collection = Sequence / Dictionary / Set Atom = Boolean / Float / Double / SignedInteger / String / ByteString / Symbol Each `Record` is an angle-bracket enclosed grouping of its label-`Value` followed by its field-`Value`s. Record = "<" Value *Value ws ">" `Sequence`s are enclosed in square brackets. `Dictionary` values are curly-brace-enclosed colon-separated pairs of values. `Set`s are written either as one or more values enclosed in curly braces, or zero or more values enclosed by the tokens `#set{` and `}`.[^printing-collections] It is an error for a set to contain duplicate elements or for a dictionary to contain duplicate keys. Sequence = "[" *Value ws "]" Dictionary = "{" *(Value ws ":" Value) ws "}" Set = %s"#set{" *Value ws "}" / "{" 1*Value ws "}" [^printing-collections]: **Implementation note.** When implementing printing of `Value`s using the textual syntax, consider supporting (a) optional pretty-printing with indentation, (b) optional JSON-compatible print mode for that subset of `Value` that is compatible with JSON, and (c) optional submodes for no commas, commas separating, and commas terminating elements or key/value pairs within a collection. `Boolean`s are the simple literal strings `#true` and `#false`. Boolean = %s"#true" / %s"#false" Numeric data follow the [JSON grammar](https://tools.ietf.org/html/rfc8259#section-6), with the addition of a trailing “f” distinguishing `Float` from `Double` values. `Float`s and `Double`s always have either a fractional part or an exponent part, where `SignedInteger`s never have either.[^reading-and-writing-floats-accurately] [^arbitrary-precision-signedinteger] Float = flt %i"f" Double = flt SignedInteger = int digit1-9 = %x31-39 nat = %x30 / ( digit1-9 *DIGIT ) int = ["-"] nat frac = "." 1*DIGIT exp = %i"e" ["-"/"+"] 1*DIGIT flt = int (frac exp / frac / exp) [^reading-and-writing-floats-accurately]: **Implementation note.** Your language's standard library likely has a good routine for converting between decimal notation and IEEE 754 floating-point. However, if not, or if you are interested in the challenges of accurately reading and writing floating point numbers, see the excellent matched pair of 1990 papers by Clinger and Steele & White, and a recent follow-up by Jaffer: Clinger, William D. ‘How to Read Floating Point Numbers Accurately’. In Proc. PLDI. White Plains, New York, 1990. . Steele, Guy L., Jr., and Jon L. White. ‘How to Print Floating-Point Numbers Accurately’. In Proc. PLDI. White Plains, New York, 1990. . Jaffer, Aubrey. ‘Easy Accurate Reading and Writing of Floating-Point Numbers’. ArXiv:1310.8121 [Cs], 27 October 2013. . [^arbitrary-precision-signedinteger]: **Implementation note.** Be aware when implementing reading and writing of `SignedInteger`s that the data model *requires* arbitrary-precision integers. Your I/O routines must not truncate precision either when reading or writing a `SignedInteger`. `String`s are, [as in JSON](https://tools.ietf.org/html/rfc8259#section-7), possibly escaped text surrounded by double quotes. The escaping rules are the same as for JSON.[^string-json-correspondence] [^escaping-surrogate-pairs] String = %x22 *char %x22 char = unescaped / %x7C / escape (escaped / %x22 / %s"u" 4HEXDIG) unescaped = %x20-21 / %x23-5B / %x5D-7B / %x7D-10FFFF escape = %x5C ; \ escaped = ( %x5C / ; \ reverse solidus U+005C %x2F / ; / solidus U+002F %x62 / ; b backspace U+0008 %x66 / ; f form feed U+000C %x6E / ; n line feed U+000A %x72 / ; r carriage return U+000D %x74 ) ; t tab U+0009 [^string-json-correspondence]: The grammar for `String` has the same effect as the [JSON](https://tools.ietf.org/html/rfc8259#section-7) grammar for `string`. Some auxiliary definitions (e.g. `escaped`) are lifted largely unmodified from the text of RFC 8259. [^escaping-surrogate-pairs]: In particular, note JSON's rules around the use of surrogate pairs for code points not in the Basic Multilingual Plane. We encourage implementations to avoid escaping such characters when producing output, and instead to rely on the UTF-8 encoding of the entire document to handle them correctly. A `ByteString` may be written in any of three different forms. The first is similar to a `String`, but prepended with a hash sign `#`. In addition, only Unicode code points overlapping with printable 7-bit ASCII are permitted unescaped inside such a `ByteString`; other byte values must be escaped by prepending a two-digit hexadecimal value with `\x`. ByteString = "#" %x22 *binchar %x22 binchar = binunescaped / escape (escaped / %x22 / %s"x" 2HEXDIG) binunescaped = %x20-21 / %x23-5B / %x5D-7E The second is as a sequence of pairs of hexadecimal digits interleaved with whitespace and surrounded by `#hex{` and `}`. ByteString =/ %s"#hex{" *(ws / 2HEXDIG) ws "}" The third is as a sequence of [Base64](https://tools.ietf.org/html/rfc4648) characters, interleaved with whitespace and surrounded by `#base64{` and `}`. Plain and URL-safe Base64 characters are allowed. ByteString =/ %s"#base64{" *(ws / base64char) ws "}" / base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "=" A `Symbol` may be written in a “bare” form[^cf-sexp-token] so long as it conforms to certain restrictions on the characters appearing in the symbol. Alternatively, it may be written in a quoted form. The quoted form is much the same as the syntax for `String`s, including embedded escape syntax, except using a bar or pipe character (`|`) instead of a double quote mark. Symbol = symstart *symcont / "|" *symchar "|" symstart = ALPHA / sympunct / symustart symcont = ALPHA / sympunct / symustart / symucont / DIGIT / "-" sympunct = "~" / "!" / "$" / "%" / "^" / "&" / "*" / "?" / "_" / "=" / "+" / "/" / "." symchar = unescaped / %x22 / escape (escaped / %x7C / %s"u" 4HEXDIG) symustart = symucont = [^cf-sexp-token]: Compare with the [SPKI S-expression][sexp.txt] definition of “token representation”, and with the [R6RS definition of identifiers](http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-7.html#node_sec_4.2.4). Finally, any `Value` may be represented by escaping from the textual syntax to the [compact binary syntax](#compact-binary-syntax) by prefixing a `ByteString` containing the binary representation of the `Value` with `#value`.[^rationale-switch-to-binary] [^no-literal-binary-in-text] Compact = %s"#value" ws ByteString [^rationale-switch-to-binary]: **Rationale.** The textual syntax cannot express every `Value`: specifically, it cannot express the several million floating-point NaNs, or the two floating-point Infinities. Since the compact binary format for `Value`s expresses each `Value` with precision, embedding binary `Value`s solves the problem. [^no-literal-binary-in-text]: Every text is ultimately physically stored as bytes; therefore, it might seem possible to escape to the raw binary form of compact binary encoding from within a pieces of textual syntax. However, while bytes must be involved in any *representation* of text, the text *itself* is logically a sequence of *code points* and is not *intrinsically* a binary structure at all. It would be incoherent to expect to be able to access the representation of the text from within the text itself. ### Annotations. **Syntax.** When written down, a `Value` may have an associated sequence of *annotations* carrying “out-of-band” contextual metadata about the value. Each annotation is, in turn, a `Value`, and may itself have annotations. Value =/ ws "@" Value Value Each annotation is preceded by `@`; the underlying annotated value follows its annotations. Here we extend only the syntactic nonterminal named “`Value`” without altering the semantic class of `Value`s. **Equivalence.** Annotations appear within syntax denoting a `Value`; however, the annotations are not part of the denoted value. They are only part of the syntax. Annotations do not play a part in equivalences and orderings of `Value`s. Reflective tools such as debuggers, user interfaces, and message routers and relays---tools which process `Value`s generically---may use annotated inputs to tailor their operation, or may insert annotations in their outputs. By contrast, in ordinary programs, as a rule of thumb, the presence, absence or content of an annotation should not change the control flow or output of the program. Annotations are data *describing* `Value`s, and are not in the domain of any specific application of `Value`s. That is, an annotation will almost never cause a non-reflective program to do anything observably different. ## Compact Binary Syntax A `Repr` is a binary-syntax encoding, or representation, of either - a `Value`, - a “placeholder” for a `Value`, or - an annotation on a `Repr`. Each `Repr` comprises one or more bytes describing the kind of represented information and the length of the representation, followed by the encoded details. For a value `v`, we write `[[v]]` for the `Repr` of v. ### Type and Length representation. Each `Repr` takes one of three possible forms: - (A) type-specific form, used for simple values such as `Boolean`s or `Float`s, for placeholders, and for introducing annotations. - (B) a variable-length form with length specified up-front, used for compound and variable-length atomic data structures when their sizes are known at the time serialization begins. - (C) a variable-length streaming form with unknown or unpredictable length, used in cases when serialization begins before the number of elements or bytes in the corresponding `Value` is known. Applications may choose between formats B and C depending on their needs at serialization time. #### The lead byte. Every `Repr` starts with a *lead byte*, constructed by `leadbyte(t,n,m)`, where `t`,`n`∈{0,1,2,3} and 0≤`m`<16: leadbyte(t,n,m) = [t*64 + n*16 + m] The arguments `t`, `n` and `m` describe the rest of the representation.[^some-encodings-unused] [^some-encodings-unused]: Some encodings are unused. All such encodings are reserved for future versions of this specification. | `t` | `n` | `m` | Meaning | | --- | --- | --- | ------- | | 0 | 0 | 0–3 | (format A) An `Atom` with fixed-length binary representation | | 0 | 0 | 4 | (format C) Stream end | | 0 | 0 | 5 | (format A) Annotation | | 0 | 1 | | (format A) Placeholder for an application-specific `Value` | | 0 | 2 | | (format C) Stream start | | 0 | 3 | | (format A) Certain small `SignedInteger`s | | 1 | | | (format B) An `Atom` with variable-length binary representation | | 2 | | | (format B) A `Compound` with variable-length representation | #### Encoding data of type-specific length (format A). Each type of data defines its own rules for this format. #### Encoding data of known length (format B). Format B is used where the length `l` of the `Value` to be encoded is known when serialization begins. Format B `Repr`s use `m` in `leadbyte` to encode `l`. The length counts *bytes* for atomic `Value`s, but counts *contained values* for compound `Value`s. - A length `l` between 0 and 14 is represented using `leadbyte` with `m=l`. - A length of 15 or greater is represented by `m=15` and additional bytes describing the length following the lead byte. The function `header(t,n,m)` yields an appropriate sequence of bytes describing a `Repr`'s type and length when `t`, `n` and `m` are appropriate non-negative integers: header(t,n,m) = leadbyte(t,n,m) when m < 15 or leadbyte(t,n,15) ++ varint(m) otherwise The additional length bytes are formatted as [base 128 varints][varint]. We write `varint(m)` for the varint-encoding of `m`. Quoting the [Google Protocol Buffers][varint] definition, > Each byte in a varint, except the last byte, has the most > significant bit (msb) set – this indicates that there are further > bytes to come. The lower 7 bits of each byte are used to store the > two's complement representation of the number in groups of 7 bits, > least significant group first. The following table illustrates varint-encoding. | Number, `m` | `m` in binary, grouped into 7-bit chunks | `varint(m)` bytes | | ------ | ------------------- | ------------ | | 15 | `0001111` | 15 | | 300 | `0000010 0101100` | 172 2 | | 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 128 148 235 220 3 | #### Streaming data of unknown length (format C). A `Repr` where the length of the `Value` to be encoded is variable and not known at the time serialization of the `Value` starts is encoded by a single Stream Start (“open”) byte, followed by zero or more *chunks*, followed by a matching Stream End (“close”) byte: open(t,n) = leadbyte(0,2, t*4 + n) = [0x20 + t*4 + n] close() = leadbyte(0,0, 4) = [0x04] For a format C `Repr` of an atomic `Value`, each chunk is to be a format B `Repr` of a `ByteString`, no matter the type of the overall `Value`. Annotations are not allowed on these individual chunks. For a format C `Repr` of a compound `Value`, each chunk is to be a single `Repr`, which may itself be annotated. Each chunk within a format C `Repr` *MUST* have non-zero length. Software that decodes `Repr`s *MUST* reject `Repr`s that include zero-length chunks. ### Records. Format B (known length): [[ ]] = header(2,0,m+1) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] For `m` fields, `m+1` is supplied to `header`, to account for the encoding of the record label. Format C (streaming): [[ ]] = open(2,0) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] ++ close() Applications *SHOULD* prefer the known-length format for encoding `Record`s. ### Placeholders. Applications may define an interpretation for numbered *placeholders* in the binary syntax, mapping each *placeholder number* `n` to a specific `Value`. For example, a placeholder number may be assigned for a frequently-used `Record` label. A `Value` `v` for which placeholder number `n` has been assigned may be tersely encoded as [[v]] = header(0,1,n) when n is a placeholder number for v **Examples.** For example, a protocol may choose to assign placeholder number 4 to the symbol `void`, making [[void]] = header(0,1,4) = [0x14] [[]] = header(2,0,1) ++ [[void]] = [0x81, 0x14] or it may map symbol `person` to placeholder number 102, making [[person]] = header(0,1,102) = [0x1F, 0x66] and so [[]] = header(2,0,4) ++ [[person]] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] = [0x84, 0x1F, 0x66] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] for format B, or open(2,0) ++ [[person]] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ close() = [0x28, 0x1F, 0x66] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ [0x04] for format C. ### Sequences, Sets and Dictionaries. Format B (known length): [[ [X_1...X_m] ]] = header(2,1,m) ++ [[X_1]] ++...++ [[X_m]] [[ #set{X_1...X_m} ]] = header(2,2,m) ++ [[X_1]] ++...++ [[X_m]] [[ {K_1:V_1...K_m:V_m} ]] = header(2,3,m*2) ++ [[K_1]] ++ [[V_1]] ++... ++ [[K_m]] ++ [[V_m]] Note that `m*2` is given to `header` for a `Dictionary`, since there are two `Value`s in each key-value pair. Format C (streaming): [[ [X_1...X_m] ]] = open(2,1) ++ [[X_1]] ++...++ [[X_m]] ++ close() [[ #set{E_1...E_m} ]] = open(2,2) ++ [[E_1]] ++...++ [[E_m]] ++ close() [[ {K_1:V_1...K_m:V_m} ]] = open(2,3) ++ [[K_1]] ++ [[V_1]] ++... ++ [[K_m]] ++ [[V_m]] ++ close() Applications may use whichever format suits their needs on a case-by-case basis. There is *no* ordering requirement on the `E_i` elements or `K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any order. However, the `E_i` and `K_i` must be pairwise distinct. [^no-sorting-rationale]: In the BitTorrent encoding format, [bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding), dictionary key/value pairs must be sorted by key. This is a necessary step for ensuring serialization of `Value`s is canonical. We do not require that key/value pairs (or set elements) be in sorted order for serialized `Value`s, because (a) where canonicalization is used for cryptographic signatures, it is more reliable to simply retain the exact binary form of the signed document than to depend on canonical de- and re-serialization, and (b) sorting keys or elements makes no sense in streaming serialization formats. However, a quality implementation may wish to offer the programmer the option of serializing with set elements and dictionary keys in sorted order. ### SignedIntegers. Format B/A (known length/fixed-size): [[ x ]] when x ∈ SignedInteger = header(1,0,m) ++ intbytes(x) if x<-3 ∨ 13≤x header(0,3,x+16) if -3≤x<0 header(0,3,x) if 0≤x<13 Integers in the range [-3,12] are compactly represented using format A because they are so frequently used. Other integers are represented using format B. Format C *MUST NOT* be used for `SignedInteger`s. The function `intbytes(x)` gives the big-endian two's-complement binary representation of `x`, taking exactly as many whole bytes as needed to unambiguously identify the value and its sign, and `m = |intbytes(x)|`. The most-significant bit in the first byte in `intbytes(x)` is the sign bit.[^zero-intbytes] [^zero-intbytes]: The value 0 needs zero bytes to identify the value, so `intbytes(0)` is the empty byte string. Non-zero values need at least one byte. For example, [[ -257 ]] = 42 FE FF [[ -3 ]] = 3D [[ 128 ]] = 42 00 80 [[ -256 ]] = 42 FF 00 [[ -2 ]] = 3E [[ 255 ]] = 42 00 FF [[ -255 ]] = 42 FF 01 [[ -1 ]] = 3F [[ 256 ]] = 42 01 00 [[ -254 ]] = 42 FF 02 [[ 0 ]] = 30 [[ 32767 ]] = 42 7F FF [[ -129 ]] = 42 FF 7F [[ 1 ]] = 31 [[ 32768 ]] = 43 00 80 00 [[ -128 ]] = 41 80 [[ 12 ]] = 3C [[ 65535 ]] = 43 00 FF FF [[ -127 ]] = 41 81 [[ 13 ]] = 41 0D [[ 65536 ]] = 43 01 00 00 [[ -4 ]] = 41 FC [[ 127 ]] = 41 7F [[ 131072 ]] = 43 02 00 00 ### Strings, ByteStrings and Symbols. Syntax for these three types varies only in the value of `n` supplied to `header` and `open`. In each case, the payload following the header is a binary sequence; for `String` and `Symbol`, it is a UTF-8 encoding of the `Value`'s code points, while for `ByteString` it is the raw data contained within the `Value` unmodified. Format B (known length): [[ S ]] = header(1,n,m) ++ encode(S) where m = |encode(S)| and (n,encode(S)) = (1,utf8(S)) if S ∈ String (2,S) if S ∈ ByteString (3,utf8(S)) if S ∈ Symbol To stream a `String`, `ByteString` or `Symbol`, emit `open(1,n)` and then a sequence of zero or more format B chunks, followed by `close()`. Every chunk must be a `ByteString`, and no chunk may be annotated. While the overall content of a streamed `String` or `Symbol` must be valid UTF-8, individual chunks do not have to conform to UTF-8. ### Fixed-length Atoms. Fixed-length atoms all use format A, and do not have a length representation. They repurpose the bits that format B `Repr`s use to specify lengths. Applications *MUST NOT* use format C with `open(0,n)` for any `n`. #### Booleans. [[ #false ]] = header(0,0,0) = [0x00] [[ #true ]] = header(0,0,1) = [0x01] #### Floats and Doubles. [[ F ]] when F ∈ Float = header(0,0,2) ++ binary32(F) [[ D ]] when D ∈ Double = header(0,0,3) ++ binary64(D) The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and 8-byte IEEE 754 binary representations of `F` and `D`, respectively. ### Annotations. To annotate a `Repr` `r` with some `Value` `v`, prepend `r` with `[0x05] ++ [[v]]`. For example, the `Repr` corresponding to textual syntax `@a@b[]`, i.e. an empty sequence annotated with two symbols, `a` and `b`, is [[ @a @b [] ]] = [0x05] ++ [[a]] ++ [0x05] ++ [[b]] ++ [[ [] ]] = [0x05, 0x71, 0x61, 0x05, 0x71, 0x62, 0x90] ## Examples ### Simple examples. For the following examples, imagine an application that maps placeholder number 0 to symbol `discard`, 1 to `capture`, and 2 to `observe`. | Value | Encoded byte sequence | |---------------------------------------------------|-------------------------------------------------------------------------------------| | `>` | 82 11 81 10 | | ` >>>` | 82 12 83 75 's' 'p' 'e' 'a' 'k' 81 10 82 11 81 11 | | `[1 2 3 4]` (format B) | 94 31 32 33 34 | | `[1 2 3 4]` (format C) | 29 31 32 33 34 04 | | `[-2 -1 0 1]` | 94 3E 3F 30 31 | | `"hello"` (format B) | 55 'h' 'e' 'l' 'l' 'o' | | `"hello"` (format C, 2 chunks) | 25 62 'h' 'e' 63 'l' 'l' 'o' 35 | | `"hello"` (format C, 5 chunks) | 25 61 'h' 61 'e' 61 'l' 61 'l' 61 'o' 35 | | `["hello" there #"world" [] #set{} #true #false]` | 97 55 'h' 'e' 'l' 'l' 'o' 75 't' 'h' 'e' 'r' 'e' 65 'w' 'o' 'r' 'l' 'd' 90 A0 01 00 | | `-257` | 42 FE FF | | `-1` | 3F | | `0` | 30 | | `1` | 31 | | `255` | 42 00 FF | | `1.0f` | 02 3F 80 00 00 | | `1.0` | 03 3F F0 00 00 00 00 00 00 | | `-1.202e300` | 03 FE 3C B7 B7 59 BF 04 26 | The next example uses a non-`Symbol` label for a record.[^extensibility2] The `Record` <[titled person 2 thing 1] 101 "Blackwell" "Dr"> encodes to 85 ;; Record, generic, 4+1 95 ;; Sequence, 5 76 74 69 74 6C 65 64 ;; Symbol, "titled" 76 70 65 72 73 6F 6E ;; Symbol, "person" 32 ;; SignedInteger, "2" 75 74 68 69 6E 67 ;; Symbol, "thing" 31 ;; SignedInteger, "1" 41 65 ;; SignedInteger, "101" 59 42 6C 61 63 6B 77 65 6C 6C ;; String, "Blackwell" 84 ;; Record, generic, 3+1 74 64 61 74 65 ;; Symbol, "date" 42 07 1D ;; SignedInteger, "1821" 32 ;; SignedInteger, "2" 33 ;; SignedInteger, "3" 52 44 72 ;; String, "Dr" [^extensibility2]: It happens to line up with Racket's representation of a record label for an inheritance hierarchy where `titled` extends `person` extends `thing`: (struct date (year month day) #:prefab) (struct thing (id) #:prefab) (struct person thing (name date-of-birth) #:prefab) (struct titled person (title) #:prefab) For more detail on Racket's representations of record labels, see [the Racket documentation for `make-prefab-struct`](http://docs.racket-lang.org/reference/structutils.html#%28def._%28%28quote._~23~25kernel%29._make-prefab-struct%29%29). --- ### JSON examples. The examples from [RFC 8259](https://tools.ietf.org/html/rfc8259#section-13) read as valid Preserves, though the JSON literals `true`, `false` and `null` read as `Symbol`s. The first example: { "Image": { "Width": 800, "Height": 600, "Title": "View from 15th Floor", "Thumbnail": { "Url": "http://www.example.com/image/481989943", "Height": 125, "Width": 100 }, "Animated" : false, "IDs": [116, 943, 234, 38793] } } encodes to binary as follows: B2 55 "Image" BC 55 "Width" 42 03 20 55 "Title" 5F 14 "View from 15th Floor" 58 "Animated" 75 "false" 56 "Height" 42 02 58 59 "Thumbnail" B6 55 "Width" 41 64 53 "Url" 5F 26 "http://www.example.com/image/481989943" 56 "Height" 41 7D 53 "IDs" 94 41 74 42 03 AF 42 00 EA 43 00 97 89 and the second example: [ { "precision": "zip", "Latitude": 37.7668, "Longitude": -122.3959, "Address": "", "City": "SAN FRANCISCO", "State": "CA", "Zip": "94107", "Country": "US" }, { "precision": "zip", "Latitude": 37.371991, "Longitude": -122.026020, "Address": "", "City": "SUNNYVALE", "State": "CA", "Zip": "94085", "Country": "US" } ] encodes to binary as follows: 92 BF 10 59 "precision" 53 "zip" 58 "Latitude" 03 40 42 E2 26 80 9D 49 52 59 "Longitude" 03 C0 5E 99 56 6C F4 1F 21 57 "Address" 50 54 "City" 5D "SAN FRANCISCO" 55 "State" 52 "CA" 53 "Zip" 55 "94107" 57 "Country" 52 "US" BF 10 59 "precision" 53 "zip" 58 "Latitude" 03 40 42 AF 9D 66 AD B4 03 59 "Longitude" 03 C0 5E 81 AA 4F CA 42 AF 57 "Address" 50 54 "City" 59 "SUNNYVALE" 55 "State" 52 "CA" 53 "Zip" 55 "94085" 57 "Country" 52 "US" ## Security Considerations **Empty chunks.** Chunks of zero length are prohibited in streamed (format C) `Repr`s. However, a malicious or broken encoder may include them nonetheless. This opens up a possibility for denial-of-service: an attacker may begin streaming a `String`, for example, sending an endless sequence of zero length chunks, appearing to make progress but not actually doing so. Implementations *MUST* reject zero length chunks when decoding, and *MUST NOT* produce them when encoding. **Whitespace.** Similarly, the textual format for `Value`s allows arbitrary whitespace in many positions. In streaming transfer situations, consider optional restrictions on the amount of consecutive whitespace that may appear in a serialized `Value`. **Annotations.** Also similarly, in modes where a `Value` is being read while annotations are skipped, an endless sequence of annotations may give an illusion of progress. **Canonical form for cryptographic hashing and signing.** As specified, neither the textual nor the compact binary encoding rules for `Value`s force canonical serializations. Two serializations of the same `Value` may yield different binary `Repr`s. ## Appendix. Table of lead byte values 00 - False 01 - True 02 - Float 03 - Double 04 - End stream 05 - Annotation (0x) RESERVED 06-0F 1x - Placeholder 2x - Start Stream 3x - Small integers 0..12,-3..-1 4x - SignedInteger 5x - String 6x - ByteString 7x - Symbol 8x - Record 9x - Sequence Ax - Set Bx - Dictionary (Cx) RESERVED C0-CF (Dx) RESERVED D0-DF (Ex) RESERVED E0-EF (Fx) RESERVED F0-FF ## Appendix. Bit fields within lead byte values tt nn mmmm contents ---------- --------- 00 00 0000 False 00 00 0001 True 00 00 0010 Float, 32 bits big-endian binary 00 00 0011 Double, 64 bits big-endian binary 00 00 0100 End Stream (to match a previous Start Stream) 00 00 0101 Annotation; two more Reprs follow 00 01 mmmm Placeholder; m is the placeholder number 00 10 ttnn Start Stream When tt = 00 --> error 01 --> each chunk is a ByteString 10 --> each chunk is a single encoded Value 11 --> error (RESERVED) 00 11 xxxx Small integers 0..12,-3..-1 01 00 mmmm SignedInteger, big-endian binary 01 01 mmmm String, UTF-8 binary 01 10 mmmm ByteString 01 11 mmmm Symbol, UTF-8 binary 10 00 mmmm Record 10 01 mmmm Sequence 10 10 mmmm Set 10 11 mmmm Dictionary 11 nn mmmm error, RESERVED Where `mmmm` appears, interpret it as an unsigned 4-bit number `m`. If `m`<15, let `l`=`m`. Otherwise, `m`=15; let `l` be the result of decoding the varint that follows. Then, if `ttnn`=`0001`, `l` is the placeholder number; otherwise, `l` is the length of the body that follows, counted in bytes for `tt`=`01` and in `Repr`s for `tt`=`10`. ## Notes