preserves/preserves.md

44 KiB
Raw Permalink Blame History

no_site_title title
true Preserves: an Expressive Data Language

Tony Garnock-Jones tonyg@leastfixedpoint.com
May 2020. Version 0.0.8.

This document proposes a data model and serialization format called Preserves.

Preserves supports records with user-defined labels. This relieves the confusion caused by encoding records as dictionaries, seen in most data languages in use on the web. It also allows Preserves to easily represent the labelled sums of products as seen in many functional programming languages.

Preserves also supports the usual suite of atomic and compound data types, in particular including binary data as a distinct type from text strings. Its annotations allow separation of data from metadata such as comments, trace information, and provenance information.

Finally, Preserves defines precisely how to compare two values. Comparison is based on the data model, not on syntax or on data structures of any particular implementation language.

Starting with Semantics

Taking inspiration from functional programming, we start with a definition of the values that we want to work with and give them meaning independent of their syntax.

Our Values fall into two broad categories: atomic and compound data.

                      Value = Atom
                            | Compound

                       Atom = Boolean
                            | Float
                            | Double
                            | SignedInteger
                            | String
                            | ByteString
                            | Symbol

                   Compound = Record
                            | Sequence
                            | Set
                            | Dictionary

Total order. As we go, we will incrementally specify a total order over Values. Two values of the same kind are compared using kind-specific rules. The ordering among values of different kinds is essentially arbitrary, but having a total order is convenient for many tasks, so we define it as follows:

        (Values)        Atom < Compound

        (Compounds)     Record < Sequence < Set < Dictionary

        (Atoms)         Boolean < Float < Double < SignedInteger
                          < String < ByteString < Symbol

Equivalence. Two Values are equal if neither is less than the other according to the total order.

Signed integers.

A SignedInteger is a signed integer of arbitrary width. SignedIntegers are compared as mathematical integers.

Unicode strings.

A String is a sequence of Unicode code-points. Strings are compared lexicographically, code-point by code-point.1

Binary data.

A ByteString is a sequence of octets. ByteStrings are compared lexicographically.

Symbols.

Programming languages like Lisp and Prolog frequently use string-like values called symbols. Here, a Symbol is, like a String, a sequence of Unicode code-points representing an identifier of some kind. Symbols are also compared lexicographically by code-point.

Booleans.

There are two Booleans, “false” and “true”. The “false” value is less-than the “true” value.

IEEE floating-point values.

Floats and Doubles are single- and double-precision IEEE 754 floating-point values, respectively. Floats, Doubles and SignedIntegers are disjoint; by the rules above, every Float is less than every Double, and every SignedInteger is greater than both. Two Floats or two Doubles are to be ordered by the totalOrder predicate defined in section 5.10 of IEEE Std 754-2008.

Records.

A Record is a labelled tuple of Values, the record's fields. A label can be any Value, but is usually a Symbol.2 3 Records are compared lexicographically: first by label, then by field sequence.

Sequences.

A Sequence is a sequence of Values. Sequences are compared lexicographically.

Sets.

A Set is an unordered finite set of Values. It contains no duplicate values, following the equivalence relation induced by the total order on Values. Two Sets are compared by sorting their elements ascending using the total order and comparing the resulting Sequences.

Dictionaries.

A Dictionary is an unordered finite collection of pairs of Values. Each pair comprises a key and a value. Keys in a Dictionary are pairwise distinct. Instances of Dictionary are compared by lexicographic comparison of the sequences resulting from ordering each Dictionary's pairs in ascending order by key.

Textual Syntax

Now we have discussed Values and their meanings, we may turn to techniques for representing Values for communication or storage.

In this section, we use case-sensitive ABNF to define a textual syntax that is easy for people to read and write.4 Most of the examples in this document are written using this syntax. In the following section, we will define an equivalent compact machine-readable syntax.

Character set.

ABNF allows easy definition of US-ASCII-based languages. However, Preserves is a Unicode-based language. Therefore, we reinterpret ABNF as a grammar for recognising sequences of Unicode code points.

Textual syntax for a Value SHOULD be encoded using UTF-8 where possible.

Whitespace.

Whitespace is defined as any number of spaces, tabs, carriage returns, line feeds, or commas.

            ws = *(%x20 / %x09 / newline / ",")
       newline = CR / LF

Grammar.

Standalone documents may have trailing whitespace.

      Document = Value ws

Any Value may be preceded by whitespace.

         Value = ws (Record / Collection / Atom / Compact)
    Collection = Sequence / Dictionary / Set
          Atom = Boolean / Float / Double / SignedInteger /
                 String / ByteString / Symbol

Each Record is an angle-bracket enclosed grouping of its label-Value followed by its field-Values.

        Record = "<" Value *Value ws ">"

Sequences are enclosed in square brackets. Dictionary values are curly-brace-enclosed colon-separated pairs of values. Sets are written either as one or more values enclosed in curly braces, or zero or more values enclosed by the tokens #set{ and }.5 It is an error for a set to contain duplicate elements or for a dictionary to contain duplicate keys.

      Sequence = "[" *Value ws "]"
    Dictionary = "{" *(Value ws ":" Value) ws "}"
           Set = %s"#set{" *Value ws "}" / "{" 1*Value ws "}"

Booleans are the simple literal strings #true and #false.

       Boolean = %s"#true" / %s"#false"

Numeric data follow the JSON grammar, with the addition of a trailing “f” distinguishing Float from Double values. Floats and Doubles always have either a fractional part or an exponent part, where SignedIntegers never have either.6 7

         Float = flt %i"f"
        Double = flt
 SignedInteger = int

      digit1-9 = %x31-39
           nat = %x30 / ( digit1-9 *DIGIT )
           int = ["-"] nat
          frac = "." 1*DIGIT
           exp = %i"e" ["-"/"+"] 1*DIGIT
           flt = int (frac exp / frac / exp)

Strings are, as in JSON, possibly escaped text surrounded by double quotes. The escaping rules are the same as for JSON.8 9

        String = %x22 *char %x22
          char = unescaped / %x7C / escape (escaped / %x22 / %s"u" 4HEXDIG)
     unescaped = %x20-21 / %x23-5B / %x5D-7B / %x7D-10FFFF
        escape = %x5C              ; \
       escaped = ( %x5C /          ; \    reverse solidus U+005C
                   %x2F /          ; /    solidus         U+002F
                   %x62 /          ; b    backspace       U+0008
                   %x66 /          ; f    form feed       U+000C
                   %x6E /          ; n    line feed       U+000A
                   %x72 /          ; r    carriage return U+000D
                   %x74 )          ; t    tab             U+0009

A ByteString may be written in any of three different forms.

The first is similar to a String, but prepended with a hash sign #. In addition, only Unicode code points overlapping with printable 7-bit ASCII are permitted unescaped inside such a ByteString; other byte values must be escaped by prepending a two-digit hexadecimal value with \x.

    ByteString = "#" %x22 *binchar %x22
       binchar = binunescaped / escape (escaped / %x22 / %s"x" 2HEXDIG)
  binunescaped = %x20-21 / %x23-5B / %x5D-7E

The second is as a sequence of pairs of hexadecimal digits interleaved with whitespace and surrounded by #hex{ and }.

   ByteString =/ %s"#hex{" *(ws / 2HEXDIG) ws "}"

The third is as a sequence of Base64 characters, interleaved with whitespace and surrounded by #base64{ and }. Plain and URL-safe Base64 characters are allowed.

   ByteString =/ %s"#base64{" *(ws / base64char) ws "}" /
    base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "="

A Symbol may be written in a “bare” form10 so long as it conforms to certain restrictions on the characters appearing in the symbol. Alternatively, it may be written in a quoted form. The quoted form is much the same as the syntax for Strings, including embedded escape syntax, except using a bar or pipe character (|) instead of a double quote mark.

        Symbol = symstart *symcont / "|" *symchar "|"
      symstart = ALPHA / sympunct / symustart
       symcont = ALPHA / sympunct / symustart / symucont / DIGIT / "-"
      sympunct = "~" / "!" / "$" / "%" / "^" / "&" / "*" /
                 "?" / "_" / "=" / "+" / "/" / "."
       symchar = unescaped / %x22 / escape (escaped / %x7C / %s"u" 4HEXDIG)
     symustart = <any code point greater than 127 whose Unicode
                  category is Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me,
                  Pc, Po, Sc, Sm, Sk, So, or Co>
      symucont = <any code point greater than 127 whose Unicode
                  category is Nd, Nl, No, or Pd>

Finally, any Value may be represented by escaping from the textual syntax to the compact binary syntax by prefixing a ByteString containing the binary representation of the Value with #value.11 12 13

       Compact = %s"#value" ws ByteString

Annotations.

Syntax. When written down, a Value may have an associated sequence of annotations carrying “out-of-band” contextual metadata about the value. Each annotation is, in turn, a Value, and may itself have annotations.

        Value =/ ws "@" Value Value

Each annotation is preceded by @; the underlying annotated value follows its annotations. Here we extend only the syntactic nonterminal named “Value” without altering the semantic class of Values.

Equivalence. Annotations appear within syntax denoting a Value; however, the annotations are not part of the denoted value. They are only part of the syntax. Annotations do not play a part in equivalences and orderings of Values.

Reflective tools such as debuggers, user interfaces, and message routers and relays---tools which process Values generically---may use annotated inputs to tailor their operation, or may insert annotations in their outputs. By contrast, in ordinary programs, as a rule of thumb, the presence, absence or content of an annotation should not change the control flow or output of the program. Annotations are data describing Values, and are not in the domain of any specific application of Values. That is, an annotation will almost never cause a non-reflective program to do anything observably different.

Compact Binary Syntax

A Repr is a binary-syntax encoding, or representation, of either a Value or an annotation on a Repr.

Each Repr comprises one or more bytes describing the kind of represented information and the length of the representation, followed by the encoded details.

For a value v, we write [[v]] for the Repr of v.

Type and Length representation.

Each Repr takes one of three possible forms:

  • (A) type-specific form, used for simple values such as Booleans or Floats as well as for introducing annotations.

  • (B) a variable-length form with length specified up-front, used for compound and variable-length atomic data structures when their sizes are known at the time serialization begins.

  • (C) a variable-length streaming form with unknown or unpredictable length, used in cases when serialization begins before the number of elements or bytes in the corresponding Value is known.

Applications may choose between formats B and C depending on their needs at serialization time.

The lead byte.

Every Repr starts with a lead byte, constructed by leadbyte(t,n,m), where t,n∈{0,1,2,3} and 0≤m<16:

leadbyte(t,n,m) = [t*64 + n*16 + m]

The arguments t, n and m describe the rest of the representation.14

t n m Meaning
0 0 03 (format A) An Atom with fixed-length binary representation
0 0 4 (format C) Stream end
0 0 5 (format A) Annotation
0 2 (format C) Stream start
0 3 (format A) Certain small SignedIntegers
1 (format B) An Atom with variable-length binary representation
2 (format B) A Compound with variable-length representation
3 3 15 (format A) 0xFF byte; no-op

Encoding data of type-specific length (format A).

Each type of data defines its own rules for this format.

Of particular note is lead byte 0xFF, which is a no-op byte acting as a kind of pseudo-whitespace in a binary-syntax encoding.

Encoding data of known length (format B).

Format B is used where the length l of the Value to be encoded is known when serialization begins. Format B Reprs use m in leadbyte to encode l. The length counts bytes for atomic Values, but counts contained values for compound Values.

  • A length l between 0 and 14 is represented using leadbyte with m=l.
  • A length of 15 or greater is represented by m=15 and additional bytes describing the length following the lead byte.

The function header(t,n,m) yields an appropriate sequence of bytes describing a Repr's type and length when t, n and m are appropriate non-negative integers:

header(t,n,m) =    leadbyte(t,n,m)                 when m < 15
                or leadbyte(t,n,15) ++ varint(m)   otherwise

The additional length bytes are formatted as base 128 varints.15 We write varint(m) for the varint-encoding of m. Quoting the Google Protocol Buffers definition,

Each byte in a varint, except the last byte, has the most significant bit (msb) set this indicates that there are further bytes to come. The lower 7 bits of each byte are used to store the two's complement representation of the number in groups of 7 bits, least significant group first.

The following table illustrates varint-encoding.

Number, m m in binary, grouped into 7-bit chunks varint(m) bytes
15 0001111 15
300 0000010 0101100 172 2
1000000000 0000011 1011100 1101011 0010100 0000000 128 148 235 220 3

It is an error for a varint-encoded m in a Repr to be anything other than the unique shortest encoding for that m. That is, a varint-encoding of m MUST NOT end in 0 unless m=0. However, the varint(m) encoding of a length MUST NOT be used when m<15, meaning that a Repr MUST NOT contain any varint-encoding with final byte 0.

Streaming data of unknown length (format C).

A Repr where the length of the Value to be encoded is variable and not known at the time serialization of the Value starts is encoded by a single Stream Start (“open”) byte, followed by zero or more chunks, followed by a matching Stream End (“close”) byte:

 open(t,n) = leadbyte(0,2, t*4 + n) = [0x20 + t*4 + n]
   close() = leadbyte(0,0, 4)       = [0x04]

For a format C Repr of an atomic Value, each chunk is to be a format B Repr of a ByteString, no matter the type of the overall Value. Annotations are not allowed on these individual chunks.

For a format C Repr of a compound Value, each chunk is to be a single Repr, which may itself be annotated.

Each chunk within a format C Repr MUST have non-zero length. Software that decodes Reprs MUST reject Reprs that include zero-length chunks.

Records.

Format B (known length):

[[ <L F_1...F_m> ]] = header(2,0,m+1) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]]

For m fields, m+1 is supplied to header, to account for the encoding of the record label.

Format C (streaming):

[[ <L F_1...F_m> ]] = open(2,0) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] ++ close()

Applications SHOULD prefer the known-length format for encoding Records.

Sequences, Sets and Dictionaries.

Format B (known length):

        [[ [X_1...X_m] ]] = header(2,1,m)   ++ [[X_1]] ++...++ [[X_m]]
    [[ #set{X_1...X_m} ]] = header(2,2,m)   ++ [[X_1]] ++...++ [[X_m]]
[[ {K_1:V_1...K_m:V_m} ]] = header(2,3,m*2) ++ [[K_1]] ++ [[V_1]] ++...
                                            ++ [[K_m]] ++ [[V_m]]

Note that m*2 is given to header for a Dictionary, since there are two Values in each key-value pair.

Format C (streaming):

        [[ [X_1...X_m] ]] = open(2,1) ++ [[X_1]] ++...++ [[X_m]] ++ close()
    [[ #set{E_1...E_m} ]] = open(2,2) ++ [[E_1]] ++...++ [[E_m]] ++ close()
[[ {K_1:V_1...K_m:V_m} ]] = open(2,3) ++ [[K_1]] ++ [[V_1]] ++...
                                      ++ [[K_m]] ++ [[V_m]] ++ close()

Applications may use whichever format suits their needs on a case-by-case basis.

There is no ordering requirement on the E_i elements or K_i/V_i pairs.16 They may appear in any order. However, the E_i and K_i MUST be pairwise distinct.

SignedIntegers.

Format B/A (known length/fixed-size):

[[ x ]] when x ∈ SignedInteger = header(1,0,m) ++ intbytes(x)  if x<-3  13≤x
                                 header(0,3,x+16)              if -3≤x<0
                                 header(0,3,x)                 if 0≤x<13

Integers in the range [-3,12] are compactly represented using format A because they are so frequently used. Other integers are represented using format B.

Format C MUST NOT be used for SignedIntegers. Format A MUST be used for integers in the range -3 to 12, inclusive.

The function intbytes(x) gives the big-endian two's-complement binary representation of x, taking exactly as many whole bytes as needed to unambiguously identify the value and its sign, and m = |intbytes(x)|. The most-significant bit in the first byte in intbytes(x) is the sign bit.17

For example,

[[   -257 ]] = 42 FE FF    [[     -3 ]] = 3D       [[    128 ]] = 42 00 80
[[   -256 ]] = 42 FF 00    [[     -2 ]] = 3E       [[    255 ]] = 42 00 FF
[[   -255 ]] = 42 FF 01    [[     -1 ]] = 3F       [[    256 ]] = 42 01 00
[[   -254 ]] = 42 FF 02    [[      0 ]] = 30       [[  32767 ]] = 42 7F FF
[[   -129 ]] = 42 FF 7F    [[      1 ]] = 31       [[  32768 ]] = 43 00 80 00
[[   -128 ]] = 41 80       [[     12 ]] = 3C       [[  65535 ]] = 43 00 FF FF
[[   -127 ]] = 41 81       [[     13 ]] = 41 0D    [[  65536 ]] = 43 01 00 00
[[     -4 ]] = 41 FC       [[    127 ]] = 41 7F    [[ 131072 ]] = 43 02 00 00

Strings, ByteStrings and Symbols.

Syntax for these three types varies only in the value of n supplied to header and open. In each case, the payload following the header is a binary sequence; for String and Symbol, it is a UTF-8 encoding of the Value's code points, while for ByteString it is the raw data contained within the Value unmodified.

Format B (known length):

          [[ S ]] = header(1,n,m) ++ encode(S)
          where m = |encode(S)|
and (n,encode(S)) = (1,utf8(S))  if S ∈ String
                    (2,S)        if S ∈ ByteString
                    (3,utf8(S))  if S ∈ Symbol

To stream a String, ByteString or Symbol, emit open(1,n) and then a sequence of zero or more format B chunks, followed by close(). Every chunk must be a ByteString, and no chunk may be annotated.

While the overall content of a streamed String or Symbol must be valid UTF-8, individual chunks do not have to conform to UTF-8.

Fixed-length Atoms.

Fixed-length atoms all use format A, and do not have a length representation. They repurpose the bits that format B Reprs use to specify lengths. Applications MUST NOT use format C with open(0,n) for any n.

Booleans.

[[ #false ]] = header(0,0,0) = [0x00]
[[  #true ]] = header(0,0,1) = [0x01]

Floats and Doubles.

[[ F ]] when F ∈ Float  = header(0,0,2) ++ binary32(F)
[[ D ]] when D ∈ Double = header(0,0,3) ++ binary64(D)

The functions binary32(F) and binary64(D) yield big-endian 4- and 8-byte IEEE 754 binary representations of F and D, respectively.

Annotations.

To annotate a Repr r with some Value v, prepend r with [0x05] ++ [[v]].

For example, the Repr corresponding to textual syntax @a@b[], i.e. an empty sequence annotated with two symbols, a and b, is

[[ @a @b [] ]]
  = [0x05] ++ [[a]] ++ [0x05] ++ [[b]] ++ [[ [] ]]
  = [0x05, 0x71, 0x61, 0x05, 0x71, 0x62, 0x90]

Examples

Simple examples.

Value Encoded byte sequence
<capture <discard>> 82 77 'c' 'a' 'p' 't' 'u' 'r' 'e' 81 77 'd' 'i' 's' 'c' 'a' 'r' 'd'
[1 2 3 4] (format B) 94 31 32 33 34
[1 2 3 4] (format C) 29 31 32 33 34 04
[-2 -1 0 1] 94 3E 3F 30 31
"hello" (format B) 55 'h' 'e' 'l' 'l' 'o'
"hello" (format C, 2 chunks) 25 62 'h' 'e' 63 'l' 'l' 'o' 35
"hello" (format C, 5 chunks) 25 61 'h' 61 'e' 61 'l' 61 'l' 61 'o' 35
["hello" there #"world" [] #set{} #true #false] 97 55 'h' 'e' 'l' 'l' 'o' 75 't' 'h' 'e' 'r' 'e' 65 'w' 'o' 'r' 'l' 'd' 90 A0 01 00
-257 42 FE FF
-1 3F
0 30
1 31
255 42 00 FF
1.0f 02 3F 80 00 00
1.0 03 3F F0 00 00 00 00 00 00
-1.202e300 03 FE 3C B7 B7 59 BF 04 26

The next example uses a non-Symbol label for a record.18 The Record

<[titled person 2 thing 1] 101 "Blackwell" <date 1821 2 3> "Dr">

encodes to

85                              ;; Record, generic, 4+1
  95                              ;; Sequence, 5
    76 74 69 74 6C 65 64            ;; Symbol, "titled"
    76 70 65 72 73 6F 6E            ;; Symbol, "person"
    32                              ;; SignedInteger, "2"
    75 74 68 69 6E 67               ;; Symbol, "thing"
    31                              ;; SignedInteger, "1"
  41 65                           ;; SignedInteger, "101"
  59 42 6C 61 63 6B 77 65 6C 6C   ;; String, "Blackwell"
  84                              ;; Record, generic, 3+1
    74 64 61 74 65                  ;; Symbol, "date"
    42 07 1D                        ;; SignedInteger, "1821"
    32                              ;; SignedInteger, "2"
    33                              ;; SignedInteger, "3"
  52 44 72                        ;; String, "Dr"

JSON examples.

The examples from RFC 8259 read as valid Preserves, though the JSON literals true, false and null read as Symbols. The first example:

{
  "Image": {
      "Width":  800,
      "Height": 600,
      "Title":  "View from 15th Floor",
      "Thumbnail": {
          "Url":    "http://www.example.com/image/481989943",
          "Height": 125,
          "Width":  100
      },
      "Animated" : false,
      "IDs": [116, 943, 234, 38793]
    }
}

encodes to binary as follows:

B2
  55 "Image"
  BC
    55 "Width"    42 03 20
    55 "Title"    5F 14 "View from 15th Floor"
    58 "Animated" 75 "false"
    56 "Height"   42 02 58
    59 "Thumbnail"
      B6
        55 "Width"  41 64
        53 "Url"    5F 26 "http://www.example.com/image/481989943"
        56 "Height" 41 7D
        53 "IDs"    94
                      41 74
                      42 03 AF
                      42 00 EA
                      43 00 97 89

and the second example:

[
  {
     "precision": "zip",
     "Latitude":  37.7668,
     "Longitude": -122.3959,
     "Address":   "",
     "City":      "SAN FRANCISCO",
     "State":     "CA",
     "Zip":       "94107",
     "Country":   "US"
  },
  {
     "precision": "zip",
     "Latitude":  37.371991,
     "Longitude": -122.026020,
     "Address":   "",
     "City":      "SUNNYVALE",
     "State":     "CA",
     "Zip":       "94085",
     "Country":   "US"
  }
]

encodes to binary as follows:

92
  BF 10
    59 "precision"  53 "zip"
    58 "Latitude"   03 40 42 E2 26 80 9D 49 52
    59 "Longitude"  03 C0 5E 99 56 6C F4 1F 21
    57 "Address"    50
    54 "City"       5D "SAN FRANCISCO"
    55 "State"      52 "CA"
    53 "Zip"        55 "94107"
    57 "Country"    52 "US"
  BF 10
    59 "precision"  53 "zip"
    58 "Latitude"   03 40 42 AF 9D 66 AD B4 03
    59 "Longitude"  03 C0 5E 81 AA 4F CA 42 AF
    57 "Address"    50
    54 "City"       59 "SUNNYVALE"
    55 "State"      52 "CA"
    53 "Zip"        55 "94085"
    57 "Country"    52 "US"

Security Considerations

Empty chunks. Chunks of zero length are prohibited in streamed (format C) Reprs. However, a malicious or broken encoder may include them nonetheless. This opens up a possibility for denial-of-service: an attacker may begin streaming a String, for example, sending an endless sequence of zero length chunks, appearing to make progress but not actually doing so. Implementations MUST reject zero length chunks when decoding, and MUST NOT produce them when encoding.

Whitespace and no-ops. Similarly, the binary format allows 0xFF no-ops and the textual format allows arbitrary whitespace in many positions. In streaming transfer situations, consider optional restrictions on the amount of consecutive whitespace or the number of consecutive no-ops that may appear.

Annotations. Also similarly, in modes where a Value is being read while annotations are skipped, an endless sequence of annotations may give an illusion of progress.

Canonical form for cryptographic hashing and signing. As specified, neither the textual nor the compact binary encoding rules for Values force canonical serializations. Two serializations of the same Value may yield different binary Reprs.

Acknowledgements

The use of low-order bits of each lead byte for the length of short values is inspired by a similar feature of CBOR.

The treatment of commas as whitespace in the text syntax is inspired by the same feature of EDN.

The text syntax for Booleans, Symbols, and ByteStrings is directly inspired by Racket's lexical syntax.

Appendix. Autodetection of textual or binary syntax

Whitespace characters 0x09 (ASCII HT (tab)), 0x0A (LF), 0x0D (CR), 0x20 (space) and 0x2C (comma) are ignored at the start of a textual-syntax Preserves Document, and their UTF-8 encodings are reserved lead byte values in binary-syntax Preserves.

The byte 0xFF, signifying a no-op in binary-syntax Preserves, has no meaning in either 7-bit ASCII or UTF-8, and therefore cannot appear in a valid textual-syntax Preserves Document.

If applications prefix their textual-syntax documents with e.g. a space or newline character, and their binary-syntax documents with a 0xFF byte, consumers of these documents may reliably autodetect the syntax being used. In a network protocol supporting this kind of autodetection, clients may transmit LF or 0xFF to select text or binary syntax, respectively.

Furthermore, if an application consistently uses Records for its top-level messages,19 eschewing Atoms in particular, then autodetection of the encoding used for a given input can be done as follows:

First byte of encoded input Encoding Other conclusions
0x80--0x8F binary Record (format B)
0x28 binary Record (format C)
0x05 binary annotated value (presumably a Record)
0xFF binary no-op; value will follow
--- --- ---
0x7B ("<") text Record
0x40 ("@") text annotated value (presumably a Record)
0x09, 0x0A, 0x0D, 0x20 or 0x2C text whitespace; value will follow

Appendix. Table of lead byte values

 00 - False
 01 - True
 02 - Float
 03 - Double
 04 - End stream
 05 - Annotation
(0x)  RESERVED 06-0F (NB. 09, 0A, 0D specially reserved)
(1x)  RESERVED
 2x - Start Stream (NB. 20, 2C specially reserved)
 3x - Small integers 0..12,-3..-1

 4x - SignedInteger
 5x - String
 6x - ByteString
 7x - Symbol

 8x - Record
 9x - Sequence
 Ax - Set
 Bx - Dictionary

(Cx)  RESERVED C0-CF
(Dx)  RESERVED D0-DF
(Ex)  RESERVED E0-EF
(Fx)  RESERVED F0-FE
 FF   No-op

Appendix. Bit fields within lead byte values

 tt nn mmmm  contents
 ---------- ---------

 00 00 0000  False
 00 00 0001  True
 00 00 0010  Float, 32 bits big-endian binary
 00 00 0011  Double, 64 bits big-endian binary
 00 00 0100  End Stream (to match a previous Start Stream)
 00 00 0101  Annotation; two more Reprs follow

 00 00 1001  (ASCII HT (tab))  \
 00 00 1010  (ASCII LF)        |- Reserved: may be used to indicate
 00 00 1101  (ASCII CR)        /    use of text encoding

 00 01 xxxx  error, RESERVED

 00 10 ttnn  Start Stream <tt,nn>
               When tt = 00 --> error
                           When nn = 00 --> (ASCII space)
                                       Reserved: may be used to indicate
                                         use of text encoding
                                     otherwise --> error
                         01 --> each chunk is a ByteString
                         10 --> each chunk is a single encoded Value
                         11 --> error (RESERVED)
                           When nn = 00 --> (ASCII comma)
                                       Reserved: may be used to indicate
                                         use of text encoding
                                     otherwise --> error

 00 11 xxxx  Small integers 0..12,-3..-1

 01 00 mmmm  SignedInteger, big-endian binary
 01 01 mmmm  String, UTF-8 binary
 01 10 mmmm  ByteString
 01 11 mmmm  Symbol, UTF-8 binary

 10 00 mmmm  Record
 10 01 mmmm  Sequence
 10 10 mmmm  Set
 10 11 mmmm  Dictionary

 11 00 xxxx  error, RESERVED
 11 01 xxxx  error, RESERVED
 11 10 xxxx  error, RESERVED
 11 11 1111  no-op; unambiguous indication of binary Preserves format

Where mmmm appears, interpret it as an unsigned 4-bit number m. If m<15, let l=m. Otherwise, m=15; let l be the result of decoding the varint that follows.

Then, l is the length of the body that follows, counted in bytes for tt=01 and in Reprs for tt=10.

Appendix. Binary SignedInteger representation

Languages that provide fixed-width machine word types may find the following table useful in encoding and decoding binary SignedInteger values.

Integer range Bytes required Encoding (hex)
-3 ≤ n < 13 (numbers -3..12 encoded specially) 1 3X
-27 ≤ n < 27 (i8) 2 41 XX
-215 ≤ n < 215 (i16) 3 42 XX XX
-223 ≤ n < 223 (i24) 4 43 XX XX XX
-231 ≤ n < 231 (i32) 5 44 XX XX XX XX
-239 ≤ n < 239 (i40) 6 45 XX XX XX XX XX
-247 ≤ n < 247 (i48) 7 46 XX XX XX XX XX XX
-255 ≤ n < 255 (i56) 8 47 XX XX XX XX XX XX XX
-263 ≤ n < 263 (i64) 9 48 XX XX XX XX XX XX XX XX

Notes


  1. Happily, the design of UTF-8 is such that this gives the same result as a lexicographic byte-by-byte comparison of the UTF-8 encoding of a string! ↩︎

  2. The Racket programming language defines “prefab” structure types, which map well to our Records. Racket supports record extensibility by encoding record supertypes into record labels as specially-formatted lists. ↩︎

  3. It is occasionally (but seldom) necessary to interpret such Symbol labels as UTF-8 encoded IRIs. Where a label can be read as a relative IRI, it is notionally interpreted with respect to the IRI urn:uuid:6bf094a6-20f1-4887-ada7-46834a9b5b34; where a label can be read as an absolute IRI, it stands for that IRI; and otherwise, it cannot be read as an IRI at all, and so the label simply stands for itself—for its own Value. ↩︎

  4. The grammar of the textual syntax is a superset of JSON, with the slightly unusual feature that true, false, and null are all read as Symbols, and that SignedIntegers are never read as Doubles. ↩︎

  5. Implementation note. When implementing printing of Values using the textual syntax, consider supporting (a) optional pretty-printing with indentation, (b) optional JSON-compatible print mode for that subset of Value that is compatible with JSON, and (c) optional submodes for no commas, commas separating, and commas terminating elements or key/value pairs within a collection. ↩︎

  6. Implementation note. Your language's standard library likely has a good routine for converting between decimal notation and IEEE 754 floating-point. However, if not, or if you are interested in the challenges of accurately reading and writing floating point numbers, see the excellent matched pair of 1990 papers by Clinger and Steele & White, and a recent follow-up by Jaffer:

    Clinger, William D. How to Read Floating Point Numbers Accurately. In Proc. PLDI. White Plains, New York, 1990. https://doi.org/10.1145/93542.93557.

    Steele, Guy L., Jr., and Jon L. White. How to Print Floating-Point Numbers Accurately. In Proc. PLDI. White Plains, New York, 1990. https://doi.org/10.1145/93542.93559.

    Jaffer, Aubrey. Easy Accurate Reading and Writing of Floating-Point Numbers. ArXiv:1310.8121 [Cs], 27 October 2013. http://arxiv.org/abs/1310.8121. ↩︎

  7. Implementation note. Be aware when implementing reading and writing of SignedIntegers that the data model requires arbitrary-precision integers. Your implementation may (but, ideally, should not) truncate precision when reading or writing a SignedInteger; however, if it does so, it should (a) signal its client that truncation has occurred, and (b) make it clear to the client that comparing such truncated values for equality or ordering will not yield results that match the expected semantics of the data model. ↩︎

  8. The grammar for String has the same effect as the JSON grammar for string. Some auxiliary definitions (e.g. escaped) are lifted largely unmodified from the text of RFC 8259. ↩︎

  9. In particular, note JSON's rules around the use of surrogate pairs for code points not in the Basic Multilingual Plane. We encourage implementations to avoid escaping such characters when producing output, and instead to rely on the UTF-8 encoding of the entire document to handle them correctly. ↩︎

  10. Compare with the SPKI S-expression definition of “token representation”, and with the R6RS definition of identifiers. ↩︎

  11. Rationale. The textual syntax cannot express every Value: specifically, it cannot express the several million floating-point NaNs, or the two floating-point Infinities. Since the compact binary format for Values expresses each Value with precision, embedding binary Values solves the problem. ↩︎

  12. Every text is ultimately physically stored as bytes; therefore, it might seem possible to escape to the raw binary form of compact binary encoding from within a pieces of textual syntax. However, while bytes must be involved in any representation of text, the text itself is logically a sequence of code points and is not intrinsically a binary structure at all. It would be incoherent to expect to be able to access the representation of the text from within the text itself. ↩︎

  13. Any text-syntax annotations preceding the #value are prepended to any binary-syntax annotations yielded by decoding the ByteString. ↩︎

  14. Some encodings are unused. All such encodings are reserved for future versions of this specification. ↩︎

  15. Also known as LEB128 encoding, for unsigned integers. Varints and LEB128-encoded integers differ only for signed integers, which are not used in Preserves. ↩︎

  16. In the BitTorrent encoding format, bencoding, dictionary key/value pairs must be sorted by key. This is a necessary step for ensuring serialization of Values is canonical. We do not require that key/value pairs (or set elements) be in sorted order for serialized Values, because (a) where canonicalization is used for cryptographic signatures, it is more reliable to simply retain the exact binary form of the signed document than to depend on canonical de- and re-serialization, and (b) sorting keys or elements makes no sense in streaming serialization formats.

    However, a quality implementation may wish to offer the programmer the option of serializing with set elements and dictionary keys in sorted order. ↩︎

  17. The value 0 needs zero bytes to identify the value, so intbytes(0) is the empty byte string. Non-zero values need at least one byte. ↩︎

  18. It happens to line up with Racket's representation of a record label for an inheritance hierarchy where titled extends person extends thing:

    (struct date (year month day) #:prefab)
    (struct thing (id) #:prefab)
    (struct person thing (name date-of-birth) #:prefab)
    (struct titled person (title) #:prefab)
    

    For more detail on Racket's representations of record labels, see the Racket documentation for make-prefab-struct. ↩︎

  19. Similar reasoning can be used to permit unambiguous detection of encoding when Collections are allowed as top-level messages as well as Records. ↩︎