--- no_site_title: true title: "Preserves: Text Syntax" --- Tony Garnock-Jones {{ site.version_date }}. Version {{ site.version }}. [sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt [abnf]: https://tools.ietf.org/html/rfc7405 *Preserves* is a data model, with associated serialization formats. This document defines one of those formats: a textual syntax for `Value`s from the [Preserves data model](preserves.html) that is easy for people to read and write. An [equivalent machine-oriented binary syntax](preserves-binary.html) also exists. ## Preliminaries The definition uses [case-sensitive ABNF][abnf]. ABNF allows easy definition of US-ASCII-based languages. However, Preserves is a Unicode-based language. Therefore, we reinterpret ABNF as a grammar for recognising sequences of Unicode code points. **Encoding.** Textual syntax for a `Value` *SHOULD* be encoded using UTF-8 where possible. **Whitespace.** Whitespace is defined as any number of spaces, tabs, carriage returns, line feeds, or commas. ws = *(%x20 / %x09 / newline / ",") newline = CR / LF ## Grammar Standalone documents may have trailing whitespace. Document = Value ws Any `Value` may be preceded by whitespace. Value = ws (Record / Collection / Atom / Embedded / Machine) Collection = Sequence / Dictionary / Set Atom = Boolean / Float / Double / SignedInteger / String / ByteString / Symbol Each `Record` is an angle-bracket enclosed grouping of its label-`Value` followed by its field-`Value`s. Record = "<" Value *Value ws ">" `Sequence`s are enclosed in square brackets. `Dictionary` values are curly-brace-enclosed colon-separated pairs of values. `Set`s are written as values enclosed by the tokens `#{` and `}`.[^printing-collections] It is an error for a set to contain duplicate elements or for a dictionary to contain duplicate keys. Sequence = "[" *Value ws "]" Dictionary = "{" *(Value ws ":" Value) ws "}" Set = "#{" *Value ws "}" [^printing-collections]: **Implementation note.** When implementing printing of `Value`s using the textual syntax, consider supporting (a) optional pretty-printing with indentation, (b) optional JSON-compatible print mode for that subset of `Value` that is compatible with JSON, and (c) optional submodes for no commas, commas separating, and commas terminating elements or key/value pairs within a collection. `Boolean`s are the simple literal strings `#t` and `#f` for true and false, respectively. Boolean = %s"#t" / %s"#f" Numeric data follow the [JSON grammar](https://tools.ietf.org/html/rfc8259#section-6), with the addition of a trailing “f” distinguishing `Float` from `Double` values. `Float`s and `Double`s always have either a fractional part or an exponent part, where `SignedInteger`s never have either.[^reading-and-writing-floats-accurately] [^arbitrary-precision-signedinteger] Float = flt %i"f" Double = flt SignedInteger = int digit1-9 = %x31-39 nat = %x30 / ( digit1-9 *DIGIT ) int = ["-"] nat frac = "." 1*DIGIT exp = %i"e" ["-"/"+"] 1*DIGIT flt = int (frac exp / frac / exp) [^reading-and-writing-floats-accurately]: **Implementation note.** Your language's standard library likely has a good routine for converting between decimal notation and IEEE 754 floating-point. However, if not, or if you are interested in the challenges of accurately reading and writing floating point numbers, see the excellent matched pair of 1990 papers by Clinger and Steele & White, and a recent follow-up by Jaffer: Clinger, William D. ‘How to Read Floating Point Numbers Accurately’. In Proc. PLDI. White Plains, New York, 1990. . Steele, Guy L., Jr., and Jon L. White. ‘How to Print Floating-Point Numbers Accurately’. In Proc. PLDI. White Plains, New York, 1990. . Jaffer, Aubrey. ‘Easy Accurate Reading and Writing of Floating-Point Numbers’. ArXiv:1310.8121 [Cs], 27 October 2013. . [^arbitrary-precision-signedinteger]: **Implementation note.** Be aware when implementing reading and writing of `SignedInteger`s that the data model *requires* arbitrary-precision integers. Your implementation may (but, ideally, should not) truncate precision when reading or writing a `SignedInteger`; however, if it does so, it should (a) signal its client that truncation has occurred, and (b) make it clear to the client that comparing such truncated values for equality or ordering will not yield results that match the expected semantics of the data model. `String`s are, [as in JSON](https://tools.ietf.org/html/rfc8259#section-7), possibly escaped text surrounded by double quotes. The escaping rules are the same as for JSON.[^string-json-correspondence] [^escaping-surrogate-pairs] String = %x22 *char %x22 char = unescaped / %x7C / escape (escaped / %x22 / %s"u" 4HEXDIG) unescaped = %x20-21 / %x23-5B / %x5D-7B / %x7D-10FFFF escape = %x5C ; \ escaped = ( %x5C / ; \ reverse solidus U+005C %x2F / ; / solidus U+002F %x62 / ; b backspace U+0008 %x66 / ; f form feed U+000C %x6E / ; n line feed U+000A %x72 / ; r carriage return U+000D %x74 ) ; t tab U+0009 [^string-json-correspondence]: The grammar for `String` has the same effect as the [JSON](https://tools.ietf.org/html/rfc8259#section-7) grammar for `string`. Some auxiliary definitions (e.g. `escaped`) are lifted largely unmodified from the text of RFC 8259. [^escaping-surrogate-pairs]: In particular, note JSON's rules around the use of surrogate pairs for code points not in the Basic Multilingual Plane. We encourage implementations to avoid using `\u` escapes when producing output, and instead to rely on the UTF-8 encoding of the entire document to handle non-ASCII codepoints correctly. A `ByteString` may be written in any of three different forms. The first is similar to a `String`, but prepended with a hash sign `#`. In addition, only Unicode code points overlapping with printable 7-bit ASCII are permitted unescaped inside such a `ByteString`; other byte values must be escaped by prepending a two-digit hexadecimal value with `\x`. ByteString = "#" %x22 *binchar %x22 binchar = binunescaped / escape (escaped / %x22 / %s"x" 2HEXDIG) binunescaped = %x20-21 / %x23-5B / %x5D-7E The second is as a sequence of pairs of hexadecimal digits interleaved with whitespace and surrounded by `#x"` and `"`. ByteString =/ %s"#x" %x22 *(ws / 2HEXDIG) ws %x22 The third is as a sequence of [Base64](https://tools.ietf.org/html/rfc4648) characters, interleaved with whitespace and surrounded by `#[` and `]`. Plain and URL-safe Base64 characters are allowed. ByteString =/ "#[" *(ws / base64char) ws "]" base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "=" A `Symbol` may be written in a “bare” form[^cf-sexp-token] so long as it conforms to certain restrictions on the characters appearing in the symbol. Alternatively, it may be written in a quoted form. The quoted form is much the same as the syntax for `String`s, including embedded escape syntax, except using a bar or pipe character (`|`) instead of a double quote mark. Symbol = symstart *symcont / "|" *symchar "|" symstart = ALPHA / sympunct / symustart symcont = ALPHA / sympunct / symustart / symucont / DIGIT / "-" sympunct = "~" / "!" / "$" / "%" / "^" / "&" / "*" / "?" / "_" / "=" / "+" / "/" / "." symchar = unescaped / %x22 / escape (escaped / %x7C / %s"u" 4HEXDIG) symustart = symucont = [^cf-sexp-token]: Compare with the [SPKI S-expression][sexp.txt] definition of “token representation”, and with the [R6RS definition of identifiers](http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-7.html#node_sec_4.2.4). An `Embedded` is written as a `Value` chosen to represent the denoted object, prefixed with `#!`. Embedded = "#!" Value Finally, any `Value` may be represented by escaping from the textual syntax to the [machine-oriented binary syntax](preserves-binary.html) by prefixing a `ByteString` containing the binary representation of the `Value` with `#=`.[^rationale-switch-to-binary] [^no-literal-binary-in-text] [^machine-value-annotations] Machine = "#=" ws ByteString [^rationale-switch-to-binary]: **Rationale.** The textual syntax cannot express every `Value`: specifically, it cannot express the several million floating-point NaNs, or the two floating-point Infinities. Since the machine-oriented binary format for `Value`s expresses each `Value` with precision, embedding binary `Value`s solves the problem. [^no-literal-binary-in-text]: Every text is ultimately physically stored as bytes; therefore, it might seem possible to escape to the raw form of binary encoding from within a piece of textual syntax. However, while bytes must be involved in any *representation* of text, the text *itself* is logically a sequence of *code points* and is not *intrinsically* a binary structure at all. It would be incoherent to expect to be able to access the representation of the text from within the text itself. [^machine-value-annotations]: Any text-syntax annotations preceding the `#` are prepended to any binary-syntax annotations yielded by decoding the `ByteString`. ## Annotations When written down, a `Value` may have an associated sequence of *annotations* carrying “out-of-band” contextual metadata about the value. Each annotation is, in turn, a `Value`, and may itself have annotations. The ordering of annotations attached to a `Value` is significant. Value =/ ws "@" Value Value Each annotation is preceded by `@`; the underlying annotated value follows its annotations. Here we extend only the syntactic nonterminal named “`Value`” without altering the semantic class of `Value`s. **Comments.** Strings annotating a `Value` are conventionally interpreted as comments associated with that value. Comments are sufficiently common that special syntax exists for them. Value =/ ws ";" *(%x00-09 / %x0B-0C / %x0E-10FFFF) newline Value When written this way, everything between the `;` and the newline is included in the string annotating the `Value`. **Equivalence.** Annotations appear within syntax denoting a `Value`; however, the annotations are not part of the denoted value. They are only part of the syntax. Annotations do not play a part in equivalences and orderings of `Value`s. Reflective tools such as debuggers, user interfaces, and message routers and relays---tools which process `Value`s generically---may use annotated inputs to tailor their operation, or may insert annotations in their outputs. By contrast, in ordinary programs, as a rule of thumb, the presence, absence or content of an annotation should not change the control flow or output of the program. Annotations are data *describing* `Value`s, and are not in the domain of any specific application of `Value`s. That is, an annotation will almost never cause a non-reflective program to do anything observably different. ## Security Considerations **Whitespace.** The textual format allows arbitrary whitespace in many positions. Consider optional restrictions on the amount of consecutive whitespace that may appear. **Annotations.** Similarly, in modes where a `Value` is being read while annotations are skipped, an endless sequence of annotations may give an illusion of progress. ## Acknowledgements The treatment of commas as whitespace in the text syntax is inspired by the same feature of [EDN](https://github.com/edn-format/edn). The text syntax for `Boolean`s, `Symbol`s, and `ByteString`s is directly inspired by [Racket](https://racket-lang.org/)'s lexical syntax. ## Notes