preserves/preserves-text.md

13 KiB
Raw Blame History

no_site_title title
true Preserves: Text Syntax

Tony Garnock-Jones tonyg@leastfixedpoint.com
{{ site.version_date }}. Version {{ site.version }}.

Preserves is a data model, with associated serialization formats. This document defines one of those formats: a textual syntax for Values from the Preserves data model that is easy for people to read and write. An equivalent machine-oriented binary syntax also exists.

Preliminaries

The definition uses case-sensitive ABNF.

ABNF allows easy definition of US-ASCII-based languages. However, Preserves is a Unicode-based language. Therefore, we reinterpret ABNF as a grammar for recognising sequences of Unicode code points.

Encoding. Textual syntax for a Value SHOULD be encoded using UTF-8 where possible.

Whitespace. Whitespace is defined as any number of spaces, tabs, carriage returns, line feeds, or commas.

            ws = *(%x20 / %x09 / newline / ",")
       newline = CR / LF

Grammar

Standalone documents may have trailing whitespace.

      Document = Value ws

Any Value may be preceded by whitespace.

         Value = ws (Record / Collection / Atom / Embedded)
    Collection = Sequence / Dictionary / Set
          Atom = Boolean / String / ByteString /
                 QuotedSymbol / SymbolOrNumber

Each Record is an angle-bracket enclosed grouping of its label-Value followed by its field-Values.

        Record = "<" Value *Value ws ">"

Sequences are enclosed in square brackets. Dictionary values are curly-brace-enclosed colon-separated pairs of values. Sets are written as values enclosed by the tokens #{ and }.1 It is an error for a set to contain duplicate elements or for a dictionary to contain duplicate keys.

      Sequence = "[" *Value ws "]"
    Dictionary = "{" *(Value ws ":" Value) ws "}"
           Set = "#{" *Value ws "}"

Booleans are the simple literal strings #t and #f for true and false, respectively.

       Boolean = %s"#t" / %s"#f"

Strings are, as in JSON, possibly escaped text surrounded by double quotes. The escaping rules are the same as for JSON.2 3

        String = %x22 *char %x22
          char = unescaped / %x7C / escape (escaped / %x22 / %s"u" 4HEXDIG)
     unescaped = %x20-21 / %x23-5B / %x5D-7B / %x7D-10FFFF
        escape = %x5C              ; \
       escaped = ( %x5C /          ; \    reverse solidus U+005C
                   %x2F /          ; /    solidus         U+002F
                   %x62 /          ; b    backspace       U+0008
                   %x66 /          ; f    form feed       U+000C
                   %x6E /          ; n    line feed       U+000A
                   %x72 /          ; r    carriage return U+000D
                   %x74 )          ; t    tab             U+0009

A ByteString may be written in any of three different forms.

The first is similar to a String, but prepended with a hash sign #. In addition, only Unicode code points overlapping with printable 7-bit ASCII are permitted unescaped inside such a ByteString; other byte values must be escaped by prepending a two-digit hexadecimal value with \x.

    ByteString = "#" %x22 *binchar %x22
       binchar = binunescaped / escape (escaped / %x22 / %s"x" 2HEXDIG)
  binunescaped = %x20-21 / %x23-5B / %x5D-7E

The second is as a sequence of pairs of hexadecimal digits interleaved with whitespace and surrounded by #x" and ".

   ByteString =/ %s"#x" %x22 *(ws / 2HEXDIG) ws %x22

The third is as a sequence of Base64 characters, interleaved with whitespace and surrounded by #[ and ]. Plain and URL-safe Base64 characters are allowed.

   ByteString =/ "#[" *(ws / base64char) ws "]"
    base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "="

A Symbol may be written in either of two forms.

The first is a quoted form, much the same as the syntax for Strings, including embedded escape syntax, except using a bar or pipe character (|) instead of a double quote mark.

  QuotedSymbol = "|" *symchar "|"
       symchar = unescaped / %x22 / escape (escaped / %x7C / %s"u" 4HEXDIG)

Alternatively, a Symbol may be written in a “bare” form4. The grammar for numeric data is a subset of the grammar for bare Symbols, so if a SymbolOrNumber also matches the grammar for Float, Double or SignedInteger, then it must be interpreted as one of those, and otherwise it must be interpreted as a bare Symbol.

SymbolOrNumber = 1*baresymchar
   baresymchar = ALPHA / DIGIT / sympunct / symuchar
      sympunct = "~" / "!" / "$" / "%" / "^" / "&" / "*" /
                 "?" / "_" / "=" / "+" / "-" / "/" / "."
      symuchar = <any code point greater than 127 whose Unicode
                  category is Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me, Nd,
                  Nl, No, Pc, Pd, Po, Sc, Sm, Sk, So, or Co>

Numeric data follow the JSON grammar except that leading zeros are permitted and an optional leading + sign is allowed. The addition of a trailing “f” distinguishes a Float from a Double value. Floats and Doubles always have either a fractional part or an exponent part, where SignedIntegers never have either.5 6

         Float = flt %i"f"
        Double = flt
 SignedInteger = int

           nat = 1*DIGIT
           int = ["-"/"+"] nat
          frac = "." 1*DIGIT
           exp = %i"e" ["-"/"+"] 1*DIGIT
           flt = int (frac exp / frac / exp)

Some valid IEEE 754 Floats and Doubles are not covered by the grammar above, namely, the several million NaNs and the two infinities. These are represented as raw hexadecimal strings similar to hexadecimal ByteStrings. Implementations are free to use hexadecimal floating-point syntax whereever convenient, even for values representable using the grammar above.7

        Value =/ HexFloat / HexDouble
      HexFloat = "#xf" %x22 4(ws 2HEXDIG) ws %x22
     HexDouble = "#xd" %x22 8(ws 2HEXDIG) ws %x22

Finally, an Embedded is written as a Value chosen to represent the denoted object, prefixed with #!.

       Embedded = "#!" Value

Annotations

When written down, a Value may have an associated sequence of annotations carrying “out-of-band” contextual metadata about the value. Each annotation is, in turn, a Value, and may itself have annotations. The ordering of annotations attached to a Value is significant.

        Value =/ ws "@" Value Value

Each annotation is preceded by @; the underlying annotated value follows its annotations. Here we extend only the syntactic nonterminal named “Value” without altering the semantic class of Values.

Comments. Strings annotating a Value are conventionally interpreted as comments associated with that value. Comments are sufficiently common that special syntax exists for them.

        Value =/ ws
                 ";" *(%x00-09 / %x0B-0C / %x0E-10FFFF) newline
                 Value

When written this way, everything between the ; and the newline is included in the string annotating the Value.

Equivalence. Annotations appear within syntax denoting a Value; however, the annotations are not part of the denoted value. They are only part of the syntax. Annotations do not play a part in equivalences and orderings of Values.

Reflective tools such as debuggers, user interfaces, and message routers and relays---tools which process Values generically---may use annotated inputs to tailor their operation, or may insert annotations in their outputs. By contrast, in ordinary programs, as a rule of thumb, the presence, absence or content of an annotation should not change the control flow or output of the program. Annotations are data describing Values, and are not in the domain of any specific application of Values. That is, an annotation will almost never cause a non-reflective program to do anything observably different.

Security Considerations

Whitespace. The textual format allows arbitrary whitespace in many positions. Consider optional restrictions on the amount of consecutive whitespace that may appear.

Annotations. Similarly, in modes where a Value is being read while annotations are skipped, an endless sequence of annotations may give an illusion of progress.

Acknowledgements

The treatment of commas as whitespace in the text syntax is inspired by the same feature of EDN.

The text syntax for Booleans, Symbols, and ByteStrings is directly inspired by Racket's lexical syntax.

Appendix. Regular expressions for bare symbols and numbers

When parsing, if a token matches both SymbolOrNumber and Number, it's a number; use Float, Double and SignedInteger to disambiguate. If it matches SymbolOrNumber but not Number, it's a "bare" Symbol.

SymbolOrNumber: ^[-a-zA-Z0-9~!$%^&*?_=+/.]+$
        Number: ^([-+]?\d+)(((\.\d+([eE][-+]?\d+)?)|([eE][-+]?\d+))([fF]?))?$
         Float: ^([-+]?\d+)(((\.\d+([eE][-+]?\d+)?)|([eE][-+]?\d+))[fF])$
        Double: ^([-+]?\d+)(((\.\d+([eE][-+]?\d+)?)|([eE][-+]?\d+)))$
 SignedInteger: ^([-+]?\d+)$

When printing, if a symbol matches both SymbolOrNumber and Number or neither SymbolOrNumber nor Number, it must be quoted (|...|). If it matches SymbolOrNumber but not Number, it may be printed as a "bare" Symbol.

Notes


  1. Implementation note. When implementing printing of Values using the textual syntax, consider supporting (a) optional pretty-printing with indentation, (b) optional JSON-compatible print mode for that subset of Value that is compatible with JSON, and (c) optional submodes for no commas, commas separating, and commas terminating elements or key/value pairs within a collection. ↩︎

  2. The grammar for String has the same effect as the JSON grammar for string. Some auxiliary definitions (e.g. escaped) are lifted largely unmodified from the text of RFC 8259. ↩︎

  3. In particular, note JSON's rules around the use of surrogate pairs for code points not in the Basic Multilingual Plane. We encourage implementations to avoid using \u escapes when producing output, and instead to rely on the UTF-8 encoding of the entire document to handle non-ASCII codepoints correctly. ↩︎

  4. Compare with the SPKI S-expression definition of “token representation”, and with the R6RS definition of identifiers. ↩︎

  5. Implementation note. Your language's standard library likely has a good routine for converting between decimal notation and IEEE 754 floating-point. However, if not, or if you are interested in the challenges of accurately reading and writing floating point numbers, see the excellent matched pair of 1990 papers by Clinger and Steele & White, and a recent follow-up by Jaffer:

    Clinger, William D. How to Read Floating Point Numbers Accurately. In Proc. PLDI. White Plains, New York, 1990. https://doi.org/10.1145/93542.93557.

    Steele, Guy L., Jr., and Jon L. White. How to Print Floating-Point Numbers Accurately. In Proc. PLDI. White Plains, New York, 1990. https://doi.org/10.1145/93542.93559.

    Jaffer, Aubrey. Easy Accurate Reading and Writing of Floating-Point Numbers. ArXiv:1310.8121 [Cs], 27 October 2013. http://arxiv.org/abs/1310.8121. ↩︎

  6. Implementation note. Be aware when implementing reading and writing of SignedIntegers that the data model requires arbitrary-precision integers. Your implementation may (but, ideally, should not) truncate precision when reading or writing a SignedInteger; however, if it does so, it should (a) signal its client that truncation has occurred, and (b) make it clear to the client that comparing such truncated values for equality or ordering will not yield results that match the expected semantics of the data model. ↩︎

  7. Rationale. Previous versions of this specification included an escape to the machine-oriented binary syntax by prefixing a ByteString containing the binary representation of a Value with #=. The only true need for this feature was to represent otherwise-unrepresentable floating-point values. Instead, this specification allows such floating-point values to be written directly. Removing the #= syntax simplifies implementations (there is no longer any need to support the machine-oriented syntax) and avoids complications around treatment of annotations potentially contained within machine-encoded values. ↩︎