treetrie-2015/README.md

103 lines
3.7 KiB
Markdown

# TreeTrie
An implementation of a trie map that can hold (patterns over) ordered
trees in its keys.
Uses djb's "critbit" structure for each branch node.
There are two types to represent:
Trie = Ok
| Tl Trie
| Br Trie Branch
The first `Trie` argument to `Br` is the wildcard case.
Branch = Mt
| Lf Trie Atom
| Nd Int Branch Branch
The `Branch` type is the "critbit" type, extended with a value slot to
make it into a map. We put the `Atom` last in a `Lf`, because we might
want to use 32-bit pointers internal to our structures, but the hosted
atoms might need 64-bit pointers.
The empty `Trie`, which we'll write `empty` is a cyclic structure (!):
empty = Br empty Mt
It's the only such structure, so we might want to actually represent
it as a distinct constant instead.
Because `Ok` and `Mt` are nullary constants, we can represent them
with special bit-patterns. Let's choose `NULL` for this. They don't
need to be distinct, since they inhabit different types.
This leaves us with four cases to represent, two in each type. While
we could represent these using a single bit, we will instead use two
bits, for debuggability.
33222222222211111111110000000000
10987654321098765432109876543210
|--------------------------------|
|--------------------------------|
| Trie pointer 00| Br case
|--------------------------------|
| Branch pointer 00|
|--------------------------------|
|--------------------------------|
| Trie pointer 01| Tl case
|--------------------------------|
|--------------------------------|
| Trie pointer 10| Lf case
|--------------------------------|
| Atom pointer |
| (may be 64 bits long) |
|--------------------------------|
|--------------------------------|
| Int 11| Nd case
|--------------------------------|
| Branch pointer 00|
|--------------------------------|
| Branch pointer 00|
|--------------------------------|
We use a weak hash table to index all our objects, because we need to
hash-cons them.
Perhaps it could be a Robin-Hood hashtable with backward shift
deletion,
<http://codecapsule.com/2013/11/17/robin-hood-hashing-backward-shift-deletion/>.
Hmm. For critbit to work, we need to be able to examine the
bitpatterns of atoms. However, if those bitpatterns represent pointers
to Racket-level objects, and Racket uses a moving collector, then not
only will our atom pointers become out of date after a GC, even if we
could update them we'd have to reindex each critbit tree.
So we probably want `Atom` to instead be some index into a *different*
table. (Another level of indirection!) While any (referenced) `Trie`
holds an indirect reference to a given `Atom`, the underlying Racket
object should be preserved across collections. We will need to find
the `Atom` for a given Racket object (based on `equal?` rather than
`eq?`), and the Racket object for a given `Atom` (an easy table
lookup). We'll want to not hold an `Atom`'s Racket object longer than
necessary.
It might be better to have the tag bits in each pointer to an object,
rather than in the object header: the tag bits would then identify
which of four separate heaps (each with its own object size) is being
referred to.
Our data structures are never cyclic.
Supporting direct atoms is probably a sensible thing to do, so that
e.g. fixnums map to `Atom` without having to take up space in the
table. In fact, the host language should probably allocate and manage
the `Atom`-to-host-object table itself! That way, our code can be
completely ignorant of that kind of detail.