llamadb/docs/Indexing and Sorting.md

# Indexing and Sorting

This is perhaps the most important implementation problem that SQL databases
must address.

## Simple and ignorant

All sorting is done with simple `memcpy()` operations.
This means that all keys' byte representations sort the same way as the keys
do semantically.
The B+Tree traversal algorithm is kept simpler this way.

The algorithm doesn't need to be aware of the types contained in the keys, so
there's no need for specialized comparators.
To the traversal algorithm, all keys are simple byte collections that are always
ordered the same way.


## Byte sorting

All keys are stored and sorted as a collection of bytes.

Here's a sorted byte list:
```
00
00 00
00 00 FF
00 01
01
02 00
...
FE FF FF FF FF FF FF
FF
FF 00
FF FF
FF FF FF
FF FF FF FF
```

Keys that share the same beginning as another key but are longer are sorted after.


## Integers

All integer keys are stored as big-endian.
If the integer is signed, then add half of the unsigned maximum (8-bit => 128).

* 255 unsigned 4-byte => `00 00 00 FF`
* -32768 signed 2-byte => `00 00`
* -1 signed 2-byte => `7F FF`
* 0 signed 2-byte => `80 00`
* 32767 signed 2-byte => `FF FF`


## Strings

All string keys are stored as UTF-8 and are null-terminated.
A length is not prefixed because this would effectively make the strings sorted
by length instead of lexicographically.

UTF-8 has the property of lexicographic sorting. Even with extension bytes,
the string will sort in ascending order of the code points.

The null terminator is used to indicate the end of the string, as an
optimization to prevent reading the last page(s) for the length.
String is backed with `byte[]`, so the string length + 1 is stored at the end of
the key. When searching lexicographically, this is ignored.
It also serves as a separator from other multi-column values in the key.

Longer strings that share the same beginning as another string are sorted after.

```
41 70 70 6C 65 00               // Apple
41 70 70 6C 65 73 00            // Apples
41 CC 88 70 66 65 6C 00         // Äpfel (NFD)
42 61 6E 61 6E 61 00            // Banana
42 61 6E 61 6E 61 73 00         // Bananas
42 61 6E 64 00                  // Band
42 65 65 68 69 76 65 00         // Beehive
42 65 65 73 00                  // Bees
61 70 70 6C 65 00               // apple
C3 84 70 66 65 6C 00            // Äpfel (NFC)
```

* `WHERE x LIKE 'Apple%'` => `41 70 70 6C 65`
* `WHERE x = 'Apple'` => `41 70 70 6C 65 00`

Strings are sorted by their UTF-8 representation, and not with a collation
algorithm.
It's theoretically possible to index strings using a collation algorithm if
the algorithm can return a byte representation that sorts the same way.
However, this is not yet supported.


## Floating point numbers

This encoding is mostly compatible with the number ranges from IEEE 754.
The only exception is NaN, which this encoding does not support.

NaN is unsortable/imcomparable, and therefore cannot be encoded.

This encoding is basically the same as binary32 IEEE 754, but with flipped bits.
Like the integer types, the encoding is in big-endian
(the byte with the sign bit comes first).

To convert IEEE 754 to or from this encoding:

* If the number is negative, flip all the bits.
* If the number is positive, flip the sign bit.

This way, an encoding of `00 7F FF FF` is a negative number with the highest exponent and the highest mantissa,
which would be the smallest possible floating point number.
Similarly, an encoding of `FF 80 00 00` is a positive number with the highest exponent and the highest mantissa,
which would be the largest possible floating point number.

* -inf => `00 7F FF FF`
* -1 => `40 7F FF FF`
* -0 => `7F FF FF FF`
* +0 => `80 00 00 00`
* +1 => `BF 80 00 00`
* +inf => `FF 80 00 00`

The removal of NaN disqualifies 16,777,214 values.
Ranges that the removal of NaN disqualifies (inclusive):

* `00 00 00 00` to `00 7F FF FE`
* `FF 80 00 01` to `FF FF FF FF`
Initial commit 2015-03-10 08:07:47 -06:00			`# Indexing and Sorting`

			`This is perhaps the most important implementation problem that SQL databases`
			`must address.`

			`## Simple and ignorant`

			All sorting is done with simple `memcpy()` operations.
			`This means that all keys' byte representations sort the same way as the keys`
			`do semantically.`
			`The B+Tree traversal algorithm is kept simpler this way.`

			`The algorithm doesn't need to be aware of the types contained in the keys, so`
			`there's no need for specialized comparators.`
			`To the traversal algorithm, all keys are simple byte collections that are always`
			`ordered the same way.`


			`## Byte sorting`

			`All keys are stored and sorted as a collection of bytes.`

			`Here's a sorted byte list:`
			```
			`00`
			`00 00`
			`00 00 FF`
			`00 01`
			`01`
			`02 00`
			`...`
			`FE FF FF FF FF FF FF`
			`FF`
			`FF 00`
			`FF FF`
			`FF FF FF`
			`FF FF FF FF`
			```

			`Keys that share the same beginning as another key but are longer are sorted after.`


			`## Integers`

			`All integer keys are stored as big-endian.`
			`If the integer is signed, then add half of the unsigned maximum (8-bit => 128).`

			* 255 unsigned 4-byte => `00 00 00 FF`
			* -32768 signed 2-byte => `00 00`
			* -1 signed 2-byte => `7F FF`
			* 0 signed 2-byte => `80 00`
			* 32767 signed 2-byte => `FF FF`


			`## Strings`

			`All string keys are stored as UTF-8 and are null-terminated.`
			`A length is not prefixed because this would effectively make the strings sorted`
			`by length instead of lexicographically.`

			`UTF-8 has the property of lexicographic sorting. Even with extension bytes,`
			`the string will sort in ascending order of the code points.`

Lots of progress on the parser and documentation. Expression subqueries work. 2015-03-20 06:08:41 -06:00			`The null terminator is used to indicate the end of the string, as an`
			`optimization to prevent reading the last page(s) for the length.`
			String is backed with `byte[]`, so the string length + 1 is stored at the end of
			`the key. When searching lexicographically, this is ignored.`
Initial commit 2015-03-10 08:07:47 -06:00			`It also serves as a separator from other multi-column values in the key.`

			`Longer strings that share the same beginning as another string are sorted after.`

			```
			`41 70 70 6C 65 00 // Apple`
			`41 70 70 6C 65 73 00 // Apples`
			`41 CC 88 70 66 65 6C 00 // Äpfel (NFD)`
			`42 61 6E 61 6E 61 00 // Banana`
			`42 61 6E 61 6E 61 73 00 // Bananas`
			`42 61 6E 64 00 // Band`
			`42 65 65 68 69 76 65 00 // Beehive`
			`42 65 65 73 00 // Bees`
			`61 70 70 6C 65 00 // apple`
			`C3 84 70 66 65 6C 00 // Äpfel (NFC)`
			```

			* `WHERE x LIKE 'Apple%'` => `41 70 70 6C 65`
			* `WHERE x = 'Apple'` => `41 70 70 6C 65 00`

			`Strings are sorted by their UTF-8 representation, and not with a collation`
			`algorithm.`
			`It's theoretically possible to index strings using a collation algorithm if`
			`the algorithm can return a byte representation that sorts the same way.`
			`However, this is not yet supported.`


			`## Floating point numbers`

			`This encoding is mostly compatible with the number ranges from IEEE 754.`
			`The only exception is NaN, which this encoding does not support.`

			`NaN is unsortable/imcomparable, and therefore cannot be encoded.`

			`This encoding is basically the same as binary32 IEEE 754, but with flipped bits.`
			`Like the integer types, the encoding is in big-endian`
			`(the byte with the sign bit comes first).`

			`To convert IEEE 754 to or from this encoding:`

			`* If the number is negative, flip all the bits.`
			`* If the number is positive, flip the sign bit.`

			This way, an encoding of `00 7F FF FF` is a negative number with the highest exponent and the highest mantissa,
			`which would be the smallest possible floating point number.`
			Similarly, an encoding of `FF 80 00 00` is a positive number with the highest exponent and the highest mantissa,
			`which would be the largest possible floating point number.`

			* -inf => `00 7F FF FF`
			* -1 => `40 7F FF FF`
			* -0 => `7F FF FF FF`
			* +0 => `80 00 00 00`
			* +1 => `BF 80 00 00`
			* +inf => `FF 80 00 00`

			`The removal of NaN disqualifies 16,777,214 values.`
			`Ranges that the removal of NaN disqualifies (inclusive):`

			* `00 00 00 00` to `00 7F FF FE`
			* `FF 80 00 01` to `FF FF FF FF`