Guacamole: Bytes of random data
By Robert EscrivaImagine you needed to generate 10TB of random data and needed to be able to efficiently convey to another individual what you expect this data to contain. How would you do it?
This particular scenario comes up quite a bit in storage system benchmarking and validation, where it is necessary to be able to quickly describe generate large quantities of data and later recall exactly which bytes comprise that data.
This post introduces guacamole, a pseudo-random number generator (PRNG) that provides a seekable random stream. We will see how guacamole helps us answer the above challenge.
Guacamole by Example
Guacamole is provided as part of the
hack.systems/random
package. To bring it
into your environment, you can fetch it via standard Go tooling:
go get hack.systems/random
If we were to go back to the prompt at the start of this post, we could use guacamole to generate 10TB of random data like so:
package main import ( "hack.systems/random/guacamole" ) const ( TenTerabyte uint64 = 1e13 BlockSize uint64 = 64 ) func main() { g := guacamole.New() buf := make([]byte, BlockSize) for i := uint64(0); i < TenTerabyte; i += BlockSize { g.Fill(buf) // do something with buf } }
So far, so good, but our example program looks just like it would for most every
other random number generator. In particular, no matter how optimized the
implementation is, it will be bound by the rate at which a single CPU can
generate the data and the speed with which we can do something with
buf
.
Guacamole helps us work around the single-CPU bottleneck through a
sequential
seed property
. Where a typical PRNG provides a
seed
function that takes some
user-provided data and unpredictably transforms it into a position within the
random stream, guacamole's
seed
function guarantees that adjacent seeds
produce adjacent outputs.
Guacamole's sequential seed property allows us to break up the work while
producing the same sequence of random bytes. We can get to the same state that
g
is in after hours of executing the above code with a single, constant time
seed operation:
g.Seed(TenTerabye / BlockSize)
Here, we've seeded the generator with the index corresponding to the tenth terabyte. Guacamole generates data in 64 byte blocks, so we've taken that into account here.
We can exploit the speed and predictability of the seed operation to parallelize our original task. If we break the data into 1TB chunks, we can use a full ten CPU cores to generate the same data we generated above:
package main import ( "hack.systems/random/guacamole" ) const ( Terabyte uint64 = 1e9 BlockSize uint64 = 64 ) func genTerabyte(i int) { g := guacamole.New() buf := make([]byte, BlockSize) g.Seed(uint64(i) * Terabyte) for i := uint64(0); i < Terabyte; i += BlockSize { g.Fill(buf) // do something with buf } } func main() { for i := 0; i < 10; i++ { genTerabyte(i) } }
At this point, it is worth examining why we would care to preserve the random number stream between these two different use cases. Why would care about the actual content of the random data? After all, it is random and has no meaning to us, right?
Not quite, because we actually know and can say quite a bit about this random data. For instance, we can directly compute the value of any 1MB segment within the data. If the random data is being used to exercise a storage system, we don't need to know anything about the actual bytes themselves to know that the bytes stored in the system better match they bytes we expect.
Salsa and Guacamole
Guacamole is a chopped-up version of the "salsa" cipher from DJB. Salsa was a great choice for this because it did not require keeping any state for the stream other than an index into the stream. Guacamole is a transformed version of salsa that makes this index an explicit input, makes the cipher operate on explicit 64B boundaries, and inlines constant values for the key and plaintext that allow for better performance.
The name is largely a play on words that comes from a misunderstanding I had. When I first learned of the salsa cipher, my mental association was with the dip for chips and not with the variety of dance. Later when it came time to assign a name to my fork of salsa that was not as the original author intended, I thought it'd be a humorous pun to play off the salsa homonym so that I was using both the code and the name in ways that weren't originally intended.
Validation
Random number generators have varying degrees of quality. Many simpler random
number generators have a tendency to repeat or have predictable patterns in the
lower order bits of the numbers that they generate. I've heard horror stories
of people using the standard
drand48
family of random number generators and
finding that it introduces significant artifacts into their experimental
results.
To help quantify the extent to which a random number generator is "good", researchers have assembled a collection of tests of randomness that can unveil weaknesses in random number generators. While what these tests actually do is outside the scope of this post, an intuition for what they do is quite helpful.
Imagine if we were to use a PRNG to flip a coin one million times and see that it comes up almost evenly split between heads and tails. This sounds like a good property for the generator to have, so it would pass this test. Another test could be to see how many time heads turns up in a row. A generator that passed the first test could fail this second test if, e.g., it swapped between heads and tails every 997 coin tosses. To have that many heads in a row once is a one in one novemdecillion (thats one with sixty zeroes). To have that happen once is practically impossible; if it happens twice, the generator is broken, and we can declare it as such.
The
Dieharder random number test
suite
is a collection
of tests much more sophisticated than our coin-flipping test that can show how
good (or bad) a random number generator is. Running it on a stream of data from
guacamole shows that
111/114 tests
pass
, with the remaining three showing
signs of weakness. For a baseline, a run with
/dev/urandom
on Linux passed
only 110/114 tests
.
Damned Lies
Guacamole is a Go facade around C and amd64 assembly implementations of the algorithm. The assembly version gives significantly better performance than the portable C code, producing up to five times more bytes per second than the C code can produce.
On a 2014-era Intel Core i7-4600 (Haswell) CPU at 2.1Ghz (in a laptop), guacamole can pump out more than one gigabyte per second per core:
Block Size | Portable | SSE4.1 |
---|---|---|
64B | 89.8 MB/s | 218.7 MB/s |
1KB | 134.4 MB/s | 876.6 MB/s |
4KB | 136.5 MB/s | 1054.0 MB/s |
64KB | 142.5 MB/s | 1134.2 MB/s |
1MB | 142.2 MB/s | 1056.2 MB/s |
These benchmarks ship alongside guacamole and you can see how guacamole performs on your own hardware with:
GOMAXPROCS=1 go test hack.systems/random/guacamole \ -bench=Guac -benchtime=10s
Summary
Guacamole is a low-level pseudo-random number generator with a unique sequential
seed property. This allows tests and benchmarks to model rather large data sets
without having to pay a storage cost to materialize the data set. In future
posts, we'll explore some other pieces of the
hack.systems/random
package that
build on top of guacamole and demonstrate how the low-level random bytes it
produces can be manipulated to build higher-level primitives.