Benchmarks

We currently use pytest-benchmark to write tests to assess the time and resources taken by various tasks.

To run benchmark tests, once inside a cloned Soledad repository, do the following:

tox -e benchmark

Results of automated benchmarking for each commit in the repository can be seen in: https://benchmarks.leap.se/.

Benchmark tests also depend on tox and CouchDB. See the Tests page for more information on how to setup the test environment.

Test repetition

pytest-benchmark runs tests multiple times so it can provide meaningful statistics for the time taken for a tipical run of a test function. The number of times that the test is run can be manually or automatically configured.

When automatically configured, the number of runs is decided by taking into account multiple pytest-benchmark configuration parameters. See the the corresponding documenation for more details on how automatic calibration works.

To achieve a reasonable number of repetitions and a reasonable amount of time at the same time, we let pytest-benchmark choose the number of repetitions for faster tests, and manually limit the number of repetitions for slower tests.

Currently, tests for synchronization and sqlcipher asynchronous document creation are fixed to run 4 times each. All the other tests are left for pytest-benchmark to decide how many times to run each one. With this setup, the benchmark suite is taking approximatelly 7 minutes to run in our CI server. As the benchmark suite is run twice (once for time and cpu stats and a second time for memory stats), the whole benchmarks run takes around 15 minutes.

The actual number of times a test is run when calibration is done automatically by pytest-benchmark depends on many parameters: the time taken for a sample run and the configuration of the minimum number of rounds and maximum time allowed for a benchmark. For a snapshot of the number of rounds for each test function see the soledad benchmarks wiki page.

Sync size statistics

Currenly, the main use of Soledad is to synchronize client-encrypted email data. Because of that, it makes sense to measure the time and resources taken to synchronize an amount of data that is realistically comparable to a user’s email box.

In order to determine what is a good example of dataset for synchronization tests, we used the size of messages of one week of incoming and outgoing email flow of a friendly provider. The statistics that came out from that are (all sizes are in KB):

  outgoing incoming
min 0.675 0.461
max 25531.361 25571.748
mean 252.411 110.626
median 5.320 14.974
mode 1.404 1.411
stddev 1376.930 732.933

Sync test scenarios

Ideally, we would want to run tests for a big data set (i.e. a high number of documents and a big payload size), but that may be infeasible given time and resource limitations. Because of that, we choose a smaller data set and suppose that the behaviour is somewhat linear to get an idea for larger sets.

Supposing a data set total size of 10MB, some possibilities for number of documents and document sizes for testing download and upload can be seen below. Scenarios marked in bold are the ones that are actually run in the current sync benchmark tests, and you can see the current graphs for each one by following the corresponding links:

In each of the above scenarios all the documents are of the same size. If we want to account for some variability on document sizes, it is sufficient to come up with a simple scenario where the average, minimum and maximum sizes are somehow coherent with the above statistics, like the following one:

  • 60 x 15KB + 1 x 1MB