We currently use pytest-benchmark to write tests to assess the time and resources taken by various tasks.
To run benchmark tests, once inside a cloned Soledad repository, do the following:
tox -e benchmark
Results of automated benchmarking for each commit in the repository can be seen in: https://benchmarks.leap.se/.
Benchmark tests also depend on tox and CouchDB. See the Tests page for more information on how to setup the test environment.
pytest-benchmark runs tests multiple times so it can provide meaningful
statistics for the time taken for a tipical run of a test function. The number
of times that the test is run can be manually or automatically configured.
When automatically configured, the number of runs is decided by taking into
pytest-benchmark configuration parameters. See the the
corresponding documenation for more
details on how automatic calibration works.
To achieve a reasonable number of repetitions and a reasonable amount of time
at the same time, we let
pytest-benchmark choose the number of repetitions
for faster tests, and manually limit the number of repetitions for slower tests.
Currently, tests for synchronization and sqlcipher asynchronous document
creation are fixed to run 4 times each. All the other tests are left for
pytest-benchmark to decide how many times to run each one. With this setup,
the benchmark suite is taking approximatelly 7 minutes to run in our CI server.
As the benchmark suite is run twice (once for time and cpu stats and a second
time for memory stats), the whole benchmarks run takes around 15 minutes.
The actual number of times a test is run when calibration is done automatically
pytest-benchmark depends on many parameters: the time taken for a sample
run and the configuration of the minimum number of rounds and maximum time
allowed for a benchmark. For a snapshot of the number of rounds for each test
function see the soledad benchmarks wiki page.
Sync size statistics¶
Currenly, the main use of Soledad is to synchronize client-encrypted email data. Because of that, it makes sense to measure the time and resources taken to synchronize an amount of data that is realistically comparable to a user’s email box.
In order to determine what is a good example of dataset for synchronization tests, we used the size of messages of one week of incoming and outgoing email flow of a friendly provider. The statistics that came out from that are (all sizes are in KB):
Sync test scenarios¶
Ideally, we would want to run tests for a big data set (i.e. a high number of documents and a big payload size), but that may be infeasible given time and resource limitations. Because of that, we choose a smaller data set and suppose that the behaviour is somewhat linear to get an idea for larger sets.
Supposing a data set total size of 10MB, some possibilities for number of documents and document sizes for testing download and upload can be seen below. Scenarios marked in bold are the ones that are actually run in the current sync benchmark tests, and you can see the current graphs for each one by following the corresponding links:
- 10 x 1M
- 20 x 500K (upload, download)
- 100 x 100K (upload, download)
- 200 x 50K
- 1000 x 10K (upload, download)
In each of the above scenarios all the documents are of the same size. If we want to account for some variability on document sizes, it is sufficient to come up with a simple scenario where the average, minimum and maximum sizes are somehow coherent with the above statistics, like the following one:
- 60 x 15KB + 1 x 1MB