Borg, Kopia, Restic: backup and resource utilization

[This is the English translation by DeepL]
[Version originale en français]

In this article, I'm going to take a closer look at backup-related metrics. In particular, those that are easy to measure: backup execution time and network transfer volumes. CPU consumption is not easy to measure on the test platform and, in my context, is of little importance. Measuring I/O on storage could have been interesting, but as the backup destination disk is shared with other uses, it wasn't a metric that could be recovered during my tests.

Without going into too much detail on the methodology, here's how I proceeded. On the client side, the backup script places time-stamped lines in a log file (start of backup for Borg, end of backup for Borg, start of purge for Borg, etc.). On the server side, I used tcpdump to record network traffic. The PCap files were transformed in tshark to obtain the list of network conversations. The time-stamped list and the client log file were then injected into Splunk. I thus obtain the time-stamped traces corresponding to the start and end of each backup task, with the number of packets and data volumes exchanged between client and server as a sandwich.
At each iteration of the backup script, I also collect the volume of the backup repositories.

The first statistics I'd like to detail here relate to the creation of the first backup. This already reveals a major difference between the three solutions:

For an initial volume of 145 GB

	duration	vol in	vol out	storage
Borg	1118 s	1.2 GB	52 GB	62 GB
Kopia	716 s	1 GB	39 GB	75 GB
Restic	583 s	0.7 GB	32.2 GB	58 GB

In terms of creation time, Restic and Borg vary by a factor of two. This is quite significant. Restic is configured to open four concurrent read streams, which will obviously maximize the use of all resources (network, IO). Restic also uses its HTTP server for transport. This enables it to send fewer packets than Borg or Kopia (33.1 million for Restic versus 54.6 million for Kopia). The result is less data in transit than the other two (fewer packets = less overhead).
Note that at this stage of my experiment I hadn't activated Kopia's compression (inactive by default), but a priori the integrated ssh client compresses streams. Compression is also disabled on the destination ZFS storage.

I have a few theories but no real explanation for the difference in volume on arrival, between the software on the one hand and with the volume of data in transit on the other.

Subsequent backups are much shorter, since only data modified since the previous backup is sent to the server. In the long run, Restic loses its advantage over Kopia and comes close to Borg's times.

	Avg	Median	perc98
Borg	30.54 s	30.00 s	38.86 s
Kopia	8.53 s	8.00 s	13.86 s
Restic	24.96 s	25.00 s	30.00 s

Kopia's advantage is truly impressive.

Over a month or so of measurements, the average duration of Borg and Restic backups has even shown an upward trend, while that of Kopia has shown a downward trend (4 periods of 7 days, 1 of 4 days, 295 backups for each program):

	Borg	Kopia	Restic
2024-02-19	29.07	8.76	23.72
2024-02-26	31.28	9.08	25.57
2024-03-04	31.59	8.26	25.29
2024-03-11	31.36	7.77	26.34
2024-03-18	32.38	8.38	26.09

In terms of backup times, Kopia is far superior to its competitors. In terms of network exchange volumes, Kopia is also the best, but by a much smaller margin. It ranks just ahead of Borg and far ahead of Restic. I haven't aggregated any statistics because my tcpdump collection wasn't "manageable" enough over a long period. I do, however, have a few graphs over a 4-day period, which I have checked manually.

The evolution of backup storage volumes is interesting for several reasons. Of course, it allows us to make capacity planning. It also enables us to determine which software is the most efficient for stacking a maximum number of backups in a minimum amount of space. Finally, it highlights an absolutely fundamental difference in retention management between Kopia and its 2 competitors. A difference that really took me by surprise.

The following graph shows the evolution of backup volumes from February 17 to March 21. The y-axis is in bytes, and the abbreviation B stands for "billion": 60B is therefore equivalent to 60 GB.

From the outset, Borg has been top of the class, with a very efficient approach. Kopia and Restic, on the other hand, saw their volumes soar. Restic finally bends its trajectory with the first purging tasks aimed at complying with the retention policy.
Marker 1 marks the moment when I activated compression for Kopia, without which I feared my test would be cut short by saturation of the destination storage.
Marker 2 marks the first purge of obsolete backups for Kopia, the descending sawtooth pattern is due to the fact that Kopia purges uncompressed backups and replaces them with compressed ones.
Marker 3 marks the evening of March 5. At this point, Borg has 68 archives, Restic has 52 snapshots, and Kopia 24. Yet the retention policies were aligned:

Borg purge options:

--keep-last 10 --keep-hourly 48 --keep-daily 7 --keep-weekly 4 --keep-monthly 24 --keep-yearly 3

Restic purge options:

--keep-last 10 --keep-hourly 48 --keep-daily 7 --keep-weekly 4 --keep-monthly 24 --keep-yearly 3

Kopia retention options:

	 Annual snapshots: 3 
	 Monthly snapshots: 24 
	 Weekly snapshots: 4 
	 Daily snapshots: 7
	 Hourly snapshots: 48
	 Latest snapshots: 10

The difference between Borg and Restic is very simple. Borg keeps an archive based on a single retention criterion: an archive may be kept because it corresponds to an hourly retention or a daily retention, but never both. Restic, on the other hand, labels snapshots with different retentions: a snapshot can be hourly & daily & weekly and so on. It will also label long retention periods (annual, monthly…) in advance. As of March 5, Restic was storing the equivalent of 74 snapshots, even if in reality it only had 52 restore points.
All in all, this is coherent.

Borg:

(rule: daily #5):        20240222.1708642488
(rule: daily #6):        20240221.1708553824
(rule: daily #7):        20240220.1708469744
(rule: weekly #1):       20240218.1708295990
(rule: weekly[oldest] #2): 20240217.1708164925

Restic:

2024-02-17 11:35:05                monthly snapshot  /Users/patpro
                                   yearly snapshot
2024-02-18 23:41:20                weekly snapshot   /Users/patpro
2024-02-25 23:40:46                weekly snapshot   /Users/patpro
2024-02-28 19:45:29                daily snapshot    /Users/patpro
2024-02-29 12:46:31                hourly snapshot   /Users/patpro
../..
2024-02-29 19:23:42                hourly snapshot   /Users/patpro
                                   daily snapshot
                                   monthly snapshot

Kopia is a very different story. It simply doesn't calculate retention like the other two. For Borg and Restic --keep-hourly 48 means "keeps the last 48 archives/hourly snapshots". For Kopia Hourly snapshots: 48 means "retains hourly snapshots for a maximum of 48 hours". This is extremely different. Remember, in the first article of the series I pointed out that due to the use of Launchd with StartInterval "the script is actually launched a little less than once per hour. For example, if each execution lasts 5 minutes, then it takes 26h and not 24h for it to be launched 24 times". I also pointed out that the client sleeps at night, causing an interruption in backups. Furthermore, from February 27 onwards, I further reduced the backup window from 17 or 18 backups per day to 8 or 9.
Within this reduced window, Kopia can now only make 8 to 9 backups per day. After 48 hours, it will have taken the equivalent of 16 to 18 hourly snapshots; an hour later, the oldest will have exceeded its lifetime and will be purged. In the same situation, Borg will keep 48 "hourly" archives spanning some 6 days.
Add to this the fact that Kopia manages its snapshots in the same way as Restic: each snapshot can have several retention labels.

Kopia:

  2024-02-18 23:38:41 CET ../.. (weekly-4)
  2024-02-25 23:43:54 CET ../.. (weekly-3)
  2024-02-28 19:46:56 CET ../.. (daily-7)
  2024-02-29 19:20:29 CET ../.. (daily-6,monthly-2)
  2024-03-01 19:02:51 CET ../.. (daily-5)
  2024-03-02 19:55:36 CET ../.. (daily-4)

Finally, Kopia doesn't hesitate to delete the oldest snapshots, whereas Borg and Restic treasure them. By March 5, Kopia had already deleted the very first snapshot. As of March 20, the oldest snapshot available is from February 29.

Here's the situation on March 20:

number of restoration points by date and program

	Borg	Kopia	Restic
2024-02-17	1		1
2024-02-18	1
2024-02-25	1
2024-02-29	1	1	1
2024-03-03	1	1	1
2024-03-06	1
2024-03-07	1
2024-03-08	1
2024-03-09	1
2024-03-10	1	1	1
2024-03-11	1
2024-03-12	1
2024-03-13	4
2024-03-14	8	1	1
2024-03-15	8	1	8
2024-03-16	8	1	8
2024-03-17	8	1	8
2024-03-18	8	1	8
2024-03-19	8	8	8
2024-03-20	7	8	7
total	71	24	52

In conclusion, if you're looking for very fast, bandwidth-saving backup tasks at the expense of guaranteed, readable retention, Kopia is the best candidate. If, on the other hand, you want to maximize the number of restore points at the expense of backup time and bandwidth usage, Borg is the clear winner. Kopia users will nevertheless be able to use Latest snapshots to ensure minimal retention for a client that is not always switched on for its periodic backup.

	Borg	Kopia	Restic
backup duration	-	+	-
network volume	+	+	-
retention management	+	-	+

Cognitive Overhead

Borg, Kopia, Restic: backup and resource utilization

Laisser un commentaire