Convergent Encryption and Why No One Uses It
by Steve Marx on
“Convergent encryption” allows for end-to-end encryption with server-side deduplication across all users’ data.
The first time I heard about this, it sounded impossible. End-to-end encryption means I encrypt the file before sending it anywhere. The idea of deduplication is to only store a single copy of each file, but encrypted files from different users should be unique!
Convergent encryption solves this seemingly impossible problem, but it has some major drawbacks that prevent adoption.
How to “converge” on a single encrypted file
If you already know how convergent encryption works, you can skip this section. But for everyone else, it’s fun to reinvent it. Here are the high-level goals:
- We want end-to-end encryption, which means each user encrypts their files before uploading them to a shared storage service. Only the user (or users!) who encrypted the file should be able to decrypt it.
- We want server-side deduplication. If two users encrypt the same file, the result of their encryption needs to be byte-for-byte identical. The server can then safely store only one copy of the file.
For multiple users to encrypt a file and arrive at an identical result, they must use the same encryption key. That key, then, has to be derived from something only those users know. So what do users who possess a given file know that other users don’t?
The answer is the content of the file! The trick is to encrypt the file with the hash of its content:
enc(m) = E(H(m), m)
E(k, m) is a symmetric encryption algorithm such as AES that encrypts message
m with key
H is a cryptographic hash function like SHA-1.1
Any user who has the message/file
m will end up with the same encrypted version. Those users will have no trouble decrypting the file later, but no one else will be able to learn the file’s content.
The security of a system can only be analyzed in the context of well-defined goals. I’d say that for end-to-end file encryption, I want to make sure that no one, including the storage service I’m using, knows the content of the files I’ve stored. I’ll accept some metadata leakage. Specifically, it’s okay with me if file sizes are leaked.
With that in mind, there are two attacks on convergent encryption: the “confirmation of file” attack and the “learn the remaining information” attack.
Confirmation of file
When I upload a file to a storage provider, they know the content of the file I’ve uploaded. If it’s encrypted, they know the ciphertext, but they shouldn’t know the plaintext.
With convergent encryption, the ciphertext for a given plaintext is always the same. Suppose I’ve obtained a bootleg copy of the upcoming film The Croods: A New Age. Let’s also suppose that the storage service I’m using has the same file. (Perhaps the MPAA asked them to be on the lookout for it.)
If the storage service and I both encrypt that file with convergent encryption, we’ll end up with the same result. When I upload my copy of the encrypted file, the storage provider will know that I’ve gotten an illegal sneak peek at the most anticipated film of 2020. This is called the “confirmation of file” attack.
Learn the remaining information
The “confirmation of file” attack was well understood in the 1990s when convergent encryption first emerged, but it wasn’t until 2008 that the “learn the remaining information” attack was described.2
Suppose I’ve rolled my own password manager, and I store my unencrypted passwords in a local Redis database.3 I know that a malicious web page could potentially exfiltrate my local Redis database, so I’ve configured Redis to require a password. That’s the only change I made to the standard Redis configuration.
If I encrypt that Redis config file with convergent encryption and someone (e.g. my storage provider) obtains a copy of that encrypted file, they can perform an offline attack to determine my password. All they need to do is grab the original Redis config file and keep plugging in different passwords until they get a match.
This is much like an offline attack on a hashed password but way worse. Passwords are hashed with key derivation functions like scrypt or Argon2, which are intentionally resistant to offline attacks.4 Passwords are also salted. A salt is a bit of random information that’s added to the password before hashing. If you and I both use the password “12345”5, it will hash to two different values after salting, rendering a lookup table useless.
Convergent encryption, on the other hand, is typically designed to be fast and efficient, which makes an offline attack feasible. Further, convergent encryption would no longer be convergent if it used a salt, so attackers can use pregenerated lookup tables.
How to use convergent encryption securely
Sadly, the weaknesses of convergent encryption are baked in. If you want deduplication of encrypted data, you have to live with attackers being able to confirm the existence of a file and to learn a small amount of missing data with offline dictionary and brute force attacks.
However, if you only care about deduplicating a single user’s data, then each user can pick their own unique key. Similarly, if you care about deduplicating across a group of users, like a company, you can pick a group-wide key and do an extra layer of encryption before sending the file to a shared storage provider.
If you’re interested in a more thorough definition and analysis of convergent encryption, I recommend Message-Locked Encryption and Secure Deduplication by Bellare et al.
To get a key of the right size and to avoid pitfalls around key reuse, it’s probably a good idea to use a key derivation function such as HKDF instead of just using the content hash directly. ↩︎
Don’t do any of this. ↩︎
This is achieved by using a lot of CPU and RAM and being hard to parallelize. ↩︎