So what are my options then? I could put out an initial version that shows what I want to accomplish, then either have it reviewed heavily, or see if I can get someone who's more of an expert to give it a go based on those specifications. Here's the problems I have so far, which I know I'm out of my depth:
1) In order to keep the backup software's deduplication functionality (and to minimize the server trusting the client), I need to have the IV (initialization vector) be a computed value based on the file's contents (so that multiple copies of the same file on the clients will encrypt to the same contents going to the backend server). I know that _This will leak some data_ (i.e., you may not want an attacker to know that a given set of files are the same contents). So I plan on making this optional -- either the user gets to take advantage of dedup, or better encryption.
2) To do number 1 above, I was planning on taking a hash of the file, encrypting that with the users key, then using that as a predictable IV. You will still have a unique IV per unique file, but I don't know enough to see if this can leak any other data. It looks like RFC 5297 describes an approach similar to this, but I think it is for a different use case.
3) I need the backup server to know which version of an encryption key the client used. That is, if the client changes keys between backups, I need the server to instruct the client to do a full backup, not incremental. So I can either have the client provide a version number for the key (or if using a keyfile, use the datestamp of the file as a version string), or I can encrypt a given string using the key, and use a hash of the encrypted string as the key version number. (Note, in no case will I ever be storing a hash of the plaintext of the client's files on the backup server, as that too can leak information)
My apologies if the above makes any experts here cringe, but as I mentioned my constraint is to have same-content files encrypted to the same target contents (for dedup purposes), although I will give the user to turn that off and use a random IV for better security (and give up dedup).
1) In order to keep the backup software's deduplication functionality (and to minimize the server trusting the client), I need to have the IV (initialization vector) be a computed value based on the file's contents (so that multiple copies of the same file on the clients will encrypt to the same contents going to the backend server). I know that _This will leak some data_ (i.e., you may not want an attacker to know that a given set of files are the same contents). So I plan on making this optional -- either the user gets to take advantage of dedup, or better encryption.
2) To do number 1 above, I was planning on taking a hash of the file, encrypting that with the users key, then using that as a predictable IV. You will still have a unique IV per unique file, but I don't know enough to see if this can leak any other data. It looks like RFC 5297 describes an approach similar to this, but I think it is for a different use case.
3) I need the backup server to know which version of an encryption key the client used. That is, if the client changes keys between backups, I need the server to instruct the client to do a full backup, not incremental. So I can either have the client provide a version number for the key (or if using a keyfile, use the datestamp of the file as a version string), or I can encrypt a given string using the key, and use a hash of the encrypted string as the key version number. (Note, in no case will I ever be storing a hash of the plaintext of the client's files on the backup server, as that too can leak information)
My apologies if the above makes any experts here cringe, but as I mentioned my constraint is to have same-content files encrypted to the same target contents (for dedup purposes), although I will give the user to turn that off and use a random IV for better security (and give up dedup).