SpamAssassin now a collaborator?

General discussions and other topics.
8 posts Page 1 of 1
by Guest » Thu Dec 05, 2013 9:06 am
Digging through graymail this morning, just for grins I checked out a couple of message headers and in the "Content analysis details" list ran across an item I don't remember seeing before:

-0.2 SONIC_KNOWN_SENDER Message was sent from someone you likey have sent mail to

It doesn't seem at all likey (or likely either) that I have ever sent email to somebody who would respond with messages like "Free Trial to Stronger sexual life!" and "BulkMailing services with Ease". While I understand that I can probably fiddle with the value assigned by this filter, I don't understand where/how SpamAssassin would decide that I might ever have corresponded with these guys, unless it was having previously received spam from them.
by kgc » Thu Dec 05, 2013 9:52 am
Didn't take people very long to notice this. This is something that I've wanted to work on for a long time and only recently was able to put something together along with the recent upgrades to the outbound mail cluster. All recipient email addresses on messages sent by a user are fed through a one way hash and then "fuzzed" before being placed in persistent storage. This new SA rule checks the sender of a message to see if it is from someone that you've corresponded with. I think there is a bug in the SA plugin I wrote and I'll work on tracking that down today. In the meantime, I've dropped to score to -0.1 which really should keep it from causing false negatives. If it works as expected and reduces false positives we'll also extend it to the MX servers so messages from these known senders will be less likely to be rejected there as well.

We're confident that the hashing scheme we've come up with, along with the fuzzing before it is stored, is sufficient to prevent this data from being used to provide anyone with a list of all of the email addresses a user has corresponded with.
Kelsey Cummings
System Architect, Sonic.net, Inc.
by Guest » Thu Dec 05, 2013 11:09 am
kgc wrote:We're confident that the hashing scheme we've come up with, along with the fuzzing before it is stored, is sufficient to prevent this data from being used to provide anyone with a list of all of the email addresses a user has corresponded with.
Couldn't that data be used to create a list of all Sonic customers who have corresponded with a given address? That would still have privacy implications.
by kgc » Thu Dec 05, 2013 12:11 pm
Guest wrote:
kgc wrote:We're confident that the hashing scheme we've come up with, along with the fuzzing before it is stored, is sufficient to prevent this data from being used to provide anyone with a list of all of the email addresses a user has corresponded with.
Couldn't that data be used to create a list of all Sonic customers who have corresponded with a given address? That would still have privacy implications.
Someone with the ability to compel us to surrender the entire database as it stands today could conceivably use it to find groups of users that were possibly sending email to the same address. It may be possible to mitigate this by seeding the hash on a per-user basis. The most that anyone with access to both the stored hashes for an individual user and the hash function would be able to do is take a address that they already knew and see if a stored hash was within the same fuzzing distance.

I'm trying to come up with a politic way of saying that I think the likelihood of anyone with the ability to compel us to surrender this database in its entirety not already knowing nearly everything it in is small.
Kelsey Cummings
System Architect, Sonic.net, Inc.
by kgc » Thu Dec 05, 2013 2:44 pm
Also, the bug was pretty easy to fix.
Kelsey Cummings
System Architect, Sonic.net, Inc.
by kgc » Fri Dec 06, 2013 11:46 am
As was seeding the hashes to make the hash for the same recipient address unique between users. This should prevent the database from being used to derive who had contacts in common.
Kelsey Cummings
System Architect, Sonic.net, Inc.
by lr » Thu Dec 12, 2013 10:28 pm
This is a really cool idea, and if carefully implemented (as Kelsey is obviously doing) should not cause any privacy problem.

Are you worried about the size of the table? I just checked my current mail spool (the one still stored on Sonic's imap server, not the one I sucked into my own server a year or so ago), and it has a little over 5000 sent mail messages. Multiply some number like that with the number of Sonic users, a times a reasonably long hash code (64 bits? 160 bits?), and its quite a bit of memory. But on a modern large machine, that's probably not a huge problem.

I'll try it in a few days.
Linda and Ralph and John
by kgc » Fri Dec 13, 2013 10:41 am
Ralph, I'm not really worried about the size of the table at this time. After running for a week it's only 129MB - if you send to the same address again the record is just touched to update a timestamp and my assumption is that most users are likely to send mail to a relatively small set of users - after the top 600 user's the number of records per user is down to less than 100.

And, honestly, it'd be fun to have to scale this out as a big-data problem, but for now it is perfectly happy in innodb on mysql.
Kelsey Cummings
System Architect, Sonic.net, Inc.
8 posts Page 1 of 1