In my previous two posts about tracking on the internet I explained how I wanted to remove my own ability to use someone's email to track them. I finally did the work to switch away from email based authentication and move back to usernames+passwords. At the same time I also figured out how to use
bcrypt(email) to hash people's email addresses, but still allow them to do password resets. So far everything works, but by doing this work I've uncovered some contradictions that I have to address.
I have zero desire to store information about people or track them. For a small business like mine there isn't much upside to having a lot of tracking information about someone, and a lot of downside if I ever get sued after a breach. I also feel that people are genuinely scared of giving out email addresses and exposing their internet behavior, so I decided to see if I could just not even have that information at all.
My first work to doing this was switching from an email based login (called a "magic link") to a regular username+password system. I then used
bcrypt to store the user's email just like a store their password. The rationale is that if this is considered secure enough to store a password, then it should be secure enough to store an email. By hashing their email I can keep it for later uses, but I basically need their permission each time I want to use their email because I need their email again to email them.
Why not just get rid of emails entirely though? One thing I learned from running a magic link based system for years is that people suck at passwords and email. You send them an email and they can't find it, it goes to spam, it gets blocked, anti-virus clicks on everything, or they don't even know their own email addresses. Passwords are just as hard for people with the main problem being they constantly forget them (because they don't use password managers). I eliminate the problem of email complexity by simply hashing the emails and not using them, but what if someone forgets their password? They will, and when they do I'd need some kind of proof that the person asking for a reset over email is the actual owner of the account.
The best compromise turns out to be having the email stored hashed, and then when they forget their password then can go to a password reset form. At that point they can temporarily give me their
username, I look their account up, use the
bcrypt compare function to confirm the email is the right one for the account, and then send them a reset code to that email address. As long as I don't try to store this email address then it's only stored hashed, but they can reset their password.
This all works, but there's some complexity with doing this.
I originally made the mistake of thinking it would be possible to search for people's emails by hashing with
bcrypt(email) and then searching for that hash. My mistake--which I knew but somehow forgot--was that the salt for the bcrypt hash is created randomly for each stored hash. The
bcrypt system stores the original salt with the hash, so when you give
bcrypt a password to compare it recovers the salt then compares the hash and salt of the requested password. This means that when I get an email and hash it, I get a completely different hash unless I use the same salt for every email in the system.
I was about to do this as my original plan, but then I backed off because I don't know the implications of using the same salt for every email hash. I would need to store this salt secretly and treat it like a private key, and then I'd have to never lose that salt or else everyone's accounts are broken. What's the motivation for being able to find people by email hashes? If I use traditional
bcrypt with a random salt then the only way to find someone's account is with their username. People frequently also forget their username, so those people would be out of luck unless I can take their email and search for their account as well.
But, if I'm storing a single salt for everyone, then I might as well just use a plain SHA-256, HMAC, or simpler scheme. At least I think that's the implication of using one salt. Either way, I decided to keep it simple and just do plain old normal
bcrypt and see if it's really a problem for people who lose their usernames. I also make sure to email them an initial email telling them their username so they have a reference just in case they forget. They can at least search through their email for that first email and find their username.
With email addresses gone from the system I had to look at other places I collected that information. I simply got rid of the mailing list system I was using, as it wasn't very good anyway. Then I looked hard at the comments system I'm using. I use commento for comments and questions at the bottom of an exercise or blog post, and that system collects emails as the logins. Now I have to replace commento with my own commenting system, which isn't really a bad thing since, but it got me thinking.
People are rather confused about how they want to live their online lives. It seems they want to be publicly involved in a massive online community like Facebook and Twitter, but also be completely anonymous, and all for free. In my case, people would want to have me store zero cookies and emails, but also have an email notification when someone replies to their questions. I personally want to know when there's been and update to a thread in a forum, or my questions are answered, or a new person messages me on a website I'm using. But, you can't have notifications off website and also not give that website any information to contact you outside that website.
The solution is again one of compromises and consent. On my main website I'll have a replacement for the commento system that uses my
bcrypt(email) scheme and does not email anyone when something's been answered. Instead they'll only be notified when they're actively on the site and that's it. This is so people who want extensive privacy can still ask questions and get help with the courses.
For anyone who wants to interact with a community and get notifications to their emails there will be a forum using NodeBB that will ask for emails, but gives you control over how you are contacted. NodeBB needs your email to tell you when you have new messages, but if you want privacy the you simply don't have to join the forum and talk to other people. You can hang out only on the main site and talk to people there in the smaller comments system.
Lastly, the NodeBB system will be separate from the main website so that if it's compromised then only that system's stored emails are ever leaked. As soon as I can I'll try to go in and add the same kind of
bcrypt(email) obfuscation to NodeBB and see if that works. My initial analysis says doing that would a lot of the functionality in NodeBB but we'll see.
The other problem comes from Invoices and Receipts, which I will call Invoices because nobody can seem to agree on which one they actually want. (Hint: An invoice comes before the purchase to demand payment. A receipt comes after a purchase to show payment the past payment.). With an Invoice you have a plethora of incredibly invasive information that I do not want to store. To make a proper invoice you need:
This is for both the customer and the seller. Every time someone buys my course they get to know my business address, and I get to know where they live, and that's scary. I definitely do not want to store any of that information, but people also need invoices about 10% of the time. In fact, I might be so bold as to claim that invoice requirements are the number one way that people's privacy is violated.
This problem I'm not too sure how to solve. I believe I just have to email them an Invoice right when they buy the course and that's it. Just attach a PDF and blast it out. However, to generate that invoice I now have to ask them for their private information, which is a big problem. I could only send invoices on demand and ask them for that information only when I generate the invoice, but then I'm not sure if that's some kind of legal violation. Like, do I have to collect all their information at the point of sale before I send it?
Finally, I have no idea why I'm required to ask them for all their information for an invoice, but when they go to Best Buy to get a video game for their Nintendo Switch they don't. This is the difference between an invoice and a receipt. I believe for an invoice I only need your information so I know where to send it for you to pay it...because it's before you pay for something. For a receipt I don't need to know anything about you, only that you bought a certain thing for a certain price and that's good enough. If you can buy food at a restaurant and get reimbursed by your boss then why does an online course suddenly require you blast out all your personal information?
I ran a quick poll on Twitter and the main reason for the PDF attachment is so people can then attach upload it to their company's ancient accounting system. In that case what I'll probably do is:
That should cover most situations, and I'll have to be careful to not store anything if I can. The PDF will be tough because I'm pretty sure I have to write it to disk as I send it, but I'll see if I can work it so that's only in RAM and then dies after send.
We also have the problem that people want their crazy fast WebTorrent downloads, but don't want to share their IP addresses with other people. Again we have a contradiction because WebTorrent does speed up downloads considerably, especially on a LAN. But, to get that fast download you have to connect with a peer computer, thus giving them your IP Address. The WebTorrent downloads really are way better for usability, so I'd like to keep them for people.
Currently you either share with others or you don't get the video. That's mostly just to get the feature working without having the backup system that alternatively downloads it from my closed servers. Probably I'll go with a simple HTTP only download for people who don't want to share, and WebTorrent for people who don't care.
One massive hole in avoiding storing information about people is logs. I have to log debugging information to figure out if the application is working correctly, and that almost always has information submitted from the user. There's also the problem of security, and without solid logs it's difficult to secure the site.
I already have good discipline about when I use debug/info/warn/error logging, so the main thing I need to do is go through the code to find places I'm leaking customer information. I'll leave that information in the debug logging since I need that during development, and then on deployment disable debug logging so that there's only the essentials in the logs.
That's the progress of the goal of removing as much identifying information as I possibly can. I think that with obfuscated emails, bitcoin payments, and optional WebTorrent/HTTP downloads, it'd be pretty safe for someone to feed their paranoia on my site and never give me any information about them. We'll see how this works out in practice once I finally get this all online. Soon. I promise.