'Non-routable' as in a single label, or as in not resolvable. I don't think it is unreasonable to consider an address invalid when its domain part cannot be resolved. Checking for RFC1918 ranges means you don't try to send to another class of addresses that's never going to be received.
You would lose the bet. The product supported IPv6 from day one.
That is a robust, if somewhat complex, solution for a relatively small volume of mail. When you're sending ten million messages a day by the end of the first month, pushing everything into a single relay of any kind is asking for a lot of trouble.
> 'Non-routable' as in a single label, or as in not resolvable. I don't think it is unreasonable to consider an address invalid when its domain part cannot be resolved.
What exactly do you mean by "cannot be resolved"?
> Checking for RFC1918 ranges means you don't try to send to another class of addresses that's never going to be received.
But why check for it? Is that actually a common mistake people make? An attacker could change the address after you checked it, so it's not going to help against attackers, is it?
> You would lose the bet. The product supported IPv6 from day one.
Good for you! :-)
> That is a robust, if somewhat complex, solution for a relatively small volume of mail. When you're sending ten million messages a day by the end of the first month, pushing everything into a single relay of any kind is asking for a lot of trouble.
Complex? Certainly less so than implementing validation yourself.
As for scalability: Well, yeah, as described that's more the setup for a company that's operating various different services, none of which has a high volume of outbound email (which would be most, even the best startups don't have ten million signups per day and don't send much email otherwise, and even that should actually still be managable with a single server).
But that's trivial to adapt without changing the general approach. First of all, obviously, you could just add more relay servers and have client servers select one randomly, that scales linearly. But if you really need to move massive amounts of email for one service, so that adding an additional relay hop for each email you send actually adds up to noticable costs, you can still use the same approach: Just put the MTA onto the same machine(s) that the service is running on, into its own network namespace (assuming Linux, analogous technology exists on other platforms), and firewall it off there so it cannot connect to your internal network. Potentially you can even just add blackhole routes for your internal networks/RFC1918 ranges, so you would not even need a stateful packet filter (though currently you might still need it due to IPv4 address shortage).
Why assume email addresses only get checked in one place, and not all?
Ten million a day was a milestone. I left that company over a year ago; it would astonish me to find that figure now exceeded by less than a factor of twenty. Granted these are mostly not signups. They are outgoing emails nonetheless, which makes the case germane despite that superficial distinction.
Your proposed solution sounds pretty expensive in ops resource, to no obviously greater benefit than the rather simple (well under one dev-day) option we chose. You seem to feel yours is strongly preferable, but I still don't understand why.
NXDOMAIN can be a temporary error. The SMTP queuing protocol is designed to be resilient against DNS failures, internet outages, routing problems, and temporary mail delivery issues.
Unless some DNS server is broken, it actually cannot. NXDOMAIN is an authoritative answer that tells you that the domain positively does not exist. Not to be confused with SERVFAIL, which you should get if the DNS resolver ran into a timeout or got an unintelligible response or whatever, NXDOMAIN should only occur if the authoritative nameserver of a parent zone explicitly says "I don't know this zone either locally nor do I have a delegation for it".
OK, that at least shouldn't reject any valid addresses, so maybe ...
> Why assume email addresses only get checked in one place, and not all?
It's not an assumption, it's just a matter of simplicity and reliability.
> Ten million a day was a milestone. I left that company over a year ago; it would astonish me to find that figure now exceeded by less than a factor of twenty. Granted these are mostly not signups. They are outgoing emails nonetheless, which makes the case germane despite that superficial distinction.
Well, no clue how well common MTAs would cope with that, but 2500 mails per second could still be within the power of a single machine if it's designed for high performance. But regardless, I don't think that really matters: If you have a relatively low volume of emails, it's probably most efficient and secure to handle it all with one central outbound relay, if you need to send lots of emails, it obviously makes sense to distribute the load, but that shouldn't really otherwise change the strategy.
> Your proposed solution sounds pretty expensive in ops resource, to no obviously greater benefit than the rather simple (well under one dev-day) option we chose. You seem to feel yours is strongly preferable, but I still don't understand why.
What sounds expensive about it?
Whether there is any benefit to it: Well, depends on your goals and what exactly your solution actually does. I still don't understand why (or even how exactly) you do those RFC1918 checks, for example?! It seems like it's mostly a security measure? But then, it's not actually secure, it's essentially a race condition/TOCTTOU. Plus, it might even break valid email addresses.
Essentially, it's four factors why I think just delegating the validation of email addresses to the mail server is the best strategy:
1. Implementing your own checks risks introducing additional mistakes (which might lead to the rejection of valid addresses).
2. Implementing your own checks is additional work when you could just use the MTA which already knows how to do this (and which you have to install/configure/use anyway as soon as you want to actually use the email address), both for the initial implementation, and possibly for subsequent maintenance (if you just let your MTA do the work, only the MTA needs to be adapted to any changes in how emails get delivered, abstracting away the problem for any software that's supposed to be sending emails and isolating it from the lower layers).
3. My approach actually gives you the perfect result, in that it does not reject any valid addresses (assuming your MTA implements the RFCs correctly), and at the same time is perfectly secure against all possible abuses with weird addresses, which is impossible to achieve when you separate the check from the actual abuse scenario, and it's even kindof trivial to see that that is the case.
4. You have to have the infrastucture to deal with bounces anyhow, both because you ultimately cannot be sure an address actually exists unless you have successfully delivered an email to it, and because addresses that once existed might not exist anymore at a later time, so it's not like you can avoid that if only you validate your addresses better.
As for "ops resources": Assuming that any service that needs to send > 10 million mails per day will be deploying machines automatically anyway, what's so much more expensive about deploying the configuration of an additional network namespace? Writing that script certainly shouldn't be more than a day of work either, should it?
I may have erred in giving the impression that the application-level checks are the only line of defense here. They're not. The (bespoke) MTA underlying this product performs most if not all of these checks as well. I didn't really spend any time on that side of the business, so I might be wrong about that, but it would be something of a surprise. I do know our analytics needed to be able to cope usefully with an astonishing panoply of bogosity warnings that came back from the MTA, but I no longer recall exactly what they covered. And, in any case, it's nice when you can to tell the user "hey, this isn't deliverable" before it gets to the point of a bounce.
Checking whether an email address's domain-part is an RFC1918 IP is actually pretty easy. Split the address by '@'. The last piece is the domain part. Split it by '.'. If there are four pieces, all of which meaningfully cast to integers, treat it as an IP address. (Otherwise it's a domain name, which is fine as long as it has more than one part and isn't NXDOMAIN when the backend tries to resolve it.) If any part is negative or greater than 255, it's invalid. If it starts with [10] or [192, 168], or if it starts with [172] and the second part is between 16 and 31 inclusive, it's an RFC1918 address. Otherwise, it's fine.
Even with unit tests, that takes almost no time to write, and when your frontend and backend share a language as ours did, you can use the same logic both places. Node gives you a name resolver binding for free. I really can't imagine it being as quick to write, test, validate, and roll out a change to the MTA node manifest.
We were pretty sure it would already be impossible, or nearly so, for a malicious user to probe our infrastructure this way, but when it's so simple to be even more sure, why not?
Similarly, we'd already observed a low but nonzero rate of users inadvertently providing such addresses - not during signup or onboarding so much, but in recipient lists they submitted. Since we used the same recipient checking code everywhere, why not cut that back to zero, too?
You would lose the bet. The product supported IPv6 from day one.
That is a robust, if somewhat complex, solution for a relatively small volume of mail. When you're sending ten million messages a day by the end of the first month, pushing everything into a single relay of any kind is asking for a lot of trouble.