The mail servers are working fine.Me, to myself.
The mail servers were not working fine.
As I am not a normal person, I run my own mail servers. These servers handle email for a couple of domains and are used by myself and a couple of family members.
I’ve been running a mail server setup of my own since 2006, and I also have responsibility for various mail servers at my job, so I like to at least pretend that I have a pretty reasonable grasp on how to run a proper mail server setup.
As an aside, I do know that there is a common piece of advice out there to anyone thinking of running their own mail server; that advice is “don’t”. But to the people who share that advice I say “boooo, you’re no fun”.
I got a report from one of the aforementioned users of my server, that they weren’t receiving any emails from some friends of theirs who have a hotmail.com email address.
As anyone who is brave enough to run their own mail server will tell you, usually the problems you will face when it comes to mail delivery is convincing the big mail providers that the mails you are sending are not spam. Mitigating that usually involves making sure your reverse DNS records are correct, setting up SPF and DKIM signing and making sure that the IP address you got assigned for your server is not on any of the many block lists out there.
Difficulties with receiving mail are a lot rarer.
I checked that the servers were running correctly and that mail was flowing through them as it should. Everything checked out. So then it was time to look into the logs and try to figure out what is actually going on. Were the mails being blocked by a spam filter rule? Was there some misconfiguration on the sending server that was causing it to run afoul of some of the other checks that postfix was making?
Looking into the logs I found… nothing.
There was no evidence in the logs that the servers had received any email from that address, and even more than that, there was no evidence that Microsoft’s email servers had even attempted to contact my mail servers.
A little bit of backstory
A couple of months ago I migrated to a new mail server setup. This setup moved from the single server that I was using before to a shiny new dual server setup. Yay redundancy!
Both servers are running as MX servers, feeding mail into a Dovecot master/master dsync replication setup. The authentication backend uses a replicated postgresql database. It now sports full IPv4/IPv6 dual stack networking and the appropriate AAAA DNS records have been added. DNS is now being served through PowerDNS (also replicated across both servers through the postgresql database), using my very own DNS UI for management. Let’s encrypt SSL certificates used by the servers are now being obtained through a PowerDNS certbot plugin. And all the server configuration is managed through my salt policy server.
Like I said, shiny.
After the migration, everything seemed to be working smoothly for the most part, but it seemed that it was since that move that we were no longer able to receive mails from hotmail.com/outlook.com.
Begrudgingly, I signed up for a hotmail account so that I could investigate further. I sent an email to myself while watching the server logs, and sure enough, there was nothing at all – no sign that a TCP connection had even been established with the SMTP server.
There was no immediate sign of the failed delivery on the hotmail side either, I would have to wait a day or so before that situation would change, so I looked into other tests in the meantime.
I sent a mail in the other direction (from my server to hotmail) and that arrived immediately, though it did get sent into the spam folder. After dutifully clicking “this is not spam”, I had a look at the headers of the received message. I didn’t find much, but I did find this:
Authentication-Results: spf=temperror (sender IP is x.x.x.x) smtp.mailfrom=example.com; hotmail.com; dkim=timeout (key query timeout) header.d=example.com;hotmail.com; dmarc=temperror action=none header.from=example.com;compauth=pass reason=105 Received-SPF: TempError (protection.outlook.com: error in processing during lookup of example.com: DNS Timeout)
That at least seems to imply that the Outlook SMTP servers are having trouble talking to my DNS servers and thus SPF and DKIM checks are failing, so I start looking deeper into that (maybe it could also explain the incoming email issues…), checking that both IPv4 and IPv6 connections to DNS are working (they are), making sure there are no firewall rules that could be blocking it (there aren’t), and checking that they’re responding in a timely manner (they are).
Some web searching regarding this error leads me to a site for testing DKIM validation, and running that shows that my servers are in fact failing SPF HELO checks, because I have no SPF records specifically set on the individual server subdomain names (only on the apex domain). I fix that, and next time I send a test mail it actually works a bit better:
Authentication-Results: spf=temperror (sender IP is 22.214.171.124) smtp.mailfrom=example.com; hotmail.com; dkim=pass (signature was verified) header.d=example.com;hotmail.com; dmarc=pass action=none header.from=example.com;compauth=pass reason=100 Received-SPF: TempError (protection.outlook.com: error in processing during lookup of example.com: DNS Timeout)
This time the DKIM check succeeded, but the SPF check still failed. To be honest, I don’t think that change I did actually made any difference at all in this case (but good to have that fixed at least), and the Outlook mail servers just seem to have some DNS issues.
Another thing I tried (also discovered through some careful web searching) was the Microsoft Remote Connectivity Analyzer Inbound SMTP Email test. It said that there was nothing wrong with the connection to my mail server, although the tool’s attempt to send a test email did fail because their test tool doesn’t have a reverse IP configured…
With no other leads to follow, I left it for the night and waited to see if I would eventually get a delivery failure message.
Error messages are your friends
A little over 24 hours since I sent the first test message from hotmail I finally got some feedback in the form of a delivery status message:
Delivery has failed to these recipients or groups: email@example.com (firstname.lastname@example.org) Your message wasn't delivered. Despite repeated attempts to deliver your message, a connection to the remote server couldn't be made. [...] Diagnostic information for administrators: Generating server: AM7EUR06HT200.mail.protection.outlook.com Receiving server: AM7EUR06HT200.eop-eur06.prod.protection.outlook.com email@example.com 11/20/2020 10:49:11 PM - Server at AM7EUR06HT200.eop-eur06.prod.protection.outlook.com returned '550 5.4.317 Message expired, cannot connect to remote server(451 4.4.8 MX hosts of 'example.com' failed MTA-STS validation.)' 11/20/2020 10:39:11 PM - Server at example.com (0.0.0.0) returned '450 4.4.317 Cannot connect to remote server [Message=451 4.4.8 MX hosts of 'example.com' failed MTA-STS validation.] [LastAttemptedServerName=example.com] [AM7EUR06FT049.eop-eur06.prod.protection.outlook.com](451 4.4.8 MX hosts of 'example.com' failed MTA-STS validation.)'
This was the information I needed. According to the Outlook mail servers, MTA-STS validation was failing for my MX servers.
MTA-STS is “a mechanism enabling mail service providers (SPs) to declare their ability to receive Transport Layer Security (TLS) secure SMTP connections and to specify whether sending SMTP servers should refuse to deliver to MX hosts that do not offer TLS with a trusted server certificate”.
In other words it’s a way to say to a connecting mail server “if I don’t identify myself correctly with an SSL certificate matching the MX records in my MTA-STS policy, abort your delivery attempt”.
The fact that this check was failing seemed odd as the hardenize.com test for MTA-STS was indicating everything to be ok, and no other mail services seemed to be having the same problem connecting to my servers. However, I had a hunch about what could be the problem, inspired by having looked into the SPF problem.
Since the migration from the old single server setup, the MX record for my domain points to 2 addresses,
mxi2.example.com. However for various reasons, the records of
mxi2 are actually CNAMEs pointing at 2 other hostnames:
tol.example.com. This is also what the reverse DNS records identified the IP addresses as.
In the previous setup (where I was also using an MTA-STS policy without issue), the MX record matched the reverse DNS record.
So, theorizing that Microsoft might be applying the MTA-STS policy based on the reverse DNS lookup of the MX servers, I updated the MTA-STS policy to the following:
version: STSv1 mode: enforce mx: mxi1.example.com mx: mxi2.example.com mx: smol.example.com mx: tol.example.com max_age: 86401
(adding the 2 additional
Of course, since the
max_age in the policy was already set to 1 day, I would have to wait another day before testing this out.
The test the next day also failed, and so I had to wait another day to get the resulting error message. When it did arrive, it confirmed I was on the right track, and also that the Outlook SMTP servers actually establish a TLS connection to the hostname of the reverse IP lookup of the MX record (or at the very least, they resolve the CNAME and use the hostname returned by that), because now I was seeing that they were getting an SSL hostname mismatch error:
Diagnostic information for administrators: Generating server: AM5EUR03HT048.mail.protection.outlook.com Receiving server: AM5EUR03HT048.eop-EUR03.prod.protection.outlook.com firstname.lastname@example.org 11/22/2020 2:47:23 AM - Server at AM5EUR03HT048.eop-EUR03.prod.protection.outlook.com returned '550 5.4.317 Message expired, cannot connect to remote server(451 4.7.5 Remote certificate MUST have a subject alternative name matching the hostname (MTA-STS))' 11/22/2020 2:37:21 AM - Server at example.com (x.x.x.x) returned '450 4.4.317 Cannot connect to remote server [Message=451 4.7.5 Remote certificate MUST have a subject alternative name matching the hostname (MTA-STS)] [LastAttemptedServerName=example.com] [LastAttemptedIP=x.x.x.x:25] [DB5EUR03FT039.eop-EUR03.prod.protection.outlook.com](451 4.7.5 Remote certificate MUST have a subject alternative name matching the hostname (MTA-STS))'
So I updated the SSL certificates with new subject alternative names (adding
tol.example.com respectively) and tried again.
This test finally worked. The mail was delivered almost immediately with no problems.
So it was confirmed that the Outlook mail servers were refusing to connect because rather than simply using the forward DNS record for the server to determine where to connect to (and checking that address against the MTA-STS policy), they were looking deeper and using either the reverse DNS or resolving the CNAME and connecting to the resolved name. I don’t know which it is, and I don’t really feel like testing much more now.
Regarding MTA-STS, it seems that very few mail servers actually implement MTA-STS checks, and frankly Microsoft deserve some respect for actually doing so in their mail servers (perhaps to be expected, as they did contribute to the spec). But I don’t know if this behaviour of how their mail servers choose where to connect to is actually following the email RFCs or not. I certainly couldn’t find any mention of it in the MTA-STS spec, but I am curious if anyone else has any insight.
I’d recommend for simplicity’s sake that you make your reverse DNS match your forward DNS for your mail servers in all cases if you possibly can and avoid using CNAMEs for them too. If not possible, then at least be aware that it can cause problems with certain mail service providers, particularly if you’re using a strict MTA-STS policy on your domain.
Also, don’t assume that just because your MTA-STS policy was working fine on your old setup, that it’ll still be working on the new one.
Main photo by onlineprinters