Fun fact: since Linux has no built-in DNS caching, most of the DNS queries are looking for…itself. Oh wait, that’s not a fun fact — it’s actually a pain in the ass.
Surely that should just be a very fast lookup in /etc/hosts?
The problem here is that these services move - so if it's in /etc/hosts, our failover mechanisms (to a DR data center which has a replica server) are severely hindered. We're adding some local cache, but there are some nasty gotchas with subnet-local ordering on resolution. By this I mean, New York resolves the local /16 first, and Denver resolve's its local /16...instead BIND doesn't care (by default) and likes to auth against let's say: the London office. Good times!
we had n datacenters each named after their city: ldn.$company.com, ny.$company.com etc etc. in the DHCP we pushed out the search order so that it would try and resolve locally, if that failed try a level up until something worked.
This meant that you'd bind to service it would first look up service.$___location.$company.com, if thats not there it'd try and find service.$company.com
This cuts down the need for nasty split horizon DNS, moving VMs/services/machines between datacenters was simple and zero config.
If you were taking a service out of commission in one datacenter, you'd CNAME service.$___location.$company.com to a different datacenter, do a staged kick of the machines, and BOOM failed over with only one config change.
On a side note, you can use SSSD or shudder NSLCD to cache DNS.
We do, but in the specific case of Active Directory, we want to fail over and auth against another data center if the primary is offline. This means for our ___domain, the local (to the /16) ___domain controllers are returned first and then the others. The problem is BIND locally doesn't preserve this order and applications are suddenly authenticating across the planet.
DNS devolution isn't a good idea here, since the external ___domain is a wildcard. We'll be paying for that mistake from long ago until (if ever) we change the internal ___domain name.
This is a pretty recent problem we're just now getting to because the DNS volume has been a back-burner issue - we'll look into permanent solutions for all Linux services after the CDN testing completes. Recommendations on the Linux DNS caching are much appreciated - we'll review each. It's something that just hasn't been an issue in the past so not experts on that particular area. I am surprised caching hasn't landed natively in most of the major distros yet though.
Aha gotcha. I was under the impression that SSSD chose the fastest AD server it could find(either via the SRV records, or via a pre-determined list)? I've not had too much trouble with it stubbornly binding to the furthest away server. (thats with AD doing the DNS and delegation to BIND )
> The problem is BIND locally doesn't preserve this order
Nor need any other DNS server software do so. The actual DNS protocol has no notion of an ordering within a resource record set in an answer.
I suspect, from your brief description here, that what you'll end up with is using the "sortlist" option in the BIND DNS client library's configuration file /etc/resolv.conf . Although SRV RRSets will introduce some interesting complexities.
I'm confused. The lookup is for the localhost, so how would this alter failover mechanisms? You don't want a lookup for the localhost being responded to with an address of a different data centre surely?
It's not for localhost, it's for the server name. While Gitlab and Teamcity normally are on the same box, they can operate on different boxes or in different data centers. It's looking up a DNS name which happens to point at the same box...does that explain it more clearly?
Also, Linux has "built in" (whatever that means) DNS caching. It's called nscd. It's just usually not enabled by default (which is sensible, since it's better off shared).