Geolocating IP addresses with DNS queries

Varun Patil
5 min readNov 7, 2019

I recently faced the (rather common) problem statement of geolocating IP addresses with the following constraints,

  1. There should be a single server (possibly behind an HA proxy) that should be queried to get the location of an IP address.
  2. It should work with both, Maxmind’s GeoIP2 database, and an internal database for locating IP addresses in intranet subnets.
  3. Should be a (very) fast and maintainable solution.
  4. The server running the geolocation server runs behind a firewall.
  5. Should work with IPv4 and IPv6

Most of the traditional approaches suggested directly connecting to popular paid services and retrieve IP information with an HTTP request, but this was ruled out due to constraint #4. I hence proceeded to find the following alternatives.

Redis #1

My first attempt focused only on intranet IP subnets (this being the primary purpose). A nice article on redislabs’ blog suggests using ZRANGEBYSCORE to store IP start and end values of a subnet as scores of a sorted set, and the location name as the key. While this approach worked, it could not be used for IPv6 ranges, the score being less than 128-bit long.

Redis #2

I quickly discovered another property of Redis sorted sets — keys with equal score are sorted (and stored as BSTs internally), thus allowing for a quick lexicographical range search. This worked for both IPv6 and IPv4 addresses, since v4 addresses could simply be converted to v6 by prefixing. Since the keys were strings, it was also possible to simply store the hexadecimal string at the cost of memory usage. To summarize, the start and end keys were stored something like

20010db885a3000000008a2e03700000:0:location1
20010db885a3000000008a2e0370ffff:1:location1
20010db885a3000000008a2e03750000:0:location2
20010db885a3000000008a2e0375ffff:1:location2

Note that colons act as delimiters, 0 stands for start IP, 1 stands for end IP and the third column is the location

When a query for an IP address is to be made, just query Redis with ZRANGEBYLEX after converting the IP to the v6 hexadecimal string. The next lexicographical entry will be returned, with two possibilities:

  1. The entry is a 1 , in which case the given IP is present in the location corresponding to that entry.
  2. The entry is a 0 , in which case the given IP is not present in any range.

This approach, while fast, had two limitations:

  1. Nested subnets cannot be handled. For example, if one subnet belongs to a building and a nested subnet to a floor, this approach cannot work in some cases.
  2. Every client needs to connect to a Redis database, i.e. each client (which may be in a variety of languages) needs to have a Redis library and active TCP connections.

Finding a solution

I could consider a few solutions to the problem now. One was to home-brew my own HTTP service to handle connections and return JSON. Since MaxMind’s binary data format did not support nested subnets, I would also have to write my own data structure. However, this still had the limitation of having open connections. A viable solution was to use UDP for the connection instead, however this meant having raw UDP connection code on each client. Thus, the solution — create a custom DNS server and make TXT queries to get information about an IP. Since IP location data is around a hundred bytes at most (usually), everything fits into one packet, thus allowing for a quick information exchange between the two microservices without a three-way handshake. Thus —

GeoIPNS

My first choice being C++, I instead chose Go for the project since it already had nice libraries to build a DNS server and besides, just because it’s awesome! What is happening inside is very simple

  1. The whole database (as a CSV initially) is parsed and stored in-memory in a sorted data structure (explained later).
  2. When a TXT query is received for *.geoipns, a callback performs a binary search on the data and returns matching data as the response (which happens to be a single packet).

Very simple, needs (generally) no client side libraries, and uses a protocol that is used by (literally) everyone on the internet!

Data Structure

Each subnet is stored as an IP range with start and end 128-bit IP addresses. All data points are stored together as an array sorted by the IP address, with a high/low boolean. The data structure is looked up using binary search to find the location of the queried IP address. If a low IP precedes the discovered location, the IP is contained in the corresponding range; if a high IP precedes then the IP is contained in the parent range (if existent). Each data point stores information about the complementary point and the parent range. You can view an example of the data structure in the project readme.

Querying

Getting the location of IP addresses is now as simple as

dig @localhost -p5312 -tTXT 3.105.177.255.geoipns

which gives you

; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> @localhost -p5312 -tTXT 3.105.177.255.geoipns
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 40194
;; flags: qr rd; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;3.105.177.255.geoipns. IN TXT

;; ANSWER SECTION:
3.105.177.255.geoipns. 3600 IN TXT "location=Sydney, NSW, AU"
3.105.177.255.geoipns. 3600 IN TXT "asn=Amazon.com, Inc."

;; Query time: 0 msec
;; SERVER: 127.0.0.1#5312(127.0.0.1)
;; WHEN: Mon Oct 21 19:13:20 IST 2019
;; MSG SIZE rcvd: 151

Benchmarks

Running the server with GeoIP2 lite with both IPv4 and IPv6 addresses uses around 1.5GB memory with minimal vCPU load. Running on an LXC container limited to 4 vCPUs and 4GB RAM on a 24 x Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz machine with PHP7.3 on the same machine in a different container for 10000 randomized IPv4 requests gave the following results:

Average 229μs per GeoIPNS request (phpdns)
Average 433μs per GeoIPNS request (native)

Thus, GeoIPNS could satisfy my constraints as,

  1. Stateless, hence can run multiple servers. DNS inherenly supports falling back to other servers, so clients usually would not need libraries for configuration.
  2. Works with arbitrary data as CSVs are directly read.
  3. At <250μs per request (considering server processing time), it was fast enough at least for my application even on a single threaded benchmark.
  4. Needs internet connectivity only for updating the database once in a while, which can be cached locally.
  5. Works with both schemes by prefixing v4 addresses internally before storing.

Conclusion

While not deployed in production yet, abusing DNS queries for IP information seems to have been a good idea for my application, at least till now. If you want to check it out or contribute, you can find the open source (MIT licensed) GeoIPNS repo at GitHub.

Cheers!

--

--