Illustration by DALL·E 2

It’s tricky to correctly mentally model the behavior of HTTP request timeouts in most HTTP libraries. You may specify a certain “timeout” on your HTTP GET request, e.g. 5 seconds, and expect that by doing so you have a guarantee it will succeed or fail within 5 seconds. This isn’t a correct model though, in actuality your total time could reach into the hours for a timeout of “5” seconds in the Python requests library, as well as many other HTTP client libraries. This one timeout passed to requests is only one piece of what is needed to constrain the response time in a noisy network environment. I want to share a more comprehensive model of all the other timeouts in plan in the network stack, how they interact with each other, and how to control them, so that you can build an accurate mental model of how long differently configured requests can take, and can wrest full control over your run time.

Scope

I am going to focus specifically on routine HTTPS GET requests using Python requests library on Linux. Here is our client code:

1
requests.get('http://localhost:5678', timeout=5)

What is the longest it can take to fail if not 5 seconds? What actually will time out first? Let’s find out!

DNS Timeout

DNS requests can time out from heavy local network loss. requests will spend an unlimited amount of time on DNS, regardless of what timeout value you specify. The DNS timeout is instead controlled at the operating system level. In my testing on Linux 5.11 with systemd-resolvd, the timeout was exactly 20 seconds until the system gave up.

Timeout worst case running total: 20 seconds.

Note on latency: DNS lookups are cached locally by Linux (disclaimer: technically not cached by the kernel itself, but by your userspace resolver service that is invoked by the kernel), so if you are making many requests to the same host in a row, this latency will only impact the first one.

TCP Connect Timeout

This timeout, alongside the TCP Read Timeout below, is what is actually being configured when you pass in a “timeout” arg to requests.

What it means is the max time the kernel will wait for SYNACK after sending first SYN packet when establishing a new TCP connection. If the first SYN or SYNACK are totally lost, the kernel will keep retrying. On Linux it does the retries with exponential backoff between successive retries, starting at 1 second and doubling each time. After too many failures though, eventually the timeout will be hit, and the request will be aborted.

Timeout worst case running total: 25 seconds (+5 seconds from connect timeout)

Note on latency: If you are making multiple requests to the same host in a row, normally TCP connections get re-used for multiple requests [1], so you do not have to wait for TCP handshakes again on the subsequent requests.

TCP Read Timeout

The max time to wait for new data after every TCP read (or after first write). This is also configurable in requests by passing “timeout” arg. The same value for the connect timeout is re-used as the read timeout. You can also pass a tuple if you want to set the connect timeout to a different value than the read time.

The read timeout is the most misunderstood timeout. It would be intuitive to expect that this is the time we wait to read in a full response from the server. However, what it actually is is the time to wait for the next successive fragment of data from the server. A malicious server could send a 10KiB response as 10240 individual packets, each containing one byte each and spaced 4 seconds apart. It will take approximately 11 hours to receive a response like this, but the read timeout of 5 seconds will not be violated as each individual data read operation happened within 5 seconds.

To really understand this timeout, we have to dive one level deeper and look at the interactions between Python and the kernel under the hood. Each time you call requests.get, here are the syscalls that are made:

1
requests.get('http://localhost:5678', timeout=5)
1
2
3
4
5
6
7
8
9
10
socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 3
setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
ioctl(3, FIONBIO, [1]) = 0
connect(3, {sa_family=AF_INET, sin_port=htons(5678), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
poll([{fd=3, events=POLLOUT|POLLERR}], 1, 5000) = 1 ([{fd=3, revents=POLLOUT}])
getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
poll([{fd=3, events=POLLOUT}], 1, 5000) = 1 ([{fd=3, revents=POLLOUT}])
sendto(3, "GET / HTTP/1.1\r\nHost: localhost:"..., 145, 0, NULL, 0) = 145
ioctl(3, FIONBIO, [1]) = 0
poll([{fd=3, events=POLLIN}], 1, 5000) = 0 (Timeout)

The lines to focus on here is the poll call on the last line. Notice the “5000” that is passed to it. This is the 5 second timeout, in milliseconds. poll will block until that timeout is hit, or until new data is available.

Next, let’s have a server send a few fragments of the response and see how requests reacts

1
2
3
4
5
6
7
poll([{fd=3, events=POLLIN}], 1, 5000)  = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "z", 8192, 0, NULL, NULL) = 1
poll([{fd=3, events=POLLIN}], 1, 5000) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "z", 8192, 0, NULL, NULL) = 1
poll([{fd=3, events=POLLIN}], 1, 5000) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "z", 8192, 0, NULL, NULL) = 1
poll([{fd=3, events=POLLIN}], 1, 5000) =

Yikes. It keeps calling the same poll function again and again, with a new 5 second timeout each time. This type of latency can go on indefinitely. Rarely will your remote server be malicious like this, but having a either a large response or a slow network connection will necessitate the use of many different fragments and can will have the same effect of a request far exceeding the specified timeout.

Timeout worst case running total: 25+inf seconds

TCP Keepalive Timeout

TCP also has a “keepalive” feature, where a heartbeat packet can be sent periodically on idle TCP connections, to ensure that the remote server is still accessible. In normal Linux configuration, these heartbeats will only trigger once the TCP connection has been idle for 2 hrs. If the other kernel is unreachable after 2 hours, the connection times out. This indeed is how it works with Python requests.

However, TCP keepalive feature becomes very interesting if you use a different client or otherwise configure the keepalives to engage earlier than after 2 hours. In fact, some http clients will configure the kernel to start sending keepalive heartbeats immediately. As an example of a library which does this, we can take a short detour away from Python Requests to instead look at Go net/http.

1
http.Get("http://localhost:5678")

With Go net/http, Linux TCP keepalive is activated immediately and configured to send a keepalive packet every 30 seconds (TCP_KEEPIDLE & TCP_KEEPINTVL). If the remote server fails to respond to any of these keepalive packets for over 30 seconds, the kernel will retry a number of times. It does 9 tries, then after 9th keepalive is ignored, the kernel then kills the connection [2].

Where does the 9 come from? 9 seems to be the default on Linux 5.11, and is found in cat /proc/sys/net/ipv4/tcp_keepalive_probes. However the 30 seconds is unexplained by default configuration, /proc/sys/net/ipv4/tcp_keepalive_intvl is set to 75 seconds, that is being configured specially by Go net/http. Let’s look at that at the syscall level, in comparison to the syscalls we observed python requests making earlier:

1
2
3
4
5
6
7
8
9
10
11
12
socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 3
setsockopt(3, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0
connect(3, {sa_family=AF_INET, sin_port=htons(5678), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
getpeername(3, {sa_family=AF_INET, sin_port=htons(5678), sin_addr=inet_addr("127.0.0.1")}, [112->16]) = 0
getsockname(3, {sa_family=AF_INET, sin_port=htons(33218), sin_addr=inet_addr("127.0.0.1")}, [112->16]) = 0
setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(3, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
setsockopt(3, SOL_TCP, TCP_KEEPINTVL, [30], 4) = 0
setsockopt(3, SOL_TCP, TCP_KEEPIDLE, [30], 4) = 0
write(3, "GET / HTTP/1.1\r\nHost: localhost:"..., 95) = 95
read(3, 0xc000120000, 4096) = -1 EAGAIN (Resource temporarily unavailable)

What is happening is net/http is setting socket-specific TCP keepalive options on every request using setsockopt, and it defaults to having all HTTP requests use a 30 second keep alive timeout. The key line of sourcecode is at https://github.com/golang/go/blob/ecf6b52b7f4ba6e8c98f25adf9e83773fe908829/src/net/http/transport.go#L47. It is quite interesting to me that they decided to enable keepalive on all connections with a custom interval, but did not also configure the number of probes to wait for (TCP_KEEPCNT) and instead use the kernel default of 9. In fact, Go’s underlying “Dialer” code does not even expose any option to configure the probes count. Does this mean your Go net/http code will timeout at different times when running on different Linux distributions in practice, or is it set to 9 universally? I have no idea, if your Go net/http code is being deployed across Linux + Mac + Windows, I am certain there will be some differences like this on each platform.

Okay, so that’s TCP keepalive. Not normally used with Python requests, but used by default even in other libraries. Let’s get back to looking at the final type of timeout.

Timeout worst case running total: 25+inf seconds (TCP keepalive has no impact in our example)

“Total” Timeout

Lastly, I want to recommend to just disregard all this complexity and get a consistent timeout, regardless of what happens at any level, via managing a top level timeout timer. You can start the timer just before creating the request at all. I definitely think this is the simplest option, when available. When configuring timeouts in software outside your control (e.g. nginx reverse-proxy upstream request timeout configuration), there may not be an option to set a timeout like this, but if you have full control over the software, you can implement a timeout like this.

In Python + Requests, you can do this most directly by putting the requests code into a dedicated Python Thread and then running a timer in your main thread (perhaps through the asyncio loop.run_in_executor interface).

This has overhead though and is not very elegant. A better solution in Python is you can switch to a more powerful library that has true nonblocking (“nonblocking” as in O_NONBLOCK) support. The most popular two such libraries in Python being httpx or aiohttp.Client [3]. aiohttp.Client goes even further and has total timeouts built right in, which is fantastic and makes this solution very easy. However, instead of showing that built-in total timeout support, for me to better communicate the underlying idea I will instead show a custom implementation of it, here using httpx.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import asyncio
import time

import httpx


async def test(url):
start_time = time.time()
try:
async with httpx.AsyncClient() as client:
await client.get(url)
finally:
print(f'Request completed after {time.time() - start_time} seconds.')


asyncio.wait_for(test('https://example.com'), timeout=5)

Additional Timeouts

There are still more failure modes I have not analyzed, but am aware of. And certainly there are even more beyond this that I am not even aware of. Here are a couple of the additional ones I am aware of:

  • OCSP request timeout. OSCP (Online Certificate Status Protocol) is an extension to TLS that enables certificate authorities to dynamically revoke otherwise valid TLS certificates, before they have expired. It works by requiring clients to check each certificate they receive with an OCSP server ran by the CA. THe OSCP server will disclose any expired certificate fingerprints to the client, and allow it to reject them. Now, this involves a new server, and a new connection, which takes time and adds a new risk of connection latency or interruption, resulting in a timeout or increased response time. Today, requests doesn’t yet have OCSP support built in, but other libraries do.
  • Extraordinary CPU scheduler delay. If e.g. the system you are running on hibernates while waiting for the response, it could take indefinite time for the timeout to hit, regardless of everything else I have said in this post. Alternatively, other processes may simply overload the scheduler, giving you a lesser version of the same effect.
  • HTTP Keep-Alive Idle Timeout. This is totally distinct from the TCP-level “keepalive” feature discussed above, and in terms of analyzing the max time that your code can take, it can be effectively modeled as having no impact. However, you have probably heard of this before, so I have briefly explained what it is in footnote [1:1].
  • TCP Write Timeout. I didn’t mention this explicitly, but the write timeout is the same idea as the read timeout, except for writes (send()). For HTTP that means this only applies if you get backpressure from the remote server (from poor network connection or a struggling server) or failure to ACK writes while sending the initial HTTP request. For a large request (e.g. uploading a file), or a bad network connection, the same pathological latency scenario can happen with write timeouts as the one described above with read timeouts.
  • Server-side timeouts. The server can also kill your connection at any time, for any reason, including when any of its own opaque timeout logic is triggered.

Takeaway

To wrap up what I shared above:

  • There is potentially slow network activity that happens even before requests’ first timeout applies: the DNS request.
  • TCP connect timeouts apply to the TCP handshake.
  • TCP read timeouts apply to each fragment of data received and can last indefinitely.
  • TCP keepalive effectively doesn’t apply with requests, do not worry about it unless using other libraries.
  • I highly recommend using total timeouts to be able to constrain the worst case behavior and give yourself a clear upper bound on the impact that network-borne problems can have on your software.

Footnotes


  1. HTTP “keep alive” refers to persisting TCP connections open across more than one request. When you are making many requests in a row to the same server, and have configured your server to support persistent connections “keep alive”, and configured your client to reuse the connection (Use requests.session for Python requests), HTTP Keep-Alive’s timeout will come into play. Both the client and the server can set a Keep-alive timeout, and if the series of requests stops, i.e. if the persistent connection sits idle for over this timeout, the connection will be closed and a new connection required for subsequent requests. Again, be sure to not confuse this concept with TCP keepalive. ↩︎ ↩︎

  2. Using errno ETIMEDOUT, which canonically shows up as “connection timed out” in most software thanks to strerror. ↩︎

  3. Just for completeness, if you’re not in a Python environment, or you otherwise have 1000s of concurrent requests and need the best performance, there is a high performance non-asyncio approach you can use here that has a divergent architecture. You would want to switch to managing the sockets with one set of file descriptors in a poll (Note: I highly suggest to wrap the poll with libev or libuv), and then within that same set of file descriptors, you can also insert in timerfds to fire when your timeout value is hit. I think that theoretically asyncio.sleep timers with uvloop can use this implementation internally, but I have not verified if it actually does or not. poll/epoll/select also accept a timeout value each time they are called, and I have seen software that internally picks which timer has the lowest delta time and then uses its remaining time for that timeout value, and then does a check for if any timer has expired every time the poll returns. I have seen using poll timeout like this far more often than timerfds, and I have not really analyzed the pros/cons as to which you should prefer in what situations. ↩︎