Docker Hub — Too Many Requests (HAP429)

This is my first blog so please excuse the brevities. I have been thinking of writing blogs for a while but procrastinated it due to time constraints and personal commitments. Inspiration and motivation from my colleagues made me write my very first blog. It’s quite elaborate and detailed but I wanted to share as many details as possible so others who run into similar kind of issue can count on this post for support.

Very recently we faced a very critical issues while pulling/pushing images with docker hub on our production servers which affected our clients severely.

This is the error

“Error response from daemon: error parsing HTTP 429 response body: invalid character ‘T’ looking for beginning of value: “Too Many Requests (HAP429).”

Initial findings

  1. As a first step, we did google search for the error. We found this link — https://docs.docker.com/docker-hub/download-rate-limit/
  2. Based on official docker documentation, image transaction rate limit is applicable only for anonymous users. (Anonymous users — It means users pulling public images without docker login)
  3. Rate limit — What it is? Restricting the number of images pull for anonymous users, max of 300 layers can be pulled in 6hours.
  4. Why a rate limit? Docker hub introduced this rate limit (in recent times) to protect from DDOS attacks.
  5. How? It tracks the IP address of users who are pulling images from docker hub. If the threshold limit is reached, it will throw “429 too many requests error” for further image pulls.
  6. Layers — When we build docker image using Dockerfile, each command (COPY, RUN, CMD, etc.) will be stored in separate layer. So, for a single image there can be multiple layers (based on commands used). When we docker pull the image, docker daemon makes API calls to docker hub, downloading the image layer by layer. So, if rate limit is reached for that IP address, image pull will stop (and in-between) throwing “429 — too many requests error”

Since we have a private account with docker hub, we thought this shouldn’t be applicable for us. Docker hub recently changed their pricing strategy (initially it was repository based now, its users based).
We thought our account is still in the repository-based pricing they missed to consider, and rate limits are applied.

Here is what we did:

I tried reaching out to docker hub using toll free number mentioned on their website (https://www.docker.com/company/contact) . The toll free by default, will connect us to a voice mail and we will receive a callback. I dropped voice mails 4–5times, but never received a call back. I was frustrated and had to try the last option of sending them an email to support@docker.com

Docker support is provided by third party called Mirantis. We dropped an email to support and received an acknowledgement for the same. After 3–4 hours, they sent a reply mail asking to reach out to the docker (to sort out the issue). We immediately called Mirantis (luckily found their working contact number from email response). These guys provide only Level1 support (do not have any technical knowledge). They kept redirecting us to other folks and finally was redirected to their manager, he told, we aren’t using the Docker Enterprise Edition and they cannot help us further.

We have hit a roadblock with docker support. Now, we need to shift thought process into a different direction.

Again, we did Google search and going through the links more carefully. Here is the useful one we found — https://github.com/docker/hub-feedback/issues/1907 . This link gave some hunches.

We have 20+ customer servers (each in separate VPC), but we were facing issue only in few servers (which are in same VPC). So, we figured that all the requests are going through only one NAT gateway (probably docker hub blocked this IP). We updated the NAT gateway for the affected VPC. But after a while, the same error popped again.

We thought setting up proxy server and configuring docker proxy will resolve this issue. In our case, NAT gateway was kind of a proxy server. We tried to docker proxy configure with NAT gateway but it was unsuccessful.

While we were in a critical state with around 100 users being affected and the issue still unresolved, our Information Security Head cum Program Manager gave a very smart idea of sending email again to docker hub support CC’ng the docker CEO and CTO. Within few minutes, we got response from the Docker Infrastructure (Technical) team and they started investigating the issue. They found 20,000+ requests made in a minute from affected VPC’s NAT Gateway IP address. Usually in docker hub, for private accounts, 2000–2500requests are allowed in a minute

We immediately realized that we use docker stack deploy to deploy our services in production servers. As part of this command, — with-registry-auth option is passed.

What it means is, this enable the Docker service to pull latest image:tag (by login) every time when service tries to run containers. In our case, there were 30+ dynamic docker services which were failing due to mount issue. Docker service tries to up container until it runs successfully (retry every second). Finally, we resolved the issue, removed those dynamic docker services and removed — with-registry-auth option from docker stack deploy.

The logs for docker daemon are recorded in /var/log/messages(for centos) (even for each service retry also), but we did not take those logs seriously, as we thought the issue is with docker hub rate limit.

Trust me, we would have taken 3–4 days to find the above root cause, if we don’t get details from docker hub infra team.

Other Possible Ways

  • We can use TCPDump/ lsof/wireshark to monitor outgoing http calls.
  • Check for docker deamon logs including warnings.

Lessons Learned

  • Please make sure to do enough and more investigation on your side
  • Never ignore any small warnings and always look for system logs.

Assoicate Architect ★ Developer ★ Troubleshooter