LOG PROBLEMS WITH DOCKER AND DATADOG

I have several servers, to facilitate the management and detect incidents I am using dockers and DataDog (in its docker version).
For those who don't know, DataDog is a monitoring system for servers. It allows many things. I am only using 2 of them:

  1. Infrastructure statistics (CPU usage, RAM, disk space, read/write, ...).
  2. Centralize the logs (instead of having them in the server of each web separately, they are sent to Datadog).

The problem I have found with the CPU consumption, that from time to time the Datadog agent reaches 100% of CPU. And this causes that the other containers to not have CPU available and therefore the web stops working.
I have found little information on the subject, and some outdated several years ago.
After researching for a while, I've realized that the problem is not Datadog. The problem is specifically when Datadog tries to send docker logs.

datadog_agent:
    image: datadog/agent:7
    container_name: "${PROJECT_NAME}_datadog_agent"
    restart: always
    environment:
      DD_API_KEY: "131ef96a2e4593a88051ceaff208b26a"
      DD_SITE: "datadoghq.eu"
#      DD_LOGS_ENABLED: "true"
#      DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL: "true"
      DD_AC_EXCLUDE: "name:${PROJECT_NAME}_datadog_agent"
      DD_HOSTNAME: "${PROJECT_BASE_URL}"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /proc/:/host/proc/:ro
      - /sys/fs/cgroup:/host/sys/fs/cgroup:ro
      - /opt/datadog-agent/run:/opt/datadog-agent/run:rw

Comment DD_LOGS_ENABLED and DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL and regenerate the container.

Depending on the case you may see in a few minutes how the CPU goes down, or you may not notice any improvement and you have to restart the server or the dockerd daemon service.
At least with this solution, you can have statistics of CPU, RAM, disk, etc … while I was looking for how to solve the log problem.

Digging more into the subject I discovered that Datadog reads the json.log files from the docker containers. And in my case, the problem was that some of those files weighed several GB.

Having many files, and several GB, makes Datadog take a long time to process them. And it makes sense that every time the Datadog server is restarted it goes to look for the files, and the workload is added to it and therefore the CPU and RAM consumption skyrockets.

The solution to how to limit the size of these files and thus avoid taking up disk space and prevent Datadog has so many logs to "process" I leave it in this other article: How to reduce the size of the logs of docker containers

Tags

Have Any Project in Mind?

If you want to do something in Drupal maybe you can hire me.

Either for consulting, development or maintenance of Drupal websites.