Debugging Job Resources

In the Distributed Runtime Environment, we learn some basic concepts about task slots and resources. And in TaskManager Resource we could get more information about TaskManager Resource configuration and know how to specify resources for each operator. After all of this, we come to a conclusion that quantitative resource management could help a lot in avoiding OutOfMemory. However, at the same time it raises the bar to debug the job scheduling problems. So we enhance some web pages of Flink dashboard to make debugging easier.

Resource Overview

As shown in the figure above, the resource section of Flink cluster overview contains total/available slots and total/available resources(cpu cores, heap memory, direct memory, native memory, managed memory and network memory).

  • In session mode, the total resources are calculated based on the configuration. The total resources minus the resources already used by the running job, the remaining is available resources. We strongly encourage that in the same session all jobs set resources or no one set it up.
  • In per-job mode, if no operator has set the resources, the total and available resources are both zero. Otherwise, the total resources are the summation of all operator resources. And the available resources are zero or close to zero. It is just because one or more slot requests will be combined into a Yarn container or kubernetes pod and allocate from resource management framework on demand.

Pending Slots

Each SlotRequest has a corresponding ResourceProfile which describes resource requirements of tasks. As shown in the figure above, all the requested but not fulfilled slots will show in a list of pending slots. The request time and resources of each pending slot could also be found.

TaskManagers

Each TaskManager represents a subset of resources of the Flink cluster. A slot can only be assigned to one TaskManager and allocation across multiple TaskManagers is not possible. When we submit a job to an existing session, even though the Flink cluster have enough resources to fulfill the pending SlotRequest, it may not be allocated due to resource fragmentation. So we need the TaskManagers resources list(including total/available amounts of all kinds of resources) to help diagnose this situation.

Procedure of diagnosis

  1. Visit Running Jobs Page of Flink dashboard to check whether all tasks are running. If not, go to the next step.
  2. Click the job name to visit Job Details Page and switch to the Exceptions tab. Check whether the Root Exceptions or Exception History is empty. If it is empty, just go to the next step. Otherwise, please try to fix all these exceptions besides TimeoutException before continuing.
  3. Switch to the Pending Slots tab and check whether the pending list is empty. If not, the not running tasks may be due to insufficient resources.
  4. Return to the Overview Page and check if all kinds of available resources are enough to fulfill the pending slots. As shown in the figure 2, each pending slot request needs the resources of <0.1 Core, 32MB Direct Memory, 256MB Heap Memory, 32MB Native Memory>. The Available UserNative memory is zero in figure 1. So the two SlotRequests could not be fulfilled and remain pending.
  5. There is another scenario, all the kinds of resources are enough, however, the slot could not be allocated. Probably it is due to resource fragmentation. Visit the Task Managers Page and we should find that no TaskManager has enough available resources to fulfill the pending slot requests.
  6. In both cases above, we need to increase cluster resources(scale up/out TaskManagers) or reducing operator resources request.

Back to top