In the Distributed Runtime Environment, we learn some basic concepts about task slots and resources. And in TaskManager Resource we could get more information about TaskManager Resource configuration and know how to specify resources for each operator. After all of this, we come to a conclusion that quantitative resource management could help a lot in avoiding
OutOfMemory. However, at the same time it raises the bar to debug the job scheduling problems. So we enhance some web pages of Flink dashboard to make debugging easier.
As shown in the figure above, the resource section of Flink cluster overview contains total/available slots and total/available resources(cpu cores, heap memory, direct memory, native memory, managed memory and network memory).
SlotRequest has a corresponding
ResourceProfile which describes resource requirements of tasks. As shown in the figure above, all the requested but not fulfilled slots will show in a list of pending slots. The request time and resources of each pending slot could also be found.
Each TaskManager represents a subset of resources of the Flink cluster. A slot can only be assigned to one TaskManager and allocation across multiple TaskManagers is not possible. When we submit a job to an existing session, even though the Flink cluster have enough resources to fulfill the pending
SlotRequest, it may not be allocated due to resource fragmentation. So we need the TaskManagers resources list(including total/available amounts of all kinds of resources) to help diagnose this situation.
Running Jobs Pageof Flink dashboard to check whether all tasks are running. If not, go to the next step.
Job Details Pageand switch to the
Exceptionstab. Check whether the
Exception Historyis empty. If it is empty, just go to the next step. Otherwise, please try to fix all these exceptions besides
Pending Slotstab and check whether the pending list is empty. If not, the not running tasks may be due to insufficient resources.
Overview Pageand check if all kinds of available resources are enough to fulfill the pending slots. As shown in the figure 2, each pending slot request needs the resources of <0.1 Core, 32MB Direct Memory, 256MB Heap Memory, 32MB Native Memory>. The Available UserNative memory is zero in figure 1. So the two
SlotRequests could not be fulfilled and remain pending.
Task Managers Pageand we should find that no TaskManager has enough available resources to fulfill the pending slot requests.