Flink’s web interface provides a tab to monitor the back pressure behaviour of running jobs.
If you see a back pressure warning (e.g.
High) for a task, this means that it is producing data faster than the downstream operators can consume. Records in your job flow downstream (e.g. from sources to sinks) and back pressure is propagated in the opposite direction, up the stream.
Take a simple
Source -> Sink job as an example. If you see a warning for
Source, this means that
Sink is consuming data slower than
Source is producing.
Sink is back pressuring the upstream operator
Back pressure monitoring works by repeatedly taking stack trace samples of your running tasks. The JobManager triggers repeated calls to
Thread.getStackTrace() for the tasks of your job.
If the samples show that a task Thread is stuck in a certain internal method call (requesting buffers from the network stack), this indicates that there is back pressure for the task.
By default, the job manager triggers 100 stack traces every 50ms for each task in order to determine back pressure. The ratio you see in the web interface tells you how many of these stack traces were stuck in the internal method call, e.g.
0.01 indicates that only 1 in 100 was stuck in that method.
In order to not overload the task managers with stack trace samples, the web interface refreshes samples only after 60 seconds.
You can configure the number of samples for the job manager with the following configuration keys:
jobmanager.web.backpressure.refresh-interval: Time after which available stats are deprecated and need to be refreshed (DEFAULT: 60000, 1 min).
jobmanager.web.backpressure.num-samples: Number of stack trace samples to take to determine back pressure (DEFAULT: 100).
jobmanager.web.backpressure.delay-between-samples: Delay between stack trace samples to determine back pressure (DEFAULT: 50, 50 ms).
You can find the Back Pressure tab next to the job overview.
This means that the JobManager triggered a stack trace sample of the running tasks. With the default configuration, this takes about 5 seconds to complete.
Note that clicking the row, you trigger the sample for all subtasks of this operator.
If you see status OK for the tasks, there is no indication of back pressure. HIGH on the other hand means that the tasks are back pressured.
Stack trace samples provide relatively accurate measure of the back pressure, However, it requires users to view the back pressure status of tasks belonging to different job vertices separately. For large jobs with many tasks, it may not be easy to identify who may cause the back pressure at the first glance.
To help identify the tasks causing back pressure more easily, we also shown the
outPoolUsage in the web UI. The
outPoolUsage reflect the estimated usage of the input/output buffers (Metrics). If a task is back pressured, the buffers will be accumulated in its output queue and the downstream tasks’ input queue, thus it is very likely that its
outPoolUsage and the downstream tasks’
inPoolUsage become 100%. Generalized to a chain of tasks, the first task whose
inPoolUsage is 100% and
outPoolUsage is not 100% usually causes the back pressure.
However, sometimes there may not be back pressure when
outPoolUsage reach 100%, because buffers may be still written by the serializer or they are cached in the input channels. Therefore,
outPoolUsage are more suitable to help identify the suspicious tasks causing back pressure first, then you could further determine the specific tasks causing back pressure finally with the stack trace sampling.