Blink的一个bug

tammypi tammypi
2019-01-30 17:59
468
2

目前在使用过程中,发现在application在yarn上运行时,如果table转为stream,且并行度大于1时。代码如下:

就会出问题,报出的错误如下:

org.apache.flink.client.program.ProgramInvocationException: Job failed. (JobID: f5e4f7243d06035202e8fa250c364304)
        at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:276)
        at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:482)
        at org.apache.flink.streaming.api.environment.StreamContextEnvironment.executeInternal(StreamContextEnvironment.java:85)
        at org.apache.flink.streaming.api.environment.StreamContextEnvironment.executeInternal(StreamContextEnvironment.java:37)
        at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1893)
        at com.ngengine.main.KafkaMergeMain.startApp(KafkaMergeMain.java:352)
        at com.ngengine.main.KafkaMergeMain.main(KafkaMergeMain.java:94)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:561)
        at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:445)
        at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:419)
        at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:786)
        at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:280)
        at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:215)
        at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1029)
        at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1105)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
        at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
        at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1105)
Caused by: java.util.concurrent.TimeoutException
        at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

 

注意,上面的代码在flink-1.7.0上跑是ok的,完全没有问题的。

但是github上阿里没有留下自己的联系方式,也没有issue提报的地方。ORZ。

 

附上情况的更新:

背景:

我提交的命令是:

./flink run -m yarn-cluster -ynm FLINK_NG_ENGINE_1 -ys 4 -yn 10 -ytm 5120 -p 40 -c XXMain ~/xx.jar

报错是所有的task都处于一段时间的create状态,然后再报出Concurrent.Timeout的错误。如下图所示:

 

1.Flink1.5版本是存在过一个资源不够却报出Concurrent.Timeout的错误的bug的。估计在Blink里也是存在的。

https://issues.apache.org/jira/browse/FLINK-9082

 

2.根据阿里伍 翀邮件的回复“blink 是不会共享 slot的, flink 是会共享slot的。 所以如果他的job 会有 > 2的非chain节点,就会用超过 40 个slot。看你的截图,是有5个非chain节点,所以实际申请slot数应该超过了40,所以资源不够。”

所以这个报错是因为资源不够导致的。

目前猜测可能5个非chain task需要5*40(并行度)=200个slot才够(此条尚未得到答复)。

但是测试发现如果是100个slots也是可以运行的,这个需要的slots数目真的很迷。

相关issue,链接:https://issues.apache.org/jira/browse/FLINK-11484。里面blink开发的回复与伍 翀的回复也存在不一致的地方。

 

 

发表评论

验证码:
2019-01-31 10:06
您可以尝试联系阿里的伍 翀(flink commiter),邮箱:imjark#gmail.com
tammypi
tammypi
2019-01-31 10:43
回复 @ :感谢。
2019-01-31 10:06
您可以尝试联系阿里的伍 翀(flink commiter),邮箱:imjark#gmail.com