Blog do projektu Open Source JavaHotel

środa, 6 listopada 2019

Hortonworks, Yarn Timeline server v.2

Problem
I spent several sleepless nights trying to resolve an issue with MapReduce jobs in HDP 3.1. While running MapReduce Service Check job, in the log file an error was reported although the job passed in green. The same message came up in others MapReduce jobs.
ERROR [pool-10-thread-1] org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl: Response from the timeline server is not successful, HTTP error code: 500, Server response:
{"exception":"WebApplicationException","message":"org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 252 actions: IOException: 252 times, servers with issues: null","javaClassName":"javax.ws.rs.WebApplicationException"}
[Job ATS Event Dispatcher] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Failed to process Event JOB_FINISHED for the job : job_1572998114693_0003 org.apache.hadoop.yarn.exceptions.YarnException: Failed while publishing entity
at org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl$TimelineEntityDispatcher.dispatchEntities(TimelineV2ClientImpl.java:548)
at org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl.putEntities(TimelineV2ClientImpl.java:149)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.processEventForNewTimelineService(JobHistoryEventHandler.java:1405)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleTimelineEvent(JobHistoryEventHandler.java:742)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.access$1200(JobHistoryEventHandler.java:93)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler$ForwardingEventHandler.hand
In the yarn log file (/var/log/hadoop-yarn/yarn) more detailed message could be found. Caused by: java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /atsv2-hbase-secure/meta-region-server
at org.apache.hadoop.hbase.client.ConnectionImplementation.get(ConnectionImplementation.java:2002)
at org.apache.hadoop.hbase.client.ConnectionImplementation.locateMeta(ConnectionImplementation.java:762)
at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:729)
at org.apache.hadoop.hbase.client.ConnectionImplementation.relocateRegion(ConnectionImplementation.java:707)
at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegionInMeta(ConnectionImplementation.java:911)
at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:732)
at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:325)
... 17 more

The problem is related to Yarn Timeline server v.2. The service is using HBase as backup storage and it looked that this embedded HBase server was not started or not working properly. HBase is keeping its runtime configuration data like active HBase Region Serves in Zookeeper znodes.
Also, the znode /atsv2-hbase-secure did not exist.
zkCli.sh
ls /

[hive, cluster, brokers, hbase-unsecure, kafka-acl, kafka-acl-changes, admin, isr_change_notification, atsv2-hbase-secure, log_dir_event_notification, kafka-acl-extended, rmstore, hbase-secure, kafka-acl-extended-changes, consumers, latest_producer_id_block, registry, controller, zookeeper, delegation_token, hiveserver2, controller_epoch, hiveserver2-leader, atsv2-hbase-unsecure, ambari-metrics-cluster, config, ams-hbase-unsecure, ams-hbase-secure]

Embedded HBase service is running as Yarn service named ats-hbase. I started to examine the local container the ats-hbase is running in. It is the directory in the shape of:
/data/hadoop/yarn/local/usercache/yarn-ats/appcache/application_1573000895404_0001/container_e82_1573000895404_0001_01_000003. And it turned out, that the local hbase-site.xml configuration file defines the znode as atsv2-hbase-unsecure
   <property>
      <name>zookeeper.znode.parent</name>
      <value>/atsv2-hbase-unsecure</value>
   </property>

So the embedded HBase was storing its configuration in improper Zookeper znode. But how it could happen when the Yarn configuration parameter defines the znode as atsv2-hbase-secure and where the other services expect to find the embedded HBase runtime data. Also, the HDFS /user/yarn-ats/3.1.0.0-78/hbase-site.xml file, it is the source file used by Application Master to create a local container directory, contains a valid znode value.
Solution
After closer examination of the container log files, I discover the following entry:
INFO provider.ProviderUtils: Added file for localization: conf/hadoop-metrics2-hbase.properties -> /user/yarn-ats/.yarn/services/ats-hbase/components/1.0.0/master/master-0/hadoop-metrics2-hbase.properties
INFO provider.ProviderUtils: Added file for localization: conf/hbase-policy.xml -> /user/yarn-ats/.yarn/services/ats-hbase/components/1.0.0/master/master-0/hbase-policy.xml
INFO provider.ProviderUtils: Added file for localization: conf/core-site.xml -> /user/yarn-ats/.yarn/services/ats-hbase/components/1.0.0/master/master-0/core-site.xml INFO provider.ProviderUtils: Added file for localization: conf/hbase-site.xml -> /user/yarn-ats/.yarn/services/ats-hbase/components/1.0.0/master/master-0/hbase-site.xml
INFO provider.ProviderUtils: Added file for localization: conf/log4j.properties -> /user/yarn-ats/.yarn/services/ats-hbase/components/1.0.0/master/master-0/log4j.properties
So it turned out, that Application/Serrvice Master is using not /user/yarn-ats/3.1.0.0-78/ HDFS path but another hidden /user/yarn-ats/.yarn/ directory to create the local container.
Finally, the solution was quite simple:
  • kill the ats-hbase Yarn application
  • stop the Yarn service
  • remove  /user/yarn-ats/.yarn HDFS directory
  • start the Yarn service