I spent several sleepless nights trying to resolve an issue with MapReduce jobs in HDP 3.1. While running MapReduce Service Check job, in the log file an error was reported although the job passed in green. The same message came up in others MapReduce jobs.
{"exception":"WebApplicationException","message":"org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 252 actions: IOException: 252 times, servers with issues: null","javaClassName":"javax.ws.rs.WebApplicationException"}
[Job ATS Event Dispatcher] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Failed to process Event JOB_FINISHED for the job : job_1572998114693_0003 org.apache.hadoop.yarn.exceptions.YarnException: Failed while publishing entity
at org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl$TimelineEntityDispatcher.dispatchEntities(TimelineV2ClientImpl.java:548)
at org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl.putEntities(TimelineV2ClientImpl.java:149)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.processEventForNewTimelineService(JobHistoryEventHandler.java:1405)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleTimelineEvent(JobHistoryEventHandler.java:742)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.access$1200(JobHistoryEventHandler.java:93)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler$ForwardingEventHandler.hand
at org.apache.hadoop.hbase.client.ConnectionImplementation.get(ConnectionImplementation.java:2002)
at org.apache.hadoop.hbase.client.ConnectionImplementation.locateMeta(ConnectionImplementation.java:762)
at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:729)
at org.apache.hadoop.hbase.client.ConnectionImplementation.relocateRegion(ConnectionImplementation.java:707)
at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegionInMeta(ConnectionImplementation.java:911)
at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:732)
at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:325)
... 17 more
The problem is related to Yarn Timeline server v.2. The service is using HBase as backup storage and it looked that this embedded HBase server was not started or not working properly. HBase is keeping its runtime configuration data like active HBase Region Serves in Zookeeper znodes.
Also, the znode /atsv2-hbase-secure did not exist.
ls /
[hive, cluster, brokers, hbase-unsecure, kafka-acl, kafka-acl-changes, admin, isr_change_notification, atsv2-hbase-secure, log_dir_event_notification, kafka-acl-extended, rmstore, hbase-secure, kafka-acl-extended-changes, consumers, latest_producer_id_block, registry, controller, zookeeper, delegation_token, hiveserver2, controller_epoch, hiveserver2-leader, atsv2-hbase-unsecure, ambari-metrics-cluster, config, ams-hbase-unsecure, ams-hbase-secure]
Embedded HBase service is running as Yarn service named ats-hbase. I started to examine the local container the ats-hbase is running in. It is the directory in the shape of:
/data/hadoop/yarn/local/usercache/yarn-ats/appcache/application_1573000895404_0001/container_e82_1573000895404_0001_01_000003. And it turned out, that the local hbase-site.xml configuration file defines the znode as atsv2-hbase-unsecure
<name>zookeeper.znode.parent</name>
<value>/atsv2-hbase-unsecure</value>
</property>
So the embedded HBase was storing its configuration in improper Zookeper znode. But how it could happen when the Yarn configuration parameter defines the znode as atsv2-hbase-secure and where the other services expect to find the embedded HBase runtime data. Also, the HDFS /user/yarn-ats/3.1.0.0-78/hbase-site.xml file, it is the source file used by Application Master to create a local container directory, contains a valid znode value.
Solution
After closer examination of the container log files, I discover the following entry:
INFO provider.ProviderUtils: Added file for localization: conf/hbase-policy.xml -> /user/yarn-ats/.yarn/services/ats-hbase/components/1.0.0/master/master-0/hbase-policy.xml
INFO provider.ProviderUtils: Added file for localization: conf/core-site.xml -> /user/yarn-ats/.yarn/services/ats-hbase/components/1.0.0/master/master-0/core-site.xml INFO provider.ProviderUtils: Added file for localization: conf/hbase-site.xml -> /user/yarn-ats/.yarn/services/ats-hbase/components/1.0.0/master/master-0/hbase-site.xml
INFO provider.ProviderUtils: Added file for localization: conf/log4j.properties -> /user/yarn-ats/.yarn/services/ats-hbase/components/1.0.0/master/master-0/log4j.properties
Finally, the solution was quite simple:
- kill the ats-hbase Yarn application
- stop the Yarn service
- remove /user/yarn-ats/.yarn HDFS directory
- start the Yarn service