HiBench is regarded as a leading benchmark in the BigData world according to this webpage. But while trying to run it in my test HDP 2.6.5 or 3.1environment, 2.6.5 or 3.1, I found it very frustrating, particularly when the cluster is secured by Kerberos.
Because the project seems to be abandoned two years ago, I started implementing manual patches to the code, it is described here. But at some point, I came to a dead end. Kafka 0.8 client does not support Kerberos, Kerberos support was added in Kafka 0.10 client, but Kafka 0.10 is not backward compatible and moving there required redeveloping substantial part of Kafka related code in Java and Scala.
Solution
So finally I decided to make my own fork of HiBench project and split completely from the trunk. The result is a new GitHub project here.
Main changes include:
- Kerberos support
- Kafka 2.0 implemented
- HDP 3.1/Apache Hadoop 3.1 support
- Hive 3.1 support, the standard HiBench was developed for Hive 0.17 and does not talk to 3.1
Features lost comparing to standard HiBench
- Only HDP is supported and tested, standalone Spark, Apache Hadoop, Kafka etc are not supported.
- Only support for HDP 2.6.5 and HDP 3.1 is implemented. Support for an older version of HDP, Spark, Scala is abandoned.
- Support for Apache Flink and Gearpunmo is removed.
Tests
The project was tested in two HDP 3.1 and HDP 2.6.5 environments, tiny in local KVM cluster and tiny/small/large in a larger multi-host cluster.
Future plans
- Review Kafka streaming benchmark, while analyzing the code I found some mysteries out there
- The same for Dfsio benchmark test
- Test in wire encryption secured cluster