javahotel: 2019

wtorek, 31 grudnia 2019

MyTPC-DS,

An enhancement to my implementation of TPC-DS test. BigSQL test can be conducted by jsqsh utility, the dependency on DB2 client software is removed. Jsqsh is the part of BigSQL package and no additional dependency software is required now. How to use jsqsh for MyTPC-DS is described here.

I also added some additional results of the TPC-DS test conducted on HDP 3.1. The result is presented in the shape of Bar Plot to make them easier to understand. The graphic is created with the help of matplotlib. The Python source code is uploaded here. The input tables are taken directly from GitHub Wiki page. The pick up the correct table, a simple HTML tag is added to the page before every table.

The tests were executed using 100 GB data set and cannot be the basis of any far-reaching conclusion. But one apparent difference between HDP 2.6 and 3.1 is the significant performance improvement of Hive, from Hive 2.1 to Hive 3.1. Now the performance of Hive is almost as effective as the performance of BigSQL and ahead of SparkSQL. Also, the query coverage of Hive is much more comprehensive, from 50% (Hive 2.1) to almost 100% (Hive 3.1).

Useful links

MyTPC-DS execution framework, link
TPC-DS results using Bar Plot graphics, link
TPC-DS results analytics tables, link
Python source code to compile the test result and prepare GitHub Wiki table, link
Python source code to prepare graphics using matplotlib package, link
Run BigSQL TPC-DS test using jsqsh tool, link

środa, 6 listopada 2019

Hortonworks, Yarn Timeline server v.2

Problem
I spent several sleepless nights trying to resolve an issue with MapReduce jobs in HDP 3.1. While running MapReduce Service Check job, in the log file an error was reported although the job passed in green. The same message came up in others MapReduce jobs.
ERROR [pool-10-thread-1] org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl: Response from the timeline server is not successful, HTTP error code: 500, Server response:
{"exception":"WebApplicationException","message":"org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 252 actions: IOException: 252 times, servers with issues: null","javaClassName":"javax.ws.rs.WebApplicationException"}
[Job ATS Event Dispatcher] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Failed to process Event JOB_FINISHED for the job : job_1572998114693_0003 org.apache.hadoop.yarn.exceptions.YarnException: Failed while publishing entity
at org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl$TimelineEntityDispatcher.dispatchEntities(TimelineV2ClientImpl.java:548)
at org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl.putEntities(TimelineV2ClientImpl.java:149)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.processEventForNewTimelineService(JobHistoryEventHandler.java:1405)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleTimelineEvent(JobHistoryEventHandler.java:742)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.access$1200(JobHistoryEventHandler.java:93)
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler$ForwardingEventHandler.hand
In the yarn log file (/var/log/hadoop-yarn/yarn) more detailed message could be found. Caused by: java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /atsv2-hbase-secure/meta-region-server
at org.apache.hadoop.hbase.client.ConnectionImplementation.get(ConnectionImplementation.java:2002)
at org.apache.hadoop.hbase.client.ConnectionImplementation.locateMeta(ConnectionImplementation.java:762)
at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:729)
at org.apache.hadoop.hbase.client.ConnectionImplementation.relocateRegion(ConnectionImplementation.java:707)
at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegionInMeta(ConnectionImplementation.java:911)
at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:732)
at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:325)
... 17 more

The problem is related to Yarn Timeline server v.2. The service is using HBase as backup storage and it looked that this embedded HBase server was not started or not working properly. HBase is keeping its runtime configuration data like active HBase Region Serves in Zookeeper znodes.
Also, the znode /atsv2-hbase-secure did not exist.
zkCli.sh
ls /

[hive, cluster, brokers, hbase-unsecure, kafka-acl, kafka-acl-changes, admin, isr_change_notification, atsv2-hbase-secure, log_dir_event_notification, kafka-acl-extended, rmstore, hbase-secure, kafka-acl-extended-changes, consumers, latest_producer_id_block, registry, controller, zookeeper, delegation_token, hiveserver2, controller_epoch, hiveserver2-leader, atsv2-hbase-unsecure, ambari-metrics-cluster, config, ams-hbase-unsecure, ams-hbase-secure]

Embedded HBase service is running as Yarn service named ats-hbase. I started to examine the local container the ats-hbase is running in. It is the directory in the shape of:
/data/hadoop/yarn/local/usercache/yarn-ats/appcache/application_1573000895404_0001/container_e82_1573000895404_0001_01_000003. And it turned out, that the local hbase-site.xml configuration file defines the znode as atsv2-hbase-unsecure
<property>
<name>zookeeper.znode.parent</name>
<value>/atsv2-hbase-unsecure</value>
</property>

So the embedded HBase was storing its configuration in improper Zookeper znode. But how it could happen when the Yarn configuration parameter defines the znode as atsv2-hbase-secure and where the other services expect to find the embedded HBase runtime data. Also, the HDFS /user/yarn-ats/3.1.0.0-78/hbase-site.xml file, it is the source file used by Application Master to create a local container directory, contains a valid znode value.
Solution
After closer examination of the container log files, I discover the following entry:
INFO provider.ProviderUtils: Added file for localization: conf/hadoop-metrics2-hbase.properties -> /user/yarn-ats/.yarn/services/ats-hbase/components/1.0.0/master/master-0/hadoop-metrics2-hbase.properties
INFO provider.ProviderUtils: Added file for localization: conf/hbase-policy.xml -> /user/yarn-ats/.yarn/services/ats-hbase/components/1.0.0/master/master-0/hbase-policy.xml
INFO provider.ProviderUtils: Added file for localization: conf/core-site.xml -> /user/yarn-ats/.yarn/services/ats-hbase/components/1.0.0/master/master-0/core-site.xml INFO provider.ProviderUtils: Added file for localization: conf/hbase-site.xml -> /user/yarn-ats/.yarn/services/ats-hbase/components/1.0.0/master/master-0/hbase-site.xml
INFO provider.ProviderUtils: Added file for localization: conf/log4j.properties -> /user/yarn-ats/.yarn/services/ats-hbase/components/1.0.0/master/master-0/log4j.properties
So it turned out, that Application/Serrvice Master is using not /user/yarn-ats/3.1.0.0-78/ HDFS path but another hidden /user/yarn-ats/.yarn/ directory to create the local container.
Finally, the solution was quite simple:

kill the ats-hbase Yarn application
stop the Yarn service
remove /user/yarn-ats/.yarn HDFS directory
start the Yarn service

środa, 30 października 2019

HDP, Kafka, LEADER_NOT_AVAILABLE

Problem
After HDP cluster kerberization, the Kafka does not work even though Kafka Healthcheck passes green. Any execution of kafka-console-producer.sh ends up with the error message: /usr/hdp/3.1.0.0-78/kafka/bin/kafka-console-producer.sh --broker-list kafka-host:6667 --producer-property security.protocol=SASL_PLAINTEXT --topic xxx

WARN [Producer clientId=console-producer] Error while fetching metadata with correlation id 1 : {xxxx=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
WARN [Producer clientId=console-producer] Error while fetching metadata with correlation id 2 : {xxxx=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
WARN [Producer clientId=console-producer] Error while fetching metadata with correlation id 3 : {xxxx=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
kafka-topics.sh reports that the leader is not assigned to the partition.
/usr/hdp/3.1.0.0-78/kafka/bin/kafka-topics.sh -zookeeper zookeeper-host:2181 --describe xxx --unavailable-partitions

Topic: ambari_kafka_service_check Partition: 0 Leader: 1001 Replicas: 1001 Isr: 1001
Topic: identity Partition: 0 Leader: none Replicas: 1002 Isr:
Topic: xxx Partition: 0 Leader: none Replicas: 1002 Isr:
Topic: xxxx Partition: 0 Leader: none Replicas: 1002 Isr:

There is nothing special in Kafka /var/log/kafka/server.log file. Only /var/log/kafka/controller.log suggests that something went wrong: cat controller.log

INFO [ControllerEventThread controllerId=1002] Starting (kafka.controller.ControllerEventManager$ControllerEventThread)
ERROR [ControllerEventThread controllerId=1002] Error processing event Startup (kafka.controller.ControllerEventManager$ControllerEventThread)
java.lang.NullPointerException
at com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:857)
at com.fasterxml.jackson.databind.ObjectMapper.readTree(ObjectMapper.java:2571)
at kafka.utils.Json$.parseBytes(Json.scala:62)
at kafka.zk.ControllerZNode$.decode(ZkData.scala:56)
at kafka.zk.KafkaZkClient.getControllerId(KafkaZkClient.scala:902)
at kafka.controller.KafkaController.kafka$controller$KafkaController$$elect(KafkaController.scala:1199)

Solution
Uncle Google brings back many entries related to LEADER_NOT_AVAILABLE error but none of them led to the solution. Finally, I found this entry.
So the healing is very simple.

Stop Kafka
Run zkCli.sh Zookeeper command line
Remove /controller znode, rmr /controller
Start Kafka again

The evil spell is defeated.

wtorek, 29 października 2019

RedHat, Steam and Crusaders King 2

I'm playing King Crusaders II on my RedHat desktop for some time. Although Steam officially supports only Ubuntu distribution, the games, particularly the older ones developed for Ubuntu 16.04, work fine also on RedHat/CentOS. Unfortunately, after the latest background update of the King Crusaders, the game stubbornly refused to run. After closer examination, I discovered that probably the updated game was compiled using newer version of GNU GCC compiler and the required level of libstdc++.so.6 library is not available on my platform.
./ck2: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by ./ck2)
./ck2: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by ./ck2)
So it was bad news. But there is good news, after the epic battle I straightened it out. It was as simple as installing a newer version of GNU GCC to get more modern libraries.
The solution is described here.

środa, 23 października 2019

BigSQL 6.0 and HDP 3.1.4

Problem
There is a problem with BigSQL 6.0 installed on the top HDP 3.1.4 or after upgrade from HDP 3.1. Look at this product support web page. Installation is successful but there are plenty of entries in BigSQL diagnostic blog
java.lang.NoSuchMethodError: com/google/common/base/Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V (loaded from file:/usr/ibmpacks/bigsql/6.0.0.0/bigsql/lib/java/guava-14.0.1.jar by sun.misc.Launcher$AppClassLoader@28528006) called from class org.apache.hadoop.conf.Configuration (loaded from file:/usr/hdp/3.1.4.0-315/hadoop/hadoop-common-3.1.1.3.1.4.0-315.jar by sun.misc.Launcher$AppClassLoader@28528006).
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1358)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1339)
at org.apache.hadoop.mapred.JobConf.setJar(JobConf.java:518)
at org.apache.hadoop.mapred.JobConf.setJarByClass(JobConf.java:536)
Solution
The cause of the problem is the old guava jar file in the /usr/ibmpacks/bigsql/6.0.0.0/bigsql/lib/java. Replace the old guava with any new guava greater than 20. Or simply make a link to existing guava at the proper level. The fix should be applied on all BigSQL nodes, Head and Worker nodes. The BigSQL should be restarted to make the change to take effect.
rm /usr/ibmpacks/bigsql/6.0.0.0/bigsql/lib/java/guava-14.0.1.jar
cd /usr/ibmpacks/bigsql/6.0.0.0/bigsql/lib/java/
ln -s /usr/hdp/3.1.4.0-315/hadoop/lib/guava-28.0-jre.jar

wtorek, 22 października 2019

BigSQL and HDP upgrade

Problem
I spent several sleepless nights trying to resolve the really nasty problem. It happened after upgrade from HDP 2.6.4 and BigSQL 5.0 to HDP 3.1 and BigSQL 6.0. Everything runs smoothly, even BigSQL Healthcheck was smiling. The only exception was "LOAD HADOOP" command which failed. BigSQL can run on the top of Hive tables but it is an alternative SQL engine, it is using HCatalog service to get access to Hive tables. In order to ingest data into Hive tables, it launches a separate MapReduce task to accomplish the task.
An example command:
db2 "begin execute immediate 'load hadoop using file url ''/tmp/data_1211057166.txt'' with source properties (''field.delimiter''=''|'', ''ignore.extra.fields''=''true'') into table testuser.smoke_hadoop2_2248299375'; end" Closer examination of MapReduce logs brought up a more detailed error message.
Service org.apache.hadoop.mapreduce.v2.app.MRAppMaster failed in state INITED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.lang.NumberFormatException: For input string: "30s"
Good uncle Google suggests that it could be caused by old MapReduce engine against new configuration files. But how could it happen since all stuff related to HDP 2.6.4 and BigSQL 5.0.0. was meticulously annihilated?
What is more, another HDP 3.1/BigSQL 6.0 installation is executing LOAD HADOOP command without any interruption. Comparing all configuration data between both environments did not reveal any difference.
After a more closer examination, I discovered that related to LOAD HADOOP MapReduce job is empowered by HDP 2.6.4 environment including legacy jar files, pay attention to 2.6.4.0-91 parameter.
exec /bin/bash -c "$JAVA_HOME/bin/java -server -XX:NewRatio=8 -Djava.net.preferIPv4Stack=true -Dhdp.version=2.6.4.0-91 -Xmx545m
Also, corresponding local cache seems to be populated by the old jar files.
ll /data/hadoop/yarn/local/filecache/14/mapreduce.tar.gz/hadoop/
drwxr-xr-x. 2 yarn hadoop 4096 Jan 4 2018 bin
drwxr-xr-x. 3 yarn hadoop 4096 Jan 4 2018 etc
drwxr-xr-x. 2 yarn hadoop 4096 Jan 4 2018 include
drwxr-xr-x. 3 yarn hadoop 4096 Jan 4 2018 lib
drwxr-xr-x. 2 yarn hadoop 4096 Jan 4 2018 libexec

-r-xr-xr-x. 1 yarn hadoop 87303 Jan 4 2018 LICENSE.txt
-r-xr-xr-x. 1 yarn hadoop 15753 Jan 4 2018 NOTICE.txt
-r-xr-xr-x. 1 yarn hadoop 1366 Jan 4 2018 README.txt
drwxr-xr-x. 2 yarn hadoop 4096 Jan 4 2018 sbin
drwxr-xr-x. 4 yarn hadoop 4096 Jan 4 2018 share

But how it could be possible when all remnants of old HDP were wiped out and no sign of any reference do 2.6.4 including running the grep command against any directory suspected of retaining this nefarious mark.
grep 2\.6\.4 /etc/hadoop -R
Solution
The nutcracker turned out to be BigSQL/DB2 dbset command.
db2set
DB2_BIGSQL_JVM_STARTARGS=-Dhdp.version=3.1.0.0-78 -Dlog4j.configuration=file:///usr/ibmpacks/bigsql/6.0.0.0/bigsql/conf/log4j.properties -Dbigsql.logid.prefix=BSL-${DB2NODE}
DB2_DEFERRED_PREPARE_SEMANTICS=YES
DB2_ATS_ENABLE=YES
DB2_COMPATIBILITY_VECTOR=40B
DB2RSHTIMEOUT=60
DB2RSHCMD=/usr/bin/ssh
DB2FODC=CORESHM=OFF
DB2_JVM_STARTARGS=-Xnocompressedrefs -Dhdp.version=2.6.4.0-91 -Dlog4j.configuration=file:///usr/ibmpacks/bigsql/5.0.4.0/bigsql/conf/log4j.properties -Dbigsql.logid.prefix=BSL-${DB2NODE}
DB2_EXTENDED_OPTIMIZATION=BI_INFER_CC ON
DB2COMM=TCPIP
DB2AUTOSTART=NO

Obviously, the DB2_JVM_STARTARGS took precedence over DB2_BIGSQL_JVM_STARTARGS and it was the reason why the old MapReduce framework was resurrected. The legacy jar files were downloaded from HFDFS /hdp/apps directory.
hdfs dfs -ls /hdp/apps
Found 2 items
drwxr-xr-x - hdfs hdfs 0 2019-10-07 22:09 /hdp/apps/2.6.4.0-91
drwxr-xr-x - hdfs hdfs 0 2019-10-12 00:16 /hdp/apps/3.1.0.0-78

The problem was sorted by a single command unsetting malicious DB2_JVM_STARTARGS variable and restarting BigSQL to take it into effect.
db2set DB2_JVM_STARTARGS=
I also removed /hdp/apps/2.6.4.0-91 HDFS directory to make sure that the vampire is ultimately killed.

poniedziałek, 23 września 2019

How to obtain an active NameNode remotely

Problem
While using WebHDFS REST API interface, the client is dealing directly with the NameNode. In HA (high availability) environment, only one NameNode is active, the second is standby. If the standby NameNode is addressed, the request is denied. So the client should be aware of which NameNode is active and construct a valid URL. But how to discover remotely the active NameNode automatically and avoid redirecting the client manually in case of failover?
Sounds strange but the there is no Ambari REST API to detect an active NameNode.
One obvious solution is to use WebHDFS Knox Gateway which, assuming configured properly, is propagating the query to the valid NameNode.
Solution
There are two convenient methods to discover the active NameNode outside Knox Gateway. One is to use JMX query and the second is to use hdfs haadmin.
The solution is described here in more details. I also added a convenient bash script to extract the active NameNode using both methods: JMX Query and hdfs haadmin. The script can be easily customized. If hdfs haadmin method is used, the script can be executed inside the cluster only so the remote shell call should be implemented.

piątek, 30 sierpnia 2019

The 'krb5-conf' configuration is not available

HDP 3.1
A nasty message as visible above suddenly popped up out of the blue. Every configuration change, stopping or starting the service was blocked because of that. The message was related to the Kerberization but "Disable Kerberos" option was also under the spell. It seemed that the only option was to plough under everything and build up the cluster from the bare ground.
The problem is described here but no solution is proposed.
Solution
The solution was quite simple. Remove the "Kerberos" marker from the cluster by modifying the Ambari database. In the case of Postgresql database execute the command:

update clusters set security_type='NONE'

After that magic, the "Enable Kerberos" button is active and after performing the "Kerberization" the cluster is happy and healthy again.

niedziela, 4 sierpnia 2019

HDP 3.1, HBase REST API, security gap

Problem
I found a nasty problem with HDP 3.1 which cost me several sleepless nights. There is a security gap in HBase REST API. The HBase REST API service does not impersonate users and all HBase commands are executed as hbase user. The same behaviour is passed to Knox HBase. It means that any user having access to HBase REST API or Knox Gateway HBase is authorized to do any action bypassing any security settings in Ranger or HBase service directly.
Solution
The only solution I found was to compile the current version of HBase downloaded from GitHub and replace the legacy hbase-rest jar with the new one.

Clone GitHub repository and build the packages
git clone https://github.com/apache/hbase.git -b branch-2.0
cd hbase
mvn package -DskipTests

As root user
cd /usr/hdp/3.1.0.0-78/hbase/lib

Archive existing jar
mkdir arch
mv mv hbase-rest-2.0.2.3.1.0.0-78.jar arch/
unlink hbase-rest.jar

Replace with the new one
ln -s /home/hbase/hbase/hbase-rest/target/hbase-rest-2.0.6-SNAPSHOT.jar hbase-rest.jar

Restart HBase REST API server.

środa, 31 lipca 2019

HDP 3.1, Wired Encryption

Introduction
https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/configuring-wire-encryption/content/wire_encryption.html
A Wired Encryption is not only adding the next layer of security, but also the next layer of complexity. Although everything is described in the HortonWorks documentation, it is not easy to extract the practical steps on how to set up the encryption, it took me some time to accomplish it. So I decided to recap my experience and publish several useful scripts and practical procedure.
https://github.com/stanislawbartkowski/hdpwiredencryption
What is described

Self-signed certificates
Enable SSL for WebHDFS, MapReduce, Tez, and YARN

Future plans

Certificates signed by CA
Other HDP services
Application connection

Problem to solve

After enabling the encryption, BigSQL LOAD HADOOP command refused working. It is related to the TimeLine service secure connection. Will try to sort it.

niedziela, 30 czerwca 2019

My Wiki

Introduction
I started creating my GitHub Wiki containing some practical guidelines and steps related to HDP and BigSQL. I do not want to duplicate or retell HortonWorks or IBM official documents and tutorial. Just my success stories and some practical advice to remember and avoid, like Sisyphus, to roll your boulder up that Kerberos hill again and again.
The list and the content are not closed and static, I'm constantly updating and enhancing it.
https://github.com/stanislawbartkowski/wikis
Below is the current content.
LDAP and Kerberos authentication for CentOS
https://github.com/stanislawbartkowski/wikis/wiki/Centos---Kerberos---LDAP
The guidelines for how to set up Kerberos and LDAP authentication for CentOS. It is a good practice to have host systems secured before installing HDP. Although described in many places, I had a hard time to put it together and end up in the loving arms of Kerberos. Particularly, secure LDAP connection and two-ways LDAP security required a lot of patience with gritted teeth.
Dockerized Kerberos
https://github.com/stanislawbartkowski/docker-kerberos
Docker is the buzzword these days. There are several Kerberos on Docker implementations circulating around but I decided to create my own I'm confident with.
HDP and GPFS
https://github.com/stanislawbartkowski/javahotel/tree/hdpgpf
GPFS is the alternative for HDFS. HDP running on the top GPFS has some advantages and disadvantages. Before going live, it is a good idea to practice using a local cluster. Here I'm presenting practical steps on how to set up HDP using GPFS as a data storage. Important: you need a valid IBM GPFS license to do that.
DB2 11 and aggregate UDF
https://github.com/stanislawbartkowski/javahotel/tree/db2aggr
Seems strange but until DB2 11 it was not possible to implement custom aggregate UDF in DB2. This feature was finally added in DB2 11 but I found it not well documented. So it took me some time to create even a simple aggregate UDF but in the end, I made it.
IBM BigSQL, monitoring and maintenance queries
https://github.com/stanislawbartkowski/wikis/wiki/BigSQL,-useful-commands
For some time, I was supporting the IBM client on issues related to IBM BigSQL. During my engagement, I created a notebook with a number of useful SQL queries related to different aspects of maintenance, performance, security etc for copying and pasting.
Dockerized IBM DB2
https://github.com/stanislawbartkowski/docker-db2
Yet another DB2 on Docker. Download the free release of DB2 or, if you are lucky enough, your licensed edition and be happy to get up DB2 by the tap of your finger.
Enable CentOS for Active Directory
https://github.com/stanislawbartkowski/wikis/wiki/CentOS---Active-Directory
Enable CentOS or RedHat for Active Directory authentication is easy comparing to MIT Kerberos/OpenLDAP but still, there are some hooks and nooks to know.
Monitoring tool for IBM BigSQL
https://github.com/stanislawbartkowski/bigsqlmoni
IBM BigSQL/DB2 contains a huge variety of monitoring queries but in most case what is important is not the value but delta. So I created a simple tool to store and provide deltas for some performance and monitoring indicators. But still looking for a way to make practical use of it.
Enable HDP for Active Directory/Kerberos/LDAP
https://github.com/stanislawbartkowski/wikis/wiki/HDP-2.6.5-3.1-and-Active-Directory
Practical steps on implementing Kerberos security in HDP. A simple test to make sure that security is in place.
Enable HDP services for Active Directory/Kerberos/LDAP
https://github.com/stanislawbartkowski/hdpactivedirectory
Wiki: https://github.com/stanislawbartkowski/hdpactivedirectory/wiki
Basic Kerberization enables Kerberos authentication for Hadoop services and makes HDFS secure. Next step is to enable a particular service for Kerberos authentication and LDAP authorization. It is highly recommended to activate Ranger security. The GitHub Wiki attached contains guidelines and practical steps to enable Hadoop services for AD/Kerberos. Every chapter contains also a basic test to make sure that security is enforced and has teeth.
HiBench for HDP
https://github.com/stanislawbartkowski/MyHiBench
HiBench is widely recognized as a leading benchmark tool for Hadoop services. Unfortunately, the development seems to be closed two years ago and I found it difficult to run it in HDP 3.1. Also, the tool seems not to be enabled for Kerberos security. After spending some time trying to find a workaround, I decided to create my own fork of HiBench.
This fork is dedicated to HDP only, the original HiBench can be used also against a standalone instalment of some Hadoop services. It required also redeveloping some Java and Scala code related particularly to Kafka and Spark Streaming.
Several Java/Scala tools to test Kerberos security
HDFS Java client https://github.com/stanislawbartkowski/KafkaSample
Kafka Java client https://github.com/stanislawbartkowski/KafkaSample
Scala Spark Streaming against Kafka https://github.com/stanislawbartkowski/SampleSparkStreaming
Several simple Java/Scala tools to test Kerberos connectivity. The tools come with source code to review. The tools are using as an additional test, next to batch or command line, for testing Kerberos security.

niedziela, 5 maja 2019

HDP, Ranger and plugins

Problem
I spent several sleepless nights trying to resolve a strange problem related to HDP (HortonWorks Data Platform), Ranger service and plugins.
After installing Ranger and enabling any plugin, an appropriate service entry should be created and visible in Ranger Admin UI. More details are here. In the beginning, a default policy is created which can be customized later according to needs.
But in my environment, the service entry was not created thus blocking any attempt to implement authorization policy. What is more, even disabling/enabling plugin, stopping/restarting the cluster does not make any change, I was unable to conjure the service entry. At some point, I even removed the Ranger, recreated the Ranger database and reinstalled the service again from scratch, but it did not help.
Solution
Finally, after carefully browsing through the log files, I found the solution. The culprit is the local directory /etc/ranger. There is a subdirectory reflecting the service entry in Ranger Admin UI.

ls /etc/ranger/MyCluster_hadoop/
cred.jceks
policycache

This directory contains a copy of ranger/service policy and is used as a recovery point in case of the database failure. It seems that after enabling the plugin if the service discovers this cache, the ranger/service policy is recreated but in this scenario, the Ranger Admin UI service entry is not restored. This cache is not removed after disabling the plugin and even after removing the whole Ranger service.
Unfortunately, it is not documented and badly implemented.
The solution is to switch off the plugin, manually remove the /etc/ranger/{service name} directory and switch on the plugin again. The service entry and default policy are recreated.
Keep in mind that the directory /etc/ranger/{service name} is located on the host where the appropriate service is installed, not the Ranger service host.

poniedziałek, 29 kwietnia 2019

My HiBench

Problem
HiBench is regarded as a leading benchmark in the BigData world according to this webpage. But while trying to run it in my test HDP 2.6.5 or 3.1environment, 2.6.5 or 3.1, I found it very frustrating, particularly when the cluster is secured by Kerberos.
Because the project seems to be abandoned two years ago, I started implementing manual patches to the code, it is described here. But at some point, I came to a dead end. Kafka 0.8 client does not support Kerberos, Kerberos support was added in Kafka 0.10 client, but Kafka 0.10 is not backward compatible and moving there required redeveloping substantial part of Kafka related code in Java and Scala.
Solution
So finally I decided to make my own fork of HiBench project and split completely from the trunk. The result is a new GitHub project here.

Main changes include:

Kerberos support
Kafka 2.0 implemented
HDP 3.1/Apache Hadoop 3.1 support
Hive 3.1 support, the standard HiBench was developed for Hive 0.17 and does not talk to 3.1

Features lost comparing to standard HiBench

Only HDP is supported and tested, standalone Spark, Apache Hadoop, Kafka etc are not supported.
Only support for HDP 2.6.5 and HDP 3.1 is implemented. Support for an older version of HDP, Spark, Scala is abandoned.
Support for Apache Flink and Gearpunmo is removed.

Tests

The project was tested in two HDP 3.1 and HDP 2.6.5 environments, tiny in local KVM cluster and tiny/small/large in a larger multi-host cluster.

Future plans

Review Kafka streaming benchmark, while analyzing the code I found some mysteries out there
The same for Dfsio benchmark test
Test in wire encryption secured cluster

sobota, 30 marca 2019

TPC-DS benchmark and HDP

I conducted TPC-DS benchmark on HDP 2.6.5 cluster and compared three SQL engines: Hive, SparkSQL and IBM BigSQL.
The results are published here.
The data size scaling factor is 100, meaning 100 GB of data. It is not a qualification database in terms of TPC-DS specification and the result should not be used for any formal comparison. I'm going only to run the test in different HDP environments and collect the results.
The queries were executed against ORC and Parquet files.
The ad-hoc impressions:

Hive falls behind as expected because of M/R paradigm.
Parquet format: IBM BigSQL and SparkSQL go head to head. Some queries execute faster by BigSQL, some faster by SparkSQL.
SparkSQL behaves worse while running against ORC, even comparing to Hive.

The benchmark test was executed using TPC-DS framework available here. The Throughput Test requires running four queries in parallel. It was achieved by launching ptest.sh script four times with the parameter describing the stream number.

nohup ./ptest.sh 0 &
nohup ./ptest.sh 1 &
nohup ./ptest.sh 2 &
nohup ./ptest.sh 3 &

In a tiny environment, the cluster was adjusted according to the demands of a particular SQL engine. Hive and SparSQL were using four dedicated yarn queues to achieve better parallelism level. BigSQL was configured to remove memory restraint, otherwise, during Throughput Test, BigSQL would fail because of lack of available memory. After the test, the specific configuration should be reverted.

The tables were created using Hive external table feature to deploy text file and then target table was loaded by the sequence:

LOAD DATA INPATH hdfs_path OVERWRITE INTO TABLE external_table
CREATE TABLE .... STORED AS PARQUET/OCR as SELECT * FROM external_table

Next steps

The TPC-DS tables are not partitioned. Try to partition them and take advantage of "partition elimination" feature in BigSQL. This feature optimizes join queries by ignoring partitions not covered by join predicates. Should boost BigSQL performance.
Migrate to HDP 3.1 and compare the results in the same hardware framework.

czwartek, 28 lutego 2019

Docker DB2

Docker and IBM DB2
There is an official IBM DB2 Docker container. But it is quite easy to create own Docker image using freely available IBM DB2-Express C to keep up with the latest version and have more flexibility.
The Dockerfile and description are available at github project. The Dockerfile allows also to consume commercial DB2 AESE version and apply the current FixPack.
Preparing an image takes some time but when ready, rule them all. Run, start, stop, and destroy DB2 instance with a tap of your finger.

piątek, 18 stycznia 2019

Civilization The Board Game, next version

Introduction
I deployed a new version of my computer implementation of Civilization The Board Game. The implementation consists of three parts:

Civilization Engine
Civilization Web Interface
Demo. Demo version is deployed to Heroku, free quota. Please wait several minutes unless the dyno is brought back to life.

New features
The journal was added. There is a window panel where the activities of the player and the opponent are outputted.

The journal panel can be brought to live by clicking the "Journal" button. It can be hidden afterwards by clicking the "Hide" button. The panel can be dragged across the screen but cannot be scaled. The actions are added to the panel dynamically during the progress of the game.
There were several problems to be resolved. Some messages are "public", meaning the message is visible for both players. For instance, if the player buys the building, the information is visible for both of them. There are also private messages visible only for the player and not replicated to the opponent. Some messages are modified according to the recipient. For instance: if the player is buying the unit, the unit type and the unit strength is reported to him. The same information is sent to the opponent but the unit strength is removed as private.
Next steps
Some culture cards are to be implemented.

poniedziałek, 7 stycznia 2019

MIT Kerberos, Ubuntu and Docker

I was using this version of Dockerized Kerberos but finally, I decided the prepare my own based on Ubuntu. Ubuntu has a smaller footprint then Centos, 230 MB against 650 MB. The Dockerfile and usage description is available here, as GitHub project.
Enjoy.

javahotel