Blog do projektu Open Source JavaHotel

niedziela, 28 lutego 2021

Maven and GitHub

 I spent half a day trying to upload my package to GitHub Maven following the guidelines GitHub Maven. As usual, it seemed too good to be true to be successful immediately. Most of the time, I was haunted by a mysterious message which failed mvn deploy command.

RROR] Failed to execute goal org.apache.maven.plugins:maven-deploy-plugin:2.7:deploy (default-deploy) on project RestService: Failed to deploy artifacts: Could not transfer artifact com.restservice:RestService:jar:1.0-20210228.194713-1 from/to github ( Failed to transfer file: Return code is: 422, ReasonPhrase: Unprocessable Entity. -> [Help 1]

After several hours I was at the point of giving up. The salvation came from stackoverflow.

Could you try lower casing your artifact ID

It was the point,  "Return code is: 422, ReasonPhrase: Unprocessable Entity" translates as "use lower case".





Java, Parquet and JDBC

 That is strange but it is almost impossible to access Parquet files outside Hadoop/Spark context. I was trying to move data imprisoned in Parquet to JDBC accessed relational database using standalone Java application and failed.

So I ended up with Spark/Java application to address the issue. 

Source and description:

The application loads Parquet formatted data, it can be a single file or a directory, partition the data into several chunks, launches executors and loads data into JDBC databases in parallel. The number of partitions and executors are configurable. 

The application was tested as a local and single-node Spark configuration. The next step is to configure and test the application in a distributed Hadoop environment.

niedziela, 31 stycznia 2021

DB2 audit

 DB2 audit is a powerful tool allowing to supervise the usage of DB2 instance and databases. Some practical advice how to set up and use DB audit is described here.

But DB2 audit by itself collects data. Next step is to make practical use of the tool. It is for no advantage to collect data without analyzing them.

So I developed a simple solution to discover and escalate any suspicious behaviour. The solution and description are available here.

The solution consists of several bash scripts and does not require any additional dependency.

Two tasks are implemented:

  • Collecting audit records and moving them to additional DB2 database ready for further analysis. This part can be executed as a crontab job
  •  Running investigative SQL queries on the audit database to discover suspicious and not expected behaviour. This part can be executed on-demand or as a  crontab job. Example
    • Not authorized user connected to DB2 database.
    • Read-only user run an update SQL statement.
    • Failed command reported as "not authorized" suggesting a user trying to overuse its authority.
Some examples of investigating queries are implemented already. Any new query can be added.

The solution is running at the instance level but investigative queries can be customized at the database level. In the case of several databases in a single instance, every database can come with its own security rules.

Every violation can be escalated using a customizable script. The script example reporting violations in a special text file is available here.

poniedziałek, 28 grudnia 2020

Mail server in a single docker

 I created a simple mail server, SMPT and IMAPS, in a single docker/podman container. The mail services engines are Postfix (SMP) and Dovecot (IMAPS). The solution is described here. I also added guidelines on how to test and configure several mail clients: Evolution and mutt.

The storage is ephemeral, not recommended for any production environment but ideal for testing, easy to create and easy to dismantle. 

I also added a sample yaml configuration file and remarks on how to deploy the container to OpenShift/Kubernetes cluster.

niedziela, 29 listopada 2020

My own Hadoop/HDP benchmark


I spent some time trying to come to terms with HiBench but finally, I gave up. The project seemed to be abandoned, there was a lot of problems to run it in a secure (kerberized) environment and adjust it to new versions of Hive and Spark. Also, it is the project with a long history and there are a lot of layers not consistent with each other

So I ended up with creating my own version of HiBench benchmark. Only code migrated from HiBench is Spark/Scala and Jave source code upgraded to new versions of dependencies and more consistent parameter handling.


  • Dedicated to HDP 3.1.5. Standalone services are not supported.  
  • Enabled for Kerberos
  • Hive 3.0
  • Spark 2.x
  • Simple to run and expand, minimal configuration
Features not supported
  • Streaming, pending
  • Nutch indexing, not under development any longer
  • Flink, Gearpump, not part of HDP stack

    All details are here.

sobota, 31 października 2020

SSL for masses


 I expanded my tool for enabling wired encryption in the HDP cluster.

Previously, only self-signed certificates were supported. I added automation for CA-signed certificates. Important: it works only if CA-signed certificate package follows the supported format.

There are two paths possible: self-signed certificates and CA-signed certificates.

Self-signed certificates

  1. ./ 0 Creates self-signed certificate and truststores for every node.
  2. ./ Creates and distributes all-client truststore.
  3. ./ 2 Secure keystores and truststores. Apply owner and Linux permissions.
CA-signed certificates
  1. ./ 3 Creates self-signed certificates and CSR (Certificate Signing Request) for every node
  2. Manual step. Send all CSR to CA centre for signing. The CA-signed certificates  should be stored in a designed format.
  3. ./ 4 CA-signed certificates are imported into corresponding keystore and replacing the self-signed certificates. Truststores are created.
  4. ./ 1 Creates and distributes all-client trustore.
  5. ./ 2 Secure keystores and trustores.


There is a number of pages containing practical steps on how to enable SSL for HDP components. It is based on documentation but more practical based on experience. 

For instance:

HDFS Ranger Plugin for SSL

NiFi service for SSL

czwartek, 13 sierpnia 2020

HBase, Phoenix and CsvBulkLoadTool

 I'm running MyBench in a new environment and it fails while loading data into Phoenix table using CsvBulkLoadTool utility.

WARN tool.LoadIncrementalHFiles: Attempt to bulk load region containing  into table BENCH.USERVISITS with files [family:0 path:hdfs://bidev/tmp/386640ec-d49e-4760-8257-05858a409321/BENCH.USERVISITS/0/b467b5560eee4d61a42d4c9e6a78eb7e] failed.  This is recoverable and they will be retried.

INFO tool.LoadIncrementalHFiles: Split occurred while grouping HFiles, retry attempt 100 with 1 files remaining to group or split

ERROR tool.LoadIncrementalHFiles: -------------------------------------------------

Bulk load aborted with some files not yet loaded:

After closer examination, I discovered that the error takes place while moving/renaming input file into HBase staging directory /apps/hbase/data/staging. In this cluster, the HBase data is encrypted and moving data between encrypted and normal zone is not possible. Failed to move HFile: hdfs://bidev/apps/hbase/data/staging/ambari-qa__BENCH.USERVISITS__dbb5qdfppq1diggr0dmdbcb1ji74ol4b9jn9ee2dgp1ttn9n5i6llfih7101fi1d/0/3a7f2d612c034253ad375ae002cc6ade to hdfs://bidev/tmp/fc43e454-00b3-4db0-8bdd-8b475885ab49/BENCH.USERVISITS/0/3a7f2d612c034253ad375ae002cc6ade

at org.apache.hadoop.hbase.regionserver.SecureBulkLoadManager$SecureBulkLoadListener.failedBulkLoad(

at org.apache.hadoop.hbase.regionserver.HRegion.bulkLoadHFiles(

at org.apache.hadoop.hbase.regionserver.SecureBulkLoadManager$

The source code can be found here.
if (!FSUtils.isSameHdfs(conf, srcFs, fs)) {
LOG.debug("Bulk-load file " + srcPath + " is on different filesystem than " +
"the destination filesystem. Copying file over to destination staging dir.");
FileUtil.copy(srcFs, p, fs, stageP, false, conf);
} else if (copyFile) {
LOG.debug("Bulk-load file " + srcPath + " is copied to destination staging dir.");
FileUtil.copy(srcFs, p, fs, stageP, false, conf);
} else {
LOG.debug("Moving " + p + " to " + stageP);
FileStatus origFileStatus = fs.getFileStatus(p);
origPermissions.put(srcPath, origFileStatus.getPermission());
if(!fs.rename(p, stageP)) {
throw new IOException("Failed to move HFile: " + p + " to " + stageP);
When data is moved between different file system, the copying is enforced but unfortunately, data movement between encrypted and decrypted zone is not covered here.

Another option is to make use of "copyFile" parameter which enforces copying. After analyzing the control flow I discovered that there exists "hbase-site.xml" parameter  always.copy.files which seems to be the solution to the problem. But after applying this parameter, nothing has changed. 
Further examination with a little help of remote debugging unearthed a sad truth. CsvBulkLoadTool is passing the control to and "doBulkLoad" function.

public Map<LoadQueueItem, ByteBuffer> doBulkLoad(Path hfofDir, final Admin admin, Table table,  RegionLocator regionLocator) throws TableNotFoundException, IOException {
      return doBulkLoad(hfofDir, admin, table, regionLocator, false, false);

Unfortunately, the "copyFiles" parameter is hardcoded as "false" although there is a sound and ready to use "isAlwaysCopyFiles()" function utilizing "hbase-site.xml" config file.
The only solution is manual fix and recreating the package from source files. But it does not go easy because one has to leverage different and outdated versions of HBase and Phoenix to create "Phoenix client" package matching HDP 3.1.
So two days spent without a solution.