Blog do projektu Open Source JavaHotel

niedziela, 26 września 2021

Recursive HDFS download - continued

I extended my WebHDFS downloader. In the case of HDFS directories having an enormous number of files, scanning the directory using WebHDFS REST API fails because of time-out. 

The solution is to obtain the HDFS directory tree using the standard hdfs dfs -ls -R command, ship the result to the node where hdfsdownloader is executed and use the file as the input for the tool.

More detailed description: https://github.com/stanislawbartkowski/webhdfsdirectory

To extend the tool, I used the inheritance feature of Python. The tool now is running in two modes, get the list of HDFS files using WebHDFS REST/API or get the list from the input text file. So the only difference is the way of obtaining the list of files, the main tool flow is the same regardless of the method.

Source code: https://github.com/stanislawbartkowski/webhdfsdirectory/blob/main/src/proc/hdfs.py

Class ClassHDFS runs the application flow. It calls getdir method to receive a list of names of files and directories in the HDFS path provided as a parameter and iterates over the list. For a regular file, the file is downloaded. For the directory entry, the class makes a recursive call and goes one level down in HDFS directory tree.

getdir method is defined in an appropriate inherited class. In the case of WebHDFS REST/API mode, the method inherits from TRAVERSEHDFS class. In the case of input text file, the method is implemented in TRAVERSEFILE class.

class DIRHDFS(CLASSHDFS, TRAVERSEHDFS):

    def __init__(self, WEBHDFSHOST, WEBHDFSPORT, WEBHDFSUSER, DIRREG=None, dryrun=False):
        TRAVERSEHDFS.__init__(self, WEBHDFSHOST, WEBHDFSPORT, WEBHDFSUSER)
        CLASSHDFS.__init__(self, DIRREG, dryrun)

class FILEHDFS(CLASSHDFS,  TRAVERSEFILE):

    def __init__(self, txtfile,WEBHDFSHOST, WEBHDFSPORT, WEBHDFSUSER, DIRREG=None, dryrun=False):
        self.WEBHDFSHOST = WEBHDFSHOST
        self.WEBHDFSPORT = WEBHDFSPORT
        self.WEBHDFSUSER = WEBHDFSUSER
        TRAVERSEFILE.__init__(self, txtfile)
        CLASSHDFS.__init__(self, DIRREG, dryrun)

środa, 30 czerwca 2021

Recursive HDFS

That's strange but I was unable to find any solution to download the HDFS tree. The project I developed is covering the gap.

https://github.com/stanislawbartkowski/webhdfsdirectory

The only dependency is Python3 and the requests package. The HDFS tree is downloaded using HDFS Rest/API or Web/HDFS. It does not support Kerberos authentication and wasn't tested on a secured/encrypted connection.

The solution contains a simple hdfs package doing the main stuff and Python and bash wrappers. It also allows applying a RegularExpession directory selector to and "dry-run" option to test it without downloading anything. It was tested in a real production environment.


wtorek, 27 kwietnia 2021

DB2 Warehouse and encryption

 I spent some trying to move the database from one DB2 instance to another, just a trivial backup and restore scenario. Unfortunately, RESTORE command failed with a nasty message.

db2 restore db bludb from /tmp taken at 20210425102241 
SQL1743N  The RESTORE command failed because the backup image is of an 

encrypted database but the target database is not encrypted.
I was trying to restore the encrypted database into a plain world. To be sure:
db2ckbkp -H /tmp/BLUDB.0.db2inst1.DBPART000.20210425102241.001 
Encrypt Info Flags             -- 0x1
	                                  Source DB was encrypted
Yes, the backup is encrypted but how to recreate it?

To decrypt the backup, we need the key used for encryption. The source keystore and the label of the secret key can be extracted from the backup.

db2 "restore db bludb from /mnt/backup taken at 20210425102241   encropts 'show master key details'"
The result is tricky, in the diagnostic directory look for a specific file containing information about source keystore and label.
cd /database/config/db2inst1/sqllib/db2dump/DIAG0000
cat BLUDB.0.db2inst1.DBPART000.20210425102241.masterKeyDetails 

                  KeyStore Type: PKCS12

             KeyStore Location: /mnt/blumeta0/db2/keystore/keystore.p12
            KeyStore Host Name: thinkde
           KeyStore IP Address: 127.0.1.1
      KeyStore IP Address Type: IPV4
          Encryption Algorithm: AES
     Encryption Algorithm Mode: CBC
         Encryption Key Length: 256
              Master Key Label: DB2_SYSGEN_db2inst1_BLUDB_2021-03-19-13.57.02_A8CF4EED
The next step is to go to the source keystore and extract the secret key. If the source database is lost and there is no copy of the keystore, the backup is also lost because there is no way to decipher it.

Display the content of keystore.

gsk8capicmd_64 -cert - -list -db /mnt/blumeta0/db2/keystore/keystore.p12  -stashed
Certificates found
* default, - personal, ! trusted, # secret key
#	DB2_SYSGEN_db2inst1_BLUDB_2021-03-19-13.57.02_A8CF4EED
Yes, the wanted secret key still exists in the keystore. Now export it and transport it to the target database.
gsk8capicmd_64 -cert -export -db /tmp/keystore/keystore.p12 -stashed -label DB2_SYSGEN_db2inst1_BLUDB_2021-03-19-13.57.02_A8CF4EED -target secretkey.p12
Target database password :
The target database password is used to encrypt secretkey.p12 file. On the target host, create an empty keystore and reconfigure DB2.
gsk8capicmd_64 -keydb -create -db /mnt/blumeta0/db2/keystore/keystore.p12   -pw "secret" -type pkcs12 -stash
db2 update dbm cfg using keystore_location  /mnt/blumeta0/db2/keystore/keystore.p12  keystore_type pkcs12
Now import a secret key into the newly created keystore.
gsk8capicmd_64 -cert -import  -db  /mnt/backup/secretkey.p12 -pw "secret"  -target   /mnt/blumeta0/db2/keystore/keystore.p12 -stashed -label DB2_SYSGEN_db2inst1_BLUDB_2021-03-19-13.57.02_A8CF4EED  
And now restore the database again.
db2 restore db bludb from /mnt/backup taken at 20210425102241 no encrypt
Success.

środa, 31 marca 2021

Java, Parquet and DB2 Warehouse in Cloud

I expanded my tool for moving data from Parquet file into JDBC database (blog). During my test, loading 6 million rows from 200 MB Parquet file into DB2 Warehouse took around 30-40 minutes. Not so bad for small data volume but not feasible while dealing with terabytes of data.

In DB2 Warehouse on Cloud, there is a very effective method of data loading from local (local to the DB2 Warehouse) or Amazon S3 bucket. It takes around 30-40 seconds to load 1.2 GB text file containing 6 million lines. 

This load option is available from DB2 Warehouse Console or through DB2 Warehouse on Cloud  REST/API.

So the idea to boost up the loading performance is to use AWS S3 bucket mounted in the local file system, deploy the delimited text files extracted from the Parquet file there, and launch DB2 Warehouse REST/API to load data from AWS S3 bucket. 

In the nutshell, it includes the following steps.

  • Mount AWS S3 bucket in the local file system. "Local" - meaning the host where the application is running. RedHat/Centos
  • The application is reading the input Parquet file and producing several delimited text file in the AWS S3 mounted directory. For instance:  /mnt/s3/sbtest/out1/export-0, /mnt/s3/sbtest/out1/export-1, /mnt/s3/sbtest/out1/export-2, /mnt/s3/sbtest/out1/export-3
  • When the delimited text file is ready, the application is launching DB2 Warehouse on Cloud REST/API and loads data from AWS S3 bucket using file names: /sbtest/out1/export-0, /sbtest/out1/export-1,  /sbtest/out1/export-2, /sbtest/out1/export-3
  • After data is loaded, the application can produce the next chunk of data and continue.
The loading time was reduced to 1 minute.

The current version of the application is available here.

Several remarks:
  • The load commands are running in parallel. In my test environment, the optimal level of parallelism was 4. Beyond this number,  the DB2 Warehouse is reporting loading errors - the problem requires further investigation.
  • Only single LOAD command can be executed against a single table. So every loading task is loading data into a private table (CREATE TABLE TABLE-1 LIKE TABLE) and later inserts data into the main table (INSERT INTO TABLE SELECT * FROM TABLE-1). The INSERT command is running very fast, it takes around 10 sec to INSERT 100 MB of data.




niedziela, 28 lutego 2021

Maven and GitHub

 I spent half a day trying to upload my package to GitHub Maven following the guidelines GitHub Maven. As usual, it seemed too good to be true to be successful immediately. Most of the time, I was haunted by a mysterious message which failed mvn deploy command.

RROR] Failed to execute goal org.apache.maven.plugins:maven-deploy-plugin:2.7:deploy (default-deploy) on project RestService: Failed to deploy artifacts: Could not transfer artifact com.restservice:RestService:jar:1.0-20210228.194713-1 from/to github (https://maven.pkg.github.com/stanislawbartkowski/RestService): Failed to transfer file: https://maven.pkg.github.com/stanislawbartkowski/RestService/com/restservice/RestService/1.0-SNAPSHOT/RestService-1.0-20210228.194713-1.jar. Return code is: 422, ReasonPhrase: Unprocessable Entity. -> [Help 1]

After several hours I was at the point of giving up. The salvation came from stackoverflow.

Could you try lower casing your artifact ID

It was the point,  "Return code is: 422, ReasonPhrase: Unprocessable Entity" translates as "use lower case".

Wrong:

<artifactId>RestService</artifactId>

Correct:

 <artifactId>restservice</artifactId>


Java, Parquet and JDBC

 That is strange but it is almost impossible to access Parquet files outside Hadoop/Spark context. I was trying to move data imprisoned in Parquet to JDBC accessed relational database using standalone Java application and failed.

So I ended up with Spark/Java application to address the issue. 

Source and description: https://github.com/stanislawbartkowski/ParquetJDBC

The application loads Parquet formatted data, it can be a single file or a directory, partition the data into several chunks, launches executors and loads data into JDBC databases in parallel. The number of partitions and executors are configurable. 

The application was tested as a local and single-node Spark configuration. The next step is to configure and test the application in a distributed Hadoop environment.

niedziela, 31 stycznia 2021

DB2 audit

 DB2 audit is a powerful tool allowing to supervise the usage of DB2 instance and databases. Some practical advice how to set up and use DB audit is described here.

But DB2 audit by itself collects data. Next step is to make practical use of the tool. It is for no advantage to collect data without analyzing them.

So I developed a simple solution to discover and escalate any suspicious behaviour. The solution and description are available here.

The solution consists of several bash scripts and does not require any additional dependency.

Two tasks are implemented:

  • Collecting audit records and moving them to additional DB2 database ready for further analysis. This part can be executed as a crontab job
  •  Running investigative SQL queries on the audit database to discover suspicious and not expected behaviour. This part can be executed on-demand or as a  crontab job. Example
    • Not authorized user connected to DB2 database.
    • Read-only user run an update SQL statement.
    • Failed command reported as "not authorized" suggesting a user trying to overuse its authority.
Some examples of investigating queries are implemented already. Any new query can be added.

The solution is running at the instance level but investigative queries can be customized at the database level. In the case of several databases in a single instance, every database can come with its own security rules.

Every violation can be escalated using a customizable script. The script example reporting violations in a special text file is available here.