Blog do projektu Open Source JavaHotel

środa, 31 marca 2021

Java, Parquet and DB2 Warehouse in Cloud

I expanded my tool for moving data from Parquet file into JDBC database (blog). During my test, loading 6 million rows from 200 MB Parquet file into DB2 Warehouse took around 30-40 minutes. Not so bad for small data volume but not feasible while dealing with terabytes of data.

In DB2 Warehouse on Cloud, there is a very effective method of data loading from local (local to the DB2 Warehouse) or Amazon S3 bucket. It takes around 30-40 seconds to load 1.2 GB text file containing 6 million lines. 

This load option is available from DB2 Warehouse Console or through DB2 Warehouse on Cloud  REST/API.

So the idea to boost up the loading performance is to use AWS S3 bucket mounted in the local file system, deploy the delimited text files extracted from the Parquet file there, and launch DB2 Warehouse REST/API to load data from AWS S3 bucket. 

In the nutshell, it includes the following steps.

  • Mount AWS S3 bucket in the local file system. "Local" - meaning the host where the application is running. RedHat/Centos
  • The application is reading the input Parquet file and producing several delimited text file in the AWS S3 mounted directory. For instance:  /mnt/s3/sbtest/out1/export-0, /mnt/s3/sbtest/out1/export-1, /mnt/s3/sbtest/out1/export-2, /mnt/s3/sbtest/out1/export-3
  • When the delimited text file is ready, the application is launching DB2 Warehouse on Cloud REST/API and loads data from AWS S3 bucket using file names: /sbtest/out1/export-0, /sbtest/out1/export-1,  /sbtest/out1/export-2, /sbtest/out1/export-3
  • After data is loaded, the application can produce the next chunk of data and continue.
The loading time was reduced to 1 minute.

The current version of the application is available here.

Several remarks:
  • The load commands are running in parallel. In my test environment, the optimal level of parallelism was 4. Beyond this number,  the DB2 Warehouse is reporting loading errors - the problem requires further investigation.
  • Only single LOAD command can be executed against a single table. So every loading task is loading data into a private table (CREATE TABLE TABLE-1 LIKE TABLE) and later inserts data into the main table (INSERT INTO TABLE SELECT * FROM TABLE-1). The INSERT command is running very fast, it takes around 10 sec to INSERT 100 MB of data.