Blog do projektu Open Source JavaHotel

sobota, 17 czerwca 2017

Dockerize DB2

Sometimes it is necessary to set up and remove DB2 instance quickly. So far, I was using KVM virtual machine with DB2 preinstalled. It works nicely, but a virtual machine, even KVM, comes with a huge and heavy overhead.
Another solution is to use docker. Docker can be used as a lightweight virtual machine, with much smaller footprint than a full-fledged virtual machine.
Here I'm describing the steps to run DB2 in a docker container. In this example, free DB2 Express-C edition is used but the pattern can be extended to any other DB2 edition.
After completing these simple steps, I have low profile DB2 instance ready to start and stop anytime if necessary.

Several pending tasks.
  • DB2 installation is performed manually. It is possible to automate this process through Dockerfile although the procedure is different depending on the DB2 edition.
  • DB2 instance should be started manually every time the container is restarted. Looking for the way to run it automatically.

wtorek, 30 maja 2017

BigInsights, docker

Problem
I've spent some time trying to dockerize BigInsights, IBM Open Platform. After resolving some issues, I was able to perform installation. Everything run smoothly except Spark installation. Although installation was reported as successful, Spark History Server did not start.

 File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 424, in action_delayed
    self.get_hdfs_resource_executor().action_delayed(action_name, self)
  File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 265, in action_delayed
    self._assert_valid()
  File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 243, in _assert_valid
    raise Fail(format("Source {source} doesn't exist"))
resource_management.core.exceptions.Fail: Source /usr/iop/current/spark-historyserver/lib/spark-assembly.jar doesn't exist
It turned out that spark-core_4_2_0_0-1.6.1_IBM-000000.el7.noarch.rpm did not unpack all files included. Some directories: /usr/iop/4.2.0.0/spark/lib and /usr/iop/4.2.0.0/spark/sbin were skipped. What is more interesting, while installing the package using rpm command directly, rpm -i spark-core_4_2_0_0-1.6.1_IBM-000000.el7.noarch.rpm, all content of rpm was extracted correctly, while using yum command, yum install spark-core_4_2_0_0-1.6.1_IBM-000000.el7.noarch.rpm, some directories were excluded without signaling any error. I spent sleepless night trying to get a clue.
Solution
I found the explanation here. There was a mistake in spark-core_4_2_0_0-1.6.1_IBM-000000.el7.noarch.rpm package. Some files in the rpm were marked as 'documentation'. It was revealed by running rpm --dump command.
rpm -qp --dump spark-core_4_2_0_0-1.6.1_IBM-000000.el7.noarch.rpm /usr/iop/4.2.0.0/spark/sbin/start-shuffle-service.sh 1279 1466126392 dfe89bfa493c263e4daa8217a9f22db12d6e9a9e1b161c5733acddc5d6b6498c 0100755 root root 0 1 0 X /usr/iop/4.2.0.0/spark/sbin/start-slave.sh 3151 1466126392 623bc623a3c92394cd4b44699ea3ab78b049149f10ee4b6f41d30ab2859f8395 0100755 root root 0 1 0 X /usr/iop/4.2.0.0/spark/sbin/start-slaves.sh 2061 1466126391 24f329f4cd7c48b8cbd52e87b33e1e17228b5ff97f1bcb5b403e1b538b17e32a 0100755 root root 0 1 0 X /usr/iop/4.2.0.0/spark/sbin/start-thriftserver.sh 1824 1466126392 fcef75ab00ef295ade0c926f584902291b3c06131dcb88786a5899e48de12bae 0100755 root root 0 1 0 X /usr/iop/4.2.0.0/spark/sbin/stop-all.sh 1478 1466126392 efb2dc4fafed8d94d652c8cfd81f6ba59de6e9c6ae04da2e234e291f867f1d41 0100755 root root 0 1 0 X /usr/iop/4.2.0.0/spark/sbin/stop-history-server.sh 1056 1466126393 8f74163405d9832f7f930ed00582dd89f3e6ffc1c6f3750e3a4a1639c63593ae 0100755 root root 0 1 0 X /usr/iop/4.2.0.0/spark/sbin/stop-master.sh 1220 1466126391 ba5058a39699ae4d478dc1821fc999f032754b476193896991100761cd847710 0100755 root root 0 1 0 X /usr/iop/4.2.0.0/spark/sbin/stop-mesos-dispatcher.sh 1112 1466126393 b30ce7366e5945f6c02494ce402bcebe5573c423d5eed646b0efc37a2dbc4a8c 0100755 root root 0 1 0 X /usr/iop/4.2.0.0/spark/sbin/stop-mesos-shuffle-service.sh 1084 1466126393 6da69a8927513ed32fdb2d8088e3971596201595a84c9617aa1bdeefd0ef8de7 0100755 root root 0 1 0 X /usr/iop/4.2.0.0/spark/sbin/stop-shuffle-service.sh 1067 1466126391 817ef1a4679c22a9bc3f182ee3e0282001ab23c1c533c12db3d0597abad81d58 0100755 root root 0 1 0 X /usr/iop/4.2.0.0/spark/sbin/stop-slave.sh 1557 1466126392 cd0e35cd11b3452e902e117226e1ee851fc2cb7e2fcce8549c1c4f4ef591173e 0100755 root root 0 1 0 X /usr/iop/4.2.0.0/spark/sbin/stop-slaves.sh 1298 1466126392 a3366c8ab6b142eb7caf46129db2e73e610a3689e3c3005023755212eb5c008c 0100755 root root 0 1 0 X /usr/iop/4.2.0.0/spark/sbin/stop-thriftserver.sh 1066 1466126391 53b9e9a886c03701d7b1973d2c4448c484de2b5860959f7824e83c4c2a48170b 0100755 root root 0 1 0 X /usr/iop/4.2.0.0/spark/work 19 1466127922 0000000000000000000000000000000000000000000000000000000000000000 0120777 root root 0 1 0 /var/run/spark/work /var/lib/spark 6 1466127905 0000000000000000000000000000000000000000000000000000000000000000 040755 spark spark 0 0 0 X /var/log/spark 6 1466127905 0000000000000000000000000000000000000000000000000000000000000000 040755 spark spark 0 0 0 X /var/run/spark 17 1466127905 0000000000000000000000000000000000000000000000000000000000000000 040755 spark spark 0 0 0 X /var/run/spark/work 6 1466127905 0000000000000000000000000000000000000000000000000000000000000000 040755 spark spark 0 0 0 X

The signature: root root 0 1 0 (mark 1) describes the file as "documentation". To shrink the space consumed by packages, the docker "centos" image contains "tsflags=nodocs" feature in /etc/yum.conf configuration file.
So the temporary workaround is to comment out this feature. To avoid loading unnecessary documentation, one can install Spark separately and have this patch in force only during the installation of this component.

środa, 10 maja 2017

Sqoop, Hive, load data incrementally

Introduction
Hive is a popular, SQL-like engine over HDFS data and Sqoop is a tool to transfer data from external RDBMS tables into HDFS. Sqoop simply runs SELECT query against RDBS table and the result is stored in HDFS or as a Hive table directly. After the first loading, the effective way to keep tables synchronized is to update Hive table incrementally in order to avoid moving all data again and again. Theoretically,  the task is simple. Assuming that external table has a primary key and source data are not updated or deleted, take the greatest key already inserted into Hive table and transfer only rows whose primary keys are greater than this threshold.
There is also an additional requirement. A very effective data format for Hive tables is Parquet but Sqoop can only create Hive tables in text format. There is --as-parquetfile Sqoop parameter but I failed to try to enable it for Hive tables.
Solution
The solution is uploaded here.
I decided to implement a two-hop solution. Firstly load delta rows into a staging table in text format using Sqoop and afterward insert rows into the target Parquet Hive table. The whole workflow can be described as follows:
  • Recognize if the target Hive table exists already. If yes, calculate the maximum value for the primary key.
  • Extract from external RDBMS table all rows with the primary key greater than maximum or the whole table if the Hive table does not exist yet. Store data into the staging table.
  • If the target Hive table does not exist, create the table in Parquet format. Execute Hive command "CREATE .. TABLE AS SELECT * FROM stage.table
  • If the target Hive table is created already, simply add new rows with command: INSERT INTO TABLE .. SELECT * FROM stage.table
The solution is implemented as Oozie workflow. Can be launched as a single Oozie task or as Oozie coordinator task. Sample shell scripts for both tasks are available here. common.properties file is used as a template for job.properties and coordinator.properties file.

poniedziałek, 27 lutego 2017

Data extraction tool

I refactored data extraction and loading tool. There is ant build.xml file to create distribution jar automatically and instruction how to recreate Eclipse project directly from GitHub to inspect and expand source code if necessary.
From a functional point of view, only one feature was added. hivedb key allows modifying the prefix for exporting tables in Hive format. In Hive there is no schema, so table name prefix is simply separate database name. By virtue of hivedb parameter, it is possible to change schema name taken from the source database or remove prefix at all and load Hive tables into default database directly.

poniedziałek, 30 stycznia 2017

New features implemented in Jython MVP framework, more Polymer

Introduction
I implemented next set of Polymer web components from GWT Vaadin Polymer demo. Demo version on Heroku platform is available here, full source code here. I added the rest of components for Paper elements, also Iron and Vaadin elements. Not everything is working perfectly, but the main bulk of work is done.














































Next step
I'm going to implement full support for list utilizing Vaadin grid components. I'm also inclined to rewrite all logic related to user interface. Assuming that user interface is supposed to be entirely Polymer-based, a lot of code supporting standard GWT widgets is out of use.

czwartek, 29 grudnia 2016

IBM InfoSphere Streams, BigSql and HBase

Introduction
Some time ago I created Java methods to load data into HBase using format understandable by BigSql. Now it is high time to move ahead and to create IBM InfoSphere operator making usage of this solution.
The Streams solution is available here. The short description is added here.
JConvert operator
JConvert operator does not load data into HBase, it should precede HBASEPut operator.
JConvert accepts one or more input streams and every input stream should have corresponding output stream. It simply encodes every attribute in the input stream to blob (binary) attribute in the output stream. The binary value is later loaded to HBase table by HBasePut operator.
Very important factor is to coordinate JConvert input stream with target HBase/BigSql table. Neither JConvert nor HBasePut can do that, if attributes and column types do not match then BigSql will not read the table properly. Conversion rules are explained here.
TestHBaseN
This operator is used for testing, it also contains a lot of usage examples.
Simple BigSql/HBase loading scenario.



On the left there is the producer, then JConvert translates all input attributes into binary format and HBasePut operator load binaries to HBase table.
More details about TestHBaseN.

poniedziałek, 26 grudnia 2016

Oracle -> DB2 migration, useful Java program

Introduction
Some time ago I created a set of useful awk scripts to ease Oracle to DB2 migration. But facing the same task, I decided to rewrite it as Java program. The main disadvantage of the previous solution is that awk analyzes input file line after line. So, for an example, if CREATE OR REPLACE clause is split into two or more lines, the problem starts to be very complicated.
Java program
Java program is available as ready to use Java package or as an Eclipse project to be cloned and updated.
Java solution and a brief description are available as an open source project here. The "two lines problem" is resolved by means of simple Tokenizer. It decomposes input file to separate entities across lines boundaries: words, special characters like (, ),; etc. Then Extractor can pick up words one after one and recognize the beginning and end of particular SQL objects. Tokenizer is also reused during source file fixing.
Fixing
Oracle Compatibility Mode in DB2 allows executing PL/SQL almost out of the box. Nevertheless, some common adjustment should be applied. A typical example is VARCHAR2(32767) or VARCHAR2(MAX). In DB2 the MAX is not supported and boundary for VARCHAR2 is 32672. If there are several occurrences of it no problem to fix it manually, but it could be the challenge in the case of hundreds or thousands of them.
So after splitting the input Oracle source file into separate SQL object, a set of adjustment is executed over them. Every adjustment is a single Java class implementing IFix interface. Current adjustments are stored in org.migration.fix.impl package. Every class should be registered in MainExtract.
This way, assuming that project is cloned as Eclipse project, adjustments can be easily extended.
DB2 comparison
As a part of the solution not covered by previous awk implementation is DB2 comparison tool. It compares the list of Oracle objects against the list of objects migrated and deployed into DB2 database. In the case of hundreds or thousands of objects, it is easy to miss something.
Conclusion
I was using this solution during real migration and found it very useful. It was worth spending several days to implement it. The most important part is the feature allowing easy extension of the existing set of source file adjustment. I was able to quickly implement next adjustment (together with Junit test) immediately after finding next problem in source Oracle code.