Blog do projektu Open Source JavaHotel

środa, 27 grudnia 2017

Next version of Civilization The Board Game

Introduction
I deployed a new version of my computer implementation of  Civilization The Board Game. The implementation consists of three parts:
New features
  • Player can explore huts
  • Player can attack and take over villages
  • Fight between players in two player game
  • Improvements
Exploring huts
To explore hut, go to the square adjacent to the hut and click "Explore" button.

Then hut is visible in the player resource panel on the left. Click the resource panel and discover the resource guarded by the hut.

Attacking villages
In order to take over the village, the player has to position the figure close to the village and click "Attack" button.

Next step is to start the battle.
Next step, last but not least, is to win the battle!
In training, single player game, the player has to take a role of attacker and defender. In two player game, the opponent has to fight as a barbarian.
To fight, just drag a unit from waiting list to the battle zone. The game engine will resolve the result of the battle.
This time you win. But be careful next time, you may lose as well.
Two player battle
In two player game, one of the players can attack another player. 


The only difference comparing to village battle, the victor can pick up a loot from the loser.
Improvements
I removed confirmation for all movements ("Are you sure ?"). The game runs more smoothly but there is a risk of mistake. 
Problems
The game is running slowly and awkward on Heroku in two players mode. In two players game, the client is probing constantly in 0.5 sec intervals the server to discover the change caused by other player action. The server is under the shell and is reacting slowly causing latency. Some improvements are necessary.
Next steps
  • Research implementation
  • Improvements


czwartek, 30 listopada 2017

HDP 2.6.3, Ambari, Python

I was stuck by very nasty error while installing latest HDP 2.6.3 on RedHat 7.3. Installation failed giving the following error message:
Traceback (most recent call last):
  File "/var/lib/ambari-agent/cache/custom_actions/scripts/check_host.py", line 170, in actionexecute
    installed_packages, repos = self.execute_existing_repos_and_installed_packages_check(config)
  File "/var/lib/ambari-agent/cache/custom_actions/scripts/check_host.py", line 233, in execute_existing_repos_and_installed_packages_check
    installedPackages = self.pkg_provider.all_installed_packages()
  File "/usr/lib/python2.6/site-packages/resource_management/core/providers/package/yumrpm.py", line 222, in all_installed_packages
    return self._get_installed_packages(None)
  File "/usr/lib/python2.6/site-packages/resource_management/core/providers/package/yumrpm.py", line 157, in _get_installed_packages
    packages = self._lookup_packages([AMBARI_SUDO_BINARY, "yum", "list", "installed"], "Installed Packages")
  File "/usr/lib/python2.6/site-packages/resource_management/core/providers/package/yumrpm.py", line 191, in _lookup_packages
    if items[i + 2].find('@') == 0:
IndexError: list index out of range
No handlers could be found for logger "ambari_agent.HostCheckReportFileHandler"
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/ambari_agent/HostCheckReportFileHandler.py", line 74, in writeHostChecksCustomActionsFile
    items.append(itemDetail['name'])
TypeError: string indices must be integers, not str
I took me some time to get to the bottom. It looked that the culprit was additional stuff attached to the end of "yum list installed".
yum-metadata-parser.x86_64        1.1.4-10.el7               @anaconda/7.3
yum-rhn-plugin.noarch             2.0.1-6.1.el7_3            @rhel-7-server-rpms
zip.x86_64                        3.0-11.el7                 @anaconda/7.3
zlib.x86_64                       1.2.7-17.el7               @anaconda/7.3
Uploading Enabled Repositories Report
Loaded plugins: product-id
Judging from the Python source code:
    for i in range(0, len(items), 3):

        if '.' in items[i]:
          items[i] = items[i][:items[i].rindex('.')]
        if items[i + 2].find('@') == 0:
          items[i + 2] = items[i + 2][1:]
        packages.append(items[i:i + 3])

    return packages
the installer is expecting a number of elements returned from "yum list" to be divided by 3 and this two lines at the end obviously break this assumption. So the solution (working) was the manual fix in /usr/lib/ambari-agent/lib/resource_management/core/providers/package/yumrpm.py. Important: in this file, not in the file specified in error stack trace:  /usr/lib/python2.6/site-packages/resource_management/core/providers/package/yumrpm.py, because the installer every time overwrites script in /usr/lib/python2.6/site-packages/resource_management/core/providers/package directory. After the fix the aforementioned piece of code looks:
     for i in range(0, len(items), 3):

        if i + 2 >= len(items) : break

        if '.' in items[i]:
          items[i] = items[i][:items[i].rindex('.')]
        if items[i + 2].find('@') == 0:
          items[i + 2] = items[i + 2][1:]
        packages.append(items[i:i + 3])

    return packages

sobota, 28 października 2017

Next version of Civilization The Board Game

Introduction
I deployed next version of my computer implementation of  Civilization The Board Game. The implementation consists of three parts:
New features
  • spend trade for production and undo spend
  • scout can send production to the city and undo sending
  • buying units
Spend trade for production

Just specify the number of production you want to get and the trade is reduced accordingly. So you can do shopping as you wish.The player can also undo the last spending unless it is used.
Send production from square to the city

Click the city and choose the scout you want to send production.
The player can also undo this action unless production is spent.
Click Undo button right to the grayed "Send Production" button.
Buying units

The player can buy units although the battle is not implemented yet. The unit panel shows only the number of units of a particular type. After clicking the panel, the detailed list of units including their strength is revealed. Of course, this option is not available for opposite player. Also, in the market panel, any player can see the detailed list of killed units but not the units still waiting to be taken.
Next steps
  • Resource harvesting including friendly villages
  • Battle

czwartek, 26 października 2017

HDP, BigInsights, Kafka, Kerberos

I spent several hours resolving a nasty problem which came up after enabling Kerberos security. Suddenly command line kafka-topic utility tools refused to cooperate:
[2017-10-26 23:31:17,424] WARN Could not login: the client is being asked for a password, but the Zookeeper client code does not currently support obtaining a password from the user. Make sure that the client is configured to use a ticket cache (using the JAAS configuration setting 'useTicketCache=true)' and restart the client. If you still get this message after that, the TGT in the ticket cache has expired and must be manually refreshed. To do so, first determine if you are using a password or a keytab. If the former, run kinit in a Unix shell in the environment of the user who is running this Zookeeper client using the command 'kinit ' (where  is the name of the client's Kerberos principal). If the latter, do 'kinit -k -t  ' (where  is the name of the Kerberos principal, and  is the location of the keytab file). After manually refreshing your cache, restart this client. If you continue to see this message after manually refreshing your cache, ensure that your KDC host's clock is in sync with this host's clock. (org.apache.zookeeper.client.ZooKeeperSaslClient)
[2017-10-26 23:31:17,426] WARN SASL configuration failed: javax.security.auth.login.LoginException: No password provided Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it. (org.apache.zookeeper.ClientCnxn)
Exception in thread "main" org.I0Itec.zkclient.exception.ZkAuthFailedException: Authentication failure
 at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:946)
 at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:923)
 at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1230)
 at org.I0Itec.zkclient.ZkClient.(ZkClient.java:156)
 at org.I0Itec.zkclient.ZkClient.(ZkClient.java:130)
 at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:76)
 at kafka.utils.ZkUtils$.apply(ZkUtils.scala:58)
 at kafka.admin.TopicCommand$.main(TopicCommand.scala:53)
 at kafka.admin.TopicCommand.main(TopicCommand.scala)

The reason is quite simple. Kafka to communicate with underlying Zookeper uses /etc/security/keytabs/kafka.service.keytab. As a default, this file has permission 400 so only kafka user can access it.
The solution is to change permission to 440, so the security is softened a little bit but the file is still protected. User vying to create Kafka topic should belong to hadoop group.

poniedziałek, 9 października 2017

Hadoop and SQL engines

Every Hadoop distribution comes with several SQL engines. So I decided to create a simple test to compare them. Run the same queries against the same data set. So far I have been working with two Hadoop distribution, BigInsights 4.x and HDP 2.6.2 with Big SQL 5.0.1. The second is now the successor of BigInsights.
I was comparing the following SQL engines:

  • MySQL, embedded
  • Hive against data in different format: text files, Parquet and OCR
  • Big SQL on Hive tables, Parquet and OCR
  • Spark SQL
  • Phoenix, SQL engine for HBase.
It is not any kind of benchmarking, the purpose is not to prove the superiority of one SQL engine over another. I also haven't done any kind of tunning or reconfiguration to speed up. Just to conduct a simple check after installation and have several numbers at hand.
The test description and several results are here.
Although I do not claim any ultimate authority here, I can provide several conclusions.
  • Big SQL is a winner. Particularly comparing to Hive. Very important: Big SQL is running on the same physical data, the only difference is a different computational model. It even beats MySQL. But, of course, MySQL will get the upper hand for OLTP requests. 
  • Hive behaves much better paired with TEZ. On the other hand, the execution time is very fluid, can change from one execution to another drastically.
  • Spark SQL is outside competition but it is hard to outmatch in-memory execution.
  • Phoenix SQL is at the end of the race, but the execution time is very stable.


sobota, 30 września 2017

Visualize car data with Brunel and Scala

There is a sample in IBM Data Science Experience "Visualize car data with Brunel". But this sample notebook is written in Python, PySpark. So I transformed it to Scala just receiving the same result but using Scala syntax. I added some comments to explain the code.
The result is published here.
To run it:

  • Download Cars+.ipynb notebook
  • Upload to Jupyter with Apache Toree-Scala (Spark) kernel enabled
  • Enjoy

wtorek, 26 września 2017

Next version of Civilization The Board Game

Introduction
I deployed next version of my computer implementation of  Civilization The Board Game. The implementation consists of three parts:
New features
I implemented new, more user-friendly interface.
Single user, training game.

Two players game, real melee battle.
The opponent deck is inactive, informative only.
More user-friendly features
If figure (scout or army) is in the corner between two hidden tiles, a dialog to select one to be revealed comes up.
Figures can be stacked. If a player decided to move stacked figures, a dialog to select figures to move pops up.
After selecting an action in the left panel, squares, where action could be conducted, are highlighted.
Next steps
Implement
  • Spend trade to rush the production
  • Send production from scout to city
  • Buying units
  • Battles (?)

sobota, 23 września 2017

New version of Civilization the Board Game

Introduction
I deployed next version of my computer implementation of  Civilization The Board Game. The implementation consists of three parts:
New features
  • Game progress saved
  • Game resume
  • Two players game
Game progress saved
The game is saved constantly in redis key/value database. It is done automatically in the background in a transparent way, there is no "Save" button. The server side of the game is stateless, it is only computational engine, redis datastore is a memory cache. Redis is accessed through an interface, I'm going to prepare HBase version as a warmup. Stateless is a very important feature, in the imaginative future when thousands of players are swarming, load balancing and traffic redirection can be applied. 

Game resume
A game can be left and resumed at any time. Just select a game and the player is pushed immediately into the middle of the battle.

Also, two players game can be resumed but second players should join to continue the game.
Two players game
Two options of creating new game are available: "Training" and "Two players game".

Training is a single player game, just to get one's hand in. "Two players game" is more serious matter, test yourself in hand-to-hand battle. Select opponent civilization you want to fight against and wait for the contester.

Joining the game
Select the game from the waiting list and the contester together with one who threw down the gauntlet are moved into the middle of the duel.

Leaving the game
The player can leave the game by clicking the close icon in the upper right corner of the board.
Leaving the game legally unlocks the game and remove it from the waiting list. But because the game is saved automatically while playing, event after abruptly closing the browser the game can be resumed exactly at the same stage whatsoever. The game is removed from waiting list after 24 hours of inactivity.
Next steps
Make map and player deck more user-friendly.

sobota, 9 września 2017

DB2, UTL_ENCODE, BASE64_DECODE, BASE64_ENCODE

I created an implementation of the two methods from Oracle UTL_ENCODE package.  It is implemented as UDF Java function and DB2 module. In Java, it is simply a utilization of JVM Base64 package. More time consuming was preparing DB2 signature and test according to this article.
Full source code is here.

niedziela, 27 sierpnia 2017

Civilization The Board Game

Introduction
For some time, I became a fan of Civilization The Board Game. I found it more engaging, dynamic and enthralling than the computer game. It is like comparing real melee combat with the bureaucratic war waged behind the office desk.
And an idea stirred me up to move the game to the computer screen. Avoid the stuff piling up on the table, train and test ideas without spreading the board game and allow players in remote locations to fight.
For the time being, my idea ended up in two projects.
CivilizationEngine here
Civilization UI here
Demo version on Heroku: https://civilizationboardgame.herokuapp.com/  (wait a moment until dyno is activated, it is a free quota).
Every project comes with its own build.xml file allowing creation of target artifact.
General design principles
The solution consists of two separate projects: Civilization Engine and Civilization UI. I decided that all game logic and state is managed by back end engine. The UI, as the name suggests, is focused only on displaying the board game and allowing the user to execute a command. The command is sent to the server, server changes the game state and UI is receiving the current game state and updates the screen.
The game is nothing more like moving from one game state to another. Every change is triggered by the command. At every moment, it is possible to restore the current game state by setting the initial board and replaying all commands up to the point.
Data is transmitted between engine and UI in JSON format.
Civilization Engine
Civilization Engine is created as IntelliJ IDEA Scala project, can be imported directly from GitHub.
Why Scala? I found it very appropriate here. Most of the operations are related to list walking through, list looking up, filtering, mapping and Scala is an excellent tool for that. If I decided to use Java probably the code would pump up twice even with Java8 streaming features.
I'm very fond of this command (full source) :
  def itemizeForSetSity(b: GameBoard, civ: Civilization.T): Seq[P] =
    getFigures(b, civ).filter(_.s.figures.numberofScouts > 0).map(_.p).filter(p => SetCityAction.verifySetCity(b, civ, p, Command.SETCITY).isEmpty)
It yields all points where a new city can be set.
  • Find all figures on the board belonging to a civilization
  • Single out squares with at least one scout
  • Map squares to points
  • Verify if the point is eligible for city setting using SetCityAction.verifySetCity
All stuff in a single line.
A general outline of the project
  • resources, game objects (JSON format) used in the game: tiles, squares, objects (now TECHNOLOGIES only)
  • gameboard , class definitions
  • objects , enumerations and classes related to game artifacts
  • helper, game logic, I found more convenient to put them as helper, object class then as methods in Gameboard class.
  • io, methods regarding reading and writing data in JSON format. I'm using a dependency PlayJSON package.
  • I, external interface
Brief interface description
  • getData(LISTOFCIV), list of civilizations available
  • getData(REGISTEOWNER), generates new game and returns unique token to be used in further communication
  • getData(GETBOARDGAME), returns current game state
  • executeCommand,  executes next command
  • itemizeCommand, provides all possible parameters for a particular command. For instance: for StartOfMove it brings all points where figure movement are allowed to commence.
So far, only a few commands are implemented
  • SetCapital
  • SetArmy
  • SetScout
  • EnfOfPhase
  • BuyScout
  • BytArmy
  • MoveFigure
  • RevealTile
  • SetCity
User Interface
For the time being, so ugly that only a mother or father could love it. More details: look here.

Next steps
Implementation of game persistence. Because of Heroku limitation, I cannot use disk file system as a mean. I'm planning to use Redis, there is a free quota for this service in Heroku. Redis will be used to store the games and also as a cache. This way, the server part will be completely stateless. Every step will consist of restoring the game from Redis, executing a command and storing the updated game to Redis again.

poniedziałek, 31 lipca 2017

Hosts, simple bash tool to run commands on several hosts

BigInsights  (IBM Hadoop) requires a number of prerequisites to have the cluster consistent. For instance: /etc/hosts file should be the same on all hosts. There are plenty of tools available, but finally, I decided to create a small tool on my own.
The tool is available here (branch hosts).
There are several simple bash procedures allowing copying and executing a single command on all hosts in the cluster. For instance: install a required package using yum command on all hosts.
Basically, two main tasks are implemented:

  • Share file across hosts (for instance /etc/hosts)
  • Run a single command on all hosts (for instance yum install package)
These two simple tools fulfill almost all tasks necessary to prepare and run the multi-host installation of BigInsights and IBM streams.
More details and description here.

sobota, 22 lipca 2017

Dockerize IBM Streams

It is very convenient to run IBM Streams in Docker container to avoid huge VM overhead. There is one project available, but my plan is not so ambitious.
The solution is described here, Dockerfile file is also available there. It is not full automation, it is rather several pieces of advice how to set up Docker container with running IBM Streams domain and instance inside, just lightweight Virtual Machine easy to set up and calm down.
But it comes with one serious limitation. Only single host standalone installation is possible. Multihost installation requires resolving IP-DNS mapping and I failed to overcome this problem.
But standalone installation is enough for developing, testing and evaluation. I will keep going to support multihost also.

sobota, 17 czerwca 2017

Dockerize DB2

Sometimes it is necessary to set up and remove DB2 instance quickly. So far, I was using KVM virtual machine with DB2 preinstalled. It works nicely, but a virtual machine, even KVM, comes with a huge and heavy overhead.
Another solution is to use docker. Docker can be used as a lightweight virtual machine, with much smaller footprint than a full-fledged virtual machine.
Here I'm describing the steps to run DB2 in a docker container. In this example, free DB2 Express-C edition is used but the pattern can be extended to any other DB2 edition.
After completing these simple steps, I have low profile DB2 instance ready to start and stop anytime if necessary.

Several pending tasks.
  • DB2 installation is performed manually. It is possible to automate this process through Dockerfile although the procedure is different depending on the DB2 edition.
  • DB2 instance should be started manually every time the container is restarted. Looking for the way to run it automatically.

wtorek, 30 maja 2017

BigInsights, docker

Problem
I've spent some time trying to dockerize BigInsights, IBM Open Platform. After resolving some issues, I was able to perform installation. Everything run smoothly except Spark installation. Although installation was reported as successful, Spark History Server did not start.

 File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 424, in action_delayed
    self.get_hdfs_resource_executor().action_delayed(action_name, self)
  File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 265, in action_delayed
    self._assert_valid()
  File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 243, in _assert_valid
    raise Fail(format("Source {source} doesn't exist"))
resource_management.core.exceptions.Fail: Source /usr/iop/current/spark-historyserver/lib/spark-assembly.jar doesn't exist
It turned out that spark-core_4_2_0_0-1.6.1_IBM-000000.el7.noarch.rpm did not unpack all files included. Some directories: /usr/iop/4.2.0.0/spark/lib and /usr/iop/4.2.0.0/spark/sbin were skipped. What is more interesting, while installing the package using rpm command directly, rpm -i spark-core_4_2_0_0-1.6.1_IBM-000000.el7.noarch.rpm, all content of rpm was extracted correctly, while using yum command, yum install spark-core_4_2_0_0-1.6.1_IBM-000000.el7.noarch.rpm, some directories were excluded without signaling any error. I spent sleepless night trying to get a clue.
Solution
I found the explanation here. There was a mistake in spark-core_4_2_0_0-1.6.1_IBM-000000.el7.noarch.rpm package. Some files in the rpm were marked as 'documentation'. It was revealed by running rpm --dump command.
rpm -qp --dump spark-core_4_2_0_0-1.6.1_IBM-000000.el7.noarch.rpm /usr/iop/4.2.0.0/spark/sbin/start-shuffle-service.sh 1279 1466126392 dfe89bfa493c263e4daa8217a9f22db12d6e9a9e1b161c5733acddc5d6b6498c 0100755 root root 0 1 0 X /usr/iop/4.2.0.0/spark/sbin/start-slave.sh 3151 1466126392 623bc623a3c92394cd4b44699ea3ab78b049149f10ee4b6f41d30ab2859f8395 0100755 root root 0 1 0 X /usr/iop/4.2.0.0/spark/sbin/start-slaves.sh 2061 1466126391 24f329f4cd7c48b8cbd52e87b33e1e17228b5ff97f1bcb5b403e1b538b17e32a 0100755 root root 0 1 0 X /usr/iop/4.2.0.0/spark/sbin/start-thriftserver.sh 1824 1466126392 fcef75ab00ef295ade0c926f584902291b3c06131dcb88786a5899e48de12bae 0100755 root root 0 1 0 X /usr/iop/4.2.0.0/spark/sbin/stop-all.sh 1478 1466126392 efb2dc4fafed8d94d652c8cfd81f6ba59de6e9c6ae04da2e234e291f867f1d41 0100755 root root 0 1 0 X /usr/iop/4.2.0.0/spark/sbin/stop-history-server.sh 1056 1466126393 8f74163405d9832f7f930ed00582dd89f3e6ffc1c6f3750e3a4a1639c63593ae 0100755 root root 0 1 0 X /usr/iop/4.2.0.0/spark/sbin/stop-master.sh 1220 1466126391 ba5058a39699ae4d478dc1821fc999f032754b476193896991100761cd847710 0100755 root root 0 1 0 X /usr/iop/4.2.0.0/spark/sbin/stop-mesos-dispatcher.sh 1112 1466126393 b30ce7366e5945f6c02494ce402bcebe5573c423d5eed646b0efc37a2dbc4a8c 0100755 root root 0 1 0 X /usr/iop/4.2.0.0/spark/sbin/stop-mesos-shuffle-service.sh 1084 1466126393 6da69a8927513ed32fdb2d8088e3971596201595a84c9617aa1bdeefd0ef8de7 0100755 root root 0 1 0 X /usr/iop/4.2.0.0/spark/sbin/stop-shuffle-service.sh 1067 1466126391 817ef1a4679c22a9bc3f182ee3e0282001ab23c1c533c12db3d0597abad81d58 0100755 root root 0 1 0 X /usr/iop/4.2.0.0/spark/sbin/stop-slave.sh 1557 1466126392 cd0e35cd11b3452e902e117226e1ee851fc2cb7e2fcce8549c1c4f4ef591173e 0100755 root root 0 1 0 X /usr/iop/4.2.0.0/spark/sbin/stop-slaves.sh 1298 1466126392 a3366c8ab6b142eb7caf46129db2e73e610a3689e3c3005023755212eb5c008c 0100755 root root 0 1 0 X /usr/iop/4.2.0.0/spark/sbin/stop-thriftserver.sh 1066 1466126391 53b9e9a886c03701d7b1973d2c4448c484de2b5860959f7824e83c4c2a48170b 0100755 root root 0 1 0 X /usr/iop/4.2.0.0/spark/work 19 1466127922 0000000000000000000000000000000000000000000000000000000000000000 0120777 root root 0 1 0 /var/run/spark/work /var/lib/spark 6 1466127905 0000000000000000000000000000000000000000000000000000000000000000 040755 spark spark 0 0 0 X /var/log/spark 6 1466127905 0000000000000000000000000000000000000000000000000000000000000000 040755 spark spark 0 0 0 X /var/run/spark 17 1466127905 0000000000000000000000000000000000000000000000000000000000000000 040755 spark spark 0 0 0 X /var/run/spark/work 6 1466127905 0000000000000000000000000000000000000000000000000000000000000000 040755 spark spark 0 0 0 X

The signature: root root 0 1 0 (mark 1) describes the file as "documentation". To shrink the space consumed by packages, the docker "centos" image contains "tsflags=nodocs" feature in /etc/yum.conf configuration file.
So the temporary workaround is to comment out this feature. To avoid loading unnecessary documentation, one can install Spark separately and have this patch in force only during the installation of this component.

środa, 10 maja 2017

Sqoop, Hive, load data incrementally

Introduction
Hive is a popular, SQL-like engine over HDFS data and Sqoop is a tool to transfer data from external RDBMS tables into HDFS. Sqoop simply runs SELECT query against RDBS table and the result is stored in HDFS or as a Hive table directly. After the first loading, the effective way to keep tables synchronized is to update Hive table incrementally in order to avoid moving all data again and again. Theoretically,  the task is simple. Assuming that external table has a primary key and source data are not updated or deleted, take the greatest key already inserted into Hive table and transfer only rows whose primary keys are greater than this threshold.
There is also an additional requirement. A very effective data format for Hive tables is Parquet but Sqoop can only create Hive tables in text format. There is --as-parquetfile Sqoop parameter but I failed to try to enable it for Hive tables.
Solution
The solution is uploaded here.
I decided to implement a two-hop solution. Firstly load delta rows into a staging table in text format using Sqoop and afterward insert rows into the target Parquet Hive table. The whole workflow can be described as follows:
  • Recognize if the target Hive table exists already. If yes, calculate the maximum value for the primary key.
  • Extract from external RDBMS table all rows with the primary key greater than maximum or the whole table if the Hive table does not exist yet. Store data into the staging table.
  • If the target Hive table does not exist, create the table in Parquet format. Execute Hive command "CREATE .. TABLE AS SELECT * FROM stage.table
  • If the target Hive table is created already, simply add new rows with command: INSERT INTO TABLE .. SELECT * FROM stage.table
The solution is implemented as Oozie workflow. Can be launched as a single Oozie task or as Oozie coordinator task. Sample shell scripts for both tasks are available here. common.properties file is used as a template for job.properties and coordinator.properties file.

poniedziałek, 27 lutego 2017

Data extraction tool

I refactored data extraction and loading tool. There is ant build.xml file to create distribution jar automatically and instruction how to recreate Eclipse project directly from GitHub to inspect and expand source code if necessary.
From a functional point of view, only one feature was added. hivedb key allows modifying the prefix for exporting tables in Hive format. In Hive there is no schema, so table name prefix is simply separate database name. By virtue of hivedb parameter, it is possible to change schema name taken from the source database or remove prefix at all and load Hive tables into default database directly.

poniedziałek, 30 stycznia 2017

New features implemented in Jython MVP framework, more Polymer

Introduction
I implemented next set of Polymer web components from GWT Vaadin Polymer demo. Demo version on Heroku platform is available here, full source code here. I added the rest of components for Paper elements, also Iron and Vaadin elements. Not everything is working perfectly, but the main bulk of work is done.














































Next step
I'm going to implement full support for list utilizing Vaadin grid components. I'm also inclined to rewrite all logic related to user interface. Assuming that user interface is supposed to be entirely Polymer-based, a lot of code supporting standard GWT widgets is out of use.