The results are published here.
The data size scaling factor is 100, meaning 100 GB of data. It is not a qualification database in terms of TPC-DS specification and the result should not be used for any formal comparison. I'm going only to run the test in different HDP environments and collect the results.
The queries were executed against ORC and Parquet files.
The ad-hoc impressions:
- Hive falls behind as expected because of M/R paradigm.
- Parquet format: IBM BigSQL and SparkSQL go head to head. Some queries execute faster by BigSQL, some faster by SparkSQL.
- SparkSQL behaves worse while running against ORC, even comparing to Hive.
The benchmark test was executed using TPC-DS framework available here. The Throughput Test requires running four queries in parallel. It was achieved by launching ptest.sh script four times with the parameter describing the stream number.
- nohup ./ptest.sh 0 &
- nohup ./ptest.sh 1 &
- nohup ./ptest.sh 2 &
- nohup ./ptest.sh 3 &
In a tiny environment, the cluster was adjusted according to the demands of a particular SQL engine. Hive and SparSQL were using four dedicated yarn queues to achieve better parallelism level. BigSQL was configured to remove memory restraint, otherwise, during Throughput Test, BigSQL would fail because of lack of available memory. After the test, the specific configuration should be reverted.
The tables were created using Hive external table feature to deploy text file and then target table was loaded by the sequence:
- LOAD DATA INPATH hdfs_path OVERWRITE INTO TABLE external_table
- CREATE TABLE .... STORED AS PARQUET/OCR as SELECT * FROM external_table
Next steps
- The TPC-DS tables are not partitioned. Try to partition them and take advantage of "partition elimination" feature in BigSQL. This feature optimizes join queries by ignoring partitions not covered by join predicates. Should boost BigSQL performance.
- Migrate to HDP 3.1 and compare the results in the same hardware framework.
Brak komentarzy:
Prześlij komentarz