Blog do projektu Open Source JavaHotel

niedziela, 30 sierpnia 2015

AQL, BigInsight Text Analytics and Java, Map/Reduce

BigInsight Text Analytics is IBM Hadoop add-on, a tool for extracting information from text data by applying a set of rules. The core of BigInsight Text Analytics is AQL, Annotation Query Language. AQL (similar to SQL) allows to develop text analysis bottom-up, starting with basic elements, like dictionaries and tokens and then build up more complicated statements. The final program is called "extractor" which runs over set of input documents and produces a collection of "views" containing desirable information in a structured form. "View" is like a table (relation) in SQL, sequence of rows composed of a list of columns.
Extractor is AQL program consisting of one or more AQL modules. Every module contains one or more source file with AQL statements.
BigInsight 3 provides Eclipse component to developing, testing and publishing AQL extractor. This component is discontinued in BigInsight 4 and replaced by Web Tool. Unfortunately, there is a design gap in BigInsight 4. The capabilities of Web Tool are very limited and if the developer wants to unleash a full power of Text Analytics AQL beyond Web Tool there is no simple way to do so. So I decided to fill this gap.
Java API
Text Analytics AQL engine can be easily leveraged using Java API and thus integrated with any other Big Data solution. So I created a simple project allowing compiling and executing AQL program.
Install BigInsight 4 Quick Start Edition. Then find systemT.jar (by executing a command: locate systemT.jar).
Test data, analysis of IBM quarterly business reports,  can be created by running Text Analytics tutorial.
Solution description
The solution can be executed in two different ways: as a standalone program and as a MapReduce task over the Hadoop cluster.
It is recommended, depending on the needs, to prepare two runnable jars: RunAQL.jar (main class com.systemt.runaql.RunAQL) and ExtractMapReduce.jar (main class com.systemt.mapreduce.MapReduceMain).
Both methods require three parameters: input directory, output directory, configuration file

java -cp RunAQL.jar:RunAQL_lib/*:/opt/ibm/bignsight/hadbi4/conf hdfs://$INPUTDIR  hdfs://$OUTDIR /home/sbartkowski/work/bigi/ma/
opt/ibm/bignsight/hadbi4/conf points to Hadoop configuration directory.

  • input directory Input directory with text data to be annotated by Text Analytics extractor. For a standalone program, it could be local directory or hdfs directory (starting with hdfs:)
  • output directory Output directory where Text Analytics result will be stored. In case of Map/Re task, a directory should be removed before. For a standalone program, if could be local directory or hdfs directory (starting with hdfs:)
  • configuration file Configuration file containing information about AQL modules to be executed.
Configuration file
  • out.tam.dir For standalone program empty. For Map/Re program,  hdfs directory where compiled AQL program (.tam) files are saved and reused by Map/Re tasks.
  • in.module.propertyN Could be more than 1. Directory with AQL source code module. It is always a local directory.
  • list.modules Should correspond to in.module.propertyN value. List of module names separated by a comma. 
  • ex.dictionaryN Pair separated by comma, external dictionary name and external dictionary file. For Map/Reduce program is should be hdfs address.
  • input.multi Y if Multilingual tokenizer should be used. N or ignore otherwise. More details below.
  • input.lang Language of input documents. List of accepted values. Could be omitted if default English is used.
Standard or Multilingual tokenizer
Although Multilingual tokenizer supersedes Standard, the Standard should be used as often as possible. There is a performance penalty attached to Multilingual tokenizer. Multilingual should be used with no-European languages like Hebrew, Chinese or Japanese. Multilingual tokenizer requires additional dependency. More details (point 3).
An example of launching sequence for Multilingual tokenizer
hadoop fs -rm -r -skipTrash $OUTDIR

hadoop jar ExtractMapReduce.jar -libjars $ADDJAR Macabbi $OUTDIR
An example of launching sequence for Text Analytics tutorial
hadoop fs -rm -r -skipTrash $OUTDIR
hadoop jar ExtractMapReduce.jar -libjars ExtractMapReduce_lib/systemT-3.3.0.jar inputdata $OUTDIR
Running Text Analytics as Map/Reduce
Annotating text data with Text Analytics AQL is perfectly suited for Map/Reduce paradigm. Input data is divided into a collection of text files and Map task annotates one document.  In most cases, Text Analytics runs over a good number relatively small input documents (for instance: Twitter tweets, medical reports etc). The output is a pair: view name (key) and view content (value). The Reduce task consolidates content of a view from all documents. Next, OutputFormat creates the result (just now only CSV format is supported).

Future extensions, known limitations
  • The current version does not support external tables, only external dictionaries are implemented. 
  • Only CSV format is supported. As a feature first line containing the header is desirable. Additional output format: TSV (tab separated values), HBase, Hive or BigSQL (IBM BigInsight add-on) should be added. The main challenge to overcome: how to pass output view column description to the OutputFormat task.
  • It is not necessary to compile AQL source files every time. Compiled .tam file can be created once, stored and reused later, in a separate process.

Brak komentarzy:

Prześlij komentarz