Scala Code Generation
Warning: This BETA API is not final, and subject to change before release.
1. Quickstart Guide
1.1. Prerequisites
- DBToaster Beta1
- Scala 2.9.2
- JVM (preferably a 64-bit version)
Note: The following steps have been tested on Fedora 14 (64-bit) and Ubuntu 12.04 (32-bit), the commands may be slightly different for other operating systems
1.2. Compiling and running your first query
We start with a simple query that looks like this:
CREATE TABLE R(A int, B int)
FROM FILE '../../experiments/data/tiny_r.dat' LINE DELIMITED
CSV (fields := ',');
CREATE STREAM S(B int, C int)
FROM FILE '../../experiments/data/tiny_s.dat' LINE DELIMITED
CSV (fields := ',');
SELECT SUM(r.A*s.C) as RESULT FROM R r, S s WHERE r.B = s.B;
This query should be saved to a file named
rs_example.sql.
To compile the query to Scala code, we invoke the DBToaster compiler with the following command:
$> bin/dbtoaster -l scala -o rs_example.scala rs_example.sql
This command will produce the file
rs_example.scala (or any other filename specified by the
-o [filename] switch) which contains the Scala code representing the query.
To compile the query to an executable JAR file, we invoke the DBToaster compiler with the -c [JARname] switch:
$> bin/dbtoaster -l scala -c rs_example rs_example.sql
Note: The ending
.jar is automatically appended to the name of the JAR.
The resulting JAR contains a main function that can be used to test the query. It runs the query until there are no more
events to be processed and prints the result. It can be run using the following command assuming that the
Scala DBToaster library can be found in the subdirectory lib/dbt_scala:
$> scala -classpath "rs_example.jar:lib/dbt_scala/dbtlib.jar" \
org.dbtoaster.RunQuery
After all tuples in the data files were processed, the result of the query will be printed:
Run time: 0.042 ms
<RESULT>156 </RESULT>
2. Scala API Guide
In the previous example, we used the standard main function to test the query. However, to make use of the query
in real applications, it has to be run from the application itself.
The following example shows how a query can be run from your own Scala code. Suppose we have a the following
source code in
main_example.scala:
import org.dbtoaster.Query
package org.example {
object MainExample {
def main(args: Array[String]) {
Query.run()
Query.printResults()
}
}
}
The code representing the query is in the
org.dbtoaster.Query object.
This program will start the query using the
Query.run() method and output its
result after it finished using the
Query.printResults() method.
To retrieve results, the getRESULTNAME() of the Query object can be used.
Note:The getRESULTNAME() functions are not thread-safe, meaning that results can be
inconsistent if they are called from another thread than the query thread. A thread-safe alternative to retrieve
the results is planned for future versions of DBToaster.
The program can be compiled to main_example.jar using the following command (assuming that the query was compiled to a file named rs_example.jar):
$> scalac -classpath "rs_example.jar" -d main_example.jar main_example.scala
The resulting program can now be launched with:
$> scala -classpath "main_example.jar:rs_example.jar:lib/dbt_scala/dbtlib.jar" org.example.MainExample
The
Query.run() method takes a function of type
Unit => Unit as an optional argument which is called every time when an event was processed.
This function can be used to retrieve results while the query is still running.
Note: The function will be executed on the same thread on which the query processing takes place, blocking
further query processing while the function is being run.
3. Generated Code Reference
The DBToaster Scala codegenerator generates a single file containing an object
Query in the package
org.dbtoaster.
For the previous example the generated code looks like this:
// Imports
import java.io.FileInputStream;
...
package org.dbtoaster {
// The generated object
object Query {
// Declaration of sources
val s1 = createInputStreamSource(
new FileInputStream("../../experiments/data/simple/tiny/r.dat"), ...
);
...
// Data structures holding the intermediate result
var RESULT = SimpleVal[Long](0);
...
// Functions to retrieve the result
def getRESULT():Long = {
RESULT.get()
};
// Trigger functions
def onInsertR(var_R_A: Long,var_R_B: Long) = ...
...
def onDeleteS(var_S_B: Long,var_S_C: Long) = ...
// Functions that handle static tables and system initialization
def onSystemInitialized() = ...
def fillTables(): Unit = ...
// Function that dispatches events to the appropriate trigger functions
def dispatcher(event: DBTEvent,
onEventProcessedHandler: Unit => Unit): Unit = ...
// (Blocking) function to start the execution of the query
def run(onEventProcessedHandler: Unit => Unit = (_ => ())): Unit = ...
// Prints the query results in some XML-like form (for debugging)
def printResults(): Unit = ...
}
}
When the run() method is called, the static tables are loaded and the processing
of events from the declared sources starts. The function returns when the sources provide no
more events.
3.1. Retrieving results
To retrieve the result, the getRESULTNAME() functions are used. In the example above,
the getRESULTNAME() method is simple but more complex methods may be generated
and the return value may be a collection instead of a single value.
3.1.1. Queries computing collections
Consider the following query:
CREATE STREAM R(A int, B int)
FROM FILE '../../experiments/data/tiny/r.dat' LINE DELIMITED
CSV (fields := ',');
CREATE STREAM S(B int, C int)
FROM FILE '../../experiments/data/tiny/s.dat' LINE DELIMITED
CSV (fields := ',');
SELECT r.B, SUM(r.A*s.C) as RESULT_1, SUM(r.A+s.C) as RESULT_2 FROM R r, S s WHERE r.B = s.B GROUP BY r.B;
In this case two functions are being generated that can be called to retrieve the result, each of them representing
one of the result columns:
def getRESULT_1():K3PersistentCollection[(Long), Long] = ...
def getRESULT_2():K3PersistentCollection[(Long), Long] = ...
In this case, the functions return a collection containing the result. For further processing, the results can be converted
to lists of key-value pairs using the
toList() method of the collection class. The key in the pair corresponds
to the columns in the
GROUP BY clause, in our case
r.B. The value corresponds to the aggregated
value for the corresponding key.
3.1.2. Partial Materialization
Some of the work involved in maintaining the results of a query can be saved by performing partial materialization
and only computing the final results when invoking tlq_t's get_TLQ_NAME functions. This
behaviour is especially desirable when the rate of querying the results is lower than the rate of updates, and
can be enabled through the -F EXPRESSIVE-TLQS command line flag.
Below is an example of a query where partial materialization is indeed beneficial.
CREATE STREAM R(A int, B int)
FROM FILE '../../experiments/data/tiny/r.dat' LINE DELIMITED
csv ();
SELECT r2.C FROM (
SELECT r1.A, COUNT(*) AS C FROM R r1 GROUP BY r1.A
) r2;
When compiling this query with the
-F EXPRESSIVE-TLQS command line flag, the function to retrieve
the results is much more complex, unlike the functions that we have seen before. It uses the partial materialization
COUNT_1_E1_1 to compute the result:
$> bin/dbtoaster -l scala -F EXPRESSIVE-TLQS test/queries/simple/r_lift_of_count.sql
def getCOUNT():K3IntermediateCollection[(Long), Long] = {
(COUNT_1_E1_1.map((y:Tuple2[(Long),Long]) =>
...
)
};