C++ Code Generation
Warning: This BETA API is not final, and subject to change before release.
1. Quickstart Guide
DBToaster generates C++ code for incrementally maintaining the results of a given set of queries
if CPP is specified as the output language (-l cpp command line option). In this case DBToaster
produces a C++ header file containing a set of datastructures (tlq_t, data_t and
Program) required for executing the sql program.
Let's consider the following sql query:
$> cat test/queries/simple/rs_example1.sql
CREATE TABLE R(A int, B int)
FROM FILE '../../experiments/data/tiny/r.dat' LINE DELIMITED
CSV (fields := ',');
CREATE STREAM S(B int, C int)
FROM FILE '../../experiments/data/tiny/s.dat' LINE DELIMITED
CSV (fields := ',');
SELECT SUM(r.A*s.C) as RESULT FROM R r, S s WHERE r.B = s.B;
The corresponding C++ header file can be obtained by running:
$> bin/dbtoaster test/queries/simple/rs_example1.sql -l cpp -o rs_example1.hpp
Alternatively, DBToaster can build a standalone binary (if the -c [binary name] flag is present) by compiling
the generated header file against lib/dbt_c++/main.cpp, which provides code for executing the
sql program and printing the results.
Requirements: The Boost header files and the following library binaries: boost_program_options,
boost_serialization, boost_system, boost_filesystem, boost_chrono and
boost_thread have to be present on the system since the generated code makes use of them.
If these can't be found in the paths searched by default by g++ then their location has to be explicitly
provided to DBToaster. This can be done in one of the following two ways, either through the environment variables:
- DBT_HDR which should contain the path to Boost's include folder;
- DBT_LIB which should contain the path to Boost's lib folder.
$> export DBT_HDR=path-to-boost-include-dir
$> export DBT_LIB=path-to-boost-lib-dir
$> bin/dbtoaster test/queries/simple/rs_example1.sql -l cpp -c rs_example1
or through the
-I and
-L command line flags:
$> bin/dbtoaster test/queries/simple/rs_example1.sql -l cpp -c rs_example1 -I path-to-boost-include-dir -L path-to-boost-lib-dir
Running the compiled binary will result in the following output:
$> ./rs_example1
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE boost_serialization>
<boost_serialization signature="serialization::archive" version="9">
Initializing program:
Running program:
Printing final result:
<snap class_id="0" tracking_level="0" version="0">
<RESULT>156</RESULT>
</snap>
</boost_serialization>
If the generated binary is run with the
--async flag, it will also print intermediary results as frequently
as possible while the sql program is running in a separate thread.
$> ./rs_example1 --async
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE boost_serialization>
<boost_serialization signature="serialization::archive" version="9">
Initializing program:
Running program:
<snap class_id="0" tracking_level="0" version="0">
<RESULT>0</RESULT>
</snap>
<snap>
<RESULT>0</RESULT>
</snap>
<snap>
<RESULT>0</RESULT>
</snap>
<snap>
<RESULT>0</RESULT>
</snap>
<snap>
<RESULT>9</RESULT>
</snap>
<snap>
<RESULT>74</RESULT>
</snap>
<snap>
<RESULT>141</RESULT>
</snap>
Printing final result:
<snap>
<RESULT>156</RESULT>
</snap>
</boost_serialization>
2. C++ API Guide
The DBToaster C++ codegenerator produces a header file containing 3 main type definitions in the dbtoaster namespace:
tlq_t, data_t and Program. Additionally snapshot_t is pre-defined as a garbage collected
pointer to tlq_t. What follows is a brief description of these types, while a more detailed presentation can be found
in the Reference section.
tlq_t encapsulates the materialized views directly needed for computing the results and offers functions for retrieving
them.
data_t extends tlq_t with auxiliary materialized views needed for maintaining the results and offers trigger
functions for incrementally updating them.
Program represents the execution engine of the sql program. It encapsulates a data_t object and provides
implementations to a set of abstract functions of the IProgram class used for running the program.
Default implementations for some of these functions are inherited from the ProgramBase class while others
are generated depending on the previously defined tlq_t and data_t types.
2.1. Executing the Program
The execution of a program can be controlled through the functions: IProgram::init(),
IProgram::run(), IProgram::is_finished(), IProgram::process_streams()
and IProgram::process_stream_event().
- virtual void IProgram::init()
- Loads the tuples of static tables and performs initialization
of materialized views based on that data. The definition of this functions is generated as part of the
Program class.
- void IProgram::run( bool async = false )
- Executes the program by invoking the
Program::process_streams() function. If parameter async is set to true
the execution takes place in a separate thread. This is a standard function defined by the IProgram
class.
- bool IProgram::is_finished()
- Tests whether the program has finished or not. Especially
relevant when the program is run in asynchronous mode. This is a standard function defined by the IProgram
class.
- virtual void IProgram::process_streams()
- Reads stream events from various sources and invokes
the IProgram::process_stream_event() on each event. Default implementation of this function
(ProgramBase::process_streams()) reads events from the sources specified in the sql program.
- virtual void IProgram::process_stream_event(event_t& ev)
- Processes each stream event passing
through the system. Default implementation of this function (ProgramBase::process_stream_event()) does
incremental maintenance work by invoking the trigger function corresponding to the event type ev.type
for stream ev.id with the arguments contained in ev.data.
2.2. Retrieving the Results
The snapshot_t IProgram::get_snapshot() function returns a snapshot of the results of the program.
The query results can then be obtained by calling the appropriate get_TLQ_NAME() function on the
snapshot object as described in the reference of tlq_t. If the program is
running in asynchronous mode it is guaranteed that the taken snapshot is consistent.
Currently, the mechanism for taking snapshots is trivial, in that a snapshot consists of a full copy of the
tlq_t object associated with the program. Consequently, the time required to obtain such a snapshot
is linear in the size of the results set.
2.3. Basic Example
We will use as an example the C++ code generated for the rs_example1.sql sql program introduced above. In the interest
of clarity some implementation details are omitted.
$> bin/dbtoaster test/queries/simple/rs_example1.sql -l cpp -o rs_example1.hpp
#include <lib/dbt_c++/program_base.hpp>
namespace dbtoaster {
/* Definitions of auxiliary maps for storing materialized views. */
...
...
...
/* Type definition providing a way to access the results of the sql */
/* program */
struct tlq_t{
tlq_t()
{}
...
/* Functions returning / computing the results of top level */
/* queries */
long get_RESULT(){
...
}
protected:
/* Data structures used for storing/computing top level queries */
...
};
/* Type definition providing a way to incrementally maintain the */
/* results of the sql program */
struct data_t : tlq_t{
data_t()
{}
/* Registering relations and trigger functions */
void register_data(ProgramBase<tlq_t>& pb) {
...
}
/* Trigger functions for table relations */
void on_insert_R(long R_A, long R_B) {
...
}
/* Trigger functions for stream relations */
void on_insert_S(long S_B, long S_C) {
...
}
void on_delete_S(long S_B, long S_C) {
...
}
void on_system_ready_event() {
...
}
private:
/* Data structures used for storing materialized views */
...
};
/* Type definition providing a way to execute the sql program */
class Program : public ProgramBase<tlq_t>
{
public:
Program(int argc = 0, char* argv[] = 0) :
ProgramBase<tlq_t>(argc,argv)
{
data.register_data(*this);
/* Specifying data sources */
...
}
/* Imports data for static tables and performs view */
/* initialization based on it. */
void init() {
process_tables();
data.on_system_ready_event();
}
/* Saves a snapshot of the data required to obtain the results */
/* of top level queries. */
snapshot_t take_snapshot(){
return snapshot_t( new tlq_t((tlq_t&)data) );
}
private:
data_t data;
};
}
}
Below is an example of how the API can be used to execute the sql program and
print its results:
#include "rs_example1.hpp"
int main(int argc, char* argv[]) {
bool async = argc > 1 && !strcmp(argv[1],"--async");
dbtoaster::Program p;
dbtoaster::Program::snapshot_t snap;
cout << "Initializing program:" << endl;
p.init();
cout << "Running program:" << endl;
p.run( async );
while( !p.is_finished() )
{
snap = p.get_snapshot();
cout << "RESULT: " << snap->get_RESULT() << endl;
}
cout << "Printing final result:" << endl;
snap = p.get_snapshot();
cout << "RESULT: " << snap->get_RESULT() << endl;
return 0;
}
2.4. Custom Execution
Custom event processing can be performed on each stream event if the virtual function
void IProgram::process_stream_event(event_t& ev) is overriden while still delegating
the basic processing task of an event to Program::process_stream_event().
Example: Custom event processing.
namespace dbtoaster{
class CustomProgram_1 : public Program
{
public:
void process_stream_event(event_t& ev) {
cout << "on_" << event_name[ev.type] << "_";
cout << get_relation_name(ev.id) << "(" << ev.data << ")" << endl;
Program::process_stream_event(ev);
}
};
}
Stream events can be manually read from custom sources and fed into the system by overriding the virtual function
void IProgram::process_streams() and calling process_stream_event() for each event read.
Example: Custom event sourcing.
namespace dbtoaster{
class CustomProgram_2 : public Program
{
public:
void process_streams() {
for( long i = 1; i <= 10; i++ ) {
event_args_t ev_args;
ev_args.push_back(i);
ev_args.push_back(i+10);
event_t ev( insert_tuple, get_relation_id("S"), ev_args);
process_stream_event(ev);
}
}
};
}
3. C++ Generated Code Reference
3.1. struct tlq_t
The tlq_t contains all the relevant datastructures for computing the results of the sql program, also called
the top level queries. It provides a set of functions named get_TLQ_NAME that return the top level query
result labeled TLQ_NAME. For our example the tlq_t produced has a function named get_RESULT
that returns the query result corresponding to SELECT SUM(r.A*s.C) as RESULT ... in rs_example1.sql.
3.1.1. Queries computing collections
In the example above the result consisted of a single value.
If however our query has a GROUP BY clause its result is a collection and
the corresponding get_RESULT function will return either a boost::multi_index_container or a std::map.
Let's consider the following example:
$> cat test/queries/simple/rs_example2.sql
CREATE STREAM R(A int, B int)
FROM FILE '../../experiments/data/tiny/r.dat' LINE DELIMITED
CSV (fields := ',');
CREATE STREAM S(B int, C int)
FROM FILE '../../experiments/data/tiny/s.dat' LINE DELIMITED
CSV (fields := ',');
SELECT r.B, SUM(r.A*s.C) as RESULT_1, SUM(r.A+s.C) as RESULT_2 FROM R r, S s WHERE r.B = s.B GROUP BY r.B;
The generated code defines two collection types
RESULT_1_map and
RESULT_2_map and two corresponding
entry types:
RESULT_1_entry and
RESULT_2_entry. These entry structures have a set of key fields
corresponding to the
GROUP BY clause, in our case
R_B, and an additional value field,
__av,
storing the aggregated value of the top level query for each key in the collection. Finally,
tlq_t contains
two functions
get_RESULT_1 and
get_RESULT_2 returning the top level query results as
RESULT_1_map
and
RESULT_2_map objects.
/* Definitions of auxiliary maps for storing materialized views. */
struct RESULT_1_entry {
long R_B; long __av;
...
};
typedef multi_index_container<RESULT_1_entry, ... > RESULT_1_map;
...
struct RESULT_2_entry {
long R_B; long __av;
...
};
typedef multi_index_container<RESULT_2_entry, ... > RESULT_2_map;
...
/* Type definition providing a way to access the results of the sql program */
struct tlq_t{
tlq_t()
{}
/* Serialization Code */
...
/* Functions returning / computing the results of top level queries */
RESULT_1_map& get_RESULT_1(){
...
}
RESULT_2_map& get_RESULT_2(){
...
}
protected:
/* Data structures used for storing / computing top level queries */
RESULT_1_map RESULT_1;
RESULT_2_map RESULT_2;
};
If the given query has no aggregates the
COUNT(*) aggregate will be computed by default and
consequently the resulting collections will be guaranteed not to have any duplicate keys.
3.1.2. Partial Materialization
Some of the work involved in maintaining the results of a query can be saved by performing partial materialization
and only computing the final results when invoking tlq_t's get_TLQ_NAME functions. This
behaviour is especially desirable when the rate of querying the results is lower than the rate of updates, and
can be enabled through the -F EXPRESSIVE-TLQS command line flag.
Below is an example of a query where partial materialization is indeed beneficial.
$> cat test/queries/simple/r_lift_of_count.sql
CREATE STREAM R(A int, B int)
FROM FILE '../../experiments/data/tiny/r.dat' LINE DELIMITED
csv ();
SELECT r2.C FROM (
SELECT r1.A, COUNT(*) AS C FROM R r1 GROUP BY r1.A
) r2;
Generated tlq_t without -F EXPRESSIVE-TLQS: We can see that
get_COUNT()
simply returns the materialized view of the results.
$> bin/dbtoaster test/queries/simple/r_lift_of_count.sql -l cpp
...
/* Type definition providing a way to access the results of the sql program */
struct tlq_t{
tlq_t()
{}
...
/* Functions returning / computing the results of top level queries */
COUNT_map& get_COUNT(){
COUNT_map& __v_1 = COUNT;
return __v_1;
}
protected:
/* Data structures used for storing / computing top level queries */
COUNT_map COUNT;
};
...
Generated tlq_t with -F EXPRESSIVE-TLQS: We can see that
get_COUNT()
perfoms some final computation for constructing the end result in a temporary
std::map before returning it.
We should remark that
tlq_t no longer contains the full materialized view of the results
COUNT_map COUNT;
but a partial materialization
COUNT_1_E1_1_map COUNT_1_E1_1; used by
get_COUNT() in computing
the final query result.
$> bin/dbtoaster test/queries/simple/r_lift_of_count.sql -l cpp -F EXPRESSIVE-TLQS
...
/* Type definition providing a way to access the results of the sql program */
struct tlq_t{
tlq_t()
{}
...
/* Functions returning / computing the results of top level queries */
map<long,long> get_COUNT(){
map<long,long> __v_41;
/* Result computation based on COUNT_1_E1_1 */
return __v_41;
}
protected:
/* Data structures used for storing / computing top level queries */
COUNT_1_E1_1_map COUNT_1_E1_1;
};
...
3.2. struct data_t
The data_t contains all the relevant datastructures and trigger functions for incrementally maintaining the results
of the sql program.
For each stream based relation STREAM_X, present in the sql program, it provides a pair of trigger functions named
on_insert_STREAM_X() and on_delete_STREAM_X() that incrementally maintain the query results in the event of
an insertion/deletion of a tuple in STREAM_X. If generating code for the query presented above (rs_example1.sql)
the data_t produced has the trigger functions void on_insert_S(long S_B, long S_C) / void on_delete_S(long S_B, long S_C).
For static table based relations only the insertion trigger is required and will get called when processing the static tables
in the initialization phase of the program.
3.3. class Program
Finally, Program is a class that implements the IProgram interface and provides the basic functionalities
for reading static table tuples and stream events from their sources, initializing the relevant datastructures, running the sql
program and retrieving its results.