There are three ways to load data into SDB:
model.read
operationsmodel.add
The last one of these requires the application to signal the beginning and end of batches.
Model.read
A Jena Model obtained from SDB via:
SDBFactory.connectModel(store)
will automatically bulk load data for each call of one of the
Model.read
operations.
Model.add
The Model.add
operations, in any form or combination of forms,
whether loading a single statement, list of statements, or another
model, will invoke the bulk loader if previously notified before an
add operation.
You can also explicitly delimit bulk operations:
model.notifyEvent(GraphEvents.startRead) ... do add/remove operations ... model.notifyEvent(GraphEvents.finishRead)
Failing to notify the end of the operations will result in data loss.
A try/finally block can ensure that the finish is notified.
model.notifyEvent(GraphEvents.startRead) ; try { ... do add/remove operations ... } finally { model.notifyEvent(GraphEvents.finishRead) ; }
The model.read
operations do this automatically.
The bulk loader will automatically chunk large sequences of additions to sizes appropriate to the underlying database. The bulk loader is threaded with double-buffered; loading to the database happens in parallel to the application thread and any RDF parsing.
Loading consists of two phases: in the java VM, and on the database itself. The SDB loader takes incoming triples and breaks them down into components ready for the database. These prepared triples are added to a queue for the database phase, which (by default) takes place on a separate thread. When the number of triples reaches a limit (default 20,000), or finish update is signalled, the triples are passed to the database.
You can configure whether to use threading and the 'chunk size' --
the number of triples per load event -- via StoreLoader
.
Store store; // SDB Store ... store.getLoader().setChunkSize(5000); // store.getLoader().setUseThreading(false); // Don't thread
You should set these before the loader has been used.
Each loader sets up two temporary tables (NNode
and NTrip
) that
mirror Nodes
and Triples
tables. These tables are virtually
identical, except that a) they are not indexed and b) for the index
variant there is no index column for nodes.
When loading prepared triples -- triples that have been broken down ready for the database -- are passed to the loader core (normally running on a different thread). When the chunk size is reached, or we are out of triples, the following happens:
NNode
. Duplicate nodes
within a chunk are suppressed on the java side (this is worth doing
since they are quite common, e.g. properties).NTrip
.Thus there are five calls to the database for every chunk. The database handles almost all of the work uninterrupted (duplicate suppression, hash to index lookup), which makes loading reasonably quick.
MySQL has a very useful INSERT IGNORE
, which will keep going,
skipping an offending row if a uniqueness constraint is violated.
For other databases we need something else.
Having tried a number of options the best seems to be to INSERT
new items by LEFT JOIN
new items to existing items, then
filtering WHERE (existing item feature) IS NULL
. Specifically,
for the triple hash case (where no id lookups are needed):
INSERT INTO Triples SELECT DISTINCT NTrip.s, NTrip.p, NTrip.o -- DISTINCT because new triples may contain duplicates (not so for nodes) NTrip LEFT JOIN Triples ON (NTrip.s=Triples.s AND NTrip.p=Triples.p AND NTrip.o=Triples.o) WHERE Triples.s IS NULL OR Triples.p IS NULL OR Triples.o IS NULL