Text searches with SPARQL

This module was first released with Jena 2.10.2.

This extension to ARQ combines SPARQL and text search.

It gives applications the ability to perform free text searches within SPARQL queries. Text indexes are additional information for accessing the RDF graph.

The text index can be either Apache Lucene for a same-machine text index, or Apache Solr for a large scale enterprise search application.

Some example code is available here.

This module is not compatible with the much older LARQ module.

Illustration

This query makes a text query for 'word' on a specific property (the index needs to correctly configured) and limits the output to 10 matches; it then looks in the RDF data for the actual label. More details are given below.

PREFIX text: <http://jena.apache.org/text#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?s
{ ?s text:query (rdfs:label 'word' 10) ; 
     rdfs:label ?label 
}

Table of Contents

Architecture

The text index is used provide a reverse index mapping query strings to URIs. The text indexed can be part of the RDF data or the text index can be used to index external content with only additional RDF in the RDF store.

The text index uses the native query language of the index: Lucene query format or Solr query format.

A text-supporting dataset is configured with a description of which properties work with. When data is added, any properties matching the description caus an entry to be added from analysed text from the triple object and mapping to the subject.

Pattern A – RDF data

In this pattern, the data in the text index is indexing literals in the RDF data.
Additions to the RDF data are reflected in additions to the index.

(Deletes do not remove text index entries - see below)

Pattern B – External content

There is no requirement that the text data indexed is present in the RDF data. As long as the index contains the index text documents to match the index description, then text search can be performed.

For example, if the content of a collection of documents is indexed and the URI naming the document is the result of the text search, then an RDF dataset with the document metadata can be combined with accessing the content by URI.

The maintence of the index is external to the RDF data store.

External applications

By using Solr, in either pattern A (RDF data indexed) or pattern B (external content indexed), other applications can share the text index with SPARQL search.

Query with SPARQL

The property function is http://jena.apache.org/text#query more conveniently writtern:

PREFIX text: <http://jena.apache.org/text#>

...   text:query ...

This is different to LARQ v1.

The following forms are all legal:

?s text:query 'word'              # query
?s text:query (rdfs:label 'word') # query specific property if multiple
?s text:query ('word' 10)         # with limit on results

The most general form is:

?s text:query (property 'query string' 'limit')

Only the query string is required, and if it is the only argument the surrounding ( ) can be omitted.

The property URI is only necessary if multiple properties have been indexed.

 Argument    Definition 
property The URI (inc prefix name form)
query string The native query string
limit The limit on the results

Good practice

The query execution does not know the selectivity of the text index. It is better to use one of two styles.

Query pattern 1 – Find in the text index and enhance results

Access to the index is first in the query and used to find a number of items of interest; further information is obtained about these items from the RDF data.

SELECT ?s
{ ?s text:query (rdfs:label 'word' 10) ; 
     rdfs:label ?label ;
     rdf:type   ?type 
}

Limit is useful here when working with large indexes to limit results to the more higher scoring results.

Query pattern 2 – Filter

By finding items of interest first in the RDF data, the text search can be used to restrict the items found still further.

SELECT ?s
{ ?s rdf:type     :book ;
     dc:createor  "John" .
  ?s text:query   (dc:title 'word') ; 
}

Configuration

The usual way to describe an index is with a Jena assembler description. Configurations can also be built with code. The assembler describes a 'text dataset' which has an underlying RDF dataset and a text index. The text index describes the text index technology (Lucene or Solr) and the details needed for for each.

A text index has an "entity map" which defines the properties to index, the name of the lucene/solr field and field used for storing the URI itself.

For common RDF use, there will be one field, mapping a property to a text index field. More complex setups, with multiple properties per enitity (URI) are possible.

Once setup this way, any data added to the text dataset is automatically indexed as well.

Text Dataset Assembler

The following is an example of a TDB dataset with a text index.

@prefix :        <http://localhost/jena_example/#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix text:    <http://jena.apache.org/text#> .

## Example of a TDB dataset and text index
## Initialize TDB
[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
tdb:GraphTDB    rdfs:subClassOf  ja:Model .

## Initialize text query
[] ja:loadClass       "org.apache.jena.query.text.TextQuery" .
# A TextDataset is a regular dataset with a text index.
text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
# Lucene index
text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
# Solr index
text:TextIndexSolrne  rdfs:subClassOf   text:TextIndex .

## ---------------------------------------------------------------
## This URI must be fixed - it's used to assemble the text dataset.

:text_dataset rdf:type     text:TextDataset ;
    text:dataset   <#dataset> ;
    text:index     <#indexLucene> ;
    .

# A TDB datset used for RDF storage
<#dataset> rdf:type      tdb:DatasetTDB ;
    tdb:location "DB" ;
    tdb:unionDefaultGraph true ; # Optional
    .

# Text index description
<#indexLucene> a text:TextIndexLucene ;
    text:directory <file:Lucene> ;
    ##text:directory "mem" ;
    text:entityMap <#entMap> ;
    .

# Mapping in the index
# URI stored in field "uri"
# rdfs:label is mapped to field "text"
<#entMap> a text:EntityMap ;
    text:entityField      "uri" ;
    text:defaultField     "text" ;
    text:map (
         [ text:field "text" ; text:predicate rdfs:label ]
         ) .

then use code such as:

Dataset ds = DatasetFactory.assemble(
    "text-config.ttl", 
    "http://localhost/jena_example/#text_dataset") ;

Key here is that the assembler contains two dataset definitions, one for the text dataset, one for the base data. Therefore, the application needs to identify the text dataset by it's URI http://localhost/jena_example/#text_dataset.

Configuration by Code

A text dataset can also be constructed in code as might be done for a purely in-memory setup:

    // Example of building a text dataset with code.
    // Example is in-memory.
    // Base dataset
    Dataset ds1 = DatasetFactory.createMem() ;

    EntityDefinition entDef = new EntityDefinition("uri", "text", RDFS.label) ;

    // Lucene, in memory.
    Directory dir =  new RAMDirectory();

    // Join together into a dataset
    Dataset ds = TextDatasetFactory.createLucene(ds1, dir, entDef) ;

Working with Fuseki

The Fuseki configuration simply points to the text dataset as the fuseki:dataset of the service.

<#service_text_tdb> rdf:type fuseki:Service ;
    rdfs:label                      "TDB/text service" ;
    fuseki:name                     "ds" ;
    fuseki:serviceQuery             "query" ;
    fuseki:serviceQuery             "sparql" ;
    fuseki:serviceUpdate            "update" ;
    fuseki:serviceUpload            "upload" ;
    fuseki:serviceReadGraphStore    "get" ;
    fuseki:serviceReadWriteGraphStore    "data" ;
    fuseki:dataset                  :text_dataset ;
    .

Building a Text Index

When working at scale, or when preparing a published, read-only, SPARQL service, creating the index by loading the text dataset is impractical.
The index and the dataset can be built using command line tools in two steps: first load the RDF data, second create an index from the existing RDF dataset.

Step 1 - Building a TDB dataset

Note: If you have an existing TDB dataset then you can skip this step

Build the TDB dataset:

java -cp $FUSEKI_HOME/fuseki-server.jar tdb.tdbloader --tdb=assembler_file data_file

using the copy of TDB included with Fuseki.

Alternatively, use one of the TDB utilities tdbloader or tdbloader2 which are better for bulk loading:

$JENA_HOME/bin/tdbloader --loc=directory  data_file

Step 2 - Build the Text Index

You can then build the text index with the jena.textindexer tool:

java -cp $FUSEKI_HOME/fuseki-server.jar jena.textindexer --desc=assembler_file

Because a Fuseki assembler description can have several datasets descriptions, and several text indexes, it may be necessary to extract a single dataset and index description into a separate assembler file for use in loading.

Deletion of Indexed Entities

If the text index is being maintain by changed to the RDF, then deletion of RDF triple or quads does not cause entries in the index to be removed. The index does not store the literal indexed, nor does it store a reference count of how many triples refer to the index so the information to delete entries is not available.

In situations where this matters, the SPARQL query should look up in the text index, then check in the RDF data. Indeed, this may be necessary anyway because a text search does not necessarily give only exact matches.

In the initial example:

SELECT ?s ?label
{ ?s text:query (rdfs:label 'word' 10) ; 
     rdfs:label ?label 
}

the SPARQL query is checking that the rdfs:label triple exists, and if it does, returning the whole label.

By only indexing but not storing the literals themselves, the index is kept smaller. It may be necessary to periodically rebuild the index if a large proportion of the RDF data changes.

Maven Dependency

The jena-text module is included in Fuseki. To use it within application code, then use the following maven dependency:

<dependency>
  <groupId>org.apache.jena</groupId>
  <artifactId>jena-text</artifactId>
  <version>X.Y.Z</version>
</dependency>

adjusting the version X.Y.Z as necessary. This will automatically include a compatible version of Lucene and the Solr java client, but not Solr server.