This module was first released with Jena 2.10.2.
This extension to ARQ combines SPARQL and text search.
It gives applications the ability to perform free text searches within SPARQL queries. Text indexes are additional information for accessing the RDF graph.
The text index can be either Apache Lucene for a same-machine text index, or Apache Solr for a large scale enterprise search application.
Some example code is available here.
This module is not compatible with the much older LARQ module.
Illustration
This query makes a text query for 'word' on a specific property (the index needs to correctly configured) and limits the output to 10 matches; it then looks in the RDF data for the actual label. More details are given below.
PREFIX text: <http://jena.apache.org/text#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> SELECT ?s { ?s text:query (rdfs:label 'word' 10) ; rdfs:label ?label }
The text index is used provide a reverse index mapping query strings to URIs. The text indexed can be part of the RDF data or the text index can be used to index external content with only additional RDF in the RDF store.
The text index uses the native query language of the index: Lucene query format or Solr query format.
A text-supporting dataset is configured with a description of which properties work with. When data is added, any properties matching the description caus an entry to be added from analysed text from the triple object and mapping to the subject.
In this pattern, the data in the text index is indexing literals in the RDF data.
Additions to the RDF data are reflected in additions to the index.
(Deletes do not remove text index entries - see below)
There is no requirement that the text data indexed is present in the RDF data. As long as the index contains the index text documents to match the index description, then text search can be performed.
For example, if the content of a collection of documents is indexed and the URI naming the document is the result of the text search, then an RDF dataset with the document metadata can be combined with accessing the content by URI.
The maintence of the index is external to the RDF data store.
By using Solr, in either pattern A (RDF data indexed) or pattern B (external content indexed), other applications can share the text index with SPARQL search.
The property function is http://jena.apache.org/text#query
more
conveniently writtern:
PREFIX text: <http://jena.apache.org/text#> ... text:query ...
This is different to LARQ v1.
The following forms are all legal:
?s text:query 'word' # query ?s text:query (rdfs:label 'word') # query specific property if multiple ?s text:query ('word' 10) # with limit on results
The most general form is:
?s text:query (property 'query string' 'limit')
Only the query string is required, and if it is the only argument the
surrounding ( )
can be omitted.
The property URI is only necessary if multiple properties have been indexed.
Argument | Definition |
---|---|
property | The URI (inc prefix name form) |
query string | The native query string |
limit | The limit on the results |
The query execution does not know the selectivity of the text index. It is better to use one of two styles.
Access to the index is first in the query and used to find a number of items of interest; further information is obtained about these items from the RDF data.
SELECT ?s { ?s text:query (rdfs:label 'word' 10) ; rdfs:label ?label ; rdf:type ?type }
Limit is useful here when working with large indexes to limit results to the more higher scoring results.
By finding items of interest first in the RDF data, the text search can be used to restrict the items found still further.
SELECT ?s { ?s rdf:type :book ; dc:createor "John" . ?s text:query (dc:title 'word') ; }
The usual way to describe an index is with a Jena assembler description. Configurations can also be built with code. The assembler describes a 'text dataset' which has an underlying RDF dataset and a text index. The text index describes the text index technology (Lucene or Solr) and the details needed for for each.
A text index has an "entity map" which defines the properties to index, the name of the lucene/solr field and field used for storing the URI itself.
For common RDF use, there will be one field, mapping a property to a text index field. More complex setups, with multiple properties per enitity (URI) are possible.
Once setup this way, any data added to the text dataset is automatically indexed as well.
The following is an example of a TDB dataset with a text index.
@prefix : <http://localhost/jena_example/#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix tdb: <http://jena.hpl.hp.com/2008/tdb#> . @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> . @prefix text: <http://jena.apache.org/text#> . ## Example of a TDB dataset and text index ## Initialize TDB [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" . tdb:DatasetTDB rdfs:subClassOf ja:RDFDataset . tdb:GraphTDB rdfs:subClassOf ja:Model . ## Initialize text query [] ja:loadClass "org.apache.jena.query.text.TextQuery" . # A TextDataset is a regular dataset with a text index. text:TextDataset rdfs:subClassOf ja:RDFDataset . # Lucene index text:TextIndexLucene rdfs:subClassOf text:TextIndex . # Solr index text:TextIndexSolrne rdfs:subClassOf text:TextIndex . ## --------------------------------------------------------------- ## This URI must be fixed - it's used to assemble the text dataset. :text_dataset rdf:type text:TextDataset ; text:dataset <#dataset> ; text:index <#indexLucene> ; . # A TDB datset used for RDF storage <#dataset> rdf:type tdb:DatasetTDB ; tdb:location "DB" ; tdb:unionDefaultGraph true ; # Optional . # Text index description <#indexLucene> a text:TextIndexLucene ; text:directory <file:Lucene> ; ##text:directory "mem" ; text:entityMap <#entMap> ; . # Mapping in the index # URI stored in field "uri" # rdfs:label is mapped to field "text" <#entMap> a text:EntityMap ; text:entityField "uri" ; text:defaultField "text" ; text:map ( [ text:field "text" ; text:predicate rdfs:label ] ) .
then use code such as:
Dataset ds = DatasetFactory.assemble( "text-config.ttl", "http://localhost/jena_example/#text_dataset") ;
Key here is that the assembler contains two dataset definitions, one for
the text dataset, one for the base data. Therefore, the application
needs to identify the text dataset by it's URI
http://localhost/jena_example/#text_dataset
.
A text dataset can also be constructed in code as might be done for a purely in-memory setup:
// Example of building a text dataset with code. // Example is in-memory. // Base dataset Dataset ds1 = DatasetFactory.createMem() ; EntityDefinition entDef = new EntityDefinition("uri", "text", RDFS.label) ; // Lucene, in memory. Directory dir = new RAMDirectory(); // Join together into a dataset Dataset ds = TextDatasetFactory.createLucene(ds1, dir, entDef) ;
The Fuseki configuration simply points to the text dataset as the
fuseki:dataset
of the service.
<#service_text_tdb> rdf:type fuseki:Service ; rdfs:label "TDB/text service" ; fuseki:name "ds" ; fuseki:serviceQuery "query" ; fuseki:serviceQuery "sparql" ; fuseki:serviceUpdate "update" ; fuseki:serviceUpload "upload" ; fuseki:serviceReadGraphStore "get" ; fuseki:serviceReadWriteGraphStore "data" ; fuseki:dataset :text_dataset ; .
When working at scale, or when preparing a published, read-only, SPARQL
service, creating the index by loading the text dataset is impractical.
The index and the dataset can be built using command line tools in two
steps: first load the RDF data, second create an index from the existing
RDF dataset.
Note: If you have an existing TDB dataset then you can skip this step
Build the TDB dataset:
java -cp $FUSEKI_HOME/fuseki-server.jar tdb.tdbloader --tdb=assembler_file data_file
using the copy of TDB included with Fuseki.
Alternatively, use one of the
TDB utilities tdbloader
or tdbloader2
which are better for bulk loading:
$JENA_HOME/bin/tdbloader --loc=directory data_file
You can then build the text index with the jena.textindexer
tool:
java -cp $FUSEKI_HOME/fuseki-server.jar jena.textindexer --desc=assembler_file
Because a Fuseki assembler description can have several datasets descriptions, and several text indexes, it may be necessary to extract a single dataset and index description into a separate assembler file for use in loading.
If the text index is being maintain by changed to the RDF, then deletion of RDF triple or quads does not cause entries in the index to be removed. The index does not store the literal indexed, nor does it store a reference count of how many triples refer to the index so the information to delete entries is not available.
In situations where this matters, the SPARQL query should look up in the text index, then check in the RDF data. Indeed, this may be necessary anyway because a text search does not necessarily give only exact matches.
In the initial example:
SELECT ?s ?label { ?s text:query (rdfs:label 'word' 10) ; rdfs:label ?label }
the SPARQL query is checking that the rdfs:label
triple exists, and if it
does, returning the whole label.
By only indexing but not storing the literals themselves, the index is kept smaller. It may be necessary to periodically rebuild the index if a large proportion of the RDF data changes.
The jena-text
module is included in Fuseki. To use it within application code,
then use the following maven dependency:
<dependency> <groupId>org.apache.jena</groupId> <artifactId>jena-text</artifactId> <version>X.Y.Z</version> </dependency>
adjusting the version X.Y.Z
as necessary. This will automatically
include a compatible version of Lucene and the Solr java client, but not Solr server.