In the original RDF specifications there were two types of literal values defined - plain literals (which are basically strings with an optional language tag) and XML literals (which are more or less plain literals plus a "well-formed-xml" flag).
Part of the remit for the 2001 RDF Core working group was to add to RDF support for typed values, i.e. things like numbers. These notes describe the support for typed literals in Jena2.
Before going into the Jena details here are some informal reminders of how typed literals work in RDF. We refer readers to the RDF core semantics, syntax and concepts documents for more precise details.
In RDF, typed literal values comprise a string (the lexical form of
the literal) and a datatype (identified by a URI). The datatype is
supposed to denote a mapping from lexical forms to some space of
values. The pair comprising the literal then denotes an element of
the value space of the datatype. For example, a typed literal
comprising ("true", xsd:boolean)
would denote the abstract true
value T
.
In the RDF/XML syntax typed literals are notated with syntax such as:
<age rdf:datatype="http://www.w3.org/2001/XMLSchema#int">13</age>
In NTriple syntax the notation is:
"13"^^<http://www.w3.org/2001/XMLSchema#int>
In Turtle, it can be abbreviated:
"13"^^xsd:int
This ^^
notation will appear in literals printed by Jena.
Note that a literal is either typed or plain (an old style literal) and which it is can be determined statically. There is no way to define a literal as having a lexical value of, say "13" but leave its datatype open and then infer the datatype from some schema or ontology definition.
In the new scheme of things well-formed XML literals are treated as
typed literals whose datatype is the special type
rdf:XMLLiteral
.
Jena will correctly parse typed literals within RDF/XML, NTriple
and Turtle source files. The same Java object,
Literal
will represent "plain" and "typed" literals. Literal now supports
some new methods:
getDatatype()
Returns null for a plain literal or a Java object which represents
the datatype of a typed Literal.
getDatatypeURI()
Returns null for a plain literal or the URI of the datatype of a
typed Literal.
getValue()
Returns a Java object representing the value of the literal, for
example for an xsd:int this will be a java.lang.Integer, for plain
literals it will be a String.
The converse operation of creating a Java object to represent a
typed literal in a model can be achieved using:
model.createTypedLiteral(value, datatype)
This allows the value
to be specified by a lexical form (i.e. a
String) or by a Java object representing the typed value; the
datatype
can be specified by a URI string or a Java object
representing the datatype.
In addition there is a built in mapping from standard Java wrapper objects to XSD datatypes (see later) so that the simpler call:
model.createTypedLiteral(Object)
will create a typed literal with the datatype appropriate for representing that java object. For example,
Literal l = model.createTypedLiteral(new Integer(25));
will create a typed literal with the lexical value "25", of type
xsd:int
.
Note that there are also functions which look similar but do not use typed literals. For example::
Literal l = model.createLiteral(25); int age = l.getInt();
These worked by converting the primitive to a string and storing
the resulting string as a plain literal. The inverse operation then
attempts to parse the string of the plain literal (as an int in
this example). These are for backward compatibility with earlier
versions of Jena and older datasets. In normal circumstances
createTypedLiteral
is preferable.
There is a well defined notion of when two typed literals should be
equal, based on the equality defined for the datatype in question.
Jena2 implements this equality function by using the method
sameValueAs
. Thus two literals ("13", xsd:int) and ("13",
xsd:decimal) will test as sameValueAs each other but neither will
test sameValueAs ("13", xsd:string).
Note that this is a different function from the Java equals
method. Had we changed the equals method to test for semantic
equality problems would have arisen because the two objects are not
substitutable in the Java sense (for example they return different
values from a getDatatype() call). This would, for example, have
made it impossible to cache literals in a hash table.
Datatypes for typed literals are represented by instances of the
interface
com.hp.hpl.jena.datatypes.RDFDatatype
.
Instances of this interface can be used to parse and serialized
typed data, test for equality and test if a typed or lexical value
is a legal value for this datatype.
Prebuilt instances of this interface are included for all the main XSD datatypes (see below).
In addition, it is possible for an application to define new datatypes and register them against some URI (see below).
When Jena parses a datatype whose lexical value is not legal for
the declared datatype is does not immediately throw an error. This
is because the RDFCore working group has defined that illegal
datatype values are errors but are not syntactic errors so we try
to avoid have parsers break at this point. Instead a literal is
created which is marked internally as ill-formed and the first time
an application attempts to access its value (with getValue()
) an
error will be thrown.
When Jena is reading a file there is also the issue of what to do when it encounters a typed value whose datatype URI is not one that is knows about. The default behaviour is to create a new datatype object (whose value space is the same as its lexical space). Again this behaviour seems in keeping with the working group preference that illegal datatypes are semantic but not syntactic errors.
However, both of these behaviours can mean that simple common errors (such as mis-spelling the xsd namespace) may go unnoticed untill very late on. To overcome this we have hidden some global switches that allow you to force Jena to report such syntactic errors earlier. These are static Boolean parameters:
com.hp.hpl.jena.shared.impl.JenaParameters.enableEagerLiteralValidation com.hp.hpl.jena.shared.impl.JenaParameters.enableSilentAcceptanceOfUnknownDatatypes
They are placed here in an impl package (and thus only visible in the full javadoc, not the API javadoc) because they should not be regarded as stable. We plan to develop a cleaner way of setting mode switches for Jena and these switches will migrate there in due course, if they prove to be useful.
Jena includes prebuilt, and pre-registered, instances of
RDFDatatype
for all of the relevant XSD types:
float double int long short byte unsignedByte unsignedShort unsignedInt unsignedLong decimal integer nonPositiveInteger nonNegativeInteger positiveInteger negativeInteger Boolean string normalizedString anyURI token Name QName language NMTOKEN ENTITIES NMTOKENS ENTITY ID NCName IDREF IDREFS NOTATION hexBinary base64Binary date time dateTime duration gDay gMonth gYear gYearMonth gMonthDay
These are all available as static member variables from
com.hp.hpl.jena.datatypes.xsd.XSDDatatype
.
Of these types, the following are registered as the default type to use to represent certain Java classes:
Java class | xsd type |
---|---|
Float | float |
Double | double |
Integer | int |
Long | long |
Short | short |
Byte | byte |
BigInteger | integer |
BigDecimal | decimal |
Boolean | Boolean |
String | string |
Thus when creating a typed literal from a Java BigInteger
then
xsd:integer
will be used. The converse mapping is more adaptive.
When parsing an xsd:integer the Java value object used will be an
Integer, Long or BigInteger depending on the size of the specific
value being represented.
XML schema allows derived types to be defined in which a base type is modified through some facet restriction such as limiting the min/max of an integer or restricting a string to a regular expression. It also allows new types to be created by unioning other types or by constructing lists of other types.
Jena provides support for derived and union types but not for list types.
These are supported through the XSDDatatype.loadUserDefined
method which allows an XML schema datatype file to be loaded. This
registers a new RDFDatatype
that can be used to create, parse,
serialize, test instances of that datatype.
There is one difficult issue in here, what URI to give to the user defined datatype? This is not defined by XML Schema, nor RDF nor OWL. Jena2 adopts the position that the defined datatype should have the base URI of the schema file with a fragment identifier given by the datatype name.
To illustrate working with the defined types, the following code then tries to create and use two instances of the over 12 type:
Model m = ModelFactory.createDefaultModel(); RDFDatatype over12Type = tm.getSafeTypeByName(uri + "#over12"); Object value = null; try { value = "15"; m.createTypedLiteral((String)value, over12Type).getValue(); System.out.println("Over 12 value of " + value + " is ok"); value = "12"; m.createTypedLiteral((String)value, over12Type).getValue(); System.out.println("Over 12 value of " + value + " is OK"); } catch (DatatypeFormatException e) { System.out.println("Over 12 value of " + value + " is illegal"); }
which products the output:
Over 12 value of 15 is OK Over 12 value of 12 is illegal
RDF allows any URI to be used as a datatype but provides no standard for how to map the datatype URI to a datatype definition.
Within Jena2 we allow new datatypes to be created and registered by
using the
TypeMapper
class.
The easiest way to define a new RDFDatatype is to subclass BaseDatatype and define implementations for parse, unparse and isEqual.
For example here is the outline of a type used to represent rational numbers:
class RationalType extends BaseDatatype { public static final String theTypeURI = "urn:x-hp-dt:rational"; public static final RDFDatatype theRationalType = new RationalType(); /** private constructor - single global instance */ private RationalType() { super(theTypeURI); } /** * Convert a value of this datatype out * to lexical form. */ public String unparse(Object value) { Rational r = (Rational) value; return Integer.toString(r.getNumerator()) + "/" + r.getDenominator(); } /** * Parse a lexical form of this datatype to a value * @throws DatatypeFormatException if the lexical form is not legal */ public Object parse(String lexicalForm) throws DatatypeFormatException { int index = lexicalForm.indexOf("/"); if (index == -1) { throw new DatatypeFormatException(lexicalForm, theRationalType, ""); } try { int numerator = Integer.parseInt(lexicalForm.substring(0, index)); int denominator = Integer.parseInt(lexicalForm.substring(index+1)); return new Rational(numerator, denominator); } catch (NumberFormatException e) { throw new DatatypeFormatException(lexicalForm, theRationalType, ""); } } /** * Compares two instances of values of the given datatype. * This does not allow rationals to be compared to other number * formats, Lang tag is not significant. */ Public Boolean isEqual(LiteralLabel value1, LiteralLabel value2) { return value1.getDatatype() == value2.getDatatype() && value1.getValue().equals(value2.getValue()); } }
To register and and use this type you simply need the call:
RDFDatatype rtype = RationalType.theRationalType; TypeMapper.getInstance().registerDatatype(rtype); ... // Create a rational literal Literal l1 = m.createTypedLiteral("3/5", rtype);
Note that whilst any serialization of RDF containing such user defined literals will be perfectly legal a client application has no standard way of looking up the datatype URI you have chosen. This has to be done "out of band" as they say.
Plain literals have an xml:Lang tag as well as a string value. Two plain literals with the same string but different Lang tags are not equal.
XML Schema states that xml:Lang is not meaningful on xsd datatypes.
Thus for almost all typed literals there is no xml:Lang tag.
At the time of last call the RDF specifications allowed the special
case that rdf:XMLLiteral
s could have a Lang tag that would be
significant in equality testing. Thus in preview releases of Jena2
the createTypedLiterals calls took an extra Lang tag argument.
However, at the time of writing that specification has been changed so that Lang tags will never be significant on typed literals (whether this means that xml:Lang is not significant on XMLLiterals or means that XMLLiteral will cease to be a typed literal is not completely certain).
For this reason we have removed the Lang tag from the createTypedLiterals calls and deprecated the createLiteral call which allowed both wellFormedXML and Lang tag to be specified.
We do not expect to need to change the API even if the working group decision changes again, the most we might expect to do would be to undeprecate the 3-argument version of createLiteral.