SpotterBase Annotation Format
In SpotterBase, annotations are represented as sets of RDF triples. In particular, the representation is based on the recommendations of the W3C Web Annotation Working Group.
Annotations can be represented in different ways, and SpotterBase can convert between them. SpotterBase often uses chunks of information, which we call records (e.g. an annotation). A record can be represented in different formats:
A JSON format based on JSON-LD. This hides the RDF nature of the annotations, but with the right JSON-LD context it can be directly imported into a triple store. Each JSON object corresponds to a record.
A set of RDF triples.
A Python object.
SpotterBase can convert between these formats.
To an extent, you can use SpotterBase without knowing RDF (e.g. if you only use the JSON format), but it is nevertheless useful to understand the underlying RDF model. A Brief Introduction to RDF might be a good place to start learning about RDF.
Annotations
View JSON
{
"type": "Annotation",
"id": "http://sigmathling.kwarc.info/arxmliv/2020/math/0511246#meta.severity.anno",
"target": "http://sigmathling.kwarc.info/arxmliv/2020/math/0511246",
"body": {
"type": "SimpleTagBody",
"val": "http://sigmathling.kwarc.info/arxmliv/severity/error"
},
"creator": "http://sigmathling.kwarc.info/spotterbase/spotter/arxmlivmetadata"
}
View Python (auto-generated)
from spotterbase.model_core import Annotation, SimpleTagBody
from spotterbase.rdf import Uri
annotation = Annotation(
target_uri=Uri('http://sigmathling.kwarc.info/arxmliv/2020/math/0511246'),
body=SimpleTagBody(
tag=Uri('http://sigmathling.kwarc.info/arxmliv/severity/error'),
),
creator_uri=Uri('http://sigmathling.kwarc.info/spotterbase/spotter/arxmlivmetadata'),
)
View Turtle
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix sb: <https://ns.mathhub.info/project/sb/> .
_:591667b9398241135a9e8a55107aa1bf a sb:SimpleTagBody ;
rdf:value <http://sigmathling.kwarc.info/arxmliv/severity/error> .
@prefix oa: <http://www.w3.org/ns/oa#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
<http://sigmathling.kwarc.info/arxmliv/2020/math/0511246#meta.severity.anno> a oa:Annotation ;
oa:hasTarget <http://sigmathling.kwarc.info/arxmliv/2020/math/0511246> ;
oa:hasBody _:591667b9398241135a9e8a55107aa1bf ;
dcterms:creator <http://sigmathling.kwarc.info/spotterbase/spotter/arxmlivmetadata> .
View N-Triples
<http://sigmathling.kwarc.info/arxmliv/2020/math/0511246#meta.severity.anno> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/oa#Annotation> .
<http://sigmathling.kwarc.info/arxmliv/2020/math/0511246#meta.severity.anno> <http://www.w3.org/ns/oa#hasTarget> <http://sigmathling.kwarc.info/arxmliv/2020/math/0511246> .
_:591667b9398241135a9e8a55107aa1bf <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://ns.mathhub.info/project/sb/SimpleTagBody> .
_:591667b9398241135a9e8a55107aa1bf <http://www.w3.org/1999/02/22-rdf-syntax-ns#value> <http://sigmathling.kwarc.info/arxmliv/severity/error> .
<http://sigmathling.kwarc.info/arxmliv/2020/math/0511246#meta.severity.anno> <http://www.w3.org/ns/oa#hasBody> _:591667b9398241135a9e8a55107aa1bf .
<http://sigmathling.kwarc.info/arxmliv/2020/math/0511246#meta.severity.anno> <http://purl.org/dc/terms/creator> <http://sigmathling.kwarc.info/spotterbase/spotter/arxmlivmetadata> .
View Graph
Following the recommendations of the W3C Web Annotation Working Group,
an annotation has two main components: a target and a body.
The target indicates what gets annotated, and the body contains information that should be associated with the target.
The annotation above, for example, indicates that the document
http://sigmathling.kwarc.info/arxmliv/2020/math/0511246 (the target)
was created by arXMLiv with the severity error (the body).
In this case, the body is a simple tag, but we can also have more complex bodies.
Every annotation must have a unique identifier (the "id" field in the JSON format).
This could be anything, but it can be helpful to use an identifier based on the document that is being annotated.
Annotations can also have a creator (in this case the annotation was created by a script). A separate record can provide more information about the creator.
Targets
The example annotation in the Annotations section targets an entire document. In practice, however, we usually want to annotate only a part (a fragment) of a document like a word or a formula.
We can do this by creating a FragmentTarget record, like the following one:
View JSON
{
"type": "FragmentTarget",
"id": "http://sigmathling.kwarc.info/spotterbase/test-corpus/paperA#spostag.target.41",
"source": "http://sigmathling.kwarc.info/spotterbase/test-corpus/paperA",
"selector": [
{
"type": "PathSelector",
"startPath": "char(/html/body/div/div/article/section/div[2]/div/p/span,82)",
"endPath": "char(/html/body/div/div/article/section/div[2]/div/p/span,84)"
},
{
"type": "OffsetSelector",
"start": 411,
"end": 413
}
]
}
View Python (auto-generated)
from spotterbase.model_core import FragmentTarget, PathSelector, OffsetSelector
from spotterbase.rdf import Uri
fragment_target = FragmentTarget(
source=Uri('http://sigmathling.kwarc.info/spotterbase/test-corpus/paperA'),
selectors=[
PathSelector(
start='char(/html/body/div/div/article/section/div[2]/div/p/span,82)',
end='char(/html/body/div/div/article/section/div[2]/div/p/span,84)',
),
OffsetSelector(
start=411,
end=413,
),
],
)
View Turtle
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix sb: <https://ns.mathhub.info/project/sb/> .
_:59e6489b398241135a9e8a55107aa1bf a sb:PathSelector ;
sb:startPath "char(/html/body/div/div/article/section/div[2]/div/p/span,82)" ;
sb:endPath "char(/html/body/div/div/article/section/div[2]/div/p/span,84)" .
@prefix oa: <http://www.w3.org/ns/oa#> .
_:59e64a69398241135a9e8a55107aa1bf a sb:OffsetSelector ;
oa:start "411"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> ;
oa:end "413"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
<http://sigmathling.kwarc.info/spotterbase/test-corpus/paperA#spostag.target.41> a sb:FragmentTarget ;
oa:hasSource <http://sigmathling.kwarc.info/spotterbase/test-corpus/paperA> ;
oa:hasSelector _:59e6489b398241135a9e8a55107aa1bf,
_:59e64a69398241135a9e8a55107aa1bf .
View N-Triples
<http://sigmathling.kwarc.info/spotterbase/test-corpus/paperA#spostag.target.41> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://ns.mathhub.info/project/sb/FragmentTarget> .
<http://sigmathling.kwarc.info/spotterbase/test-corpus/paperA#spostag.target.41> <http://www.w3.org/ns/oa#hasSource> <http://sigmathling.kwarc.info/spotterbase/test-corpus/paperA> .
_:59e6489b398241135a9e8a55107aa1bf <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://ns.mathhub.info/project/sb/PathSelector> .
_:59e6489b398241135a9e8a55107aa1bf <https://ns.mathhub.info/project/sb/startPath> "char(/html/body/div/div/article/section/div[2]/div/p/span,82)" .
_:59e6489b398241135a9e8a55107aa1bf <https://ns.mathhub.info/project/sb/endPath> "char(/html/body/div/div/article/section/div[2]/div/p/span,84)" .
_:59e64a69398241135a9e8a55107aa1bf <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://ns.mathhub.info/project/sb/OffsetSelector> .
_:59e64a69398241135a9e8a55107aa1bf <http://www.w3.org/ns/oa#start> "411"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
_:59e64a69398241135a9e8a55107aa1bf <http://www.w3.org/ns/oa#end> "413"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
<http://sigmathling.kwarc.info/spotterbase/test-corpus/paperA#spostag.target.41> <http://www.w3.org/ns/oa#hasSelector> _:59e6489b398241135a9e8a55107aa1bf .
<http://sigmathling.kwarc.info/spotterbase/test-corpus/paperA#spostag.target.41> <http://www.w3.org/ns/oa#hasSelector> _:59e64a69398241135a9e8a55107aa1bf .
View Graph
A FragmentTarget record has the following fields:
"id": A unique identifier for the target. This identifier can be used in annotations to refer to the target."source": The document that contains the fragment."selector": A list of selectors that specify the fragment. Each selector should specify the same fragment. The selectors suggested by the Web Annotation Working Group were insufficient (or at least inconvenient) for our purposes, so SpotterBase supports two custom selectors:PathSelectorandOffsetSelector. While both selectors specify the same fragment, they have different advantages and disadvantages depending on the application.
Tip
SpotterBase can convert between the two selectors. If you write a spotter, you only need to create one of them and SpotterBase can create the other one for you.
The PathSelector
The PathSelector selects a document range
by specifying the start and the end of the fragement.
Following the Web Annotation Recommandations,
the end is exclusive, i.e. the specified end is not part of the fragment.
The specification is based on XPath. Concretely, three types of expression are supported:
char(xpath, n): Selects the n-th character of text inside the tag specified by the XPath expression. The XPath should not select a text node, but a tag that contains text. Text in nested tags is also counted.node(xpath): Selects the node specified by the XPath expression.after-node(xpath): Selects the point right after the node specified by the XPath expression. This is useful for the end of the fragment (the end is excluded, soafter-nodelets you effectively include the node).
Important
SpotterBase assumes that documents are static. It is designed to deal with “frozen” corpora, not with documents that change over time. That makes the XPath expressions much less brittle. Nevertheless, there are some things to keep in mind:
Some HTML parsers (including browsers) insert additional tags into the document. The main example we are aware of is the insertion of a
<tbody>tag into tables.charexpressions should count characters. For example, in JavaScript, thelengthproperty of a string counts UTF-16 code units, not characters.
Note
While in principle arbitrary XPath expressions are supported, simple absolute paths are preferred as they can be processed efficiently and are supported by a wide range of tools.
The OffsetSelector
The OffsetSelector can select document ranges with the same granularity as the PathSelector,
but it uses offsets instead of XPath expressions.
The offsets essentially count every opening tag, closing tag and character in text nodes.
While this is more difficult to emulate in other tools,
it has two key advantages:
Offsets can be represented much more compactly.
Offsets can be compared easily, e.g. to check if one target is contained in another one.
Aside from SpotterBase-internal uses, you might encounter the OffsetSelector in two other places:
When you write a spotter using pre-processed files, they typically only references based on the
OffsetSelectoras they are much more compact. SpotterBase can then convert them to aPathSelectorfor you.In SPARQL queries, you can use the
OffsetSelectorto compare targets (e.g. to check if a word annotation is contained in a paragraph annotation).
Discontinuous fragments
The PathSelector and OffsetSelector can only select continuous fragments.
To select discontinuous fragments, the selectors can be refined with a ListSelector,
which lists selectors for the ranges that make up the discontinuous fragment.
The ListSelector should only be used as a refinement of a selector that
selects the complete range of the fragment.
That way, tools that do not support discontinuous fragments can still process the annotation.
The selectors in a ListSelector should have the same type as the selector that is refined.
Here is an example of a discontinuous fragment using the OffsetSelector:
View JSON
{
"type": "OffsetSelector",
"start": 752,
"end": 773,
"refinedBy": {
"type": "ListSelector",
"vals": [
{
"type": "OffsetSelector",
"start": 752,
"end": 761
},
{
"type": "OffsetSelector",
"start": 767,
"end": 773
}
]
}
}
View Python (auto-generated)
from spotterbase.model_core import OffsetSelector, ListSelector
offset_selector = OffsetSelector(
start=752,
end=773,
refinement=ListSelector(
selectors=[
OffsetSelector(
start=752,
end=761,
),
OffsetSelector(
start=767,
end=773,
),
],
),
)
View Turtle
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix sb: <https://ns.mathhub.info/project/sb/> .
@prefix oa: <http://www.w3.org/ns/oa#> .
_:59e290d7398241135a9e8a55107aa1bf a sb:OffsetSelector ;
oa:start "752"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> ;
oa:end "761"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
_:59e29119398241135a9e8a55107aa1bf a sb:OffsetSelector ;
oa:start "767"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> ;
oa:end "773"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
_:59e29f81398241135a9e8a55107aa1bf a sb:ListSelector ;
rdf:value _:59e2925b398241135a9e8a55107aa1bf .
_:59e2925b398241135a9e8a55107aa1bf rdf:first _:59e290d7398241135a9e8a55107aa1bf ;
rdf:rest _:59e292d9398241135a9e8a55107aa1bf .
_:59e292d9398241135a9e8a55107aa1bf rdf:first _:59e29119398241135a9e8a55107aa1bf ;
rdf:rest rdf:nil .
_:59e29e07398241135a9e8a55107aa1bf a sb:OffsetSelector ;
oa:start "752"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> ;
oa:end "773"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> ;
oa:refinedBy _:59e29f81398241135a9e8a55107aa1bf .
View N-Triples
_:59e29e07398241135a9e8a55107aa1bf <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://ns.mathhub.info/project/sb/OffsetSelector> .
_:59e29e07398241135a9e8a55107aa1bf <http://www.w3.org/ns/oa#start> "752"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
_:59e29e07398241135a9e8a55107aa1bf <http://www.w3.org/ns/oa#end> "773"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
_:59e29f81398241135a9e8a55107aa1bf <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://ns.mathhub.info/project/sb/ListSelector> .
_:59e290d7398241135a9e8a55107aa1bf <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://ns.mathhub.info/project/sb/OffsetSelector> .
_:59e290d7398241135a9e8a55107aa1bf <http://www.w3.org/ns/oa#start> "752"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
_:59e290d7398241135a9e8a55107aa1bf <http://www.w3.org/ns/oa#end> "761"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
_:59e29119398241135a9e8a55107aa1bf <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://ns.mathhub.info/project/sb/OffsetSelector> .
_:59e29119398241135a9e8a55107aa1bf <http://www.w3.org/ns/oa#start> "767"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
_:59e29119398241135a9e8a55107aa1bf <http://www.w3.org/ns/oa#end> "773"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
_:59e29f81398241135a9e8a55107aa1bf <http://www.w3.org/1999/02/22-rdf-syntax-ns#value> _:59e2925b398241135a9e8a55107aa1bf .
_:59e2925b398241135a9e8a55107aa1bf <http://www.w3.org/1999/02/22-rdf-syntax-ns#first> _:59e290d7398241135a9e8a55107aa1bf .
_:59e2925b398241135a9e8a55107aa1bf <http://www.w3.org/1999/02/22-rdf-syntax-ns#rest> _:59e292d9398241135a9e8a55107aa1bf .
_:59e292d9398241135a9e8a55107aa1bf <http://www.w3.org/1999/02/22-rdf-syntax-ns#first> _:59e29119398241135a9e8a55107aa1bf .
_:59e292d9398241135a9e8a55107aa1bf <http://www.w3.org/1999/02/22-rdf-syntax-ns#rest> <http://www.w3.org/1999/02/22-rdf-syntax-ns#nil> .
_:59e29e07398241135a9e8a55107aa1bf <http://www.w3.org/ns/oa#refinedBy> _:59e29f81398241135a9e8a55107aa1bf .
View Graph
Note
In the RDF representation, the selectors in the ListSelector
are represented as a linked list (using rdf:first, rdf:rest and rdf:nil).
This allows to represent a closed list despite the open world assumption of RDF.
Despite the list-like nature of the representation, the order of the selectors is not significant.
For SPARQL queries, you can use the rdf:rest*/rdf:first property path to get all selectors in the list.
Bodies
In principle, the body of an annotation can be any RDF resource. This is key to the flexibility of the annotation model and was a primary reason for using RDF in the first place.
For consistency, however, SpotterBase supports a few standard types of bodies.
Creators
Annotations should have a creator. That is especially important if we want to compare annotations from different sources and have them in the same database. For example, we might have two people annotating the same document and then want to compute the inter-annotator agreement. Similarly, we might want to evaluate a spotter by comparing its annotations to a gold standard.
The creator of an annotation (creator field) can be any RDF resource.
SpotterBase also offers records to provide more information about creators.
For example, the SpotterRun record can be used to provide information about a spotter run.
Each run of a spotter should have a unique identifier.
That makes it possible to compare annotations from different runs of the same spotter.
Here is an example of a SpotterRun record:
View JSON
{
"type": "SpotterRun",
"id": "urn:uuid:96233573-e637-4c88-aa2b-24cfcd627496",
"withSpotter": "http://sigmathling.kwarc.info/spotterbase/ext/spotters#spostag",
"spotterVersion": "0.0.2",
"label": "Simple Part-Of-Speech Tagger based on NLTK",
"created": "2024-01-04T12:46:12.475054"
}
View Python (auto-generated)
from spotterbase.model_core import SpotterRun
from spotterbase.rdf import Uri
from datetime import datetime
spotter_run = SpotterRun(
spotter_uri=Uri('http://sigmathling.kwarc.info/spotterbase/ext/spotters#spostag'),
spotter_version='0.0.2',
label='Simple Part-Of-Speech Tagger based on NLTK',
date=datetime(2024, 1, 4, 12, 46, 12, 475054),
)
View Turtle
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix sb: <https://ns.mathhub.info/project/sb/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
<urn:uuid:96233573-e637-4c88-aa2b-24cfcd627496> a sb:SpotterRun ;
sb:withSpotter <http://sigmathling.kwarc.info/spotterbase/ext/spotters#spostag> ;
sb:spotterVersion "0.0.2" ;
rdfs:label "Simple Part-Of-Speech Tagger based on NLTK" ;
dcterms:created "2024-01-04T12:46:12.475054"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
View N-Triples
<urn:uuid:96233573-e637-4c88-aa2b-24cfcd627496> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://ns.mathhub.info/project/sb/SpotterRun> .
<urn:uuid:96233573-e637-4c88-aa2b-24cfcd627496> <https://ns.mathhub.info/project/sb/withSpotter> <http://sigmathling.kwarc.info/spotterbase/ext/spotters#spostag> .
<urn:uuid:96233573-e637-4c88-aa2b-24cfcd627496> <https://ns.mathhub.info/project/sb/spotterVersion> "0.0.2" .
<urn:uuid:96233573-e637-4c88-aa2b-24cfcd627496> <http://www.w3.org/2000/01/rdf-schema#label> "Simple Part-Of-Speech Tagger based on NLTK" .
<urn:uuid:96233573-e637-4c88-aa2b-24cfcd627496> <http://purl.org/dc/terms/created> "2024-01-04T12:46:12.475054"^^<http://www.w3.org/2001/XMLSchema#dateTime> .