Developing a spotter

In this tutorial, we will discuss four different approaches to developing a spotter, which increasingly leverage the SpotterBase codebase.

Along the way, we will develop multiple spotters that build on each other’s results.

Formula Spotter (does not use SpotterBase)

You do not have to use SpotterBase at all and can simply implement code that creates annotations in the right format (either in the JSON format or as RDF triples). Often, the SpotterBase command line tools might be helpful anyway, but for some applications, their usefulness might be very limited.

As an example, we will create a spotter that annotates for documents whether they contain a formula or not.

Here is a simple implementation of such a spotter:

# We use SpotterBase only to get the directory of the test corpus
from spotterbase.corpora.test_corpus import TEST_CORPUS_DIR

with open("mathcheck.ttl", "w") as fp:
    fp.write("@prefix oa: <http://www.w3.org/ns/oa#> .\n")
    fp.write("@prefix sb: <https://ns.mathhub.info/project/sb/> .\n")
    fp.write("@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .\n")
    fp.write("@prefix dcterms: <http://purl.org/dc/terms/> .\n\n")

    for path in sorted(TEST_CORPUS_DIR.glob('*.html')):
        doc_uri = "https://ns.mathhub.info/project/sb/data/test-corpus/" + path.name[:-5]
        if "</math>" in path.read_text():       # very crude check
            tag = "http://example.org/mathspotter-result#contains-math"
        else:
            tag = "http://example.org/mathspotter-result#no-math"

        anno_uri = doc_uri + "#mathcheckanno"   # can be anything, but must be unique
        fp.write(f"\n<{anno_uri}> a oa:Annotation ;\n")
        fp.write(f"    oa:hasTarget <{doc_uri}> ;\n")
        fp.write(f"    oa:hasBody [\n")
        fp.write(f"        a sb:SimpleTagBody ;\n")
        fp.write(f"        rdf:value <{tag}> ;\n")
        fp.write(f"    ] ;\n")
        fp.write(f"    dcterms:creator <http://example.org/mathspotter> .\n")

In this example, we use Python, but you can use any programming language. The example creates a Turtle file (mathcheck.ttl) that contains the annotations. Another option would have been to create a JSON file.

We can now upload the annotations into a triple store and query them. For example, we might want to find all documents that contain a formula so that we can process them with the next spotters (processing documents without formulae would be a waste of resources). Here is a SPARQL query that does this:

PREFIX oa: <http://www.w3.org/ns/oa#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT ?document WHERE {
    ?annotation a oa:Annotation .
    ?annotation oa:hasTarget ?document .
    ?annotation oa:hasBody/rdf:value <http://example.org/mathspotter-result#contains-math> .
}

Declaration spotter (only uses SpotterBase command line tools)

You can use the SpotterBase command line tools (documented here) for pre-processing documents.

As an example, we will create a spotter that looks for the phrases like “let …” or “for all …” and annotates the phrase and the declared identifier in the formula. E.g. for “let x ∈ ℝ”, we would annotate “let x ∈ ℝ” as a declaration phrase and “x” as the declared identifier. In the next approach, we will use these annotations to build a spotter that extracts the identifiers that were introduced.

Step 1: Pre-process the document

Working with HTML documents is difficult (especially as we also want to get the annotation targets right). Therefore, we will convert the HTML document to a JSON file using the JSON pre-processing tool that comes with SpotterBase. There are different variants of the pre-processing tool (see the documentation for more details).

Let us pre-process a document from the example corpus:

python3 -m spotterbase.convert.document_to_json \
   --include-replaced-nodes \
   --document=https://ns.mathhub.info/project/sb/data/test-corpus/paperA \
   --output=preprocessed-paper-A.json

Step 2: Write the actual spotter

In the pre-processed document, math nodes are replaced with a token. We will look for patterns like “for all @MathNode:3@”. The document also contains the MathML of the node, which we can use to extract the identifier. The spotter (for didactic reasons) creates two annotations for each declaration: one for the entire phrase and one for the declared identifier. Here is a simple implementation of such a spotter:

import json
import re
from io import StringIO

from lxml import etree

decl_regex = re.compile(r'(for all|for every|for any|where|let) (?P<formula>@MathNode:(\d+)@)')

with open('preprocessed-paper-A.json') as fp:
    document = json.load(fp)

doc_uri = document['document']   # document URI
# formulae are replaced by tokens, but the document contains the original formulae as annotations
token_to_html = {
    annotation['string']: annotation['annotation']['body']['html-value']
    for annotation in document['annotations'] if 'html-value' in annotation['annotation']['body']
}

records = []
for counter, match in enumerate(decl_regex.finditer(document['plaintext'])):
    # Target for the entire phrase
    target_uri = f'{doc_uri}#declphrase.{counter}.target'
    records.append({
        'type': 'FragmentTarget',
        'id': target_uri,
        'source': doc_uri,
        'selector': [
            {
                'type': 'OffsetSelector',
                'start': document['start-refs'][match.start()],
                'end': document['end-refs'][match.end()],
            }
        ]
    })
    # Annotation for the entire phrase
    records.append({
        'type': 'Annotation',
        'id': f'{doc_uri}#declphrase.{counter}.anno',
        'target': target_uri,
        'body': {
            'type': 'SimpleTagBody',
            'value': 'http://example.org/DeclarationPhrase',
        }
    })

    # we assume that the first <mi> element in the formula is the declared identifier
    nodes = etree.parse(StringIO(token_to_html[match.group('formula')])).xpath('//mi')
    if not nodes:   # no <mi> element found
        continue

    # target for the identifier
    target_uri = f'{doc_uri}#declvar.{counter}.target'
    records.append({
        'type': 'FragmentTarget',
        'id': target_uri,
        'source': doc_uri,
        'selector': [
            {
                'type': 'PathSelector',
                'startPath': f'node(//mi[@id="{nodes[0].get("id")}"])',
                'endPath': f'after-node(//mi[@id="{nodes[0].get("id")}"])',
            }
        ]
    })

    # annotation for the identifier
    records.append({
        'type': 'Annotation',
        'id': f'{doc_uri}#declvar.{counter}.anno',
        'target': target_uri,
        'body': {
            'type': 'SimpleTagBody',
            'value': 'http://example.org/DeclaredVariable',
        }
    })

with open('paper-A-annotations.json', 'w') as fp:
    json.dump(records, fp, indent=4)

In this case, the spotter creates a JSON file, following the SpotterBase JSON format (the annotation format page has more details).

Step 3: Selector normalization

The annotations created by the spotter are not exactly in the right format. Some only use the OffsetSelector, others only use the PathSelector (and does not use the simple absolute XPaths recommended). SpotterBase provides a tool that normalizes the selectors:

python3 -m spotterbase.convert.normalize_selectors \
    --input paper-A-annotations.json \
    --output paper-A-annotations-normalized.json

Another declaration spotter (uses SpotterBase as a library)

In this approach, we will create the same declaration spotter as in the previous approach, but we will use SpotterBase as a library. Instead of running the pre-processing to obtain a plaintext document, we can use the DNM library to obtain a plaintext representation.

Here is an implementation of the spotter:

import re

from spotterbase.corpora.resolver import Resolver
from spotterbase.dnm.defaults import ARXMLIV_STANDARD_DNM_FACTORY
from spotterbase.model_core import SimpleTagBody, Annotation
from spotterbase.rdf import FileSerializer
from spotterbase.rdf.namespace_collection import EXAMPLE
from spotterbase.selectors.dom_range import DomRange

decl_regex = re.compile(r'(for all|for every|for any|where|let) (?P<formula>@MathNode:(\d+)@)')

document = Resolver.get_document('https://ns.mathhub.info/project/sb/data/test-corpus/paperA')
dnm = ARXMLIV_STANDARD_DNM_FACTORY.dnm_from_document(document)

records = []
for counter, match in enumerate(decl_regex.finditer(str(dnm))):
    # annotate phrase
    target_uri = f'{document.get_uri()}#declphrase.{counter}.target'
    records.append(dnm[match].to_fragment_target(target_uri))
    records.append(
        Annotation(
            f'{document.get_uri()}#declphrase.{counter}.anno',
            target_uri=target_uri,
            body=SimpleTagBody(EXAMPLE['DeclarationPhrase']),
        )
    )

    math_node = dnm[match.start('formula'):match.end('formula')].to_dom().get_containing_node()
    identifier = math_node.xpath('//mi')
    if not identifier:
        continue

    # annotate identifier
    target_uri = f'{document.get_uri()}#declvar.{counter}.target'
    records.append(document.get_selector_converter().dom_to_fragment_target(
        target_uri, DomRange.from_node(identifier[0])
    ))
    records.append(
        Annotation(
            f'{document.get_uri()}#declvar.{counter}.anno',
            target_uri=target_uri,
            body=SimpleTagBody(EXAMPLE['DeclarationVariable']),
        )
    )

# write records to RDF file
with FileSerializer('paper-a-decl.ttl') as serializer:
    for record in records:
        serializer.add_from_iterable(record.to_triples())

In this case, the script generates RDF triples in the Turtle format, instead of a JSON file as in the previous approach. We could have generated a JSON file as well, but we can also convert the Turtle file to JSON using the SpotterBase command line tools:

python3 -m spotterbase.records.rdf_to_jsonld \
  --file=paper-a-decl.ttl \
  --output=annotations.jsonld

Yet another declaration spotter (uses SpotterBase as a framework)

When running a spotter over a large corpus, we typically need some more infrastructure work. Spotters should run in parallel, we want to have an idea of the progress, we might want to interrupt the processing and resume it later, we might want to run multiple spotters without re-parsing the HTML documents for each spotter, etc. SpotterBase provides a framework that takes care of these issues.

Here is a simple example how that works (we only annotation the declared identifiers in this case):

import re
from datetime import datetime

from spotterbase.dnm.defaults import ARXMLIV_STANDARD_DNM_FACTORY
from spotterbase.model_core import SpotterRun, Annotation, SimpleTagBody
from spotterbase.rdf import TripleI, Uri
from spotterbase.rdf.namespace_collection import EXAMPLE
from spotterbase.selectors.dom_range import DomRange
from spotterbase.spotters.spotter import Spotter, SpotterContext


class ExampleDeclSpotter(Spotter):
    spotter_short_id = 'example-decl-spotter'

    @classmethod
    def setup_run(cls, **kwargs) -> tuple[SpotterContext, TripleI]:
        run = SpotterRun(
            uri=Uri.uuid(),
            spotter_uri=EXAMPLE['exampledeclarationspotter'],
            spotter_version='0.0.1',
            date=datetime.now(),
            label='Simple Part-Of-Speech Tagger based on NLTK'
        )
        return SpotterContext(run_uri=run.uri), run.to_triples()

    def process_document(self, document) -> TripleI:
        dnm = ARXMLIV_STANDARD_DNM_FACTORY.dnm_from_document(document)
        decl_regex = re.compile(r'(for all|for every|for any|where|let) (?P<formula>@MathNode:(\d+)@)')
        for counter, match in enumerate(decl_regex.finditer(str(dnm))):
            math_node = dnm[match.start('formula'):match.end('formula')].to_dom().get_containing_node()
            identifier = math_node.xpath('//mi')
            if not identifier:
                continue

            target_uri = f'{document.get_uri()}#declvar.{counter}.target'
            yield from document.get_selector_converter().dom_to_fragment_target(
                target_uri, DomRange.from_node(identifier[0])
            ).to_triples()
            yield from Annotation(
                f'{document.get_uri()}#declvar.{counter}.anno',
                target_uri=target_uri,
                body=SimpleTagBody(EXAMPLE['DeclarationVariable']),
                creator_uri=self.ctx.run_uri,
            ).to_triples()


if __name__ == '__main__':
    from spotterbase.spotters import spotter_runner
    spotter_runner.auto_run_spotter(ExampleDeclSpotter)

We can run the spotter over the example document with the following command:

python3 spotter.py \
    --document=https://ns.mathhub.info/project/sb/data/test-corpus/paperA \
    --dir=spotterresults