Command Line Tools

The convert can be used to convert a document into an easier-to-process format with information for linking annotations back to the original document. Currently, the following formats are supported:

a JSON-based format where each word is represented as a JSON object,
an HTML-based format where each word is wrapped in a <span>.

The conversion code can be used from the command line or from Python code.

convert also can be used to recover and normalize annotation targets.

Preprocessing to JSON

The convert to JSON results in an easy-to-use JSON document. Every word is represented as a JSON object with the its offsets. Example word:

{
 "token": "triangle",
 "start-ref": 302,
 "end-ref": 310
}

Example call:

python3 -m spotterbase.convert.document_to_json \
   --include-replaced-nodes \
   --document=https://ns.kwarc.info/project/sb/data/test-corpus/paperA \
   --output=tokenized.json

With the --include-replaced-nodes option, the will contain the HTML nodes for tokens that were created by replacing a node (e.g. a <math> node for "MathNode" tokens).

If you want to use the preprocessor from Python code, you have to use the Doc2JsonConverter, e.g.:

>>> from spotterbase.convert.document_to_json import Doc2JsonConverter
>>> from spotterbase.corpora.resolver import Resolver
>>> document = Resolver.get_document('https://ns.mathhub.info/project/sb/data/test-corpus/paperA')
>>> converter = Doc2JsonConverter(include_replaced_nodes=True, skip_titles=True)
>>> json_document = converter.process(document)

Command Line Tools

Preprocessing to JSON

Preprocessing to HTML