Command Line Tools
The convert can be used to
convert a document into an easier-to-process format
with information for linking annotations back to the original document.
Currently, the following formats are supported:
a JSON-based format where each word is represented as a JSON object,
an HTML-based format where each word is wrapped in a
<span>.
The conversion code can be used from the command line or from Python code.
convert also can be used to recover and normalize
annotation targets.
Preprocessing to JSON
The convert to JSON results in an easy-to-use JSON document. Every word is represented as a JSON object with the its offsets. Example word:
{
"token": "triangle",
"start-ref": 302,
"end-ref": 310
}
Example call:
python3 -m spotterbase.convert.document_to_json \
--include-replaced-nodes \
--document=https://ns.kwarc.info/project/sb/data/test-corpus/paperA \
--output=tokenized.json
With the --include-replaced-nodes option, the will contain the
HTML nodes for tokens that were created by replacing a node
(e.g. a <math> node for "MathNode" tokens).
If you want to use the preprocessor from Python code, you have to use the
Doc2JsonConverter, e.g.:
>>> from spotterbase.convert.document_to_json import Doc2JsonConverter
>>> from spotterbase.corpora.resolver import Resolver
>>> document = Resolver.get_document('https://ns.mathhub.info/project/sb/data/test-corpus/paperA')
>>> converter = Doc2JsonConverter(include_replaced_nodes=True, skip_titles=True)
>>> json_document = converter.process(document)