.. _cmdtools: Command Line Tools ================== The :mod:`~spotterbase.convert` can be used to convert a document into an easier-to-process format with information for linking annotations back to the original document. Currently, the following formats are supported: * a JSON-based format where each word is represented as a JSON object, * an HTML-based format where each word is wrapped in a ````. The conversion code can be used from the command line or from Python code. :mod:`~spotterbase.convert` also can be used to recover and normalize annotation targets. .. _cmdtools-preprocess-json: Preprocessing to JSON --------------------- The convert to JSON results in an easy-to-use JSON document. Every word is represented as a JSON object with the its offsets. Example word: .. code-block:: json { "token": "triangle", "start-ref": 302, "end-ref": 310 } Example call: .. code-block:: bash python3 -m spotterbase.convert.document_to_json \ --include-replaced-nodes \ --document=https://ns.kwarc.info/project/sb/data/test-corpus/paperA \ --output=tokenized.json With the ``--include-replaced-nodes`` option, the will contain the HTML nodes for tokens that were created by replacing a node (e.g. a ```` node for ``"MathNode"`` tokens). If you want to use the preprocessor from Python code, you have to use the :class:`~spotterbase.convert.document_to_json.Doc2JsonConverter`, e.g.: >>> from spotterbase.convert.document_to_json import Doc2JsonConverter >>> from spotterbase.corpora.resolver import Resolver >>> document = Resolver.get_document('https://ns.mathhub.info/project/sb/data/test-corpus/paperA') >>> converter = Doc2JsonConverter(include_replaced_nodes=True, skip_titles=True) >>> json_document = converter.process(document) Preprocessing to HTML ---------------------