:mod:`~spotterbase.corpora` package =================================== The :mod:`~spotterbase.corpora` package provides functionality for working with corpora (in particular the arXMLiv corpus). Using the :class:`~spotterbase.corpora.resolver.Resolver` --------------------------------------------------------- The ``Resolver`` can be used to load a document if you have its URI. Usually, it is required that you have downloaded the corpus and SpotterBase is able to find it. SpotterBase comes with a test corpus, which we will use for the examples: >>> from spotterbase.corpora.resolver import Resolver >>> from spotterbase.rdf import Uri >>> uri = Uri('https://ns.mathhub.info/project/sb/data/test-corpus/') >>> corpus = Resolver.get_corpus(uri) >>> for document in corpus: ... print(document.get_uri()) https://ns.mathhub.info/project/sb/data/test-corpus/paperA https://ns.mathhub.info/project/sb/data/test-corpus/paperB >>> document = corpus.get_document(uri / 'paperB') >>> # alternatively, we can get the document directly from the Resolver: >>> document = Resolver.get_document(uri / 'paperB') >>> document.get_uri() Uri('https://ns.mathhub.info/project/sb/data/test-corpus/paperB') >>> with document.open_text() as fp: ... print(fp.read(21)) Corpus metadata --------------- :mod:`~spotterbase.corpora` also contains scripts to create the metadata associated with a corpus. This should include linking all documents to their corpus. It can also include annotations such as the classification of the document. Adding another corpus --------------------- The :mod:`spotterbase.corpora.test_corpus` module has an implementation of a simple corpus. You should be able to adapt it to your own needs. If you want it to be supported by the :class:``~spotterbase.corpora.resolver.Resolver``, you will have to register it.