corpora package

The corpora package provides functionality for working with corpora (in particular the arXMLiv corpus).

Using the Resolver

The Resolver can be used to load a document if you have its URI. Usually, it is required that you have downloaded the corpus and SpotterBase is able to find it. SpotterBase comes with a test corpus, which we will use for the examples:

>>> from spotterbase.corpora.resolver import Resolver
>>> from spotterbase.rdf import Uri
>>> uri = Uri('https://ns.mathhub.info/project/sb/data/test-corpus/')
>>> corpus = Resolver.get_corpus(uri)
>>> for document in corpus:
...     print(document.get_uri())
https://ns.mathhub.info/project/sb/data/test-corpus/paperA
https://ns.mathhub.info/project/sb/data/test-corpus/paperB
>>> document = corpus.get_document(uri / 'paperB')
>>> # alternatively, we can get the document directly from the Resolver:
>>> document = Resolver.get_document(uri / 'paperB')
>>> document.get_uri()
Uri('https://ns.mathhub.info/project/sb/data/test-corpus/paperB')
>>> with document.open_text() as fp:
...     print(fp.read(21))
<!DOCTYPE html><html>

Corpus metadata

corpora also contains scripts to create the metadata associated with a corpus. This should include linking all documents to their corpus. It can also include annotations such as the classification of the document.

Adding another corpus

The spotterbase.corpora.test_corpus module has an implementation of a simple corpus. You should be able to adapt it to your own needs. If you want it to be supported by the :class:~spotterbase.corpora.resolver.Resolver, you will have to register it.