spotterbase.selectors package

Submodules

spotterbase.selectors.anno_by_offset module

class spotterbase.selectors.anno_by_offset.AnnoByOffset(annos: Iterable[tuple[DomOffsetRange, Annotation]] = ())

Bases: object

add_annotation(range_: DomOffsetRange, annotation: Annotation)

get_annotations_from_point(point: int) → list[Annotation]

get_annotations_from_range(range_: DomOffsetRange) → list[Annotation]

spotterbase.selectors.dom_range module

class spotterbase.selectors.dom_range.DomPoint(node: _Element, *, text_offset: int | None = None, tail_offset: int | None = None, after: bool = False)

Bases: object

References a point in the DOM.

Attributes:: node: The referenced node. text_offset: if not None, that character in the text content is referred to. tail_offset: if not None, that character in the tail content is referred to. after: actually, we reference whatever comes after.

Why do we need after? To be honest, I keep going back and forth between having it and not having it. The background is that ranges are right-exclusive because that is the convention in the Web Annotation standard. after makes things easier in a way. In particular, it lets us include the end of the DOM. It also allows us to have DomPoint as a simple datastructure without e.g. complex processing code for finding whatever comes after.

get_after() → DomPoint

is_element() → bool

class spotterbase.selectors.dom_range.DomRange(start: DomPoint | DomRange, end: DomPoint | DomRange)

Bases: object

classmethod from_node(node: _Element) → DomRange

get_containing_node() → _Element: Note: This does not cover edge cases yet…

spotterbase.selectors.dom_range.get_parent_asserted(node: _Element) → _Element

spotterbase.selectors.offset_converter module

class spotterbase.selectors.offset_converter.DomOffsetRange(start: int, end: int, converter: OffsetConverter)

Bases: object

A DomRange, except that it uses node offsets as created by the OffsetConverter

converter: OffsetConverter

end: int

start: int

to_dom_range() → DomRange

class spotterbase.selectors.offset_converter.NodeOffsetData(text_offset_before: int, node_text_offset_before: int, text_offset_after: int, node_text_offset_after: int)

Bases: object

Two types of offsets are recorded: text offsets and node text offsets. These offsets are recorded both for the node itself and for the first element after the node.

The text offset correspond to the number of characters in text nodes until then. The node text offset additionally counts all nodes (opening tags). The node text offsets allows targeting nodes directly. That way, it’s possible to target e.g. an <img …/> node or to distinguish whether the <mrow>, the <mi> or the n is targeted in <mrow><mi>n</mi>….

Details: * text_offset is the offset of the last character before the node * node_text_offset is the offset of the node itself

get_offsets_of_type(offset_type: OffsetType) → tuple[int, int]

node_text_offset_after: int

node_text_offset_before: int

text_offset_after: int

text_offset_before: int

class spotterbase.selectors.offset_converter.OffsetConverter(root: _Element)

Bases: object

Records offsets in the DOM.

Notes on efficiency:

Recurses through entire DOM at initialization, which takes time (approximately 1/6th of parsing time).
If a single offset is of interest, using an html tree (lxml.html.parse) and .text_content()
with a custom implementation is every efficient (10x faster). However, I expect that there will often be more than 10 offets to convert.

convert_dom_range(dom_range: DomRange) → DomOffsetRange

get_dom_point(offset: int, offset_type: OffsetType, is_start: bool | None = None) → DomPoint

get_offset(point: _Element | DomPoint, offset_type: OffsetType) → int

get_offset_data(node: _Element) → NodeOffsetData

root: _Element

class spotterbase.selectors.offset_converter.OffsetType(value)

Bases: Enum

We record two types of offsets: * Text offsets increase with every character in a text node

(we need this for the char because it only counts text)

Node text offsets additionally increase with every opening tag

NodeText = 1

Text = 0

spotterbase.selectors.selector_converter module

class spotterbase.selectors.selector_converter.SelectorConverter(document_uri: Uri, dom: _Element, offset_converter: OffsetConverter)

Bases: object

dom_to_fragment_target(target_uri: str | Uri | URIRef | Path | VocabularyMeta, dom_range: DomRange, sub_ranges: list[DomRange] | None = None) → FragmentTarget

dom_to_offset_selector(dom_range: DomRange) → OffsetSelector

dom_to_path_selector(dom_range: DomRange) → PathSelector

dom_to_selectors(dom_range: DomRange, sub_ranges: list[DomRange] | None = None) → list[PathSelector | OffsetSelector]

property offset_converter: OffsetConverter

selector_to_dom(selector: OffsetSelector | PathSelector) → tuple[DomRange, list[DomRange] | None]

target_to_dom(target: FragmentTarget) → tuple[DomRange, list[DomRange] | None]

to_dom(arg: FragmentTarget | PathSelector | OffsetSelector) → tuple[DomRange, list[DomRange] | None]

spotterbase.selectors package

Submodules

spotterbase.selectors.anno_by_offset module

spotterbase.selectors.dom_range module

spotterbase.selectors.offset_converter module

spotterbase.selectors.selector_converter module

Module contents