spotterbase.selectors package

Submodules

spotterbase.selectors.anno_by_offset module

class spotterbase.selectors.anno_by_offset.AnnoByOffset(annos: Iterable[tuple[DomOffsetRange, Annotation]] = ())

Bases: object

add_annotation(range_: DomOffsetRange, annotation: Annotation)
get_annotations_from_point(point: int) list[Annotation]
get_annotations_from_range(range_: DomOffsetRange) list[Annotation]

spotterbase.selectors.dom_range module

class spotterbase.selectors.dom_range.DomPoint(node: _Element, *, text_offset: int | None = None, tail_offset: int | None = None, after: bool = False)

Bases: object

References a point in the DOM.

Attributes:

node: The referenced node. text_offset: if not None, that character in the text content is referred to. tail_offset: if not None, that character in the tail content is referred to. after: actually, we reference whatever comes after.

Why do we need after? To be honest, I keep going back and forth between having it and not having it. The background is that ranges are right-exclusive because that is the convention in the Web Annotation standard. after makes things easier in a way. In particular, it lets us include the end of the DOM. It also allows us to have DomPoint as a simple datastructure without e.g. complex processing code for finding whatever comes after.

get_after() DomPoint
is_element() bool
class spotterbase.selectors.dom_range.DomRange(start: DomPoint | DomRange, end: DomPoint | DomRange)

Bases: object

classmethod from_node(node: _Element) DomRange
get_containing_node() _Element

Note: This does not cover edge cases yet…

spotterbase.selectors.dom_range.get_parent_asserted(node: _Element) _Element

spotterbase.selectors.offset_converter module

class spotterbase.selectors.offset_converter.DomOffsetRange(start: int, end: int, converter: OffsetConverter)

Bases: object

A DomRange, except that it uses node offsets as created by the OffsetConverter

converter: OffsetConverter
end: int
start: int
to_dom_range() DomRange
class spotterbase.selectors.offset_converter.NodeOffsetData(text_offset_before: int, node_text_offset_before: int, text_offset_after: int, node_text_offset_after: int)

Bases: object

Two types of offsets are recorded: text offsets and node text offsets. These offsets are recorded both for the node itself and for the first element after the node.

The text offset correspond to the number of characters in text nodes until then. The node text offset additionally counts all nodes (opening tags). The node text offsets allows targeting nodes directly. That way, it’s possible to target e.g. an <img …/> node or to distinguish whether the <mrow>, the <mi> or the n is targeted in <mrow><mi>n</mi>….

Details: * text_offset is the offset of the last character before the node * node_text_offset is the offset of the node itself

get_offsets_of_type(offset_type: OffsetType) tuple[int, int]
node_text_offset_after: int
node_text_offset_before: int
text_offset_after: int
text_offset_before: int
class spotterbase.selectors.offset_converter.OffsetConverter(root: _Element)

Bases: object

Records offsets in the DOM.

Notes on efficiency:

  • Recurses through entire DOM at initialization, which takes time (approximately 1/6th of parsing time).

  • If a single offset is of interest, using an html tree (lxml.html.parse) and .text_content()

    with a custom implementation is every efficient (10x faster). However, I expect that there will often be more than 10 offets to convert.

convert_dom_range(dom_range: DomRange) DomOffsetRange
get_dom_point(offset: int, offset_type: OffsetType, is_start: bool | None = None) DomPoint
get_offset(point: _Element | DomPoint, offset_type: OffsetType) int
get_offset_data(node: _Element) NodeOffsetData
root: _Element
class spotterbase.selectors.offset_converter.OffsetType(value)

Bases: Enum

We record two types of offsets: * Text offsets increase with every character in a text node

(we need this for the char because it only counts text)

  • Node text offsets additionally increase with every opening tag

NodeText = 1
Text = 0

spotterbase.selectors.selector_converter module

class spotterbase.selectors.selector_converter.SelectorConverter(document_uri: Uri, dom: _Element, offset_converter: OffsetConverter)

Bases: object

dom_to_fragment_target(target_uri: str | Uri | URIRef | Path | VocabularyMeta, dom_range: DomRange, sub_ranges: list[DomRange] | None = None) FragmentTarget
dom_to_offset_selector(dom_range: DomRange) OffsetSelector
dom_to_path_selector(dom_range: DomRange) PathSelector
dom_to_selectors(dom_range: DomRange, sub_ranges: list[DomRange] | None = None) list[PathSelector | OffsetSelector]
property offset_converter: OffsetConverter
selector_to_dom(selector: OffsetSelector | PathSelector) tuple[DomRange, list[DomRange] | None]
target_to_dom(target: FragmentTarget) tuple[DomRange, list[DomRange] | None]
to_dom(arg: FragmentTarget | PathSelector | OffsetSelector) tuple[DomRange, list[DomRange] | None]

Module contents