spotterbase.selectors package
Submodules
spotterbase.selectors.anno_by_offset module
- class spotterbase.selectors.anno_by_offset.AnnoByOffset(annos: Iterable[tuple[DomOffsetRange, Annotation]] = ())
Bases:
object- add_annotation(range_: DomOffsetRange, annotation: Annotation)
- get_annotations_from_point(point: int) list[Annotation]
- get_annotations_from_range(range_: DomOffsetRange) list[Annotation]
spotterbase.selectors.dom_range module
- class spotterbase.selectors.dom_range.DomPoint(node: _Element, *, text_offset: int | None = None, tail_offset: int | None = None, after: bool = False)
Bases:
objectReferences a point in the DOM.
- Attributes:
node: The referenced node. text_offset: if not None, that character in the text content is referred to. tail_offset: if not None, that character in the tail content is referred to. after: actually, we reference whatever comes after.
Why do we need after? To be honest, I keep going back and forth between having it and not having it. The background is that ranges are right-exclusive because that is the convention in the Web Annotation standard. after makes things easier in a way. In particular, it lets us include the end of the DOM. It also allows us to have DomPoint as a simple datastructure without e.g. complex processing code for finding whatever comes after.
- is_element() bool
- class spotterbase.selectors.dom_range.DomRange(start: DomPoint | DomRange, end: DomPoint | DomRange)
Bases:
object- get_containing_node() _Element
Note: This does not cover edge cases yet…
- spotterbase.selectors.dom_range.get_parent_asserted(node: _Element) _Element
spotterbase.selectors.offset_converter module
- class spotterbase.selectors.offset_converter.DomOffsetRange(start: int, end: int, converter: OffsetConverter)
Bases:
objectA DomRange, except that it uses node offsets as created by the OffsetConverter
- converter: OffsetConverter
- end: int
- start: int
- class spotterbase.selectors.offset_converter.NodeOffsetData(text_offset_before: int, node_text_offset_before: int, text_offset_after: int, node_text_offset_after: int)
Bases:
objectTwo types of offsets are recorded: text offsets and node text offsets. These offsets are recorded both for the node itself and for the first element after the node.
The text offset correspond to the number of characters in text nodes until then. The node text offset additionally counts all nodes (opening tags). The node text offsets allows targeting nodes directly. That way, it’s possible to target e.g. an <img …/> node or to distinguish whether the <mrow>, the <mi> or the n is targeted in <mrow><mi>n</mi>….
Details: * text_offset is the offset of the last character before the node * node_text_offset is the offset of the node itself
- get_offsets_of_type(offset_type: OffsetType) tuple[int, int]
- node_text_offset_after: int
- node_text_offset_before: int
- text_offset_after: int
- text_offset_before: int
- class spotterbase.selectors.offset_converter.OffsetConverter(root: _Element)
Bases:
objectRecords offsets in the DOM.
Notes on efficiency:
Recurses through entire DOM at initialization, which takes time (approximately 1/6th of parsing time).
- If a single offset is of interest, using an html tree (lxml.html.parse) and .text_content()
with a custom implementation is every efficient (10x faster). However, I expect that there will often be more than 10 offets to convert.
- convert_dom_range(dom_range: DomRange) DomOffsetRange
- get_dom_point(offset: int, offset_type: OffsetType, is_start: bool | None = None) DomPoint
- get_offset(point: _Element | DomPoint, offset_type: OffsetType) int
- get_offset_data(node: _Element) NodeOffsetData
- root: _Element
- class spotterbase.selectors.offset_converter.OffsetType(value)
Bases:
EnumWe record two types of offsets: * Text offsets increase with every character in a text node
(we need this for the char because it only counts text)
Node text offsets additionally increase with every opening tag
- NodeText = 1
- Text = 0
spotterbase.selectors.selector_converter module
- class spotterbase.selectors.selector_converter.SelectorConverter(document_uri: Uri, dom: _Element, offset_converter: OffsetConverter)
Bases:
object- dom_to_fragment_target(target_uri: str | Uri | URIRef | Path | VocabularyMeta, dom_range: DomRange, sub_ranges: list[DomRange] | None = None) FragmentTarget
- dom_to_offset_selector(dom_range: DomRange) OffsetSelector
- dom_to_path_selector(dom_range: DomRange) PathSelector
- dom_to_selectors(dom_range: DomRange, sub_ranges: list[DomRange] | None = None) list[PathSelector | OffsetSelector]
- property offset_converter: OffsetConverter
- selector_to_dom(selector: OffsetSelector | PathSelector) tuple[DomRange, list[DomRange] | None]
- target_to_dom(target: FragmentTarget) tuple[DomRange, list[DomRange] | None]
- to_dom(arg: FragmentTarget | PathSelector | OffsetSelector) tuple[DomRange, list[DomRange] | None]