spotterbase.dnm_nlp package

Submodules

spotterbase.dnm_nlp.sentence_tokenizer module

Rule-based sentence tokenization.

Note that some aspects (e.g. recognition of display math) are arXMLiv-specific, this could be extended to work better with other corpora as well.

spotterbase.dnm_nlp.sentence_tokenizer.get_surrounding_node(dnm: Dnm) _Element
spotterbase.dnm_nlp.sentence_tokenizer.is_display_math(node: _Element) bool
spotterbase.dnm_nlp.sentence_tokenizer.is_in_header(node: _Element) bool
spotterbase.dnm_nlp.sentence_tokenizer.is_ref_node(node: _Element) bool
spotterbase.dnm_nlp.sentence_tokenizer.normal_end_of_sentence(dnm: Dnm, i: int) bool
spotterbase.dnm_nlp.sentence_tokenizer.sentence_tokenize(dnm: Dnm) list[Dnm]

spotterbase.dnm_nlp.word_tokenizer module

A very simple word tokenizer implementation.

As it does not use any DNM-specific features (except for working on DnmDstr), it might be better to use an off-the-shelf tokenizer.

spotterbase.dnm_nlp.word_tokenizer.word_tokenize(sentence: LinkedStr_T, keep_as_words: list[tuple[int, int]] | None = None) list[LinkedStr_T]
spotterbase.dnm_nlp.word_tokenizer.word_tokenize(sentence: str, keep_as_words: list[tuple[int, int]] | None = None) list[str]

Tokenizes a sentence (or longer text) into words using some simple rules. keep_as_words can be used to keep certain parts (range is right-exclusive) of the text as words (e.g. annotated ranges that were replaced with a complex token).

spotterbase.dnm_nlp.xml_match module

class spotterbase.dnm_nlp.xml_match.MatchTree(label: str, node: _Element | None, children: List[MatchTree])

Bases: object

property only_child: MatchTree
class spotterbase.dnm_nlp.xml_match.Matcher

Bases: object

class spotterbase.dnm_nlp.xml_match.MatcherAnyNode

Bases: NodeMatcher

class spotterbase.dnm_nlp.xml_match.MatcherLabelled(node_matcher: NodeMatcher, label: str)

Bases: NodeMatcher

class spotterbase.dnm_nlp.xml_match.MatcherNodeAsSeq(node_matcher: NodeMatcher)

Bases: SeqMatcher

class spotterbase.dnm_nlp.xml_match.MatcherNodeOr(node_matchers: List[NodeMatcher])

Bases: NodeMatcher

class spotterbase.dnm_nlp.xml_match.MatcherNodeWithChildren(node_matcher: NodeMatcher, seq_matcher: SeqMatcher, allow_remainder: bool = False)

Bases: NodeMatcher

class spotterbase.dnm_nlp.xml_match.MatcherNodeWithClass(node_matcher: NodeMatcher, acceptable_classes: Set[str])

Bases: NodeMatcher

class spotterbase.dnm_nlp.xml_match.MatcherNodeWithText(node_matcher: NodeMatcher, regex: Pattern, require_full_match: bool)

Bases: NodeMatcher

class spotterbase.dnm_nlp.xml_match.MatcherSeqAny(node_matcher: NodeMatcher)

Bases: SeqMatcher

Matches a whole sequence if a single element matches the specified node matcher

class spotterbase.dnm_nlp.xml_match.MatcherSeqConcat(seq_matchers: List[SeqMatcher])

Bases: SeqMatcher

Concatenation of sequence matchers

class spotterbase.dnm_nlp.xml_match.MatcherSeqOr(seq_matchers: List[SeqMatcher])

Bases: SeqMatcher

class spotterbase.dnm_nlp.xml_match.MatcherTag(tagname: str)

Bases: NodeMatcher

class spotterbase.dnm_nlp.xml_match.NodeMatcher

Bases: Matcher

match(node: _Element) Iterator[MatchTree]
with_class(*classes: str) NodeMatcher
with_text(regex: str, require_full_match: bool = True) NodeMatcher
class spotterbase.dnm_nlp.xml_match.SeqMatcher

Bases: Matcher

spotterbase.dnm_nlp.xml_match.maybe(matcher: NodeMatcher | SeqMatcher) SeqMatcher
spotterbase.dnm_nlp.xml_match.seq(*matchers: NodeMatcher | SeqMatcher) SeqMatcher
spotterbase.dnm_nlp.xml_match.tag(name: str) NodeMatcher

Module contents