spotterbase.dnm_nlp package

Submodules

spotterbase.dnm_nlp.sentence_tokenizer module

Rule-based sentence tokenization.

Note that some aspects (e.g. recognition of display math) are arXMLiv-specific, this could be extended to work better with other corpora as well.

spotterbase.dnm_nlp.sentence_tokenizer.get_surrounding_node(dnm: Dnm) → _Element

spotterbase.dnm_nlp.sentence_tokenizer.is_display_math(node: _Element) → bool

spotterbase.dnm_nlp.sentence_tokenizer.is_in_header(node: _Element) → bool

spotterbase.dnm_nlp.sentence_tokenizer.is_ref_node(node: _Element) → bool

spotterbase.dnm_nlp.sentence_tokenizer.normal_end_of_sentence(dnm: Dnm, i: int) → bool

spotterbase.dnm_nlp.sentence_tokenizer.sentence_tokenize(dnm: Dnm) → list[Dnm]

spotterbase.dnm_nlp.word_tokenizer module

A very simple word tokenizer implementation.

As it does not use any DNM-specific features (except for working on DnmDstr), it might be better to use an off-the-shelf tokenizer.

spotterbase.dnm_nlp.word_tokenizer.word_tokenize(sentence: LinkedStr_T, keep_as_words: list[tuple[int, int]] | None = None) → list[LinkedStr_T]
spotterbase.dnm_nlp.word_tokenizer.word_tokenize(sentence: str, keep_as_words: list[tuple[int, int]] | None = None) → list[str]: Tokenizes a sentence (or longer text) into words using some simple rules. keep_as_words can be used to keep certain parts (range is right-exclusive) of the text as words (e.g. annotated ranges that were replaced with a complex token).

spotterbase.dnm_nlp.xml_match module

class spotterbase.dnm_nlp.xml_match.MatchTree(label: str, node: _Element | None, children: List[MatchTree])

Bases: object

property only_child: MatchTree

class spotterbase.dnm_nlp.xml_match.Matcher: Bases: object

class spotterbase.dnm_nlp.xml_match.MatcherAnyNode: Bases: NodeMatcher

class spotterbase.dnm_nlp.xml_match.MatcherLabelled(node_matcher: NodeMatcher, label: str): Bases: NodeMatcher

class spotterbase.dnm_nlp.xml_match.MatcherNodeAsSeq(node_matcher: NodeMatcher): Bases: SeqMatcher

class spotterbase.dnm_nlp.xml_match.MatcherNodeOr(node_matchers: List[NodeMatcher]): Bases: NodeMatcher

class spotterbase.dnm_nlp.xml_match.MatcherNodeWithChildren(node_matcher: NodeMatcher, seq_matcher: SeqMatcher, allow_remainder: bool = False): Bases: NodeMatcher

class spotterbase.dnm_nlp.xml_match.MatcherNodeWithClass(node_matcher: NodeMatcher, acceptable_classes: Set[str]): Bases: NodeMatcher

class spotterbase.dnm_nlp.xml_match.MatcherNodeWithText(node_matcher: NodeMatcher, regex: Pattern, require_full_match: bool): Bases: NodeMatcher

class spotterbase.dnm_nlp.xml_match.MatcherSeqAny(node_matcher: NodeMatcher)

Bases: SeqMatcher

Matches a whole sequence if a single element matches the specified node matcher

class spotterbase.dnm_nlp.xml_match.MatcherSeqConcat(seq_matchers: List[SeqMatcher])

Bases: SeqMatcher

Concatenation of sequence matchers

class spotterbase.dnm_nlp.xml_match.MatcherSeqOr(seq_matchers: List[SeqMatcher]): Bases: SeqMatcher

class spotterbase.dnm_nlp.xml_match.MatcherTag(tagname: str): Bases: NodeMatcher

class spotterbase.dnm_nlp.xml_match.NodeMatcher

Bases: Matcher

match(node: _Element) → Iterator[MatchTree]

with_class(*classes: str) → NodeMatcher

with_text(regex: str, require_full_match: bool = True) → NodeMatcher

class spotterbase.dnm_nlp.xml_match.SeqMatcher: Bases: Matcher

spotterbase.dnm_nlp.xml_match.maybe(matcher: NodeMatcher | SeqMatcher) → SeqMatcher

spotterbase.dnm_nlp.xml_match.seq(*matchers: NodeMatcher | SeqMatcher) → SeqMatcher

spotterbase.dnm_nlp.xml_match.tag(name: str) → NodeMatcher

spotterbase.dnm_nlp package

Submodules

spotterbase.dnm_nlp.sentence_tokenizer module

spotterbase.dnm_nlp.word_tokenizer module

spotterbase.dnm_nlp.xml_match module

Module contents