spotterbase.dnm_nlp package
Submodules
spotterbase.dnm_nlp.sentence_tokenizer module
Rule-based sentence tokenization.
Note that some aspects (e.g. recognition of display math) are arXMLiv-specific, this could be extended to work better with other corpora as well.
- spotterbase.dnm_nlp.sentence_tokenizer.is_display_math(node: _Element) bool
- spotterbase.dnm_nlp.sentence_tokenizer.is_in_header(node: _Element) bool
- spotterbase.dnm_nlp.sentence_tokenizer.is_ref_node(node: _Element) bool
spotterbase.dnm_nlp.word_tokenizer module
A very simple word tokenizer implementation.
As it does not use any DNM-specific features (except for working on DnmDstr), it might be better to use an off-the-shelf tokenizer.
- spotterbase.dnm_nlp.word_tokenizer.word_tokenize(sentence: LinkedStr_T, keep_as_words: list[tuple[int, int]] | None = None) list[LinkedStr_T]
- spotterbase.dnm_nlp.word_tokenizer.word_tokenize(sentence: str, keep_as_words: list[tuple[int, int]] | None = None) list[str]
Tokenizes a sentence (or longer text) into words using some simple rules. keep_as_words can be used to keep certain parts (range is right-exclusive) of the text as words (e.g. annotated ranges that were replaced with a complex token).
spotterbase.dnm_nlp.xml_match module
- class spotterbase.dnm_nlp.xml_match.MatchTree(label: str, node: _Element | None, children: List[MatchTree])
Bases:
object
- class spotterbase.dnm_nlp.xml_match.Matcher
Bases:
object
- class spotterbase.dnm_nlp.xml_match.MatcherAnyNode
Bases:
NodeMatcher
- class spotterbase.dnm_nlp.xml_match.MatcherLabelled(node_matcher: NodeMatcher, label: str)
Bases:
NodeMatcher
- class spotterbase.dnm_nlp.xml_match.MatcherNodeAsSeq(node_matcher: NodeMatcher)
Bases:
SeqMatcher
- class spotterbase.dnm_nlp.xml_match.MatcherNodeOr(node_matchers: List[NodeMatcher])
Bases:
NodeMatcher
- class spotterbase.dnm_nlp.xml_match.MatcherNodeWithChildren(node_matcher: NodeMatcher, seq_matcher: SeqMatcher, allow_remainder: bool = False)
Bases:
NodeMatcher
- class spotterbase.dnm_nlp.xml_match.MatcherNodeWithClass(node_matcher: NodeMatcher, acceptable_classes: Set[str])
Bases:
NodeMatcher
- class spotterbase.dnm_nlp.xml_match.MatcherNodeWithText(node_matcher: NodeMatcher, regex: Pattern, require_full_match: bool)
Bases:
NodeMatcher
- class spotterbase.dnm_nlp.xml_match.MatcherSeqAny(node_matcher: NodeMatcher)
Bases:
SeqMatcherMatches a whole sequence if a single element matches the specified node matcher
- class spotterbase.dnm_nlp.xml_match.MatcherSeqConcat(seq_matchers: List[SeqMatcher])
Bases:
SeqMatcherConcatenation of sequence matchers
- class spotterbase.dnm_nlp.xml_match.MatcherSeqOr(seq_matchers: List[SeqMatcher])
Bases:
SeqMatcher
- class spotterbase.dnm_nlp.xml_match.MatcherTag(tagname: str)
Bases:
NodeMatcher
- class spotterbase.dnm_nlp.xml_match.NodeMatcher
Bases:
Matcher- with_class(*classes: str) NodeMatcher
- with_text(regex: str, require_full_match: bool = True) NodeMatcher
- spotterbase.dnm_nlp.xml_match.maybe(matcher: NodeMatcher | SeqMatcher) SeqMatcher
- spotterbase.dnm_nlp.xml_match.seq(*matchers: NodeMatcher | SeqMatcher) SeqMatcher
- spotterbase.dnm_nlp.xml_match.tag(name: str) NodeMatcher