science_live.pipeline.entity_extractor#

EntityExtractorLinker - Extract and link entities to URIs (FIXED FINAL VERSION)

Module Contents#

Classes#

EntityExtractorLinker

Extract and link entities to URIs with proper filtering and punctuation handling

API#

class science_live.pipeline.entity_extractor.EntityExtractorLinker(endpoint_manager, config: Dict[str, Any] = None)[source]#

Extract and link entities to URIs with proper filtering and punctuation handling

Initialization

_initialize_function_words() set[source]#

Initialize function words that are definitely not entities

_initialize_question_words() set[source]#

Initialize question words and interrogatives

_initialize_boundary_words() set[source]#

Initialize words that should not be at entity boundaries

Extract and link entities from processed question

async _extract_entities(processed_question: science_live.pipeline.common.ProcessedQuestion) List[science_live.pipeline.common.ExtractedEntity][source]#

Extract entities with type classification and punctuation cleaning

_extract_parenthetical_examples(text: str) List[science_live.pipeline.common.ExtractedEntity][source]#

Extract examples from parentheses

_extract_clean_noun_phrases(text: str) List[science_live.pipeline.common.ExtractedEntity][source]#

Extract clean noun phrases with proper boundaries

_extract_meaningful_words(text: str) List[science_live.pipeline.common.ExtractedEntity][source]#

Extract meaningful single words

_clean_entity_text(text: str) str[source]#

Clean entity text by removing boundary words

_clean_phrase_boundaries(phrase: str) str[source]#

Clean phrase boundaries by removing function words at edges

_is_valid_acronym(text: str) bool[source]#

Check if text is a valid acronym

_is_meaningful_phrase(phrase: str) bool[source]#

Check if phrase is meaningful

_is_meaningful_single_word(word: str) bool[source]#

Check if single word is meaningful

_clean_and_filter_entities(entities: List[science_live.pipeline.common.ExtractedEntity]) List[science_live.pipeline.common.ExtractedEntity][source]#

Remove duplicates, overlaps and filter low-quality entities

_entities_overlap(entity1: science_live.pipeline.common.ExtractedEntity, entity2: science_live.pipeline.common.ExtractedEntity) bool[source]#

Check if two entities overlap in text position

Link entities to URIs

async _get_orcid_name(orcid: str) str[source]#

Get name for ORCID

Link entity via external services

_classify_entities(entities: List[science_live.pipeline.common.ExtractedEntity], processed_question: science_live.pipeline.common.ProcessedQuestion) Tuple[List[science_live.pipeline.common.ExtractedEntity], List[science_live.pipeline.common.ExtractedEntity]][source]#

Classify entities as potential subjects or objects

_calculate_linking_confidence(entities: List[science_live.pipeline.common.ExtractedEntity]) float[source]#

Calculate overall linking confidence