Dug Annotation Architecture
The Annotation phase is the semantic engine of Dug. Its purpose is to analyze the unstructured text descriptions of variables and studies and extract structured, interoperable ontology identifiers (CURIEs). These identifiers serve as the “hooks” that link disparate datasets to the broader biomedical knowledge graph.
1. Core Architecture
The annotation pipeline is built around a modular sequence of operations: Identify $\rightarrow$ Normalize $\rightarrow$ Enrich.
Key Components
1. Annotator (Annotator):
-
Role: The entry point. It accepts raw text (e.g., “Patient has type 2 diabetes”) and returns a list of biological entities found in that text.
-
Implementations:
-
Monarch Annotator: Uses the Monarch SciGraph API for dictionary-based named entity recognition (NER).
-
Sapbert Annotator: Uses a token classification model (SapBERT) to identify entities, suitable for more complex or context-heavy matching.
2. Normalizer (DefaultNormalizer):
-
Role: Takes a raw ID found by the annotator (which might be deprecated or from a non-standard ontology) and maps it to a “preferred” CURIE using the Translator Node Normalization service.
-
Output: Updates the
DugIdentifierwith the preferred ID, Label, and Biolink Types (e.g., mappingbiolink:Diseasetodisease).
3. Synonym Finder (DefaultSynonymFinder):
- Role: Fetches a list of synonyms for the normalized ID. This ensures that a user searching for “Heart Attack” can find a variable annotated with “Myocardial Infarction”.
4. DugIdentifier:
- Role: The data object passed through the pipeline. It starts as a raw hit and gains metadata (preferred labels, types, synonyms) as it moves through normalization and enrichment.
2. Data Flow
When dug crawl runs, the Crawler passes the description of every parsed element to the configured Annotator.
sequenceDiagram
participant C as Crawler
participant A as Annotator
participant API as "External API (Monarch/SapBERT)"
participant N as Normalizer
participant S as SynonymFinder
C->>A: annotate("history of heart attack")
A->>A: Preprocess text (stopwords, abbrev)
A->>API: Request entity extraction
API-->>A: Raw ID: HP:0001658 (Myocardial Infarction)
loop for each Raw ID
A->>N: normalize(HP:0001658)
N-->>A: Canonical ID + Biolink types
A->>S: get_synonyms(HP:0001658)
S-->>A: ["Heart attack", "Cardial infarction", ...]
A->>A: Create DugIdentifier
end
A-->>C: List of DugIdentifier
3. Existing Implementations
Monarch Annotator
- Location:
src/dug/core/annotators/monarch_annotator.py - Mechanism: Uses a “sliding window” approach to chunk long descriptions and sends them to the Monarch API.
- Best For: General biomedical text where exact ontology term matching is sufficient.
Sapbert Annotator
-
Location: src/dug/core/annotators/sapbert_annotator.py
-
Mechanism :
-
Token Classification : Sends text to a dense labeling API to identify spans of text that look like biomedical entities.
-
Grounding: Sends those spans to a SapBERT model to resolve them to specific ontology IDs (e.g., linking the text span “heart failure” to MONDO:0005009).
-
Best For: Noisy text or when “fuzzy” semantic matching is needed.
4. How to Extend the Architecture
You can add new annotation methods (e.g., using a local LLM, AWS Comprehend, or a custom dictionary) by implementing the Annotator interface and registering it via pluggy.
Step 1: Create the Annotator Class
Create a new file (e.g., src/dug/core/annotators/custom_annotator.py). Your class must implement __call__.
import logging
from typing import List
from dug.core.annotators._base import DugIdentifier
from dug.config import Config
logger = logging.getLogger('dug')
class CustomAnnotator:
def __init__(self, normalizer, synonym_finder, config: Config, **kwargs):
self.normalizer = normalizer
self.synonym_finder = synonym_finder
self.api_key = kwargs.get('api_key', '')
def __call__(self, text: str, http_session) -> List[DugIdentifier]:
# 1. Preprocess text
clean_text = text.lower()
# 2. Call your custom logic/API to find entities
# (Pseudo-code example)
raw_matches = self.find_entities_in_text(clean_text)
processed_identifiers = []
for match in raw_matches:
# Create a preliminary identifier
raw_id = DugIdentifier(
id=match['curie'],
label=match['name'],
types=match['type']
)
# 3. Normalize (Crucial for Knowledge Graph integration)
norm_id = self.normalizer(raw_id, http_session)
if not norm_id:
continue
# 4. Enrich with Synonyms
norm_id.synonyms = self.synonym_finder(norm_id.id, http_session)
processed_identifiers.append(norm_id)
return processed_identifiers
def find_entities_in_text(self, text):
# Your custom implementation here
return [{"curie": "MONDO:0005009", "name": "Heart Failure", "type": "biolink:Disease"}]
Step 2: Register the Annotator
Modify src/dug/core/annotators/__init__.py to register your new class using the define_annotators hook.
from .custom_annotator import CustomAnnotator
@hookimpl
def define_annotators(annotator_dict: Dict[str, Annotator], config: Config):
# ... existing annotators ...
# Register the factory function or class
annotator_dict["custom"] = CustomAnnotator(
normalizer=DefaultNormalizer(**config.normalizer),
synonym_finder=DefaultSynonymFinder(**config.synonym_service),
config=config
)
Step 3: Update Configuration
If your annotator needs specific settings (like URLs or API keys), add them to src/dug/config.py.
@dataclass
class Config:
# ... existing config ...
annotator_args: dict = field(
default_factory=lambda: {
"monarch": { ... },
"sapbert": { ... },
"custom": {
"api_key": "SECRET_KEY",
"endpoint": "http://my-custom-ner-service/"
}
}
)
Step 4: Run
Use the --annotator flag in the CLI to use your new logic:
dug crawl my_data.csv --parser topmedtag --annotator custom