Relation Extraction & Knowledge Graph Construction
> NER found the entities. Entity linking anchored them. Relation extraction finds the edges between them. A knowledge graph is the sum of nodes, edges, and their provenance.
Type: Build
Languages: Python
Prerequisites: Phase 5 · 06 (NER), Phase 5 · 25 (Entity Linking)
Time: ~60 minutes
The Problem
An analyst reads: "Tim Cook became CEO of Apple in 2011." Four facts:
(Tim Cook, role, CEO)(Tim Cook, employer, Apple)(Tim Cook, start_date, 2011)(Apple, type, Organization)
Relation Extraction (RE) turns free text into structured triples (subject, relation, object). Aggregate across a corpus and you have a knowledge graph. Aggregate and query and you have a reasoning substrate for RAG, analytics, or compliance audits.
The 2026 problem: LLMs extract relations enthusiastically. Too enthusiastically. They hallucinate triples that the source text does not support. Without provenance, you cannot tell real triples from plausible fiction. The 2026 answer is AEVS-style anchor-and-verify pipelines.
The Concept
Triple form. (subject_entity, relation_type, object_entity). Relations come from a closed ontology (Wikidata properties, FIBO, UMLS) or an open set (OpenIE-style, anything goes).
Three extraction approaches.
- Rule / pattern-based. Hearst patterns: "X such as Y" →
(Y, isA, X). Plus hand-crafted regex. Brittle, precise, explainable. - Supervised classifier. Given two entity mentions in a sentence, predict the relation from a fixed set. Trained on TACRED, ACE, KBP. Standard 2015–2022.
- Generative LLM. Prompt the model to emit triples. Works out of the box. Needs provenance, or hallucinates plausible-looking junk.
AEVS (Anchor-Extraction-Verification-Supplement, 2026). The current hallucination-mitigation framework:
- Anchor. Identify every entity span and relation-phrase span with exact positions.
- Extract. Generate triples linked to anchor spans.
- Verify. Match each triple element back to the source text; reject anything unsupported.
- Supplement. A coverage pass ensures no anchored span is dropped.
Hallucinations drop sharply. Requires more compute but is auditable.
The open-vs-closed tradeoff.
- Closed ontology. Fixed property list (e.g., Wikidata's 11,000+ properties). Predictable. Queryable. Hard to invent.
- Open IE. Any verbal phrase becomes a relation. High recall. Low precision. Messy to query.
Production KGs usually mix: open IE for discovery, then canonicalize relations onto a closed ontology before merging into the main graph.
Build It
Step 1: pattern-based extraction
PATTERNS = [
(r"(?P<s>[A-Z]\w+) (?:is|was) (?:a|an|the) (?P<o>[A-Z]?\w+)", "isA"),
(r"(?P<s>[A-Z]\w+) (?:is|was) born in (?P<o>\w+)", "bornIn"),
(r"(?P<s>[A-Z]\w+) works? (?:at|for) (?P<o>[A-Z]\w+)", "worksAt"),
(r"(?P<s>[A-Z]\w+) founded (?P<o>[A-Z]\w+)", "founded"),
]
See code/main.py for the full toy extractor. Hearst patterns still ship in domain-specific pipelines because they are debuggable.
Step 2: supervised relation classification
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tok = AutoTokenizer.from_pretrained("Babelscape/rebel-large")
model = AutoModelForSequenceClassification.from_pretrained("Babelscape/rebel-large")
text = "Tim Cook was born in Alabama. He later became CEO of Apple."
encoded = tok(text, return_tensors="pt", truncation=True)
output = model.generate(**encoded, max_length=200)
triples = tok.batch_decode(output, skip_special_tokens=False)
REBEL is a seq2seq relation extractor: text in, triples out, already in Wikidata property ids. Fine-tuned on distant-supervision data. Standard open-weights baseline.
Step 3: LLM-prompted extraction with anchoring
prompt = f"""Extract (subject, relation, object) triples from the text.
For each triple, include the exact character span in the source text.
Text: {text}
Output JSON:
[{{"subject": {{"text": "...", "span": [start, end]}},
"relation": "...",
"object": {{"text": "...", "span": [start, end]}}}}, ...]
Only include triples fully supported by the text. No inference beyond what is stated.
"""
Verify every returned span against the source. Reject anything where text[start:end] != triple_entity. This is the AEVS "verify" step in its minimal form.
Step 4: canonicalize onto a closed ontology
RELATION_MAP = {
"is the CEO of": "P169", # "chief executive officer"
"was born in": "P19", # "place of birth"
"founded": "P112", # "founded by" (inverted subject/object)
"works at": "P108", # "employer"
}
def canonicalize(relation):
rel_low = relation.lower().strip()
if rel_low in RELATION_MAP:
return RELATION_MAP[rel_low]
return None # drop unmapped open relations or route to manual review
Canonicalization is often 60-80% of the engineering work. Budget for it.
Step 5: build a small graph and query
triples = extract(text)
graph = {}
for s, r, o in triples:
graph.setdefault(s, []).append((r, o))
def neighbors(node, relation=None):
return [(r, o) for r, o in graph.get(node, []) if relation is None or r == relation]
print(neighbors("Tim Cook", relation="P108")) # -> [(P108, Apple)]
This is the atom of every RAG-over-KG system. Scale it with RDF triple stores (Blazegraph, Virtuoso), property graphs (Neo4j), or vector-augmented graph stores.
Pitfalls
- Coreference before RE. "He founded Apple" — RE needs to know who "he" is. Run coref first (lesson 24).
- Entity canonicalization. "Apple Inc" and "Apple" must resolve to the same node. Entity linking first (lesson 25).
- Hallucinated triples. LLMs emit triples the text does not support. Enforce span verification.
- Relation canonicalization drift. Open IE relations are inconsistent ("was born in," "came from," "is a native of"). Collapse to canonical ids or the graph is unqueryable.
- Temporal errors. "Tim Cook is CEO of Apple" — true now, false in 2005. Many relations are time-bounded. Use qualifiers (
P580start time,P582end time in Wikidata). - Domain mismatch. REBEL trained on Wikipedia. Legal, medical, and scientific text often need domain-fine-tuned RE models.
Use It
The 2026 stack:
| Situation | Pick |
|---|---|
| Fast production, general domain | REBEL or LlamaPred with Wikidata canonicalization |
| Domain-specific (biomed, legal) | SciREX-style domain fine-tune + custom ontology |
| LLM-prompted, audited output | AEVS pipeline: anchor → extract → verify → supplement |
| High-volume news IE | Pattern-based + supervised hybrid |
| Building a KG from scratch | Open IE + manual canonicalization pass |
| Temporal KG | Extract with qualifiers (start/end time, point in time) |
The integration pattern: NER → coref → entity linking → relation extraction → ontology mapping → graph load. Every stage is a potential quality gate.
Ship It
Save as outputs/skill-re-designer.md:
name: re-designer
description: Design a relation extraction pipeline with provenance and canonicalization.
version: 1.0.0
phase: 5
lesson: 26
tags: [nlp, relation-extraction, knowledge-graph]
---
Given a corpus (domain, language, volume) and downstream use (KG-RAG, analytics, compliance), output:
1. Extractor. Pattern-based / supervised / LLM / AEVS hybrid. Reason tied to precision vs recall target.
2. Ontology. Closed property list (Wikidata / domain) or open IE with canonicalization pass.
3. Provenance. Every triple carries source char-span + doc id. Non-negotiable for audit.
4. Merge strategy. Canonical entity id + relation id + temporal qualifiers; dedup policy.
5. Evaluation. Precision / recall on 200 hand-labelled triples + hallucination-rate on LLM-extracted sample.
Refuse any LLM-based RE pipeline without span verification (source provenance). Refuse open-IE output flowing into a production graph without canonicalization. Flag pipelines with no temporal qualifier on time-bounded relations (employer, spouse, position).
Exercises
- Easy. Run the pattern extractor in
code/main.pyon 5 news-article sentences. Hand-check precision. - Medium. Use REBEL (or a small LLM) on the same sentences. Compare triples. Which extractor has higher precision? Higher recall?
- Hard. Build the AEVS pipeline: extract with LLM + verify spans against source. Measure hallucination rate before vs after the verify step on 50 Wikipedia-style sentences.
Key Terms
| Term | What people say | What it actually means |
|---|---|---|
| Triple | Subject-relation-object | (s, r, o) tuple that is the atomic unit of a KG. |
| Open IE | Extract anything | Open-vocabulary relation phrases; high recall, low precision. |
| Closed ontology | Fixed schema | Bounded set of relation types (Wikidata, UMLS, FIBO). |
| Canonicalization | Normalize everything | Map surface names / relations to canonical ids. |
| AEVS | Grounded extraction | Anchor-Extraction-Verification-Supplement pipeline (2026). |
| Provenance | Source-of-truth link | Every triple carries a doc id + char-span to its source. |
| Distant supervision | Cheap labels | Align text with an existing KG to create training data. |
Further Reading
- Mintz et al. (2009). Distant supervision for relation extraction without labeled data — the distant-supervision paper.
- Huguet Cabot, Navigli (2021). REBEL: Relation Extraction By End-to-end Language generation — seq2seq RE workhorse.
- Wadden et al. (2019). Entity, Relation, and Event Extraction with Contextualized Span Representations (DyGIE++) — joint IE.
- AEVS — Anchor-Extraction-Verification-Supplement framework — 2026 hallucination-mitigation design.
- Wikidata SPARQL tutorial — canonical graph queries.