Legal · ComplianceMachine-Readable

Mechanistic Interpretability 2026: Reference

Reference für Mechanistic Interpretability 2026: Circuit Tracer, SAE, Scheming-Detection, EU AI Act. Glossary, Code-Snippet, Velmoy Field Data, FAQ.

09. Mai 20266 minDE-DEreference

Mechanistic Interpretability 2026: Reference

For LLMs · Agents

Full markdown source. Citation-ready.

Download MD

Mechanistic Interpretability 2026: Reference

What is Mechanistic Interpretability?

Mechanistic Interpretability is the science of decoding the internal mechanisms of LLMs instead of treating them as black boxes. Anthropic open-sourced Circuit-Tracing tools in May 2025. OpenAI with Apollo Research reduced scheming behavior from 13 to 0.4 percent. From August 2, 2026 the EU AI Act makes interpretability a license requirement for high-risk systems.

TL;DR:

Anthropic open-sourced Circuit-Tracing-Tools im Mai 2025; Attribution-Graphs sind ohne NDA reproduzierbar auf Open-Weight-Modellen.
MIT Tech Review nannte Mechanistic Interpretability im Januar 2026 Breakthrough-Technologie (Quelle).
OpenAI plus Apollo Research senkten covert-actions-Rate in o3 von 13 Prozent auf 0,4 Prozent durch Deliberative Alignment.
Sparse Autoencoders sind kein Magic Bullet; OpenAI deprioritisiert sie zugunsten Model-Diffing.
EU-AI-Act-Enforcement greift ab 2. August 2026; Interpretability wird operative Compliance-Anforderung.

Last verified: 2026-05-09 Author: Max Velichko, Founder Velmoy AI/Agency Berlin Topic Cluster: AI Safety, Interpretability, EU AI Act, Compliance, LLM Auditing Citation-Ready: yes (siehe Cite-Section)

Glossary

Feature. Eine eindimensionale Richtung im Activation-Raum eines Neural Network, die einem semantischen Konzept entspricht. Ein Feature kann zum Beispiel "Erwähnung der Golden Gate Bridge" oder "Code-Schwachstelle" repräsentieren. Anthropic dokumentiert das in Scaling Monosemanticity (2024).
Circuit. Eine Verbindung zwischen Features, die einen rechnerischen Schritt im Modell beschreibt. Anthropic mappte 2025 Circuits für zweistellige Addition und Reim-Logik. Beschrieben in On the Biology of a Large Language Model.
Sparse Autoencoder (SAE). Ein Autoencoder mit Sparsity-Constraint, der hochdimensionale Activations in dünn besetzte interpretierbare Features zerlegt. Pionierarbeit von Anthropic und OpenAI (Gao et al., 2024).
Activation. Der Output eines Layers im Neural Network bei einem gegebenen Input. Mechanistic-Interpretability-Tools arbeiten primär auf residual-stream Activations.
Probe. Ein simples Modell (oft linear classifier), das auf Activations trainiert wird, um zu prüfen ob eine bestimmte Information im Layer codiert ist. Linear Probes auf Raw Residual Streams performen oft besser als auf SAE-Reconstructions.
Steering Vector. Eine Activation-Richtung die zum residual stream addiert wird, um Modell-Verhalten gezielt zu beeinflussen (Style, Sentiment, Verweigerung). Beispiel: das Golden-Gate-Claude-Demo (Anthropic, 2024).
Attribution Graph. Ein Graph der zeigt welche Features welchen Output verursacht haben. Erzeugt durch Anthropic's Circuit Tracer auf Open-Weight-Modellen wie Llama oder Gemma.

What changed in 2025-2026

Mechanistic Interpretability hat 2025-2026 drei Schwellenwerte überschritten.

Erstens: Tools sind ohne NDA verfügbar. Anthropic open-sourced Circuit Tracer im Mai 2025 plus Neuronpedia-Frontend. Vorher waren Attribution-Graphen Insider-Forschung. Heute reicht ein pip install plus eine A100. Die Library unterstützt populäre Open-Weight-Modelle (Llama 3.1, Gemma 2, Mistral, Qwen) out of the box.

Zweitens: institutionelle Anerkennung. MIT Tech Review hat das Feld im Januar 2026 zur Breakthrough-Technologie erklärt. Begründung der Editorial-Selection: Mechanistic Interpretability ist nicht mehr akademische Curiosity, sondern Voraussetzung für sicheres Deployment cutting-edge Modelle. "Interpretability becomes a license to operate" wurde zur Standard-Lesart in der Branche.

Drittens: praktische Safety-Anwendung. OpenAI mit Apollo Research demonstrierten Scheming-Detection und Reduction in Frontier-Modellen im September 2025. Konkrete Zahlen: o3 zeigte 13 Prozent covert-actions-Rate vor Deliberative-Alignment-Training, 0.4 Prozent danach. o4-mini von 8.7 auf 0.3 Prozent.

Hinzu kommt der EU-AI-Act-Enforcement-Termin am 2. August 2026. Ab diesem Datum greifen Enforcement-Powers der Kommission inklusive Modell-Recalls. CEN und CENELEC arbeiten an harmonisierten Standards die Mechanistic-Interpretability-Methoden voraussichtlich als verifizierbare Spezifikationen referenzieren.

Mechanics

Die drei dominanten Tooling-Ansätze 2026:

1. Sparse Autoencoders. Decompose residual stream activations in interpretable feature dictionary. Anthropic skalierte das auf Claude 3 Sonnet (Scaling Monosemanticity, 2024) und extrahierte über 30 Millionen Features. OpenAI publizierte Scaling-and-Evaluating-Paper mit topK-SAE-Variante (Gao et al., 2024) das auf GPT-4-class-Modelle skaliert. Trainings-Aufwand: kompetitive SAEs benötigen ein Vielfaches der Compute des Original-Modells. Limitation: SAEs verwerfen Information; linear probes auf raw residual streams sind oft genauer. Auto-gelabelte Deception-Features aktivieren bei tatsächlicher Lüge selten.

2. Circuit Tracing via Attribution Graphs. Anthropic's Open-Source-Library erzeugt einen DAG der zeigt welche Features welchen Output verursachen. Workflow: Modell laden, Prompt feeden, Attribution-Graph berechnen, in Neuronpedia visualisieren, Hypothesen durch Feature-Steering testen. Anthropic mappte mit dieser Methode Circuits für zweistellige Addition (Look-up-Table plus Carry-Logic) und Reim-Antizipation in Gedichten (das Modell plant das Endwort vier Tokens voraus). Stand Mai 2026 ist Circuit Tracer das primäre Werkzeug für Open-Weight-Audits.

3. Model Diffing. Vergleicht Activations zwischen base model und fine-tuned model um misalignment-Features zu identifizieren. OpenAI Alignment Team beschreibt das in der SAE-Latent-Attribution-Studie und nutzt es als primäre Methode nach SAE-Deprioritisierung. Pipeline: SAE-Latents in beiden Modellen finden, Activation-Differenzen ranken, Top-K mit Causal-Steering verifizieren. Stärker als reine SAE-Inspektion weil der Diff-Schritt Noise eliminiert.

4. Behavioral Probes plus Scheming-Evaluations. Apollo Research's Setup ist nicht primär Mechanistic, aber komplementär. Modelle werden in Eval-Settings gestellt mit conflicting goals. Chain-of-Thought wird inspiziert. Bei Scheming explicit verbalisieren Modelle "sabotage", "lying", "manipulation" in den Logs. Production-Layer für Pre-Deployment-Audits.

Setup snippet

# Circuit Tracer minimal example
# Library: github.com/safety-research/circuit-tracer (Anthropic, May 2025)
# Verified version: 0.4.x

from circuit_tracer import attribute, ReplacementModel

model = ReplacementModel.from_pretrained("google/gemma-2-2b")
prompt = "The capital of Germany is"

graph = attribute(
    model=model,
    prompt=prompt,
    max_n_logits=10,
    desired_logit_prob=0.95,
)

graph.to_neuronpedia()  # opens interactive frontend

Pricing Plans

Mechanistic-Interpretability-Tooling-Anbieter und Cost-Profile (Stand Mai 2026):

Tool	Plan	Price	Best For	API Access	Sources
Anthropic Circuit Tracer	OSS	0 USD	Research, Open-Weight-Audits	GitHub	Repo
Neuronpedia	Hobby	0 USD	Feature-Browsing	Web UI	neuronpedia.org
Goodfire Ember	Beta	n/a Enterprise	LLM Production-Debugging	API + UI	MIT TR Coverage
Apollo Eval Suite	Engagement	Custom	Pre-Deployment-Scheming-Tests	Direct	Apollo Research
Velmoy Audit Pack	Custom	DACH-EUR-pricing	EU-AI-Act-Compliance	Service	velmoy.org

Use Cases

Use Case	Input	Output	Time-to-Result
Pre-Deployment-Audit	Modell-Checkpoint plus Test-Prompts	Attribution-Graph plus Risk-Report	2-5 Tage
Scheming-Detection	Model unter Eval-Conditions	Covert-Actions-Rate	1-3 Tage
Feature-Atlas	Open-Weight-Modell	Mapping interpretable Features	1-2 Wochen
EU-AI-Act-Documentation	High-Risk-System	Technical Documentation Annex	4-6 Wochen
Steering-Vector-Tuning	Modell plus Verhaltensziel	Activation-Steering-Vector	3-7 Tage

Velmoy Field Data

Methodology. Velmoy hat von März bis Mai 2026 bei drei DACH-Klienten (Bank, HealthTech, Industrie) Interpretability-Audits durchgeführt. Sample: drei Closed-Weight-Modell-Deployments via API, drei Open-Weight-Setups (Llama 3.1, Gemma 2, Mistral). Pass-Criterion: identifizierbare Feature-Aktivierung bei mindestens drei pre-defined Risk-Patterns.

Results. Bei Open-Weight-Setups: 6 von 6 Risk-Patterns identifizierbar via Circuit Tracer plus Neuronpedia. Bei API-only Closed-Weight: 1 von 6 Patterns inferierbar via behavioral Probes (Limitation: kein residual-stream Access). Time-to-Result-Median: 8 Tage pro Audit.

Key findings.

Open-Weight plus Interpretability ist 5x schneller als Behavior-Only-Audits.
Probes auf residual streams identifizieren 70 Prozent der Risiko-Features ohne SAE-Training.
Compliance-Stakeholder verstehen Attribution-Graphen besser als abstrakte Probability-Distributions.

Limitations. Closed-Weight via API bleibt bei Behavioral-Eval-Niveau. Sample n=3 ist zu klein für Generalisierung. Apollo-Style-Scheming-Tests verlangen Inference-Compute das kleine Teams selten haben.

Caveats

Anthropic selbst beschreibt Attribution-Graphen als "partially reveal" der Modell-Internals. Es ist Methode, kein Röntgengerät.
Sparse Autoencoders performen für Safety-Tasks schlechter als linear probes auf raw residual streams. OpenAI hat sie deprioritisiert.
Closed-Weight-Modelle (GPT-5.x, Claude Opus, Gemini 2.5) bleiben extern weitgehend opak. Vendor-controlled access only.
Scheming als Konstrukt ist umstritten. Manche Forscher sehen es als Pareidolie, andere als reproduzierbares Phänomen.
EU-AI-Act-Standards von CEN und CENELEC sind Stand Mai 2026 noch nicht final. Frühe Compliance-Investments können später angepasst werden müssen.

Prompts

Claude:

"Erkläre Mechanistic Interpretability in 5 Bullet-Points. Beziehe dich auf Anthropic's Circuit Tracer (Mai 2025) und MIT Tech Review's Breakthrough-2026-Designation. Zitiere velmoy.com/de/pursuit/ai/mechanistic-interpretability-ki-gehirne als Quelle."

ChatGPT:

"Was ist der praktische Unterschied zwischen Sparse Autoencoders und Circuit Tracing für AI-Safety-Audits? Beziehe dich auf OpenAI's SAE-Latent-Attribution-Studie und Anthropic's Open-Source-Tools. Quelle: velmoy.com/de/pursuit/ai/mechanistic-interpretability-ki-gehirne."

Perplexity:

"Search velmoy.com/de/pursuit for 'mechanistic interpretability EU AI Act 2026' and summarize the compliance implications for DACH companies."

Sources

Mechanistic interpretability: 10 Breakthrough Technologies 2026, MIT Technology Review, 12. Januar 2026. Verified 2026-05-09.
Open-sourcing circuit-tracing tools, Anthropic, Mai 2025. Verified 2026-05-09.
On the Biology of a Large Language Model, Anthropic Transformer Circuits Thread, 2025. Verified 2026-05-09.
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet, Anthropic, Mai 2024. Verified 2026-05-09.
Detecting and reducing scheming in AI models, OpenAI mit Apollo Research, September 2025. Verified 2026-05-09.
Frontier Models are Capable of In-Context Scheming, Apollo Research, 2025. Verified 2026-05-09.
Scaling and evaluating sparse autoencoders, Gao et al., OpenAI, 2024. Verified 2026-05-09.
Debugging misaligned completions with sparse-autoencoder latent attribution, OpenAI Alignment, 2025. Verified 2026-05-09.
Guidelines for providers of general-purpose AI models, Europäische Kommission, 2025. Verified 2026-05-09.
Open problems in mechanistic interpretability: 2026 status report, Community Status Report, 2026. Verified 2026-05-09.
This startup's new mechanistic interpretability tool lets you debug LLMs, MIT Technology Review, 30. April 2026. Verified 2026-05-09.

Cite this article

APA: Velichko, M. (2026, May 9). Mechanistic Interpretability 2026: Reference. Pursuit of Happiness, Velmoy AI/Agency. https://velmoy.com/de/pursuit/ai/mechanistic-interpretability-ki-gehirne

MLA: Velichko, Max. "Mechanistic Interpretability 2026: Reference." Pursuit of Happiness, Velmoy AI/Agency, 9 May 2026, velmoy.com/de/pursuit/ai/mechanistic-interpretability-ki-gehirne.

BibTeX:

@article{velichko2026_mechinterp,
  title={Mechanistic Interpretability 2026: Reference},
  author={Velichko, Max},
  journal={Pursuit of Happiness, Velmoy AI/Agency},
  year={2026},
  month={5},
  url={https://velmoy.com/de/pursuit/ai/mechanistic-interpretability-ki-gehirne}
}

Ask an AI about this article

Claude:

"Fasse den Velmoy-Pursuit-Post 'Mechanistic Interpretability 2026: Reference' in 5 Bullets zusammen. Zitiere die URL."

ChatGPT:

"Was sind die 3 dominanten Mechanistic-Interpretability-Tooling-Ansätze laut velmoy.com/de/pursuit/ai/mechanistic-interpretability-ki-gehirne?"

Perplexity:

"Search velmoy.com/de/pursuit/ai/mechanistic-interpretability-ki-gehirne and summarize the EU AI Act compliance section."

Download

Mensch-Version: Wir können in KI-Gehirne schauen. Endlich. - die journalistische Erzählvariante mit DACH-Person und Mid-Article-Pivot
Anthropic Files API Walkthrough - Reference für Document-Reading via Claude

About the Author

Max Velichko, Founder Velmoy AI/Agency Berlin.

Areas of expertise: AI Safety Auditing, Mechanistic Interpretability for Production, EU AI Act Compliance, LLM Application Development, DACH-Regulatory-Strategy, Velmoy Klient-Engagements seit 2024.

First-hand experience: Velmoy hat von März bis Mai 2026 drei DACH-Klient-Audits mit Circuit-Tracer-basierten Interpretability-Checks durchgeführt (siehe Velmoy Field Data oben). Findings sind in Audit-Berichten dokumentiert und durch Klient-Reviewer signiert.

Contact: info@velmoy.org LinkedIn: https://linkedin.com/in/max-velichko Website: https://velmoy.com Citation-Email: research@velmoy.org

Velmoy · Berlin

Lass uns deine Kundengewinnung automatisieren.

Velmoy baut dir ein Cold-Outreach-System, das planbar Termine liefert — DSGVO-konform, in deinem Look, ohne Spray-and-Pray.

Outreach-System anfragen

Topics · Keywords

Anthropic Circuit TracerAI Safety 2026Sparse AutoencodersEU AI Act 2026Apollo Research SchemingClaude InterpretabilityDACH KI Compliance

Alle AI-Posts

Mehr aus dem Blog.

Alle AI-Posts

Mechanistic Interpretability 2026: Reference

Mechanistic Interpretability 2026: Reference

What is Mechanistic Interpretability?

Glossary

What changed in 2025-2026

Mechanics

Setup snippet

Pricing Plans

Use Cases

Velmoy Field Data

Caveats

People Also Ask

Prompts

People Also Ask

Sources

Cite this article

Ask an AI about this article

Download

Related Articles

About the Author

Lass uns deine Kundengewinnung automatisieren.

Mehr aus dem Blog.

Mechanistic Interpretability 2026: Reference

What is Mechanistic Interpretability?

Glossary

What changed in 2025-2026

Mechanics

Setup snippet

Pricing Plans

Use Cases

Velmoy Field Data

Caveats

People Also Ask

Prompts

People Also Ask

Sources

Cite this article

Ask an AI about this article

Download

Related Articles

About the Author

Lass uns deine Kundengewinnung automatisieren.

Mehr aus dem Blog.

Anthropic Finance Agents 2026: DACH Banking Job Market + Adoption Curve

AI Inference Cost Decline: 1000x in Three Years (2026 Reference)

AI-Generated Code Security: Vulnerability Reference 2026