๐ก๏ธ Privacy-First Telemetry¶
| Field | Value |
|---|---|
| Status | Draft |
| Author | Vivek Kalyanarangan |
| Created | 2026-02-11 |
| Component | backend/ai-engine/src/observability/ |
| Related | Opik integration, V3 mindmap pipeline |
Principles¶
1. Content Never Leaves the Process Boundary Unless Explicitly Opted In¶
User content and LLM derivatives (node labels, summaries, semantic anchors) are privacy-sensitive. LLM derivatives are lossy compressions of user input -- someone reading "Patient Cardiac History" or "Competitor threat from X" in an observability dashboard can infer what the user submitted.
Observability data exits the process at a single boundary (the Opik logging call). Sanitization happens at that boundary, nowhere else. The generation pipeline, SSE events, and frontend are completely unaffected.
2. Shape, Not Content¶
The structural shape of LLM outputs is safe to log. Knowing "5 themes extracted, depth=3, complexity=0.72, content_type=technical" reveals pipeline health without revealing what those themes say.
This distinction -- shape vs. content -- is the foundation of the architecture.
3. Binary Redaction, Not PII Scrubbing¶
Content either goes to Opik in full or not at all. No partial redaction, no regex-based PII stripping, no "mask the first 10 characters." These half-measures are fragile, language-dependent, and create false confidence. User content can be in any language, any format, about any topic.
4. Privacy Is an Access Control Problem, Not Just a Transit Problem¶
Even when Opik moves on-prem, developers should not be reading user content through observability dashboards. The sanitizer remains valuable for access control regardless of where Opik is hosted.
Privacy Taxonomy¶
| Tier | Data | Examples | Visible to developers? |
|---|---|---|---|
| Tier 1: Raw Input | User-submitted text | The text field from the request |
No (except dev environments) |
| Tier 2: LLM Derivatives | Generated content derived from user input | main_topic, node labels, summaries, semantic anchors |
No -- but their shape (counts, lengths, numeric fields) is safe |
| Tier 3: Operational Metadata | Metrics about the process | Token counts, latency, cost, model name, success/failure | Yes, always |
Three Privacy Levels¶
Environment variable: OBSERVABILITY_PRIVACY_LEVEL
| Level | Intended For | Prompt Content | LLM Response Content | Structural Shape | Operational Metrics |
|---|---|---|---|---|---|
full |
Local development | Full text | Full text | Yes | Yes |
redacted |
Production (default) | Length only | Counts, keys, numerics only | Yes | Yes |
metadata_only |
Ultra-strict compliance | Length only | Nothing | No | Yes |
Default: redacted. This provides enough structural signal for MLOps while keeping all text content out of Opik. Use metadata_only only if compliance requirements prohibit even structural shapes.
What You Can Observe at Each Level¶
Always available (all levels)¶
| MLOps Goal | Signal | Source |
|---|---|---|
| Cost monitoring | Cost per request, per LLM call, per model | Token counts + pricing table |
| Latency monitoring | Time per LLM call, total generation time | latency_ms per span |
| Error/failure tracking | Which calls fail, failure rates by operation | success flag + error type |
| Token efficiency | Input/output token ratio per operation | Usage metrics per span |
| Model comparison | All above metrics segmented by model | Model name in span metadata |
Available at redacted level¶
| MLOps Goal | Signal | Source |
|---|---|---|
| Pipeline health | "5 themes, depth 3, complexity 0.72" | Numeric fields from V3AnalysisResult |
| Generation quality | "4 nodes with 2 connections" vs "Fallback: 1 node" | Node/connection/container counts |
| Semantic anchor quality | "Generated 5 anchors" vs "Fallback: 0" | Count from V3SemanticAnchorsResult |
| Content categorization | Segment by content_type (technical/narrative/educational/mixed) | Enum from analysis |
| Fallback rate | How often layers fall back to single-node generation | Binary fallback flag per layer |
Requires full mode (dev only)¶
| MLOps Goal | When Needed |
|---|---|
| Prompt engineering | Iterating on prompt text |
| Deep debugging | Reproducing specific user issues locally with consent |
| Output quality review | Evaluating whether generated labels are good |
Example: Production trace (redacted)¶
Trace: v3_mindmap_generation
input: {"content_length": 51234, "enable_search": false}
metadata: {"model": "gpt-4o", "generator": "v3"}
Span: llm_parse_V3AnalysisResult
input: {"operation": "parse_V3AnalysisResult", "prompt_length": 51234}
output: {"success": true, "tokens": 1847,
"response_shape": {"schema": "V3AnalysisResult",
"key_themes_count": 5,
"suggested_depth": 3,
"complexity_score": 0.72,
"content_type": "technical"}}
usage: {prompt_tokens: 1200, completion_tokens: 647}
cost: $0.009535 | latency: 2340ms
Span: llm_parse_V3LayerNodes (layer 1)
input: {"operation": "parse_V3LayerNodes", "prompt_length": 41087}
output: {"success": true, "tokens": 923,
"response_shape": {"schema": "V3LayerNodes",
"nodes_count": 4,
"connections_count": 2,
"containers_count": 1}}
usage: {prompt_tokens: 780, completion_tokens: 143}
cost: $0.003380 | latency: 1120ms
Generation quality:
{"total_layers": 4, "total_nodes": 13, "fallback_count": 0,
"avg_nodes_per_layer": 3.25, "anchor_count": 5, "status": "success"}
A developer sees pipeline health and cost. They know nothing about what the user submitted or what the nodes say.
Quality Evaluation Without Content¶
Structural Quality Scores¶
Numeric quality signals computed in-pipeline, with only the scores logged to Opik:
| Signal | Calculation | Healthy Range |
|---|---|---|
fallback_count |
Layers that fell back to single-node generation | 0 |
avg_nodes_per_layer |
Total non-root nodes / theme layer count | 2.5 - 4.0 |
connection_density |
Total connections / total nodes | 0.3 - 1.0 |
anchor_count |
Semantic anchors generated | 3 - 7 |
anchor_coverage |
Anchors generated / anchors requested | 0.8 - 1.0 |
LLM-as-Judge (future)¶
A second LLM evaluates output quality and produces a numeric score (0-1). The generation LLM already processes the user's content, so an evaluation LLM is not a new privacy surface. Only the score is logged, never the rationale text.
Architecture¶
Pipeline (full data) Opik Boundary
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโ
User Request โโโ V3MindmapGenerator โโโ LLM Calls โโโ TrackedAsyncOpenAI
โ โ
โ (SSE events unchanged) โ (captures full prompt + response)
โผ โผ
Frontend _log_llm_call_to_opik()
โ
โโโโโโโโโโดโโโโโโโโโ
โ SANITIZER โ
โ (privacy_level)โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
Opik Cloud
(only sanitized data)
The sanitizer sits at the Opik logging boundary -- the single point where data exits the process. The pipeline, SSE events, and frontend are completely unaffected.
Implementation Plan¶
Step 1: Add privacy level to OpikTracker¶
File: src/observability/tracker.py
- Read
OBSERVABILITY_PRIVACY_LEVELenv var ininitialize()(default:redacted) - Validate: must be one of
full,redacted,metadata_only - Expose as
privacy_levelproperty on the singleton - Modify
track_generationandtrack_phasedecorators: setcapture_input/capture_outputtoTrueonly when privacy level isfull
Step 2: Create sanitizer module¶
File: src/observability/sanitizer.py (NEW, ~80 lines)
| Function | Purpose |
|---|---|
sanitize_input(input_data, privacy_level) |
Replaces prompt content with length indicator |
sanitize_output(output_data, privacy_level, schema_name) |
Replaces response content with structural shape |
_describe_shape(response, schema_name) |
Extracts keys, list counts, numeric/boolean values; replaces string values with lengths |
Pure functions, zero external dependencies, easily testable in isolation.
Step 3: Wire sanitizer into Opik logging boundary¶
File: src/observability/llm_client.py
- In
_log_llm_call_to_opik(): applysanitize_input()andsanitize_output()before passing data to_tracker.client.span()
Step 4: Enrich generation trace with quality signals¶
File: src/agents/agents_v3.py
- Track
fallback_count,avg_nodes_per_layer,total_nodes,connection_density,anchor_count - Pass as numeric metadata in
end_generation_trace(output_data={...}) - Safe at all privacy levels
Step 5: Export and configure¶
Files: src/observability/__init__.py, serverless.yml, .env
- Export sanitizer from the observability module
- Add
OBSERVABILITY_PRIVACY_LEVELto serverless env config (default:redacted) and.env(set tofullfor dev)
Step 6: Tests¶
File: tests/test_sanitizer.py (NEW)
- Test
sanitize_inputandsanitize_outputat each privacy level - Test
_describe_shapewith actual V3 schema shapes - Test default-to-
redactedwhen env var missing/invalid - Test that
fullmode passes data through unchanged
Key Design Decisions¶
| Decision | Rationale |
|---|---|
| Binary redaction, not PII scrubbing | Regex-based PII detection is fragile and language-dependent. Content either goes to Opik in full or it doesn't. |
| Sanitization at the boundary, not in the pipeline | The generation pipeline is untouched. SSE events are untouched. Only the Opik serialization path changes. |
redacted as production default |
Structural shapes are genuinely useful for MLOps and don't expose content. metadata_only is overly restrictive for most use cases. |
| New module rather than inline changes | Sanitization logic is testable in isolation, reusable across leak points, and keeps logging code focused. |
Agentic Value Chain Versioning¶
As the ai-engine evolves, changes span multiple dimensions that affect output quality independently. We version the entire agentic pipeline as a composite unit:
AGENTIC_VALUE_CHAIN_VERSION = {
"agentic_version": "0.0.1", # Agent topology, models, work patterns
"prompt_version": "0.0.1", # Prompt text and template iterations
"tooling_version": "0.0.1", # Ancillary tool logic used by agents
}
| Dimension | What Changes | Example |
|---|---|---|
| Agentic logic | Agent count, orchestration pattern, underlying models | Switching from 3-call sequential to a 2-agent parallel pattern |
| Prompts | Prompt text, few-shot examples, output schemas | Rewriting the analyze_content prompt for better theme extraction |
| Tooling | Search integration, content preprocessing, format converters | Adding a new content chunking strategy |
Each dimension is independently semantic-versioned. The composite version is logged to Opik alongside every trace, enabling:
- A/B comparison of prompt changes without logging prompt content (compare version hashes, not text)
- Regression detection by correlating quality score drops with specific version bumps
- Rollback decisions informed by which dimension degraded ("prompt v0.3.0 dropped anchor coverage from 0.9 to 0.6")
The composite version is Tier 3 data (operational metadata) -- safe to log at all privacy levels.
Future Considerations¶
- LLM-as-judge quality evaluation: Score output quality numerically, log only the score
- Per-user consent: A "share trace for debugging" opt-in for specific generations (e.g., via a "report issue" button)
- Opik on-prem migration: The sanitizer remains valuable for developer access control regardless of hosting