🛡️ Privacy-First Telemetry¶

Field	Value
Status	Draft
Author	Vivek Kalyanarangan
Created	2026-02-11
Component	`backend/ai-engine/src/observability/`
Related	Opik integration, V3 mindmap pipeline

Principles¶

1. Content Never Leaves the Process Boundary Unless Explicitly Opted In¶

User content and LLM derivatives (node labels, summaries, semantic anchors) are privacy-sensitive. LLM derivatives are lossy compressions of user input -- someone reading "Patient Cardiac History" or "Competitor threat from X" in an observability dashboard can infer what the user submitted.

Observability data exits the process at a single boundary (the Opik logging call). Sanitization happens at that boundary, nowhere else. The generation pipeline, SSE events, and frontend are completely unaffected.

2. Shape, Not Content¶

The structural shape of LLM outputs is safe to log. Knowing "5 themes extracted, depth=3, complexity=0.72, content_type=technical" reveals pipeline health without revealing what those themes say.

This distinction -- shape vs. content -- is the foundation of the architecture.

3. Binary Redaction, Not PII Scrubbing¶

Content either goes to Opik in full or not at all. No partial redaction, no regex-based PII stripping, no "mask the first 10 characters." These half-measures are fragile, language-dependent, and create false confidence. User content can be in any language, any format, about any topic.

4. Privacy Is an Access Control Problem, Not Just a Transit Problem¶

Even when Opik moves on-prem, developers should not be reading user content through observability dashboards. The sanitizer remains valuable for access control regardless of where Opik is hosted.

Privacy Taxonomy¶

Tier	Data	Examples	Visible to developers?
Tier 1: Raw Input	User-submitted text	The `text` field from the request	No (except dev environments)
Tier 2: LLM Derivatives	Generated content derived from user input	`main_topic`, node labels, summaries, semantic anchors	No -- but their shape (counts, lengths, numeric fields) is safe
Tier 3: Operational Metadata	Metrics about the process	Token counts, latency, cost, model name, success/failure	Yes, always

Three Privacy Levels¶

Environment variable: OBSERVABILITY_PRIVACY_LEVEL

Level	Intended For	Prompt Content	LLM Response Content	Structural Shape	Operational Metrics
`full`	Local development	Full text	Full text	Yes	Yes
`redacted`	Production (default)	Length only	Counts, keys, numerics only	Yes	Yes
`metadata_only`	Ultra-strict compliance	Length only	Nothing	No	Yes

Default: redacted. This provides enough structural signal for MLOps while keeping all text content out of Opik. Use metadata_only only if compliance requirements prohibit even structural shapes.

What You Can Observe at Each Level¶

Always available (all levels)¶

MLOps Goal	Signal	Source
Cost monitoring	Cost per request, per LLM call, per model	Token counts + pricing table
Latency monitoring	Time per LLM call, total generation time	`latency_ms` per span
Error/failure tracking	Which calls fail, failure rates by operation	`success` flag + error type
Token efficiency	Input/output token ratio per operation	Usage metrics per span
Model comparison	All above metrics segmented by model	Model name in span metadata

Available at `redacted` level¶

MLOps Goal	Signal	Source
Pipeline health	"5 themes, depth 3, complexity 0.72"	Numeric fields from V3AnalysisResult
Generation quality	"4 nodes with 2 connections" vs "Fallback: 1 node"	Node/connection/container counts
Semantic anchor quality	"Generated 5 anchors" vs "Fallback: 0"	Count from V3SemanticAnchorsResult
Content categorization	Segment by content_type (technical/narrative/educational/mixed)	Enum from analysis
Fallback rate	How often layers fall back to single-node generation	Binary fallback flag per layer

Requires `full` mode (dev only)¶

MLOps Goal	When Needed
Prompt engineering	Iterating on prompt text
Deep debugging	Reproducing specific user issues locally with consent
Output quality review	Evaluating whether generated labels are good

Example: Production trace (`redacted`)¶

Trace: v3_mindmap_generation
  input:  {"content_length": 51234, "enable_search": false}
  metadata: {"model": "gpt-4o", "generator": "v3"}

  Span: llm_parse_V3AnalysisResult
    input:  {"operation": "parse_V3AnalysisResult", "prompt_length": 51234}
    output: {"success": true, "tokens": 1847,
             "response_shape": {"schema": "V3AnalysisResult",
                                "key_themes_count": 5,
                                "suggested_depth": 3,
                                "complexity_score": 0.72,
                                "content_type": "technical"}}
    usage:  {prompt_tokens: 1200, completion_tokens: 647}
    cost:   $0.009535 | latency: 2340ms

  Span: llm_parse_V3LayerNodes (layer 1)
    input:  {"operation": "parse_V3LayerNodes", "prompt_length": 41087}
    output: {"success": true, "tokens": 923,
             "response_shape": {"schema": "V3LayerNodes",
                                "nodes_count": 4,
                                "connections_count": 2,
                                "containers_count": 1}}
    usage:  {prompt_tokens: 780, completion_tokens: 143}
    cost:   $0.003380 | latency: 1120ms

  Generation quality:
    {"total_layers": 4, "total_nodes": 13, "fallback_count": 0,
     "avg_nodes_per_layer": 3.25, "anchor_count": 5, "status": "success"}

A developer sees pipeline health and cost. They know nothing about what the user submitted or what the nodes say.

Quality Evaluation Without Content¶

Structural Quality Scores¶

Numeric quality signals computed in-pipeline, with only the scores logged to Opik:

Signal	Calculation	Healthy Range
`fallback_count`	Layers that fell back to single-node generation	0
`avg_nodes_per_layer`	Total non-root nodes / theme layer count	2.5 - 4.0
`connection_density`	Total connections / total nodes	0.3 - 1.0
`anchor_count`	Semantic anchors generated	3 - 7
`anchor_coverage`	Anchors generated / anchors requested	0.8 - 1.0

LLM-as-Judge (future)¶

A second LLM evaluates output quality and produces a numeric score (0-1). The generation LLM already processes the user's content, so an evaluation LLM is not a new privacy surface. Only the score is logged, never the rationale text.

Architecture¶

                    Pipeline (full data)                    Opik Boundary
                    ═══════════════════                    ══════════════

User Request ──→ V3MindmapGenerator ──→ LLM Calls ──→ TrackedAsyncOpenAI
                      │                                       │
                      │ (SSE events unchanged)                │ (captures full prompt + response)
                      ▼                                       ▼
                 Frontend                              _log_llm_call_to_opik()
                                                              │
                                                     ┌────────┴────────┐
                                                     │   SANITIZER     │
                                                     │  (privacy_level)│
                                                     └────────┬────────┘
                                                              │
                                                              ▼
                                                        Opik Cloud
                                                   (only sanitized data)

The sanitizer sits at the Opik logging boundary -- the single point where data exits the process. The pipeline, SSE events, and frontend are completely unaffected.

Implementation Plan¶

Step 1: Add privacy level to `OpikTracker`¶

File: src/observability/tracker.py

Read OBSERVABILITY_PRIVACY_LEVEL env var in initialize() (default: redacted)
Validate: must be one of full, redacted, metadata_only
Expose as privacy_level property on the singleton
Modify track_generation and track_phase decorators: set capture_input/capture_output to True only when privacy level is full

Step 2: Create sanitizer module¶

File: src/observability/sanitizer.py (NEW, ~80 lines)

Function	Purpose
`sanitize_input(input_data, privacy_level)`	Replaces prompt content with length indicator
`sanitize_output(output_data, privacy_level, schema_name)`	Replaces response content with structural shape
`_describe_shape(response, schema_name)`	Extracts keys, list counts, numeric/boolean values; replaces string values with lengths

Pure functions, zero external dependencies, easily testable in isolation.

Step 3: Wire sanitizer into Opik logging boundary¶

File: src/observability/llm_client.py

In _log_llm_call_to_opik(): apply sanitize_input() and sanitize_output() before passing data to _tracker.client.span()

Step 4: Enrich generation trace with quality signals¶

File: src/agents/agents_v3.py

Track fallback_count, avg_nodes_per_layer, total_nodes, connection_density, anchor_count
Pass as numeric metadata in end_generation_trace(output_data={...})
Safe at all privacy levels

Step 5: Export and configure¶

Files: src/observability/__init__.py, serverless.yml, .env

Export sanitizer from the observability module
Add OBSERVABILITY_PRIVACY_LEVEL to serverless env config (default: redacted) and .env (set to full for dev)

Step 6: Tests¶

File: tests/test_sanitizer.py (NEW)

Test sanitize_input and sanitize_output at each privacy level
Test _describe_shape with actual V3 schema shapes
Test default-to-redacted when env var missing/invalid
Test that full mode passes data through unchanged

Key Design Decisions¶

Decision	Rationale
Binary redaction, not PII scrubbing	Regex-based PII detection is fragile and language-dependent. Content either goes to Opik in full or it doesn't.
Sanitization at the boundary, not in the pipeline	The generation pipeline is untouched. SSE events are untouched. Only the Opik serialization path changes.
`redacted` as production default	Structural shapes are genuinely useful for MLOps and don't expose content. `metadata_only` is overly restrictive for most use cases.
New module rather than inline changes	Sanitization logic is testable in isolation, reusable across leak points, and keeps logging code focused.

Agentic Value Chain Versioning¶

As the ai-engine evolves, changes span multiple dimensions that affect output quality independently. We version the entire agentic pipeline as a composite unit:

AGENTIC_VALUE_CHAIN_VERSION = {
    "agentic_version": "0.0.1",   # Agent topology, models, work patterns
    "prompt_version": "0.0.1",    # Prompt text and template iterations
    "tooling_version": "0.0.1",   # Ancillary tool logic used by agents
}

Dimension	What Changes	Example
Agentic logic	Agent count, orchestration pattern, underlying models	Switching from 3-call sequential to a 2-agent parallel pattern
Prompts	Prompt text, few-shot examples, output schemas	Rewriting the `analyze_content` prompt for better theme extraction
Tooling	Search integration, content preprocessing, format converters	Adding a new content chunking strategy

Each dimension is independently semantic-versioned. The composite version is logged to Opik alongside every trace, enabling:

A/B comparison of prompt changes without logging prompt content (compare version hashes, not text)
Regression detection by correlating quality score drops with specific version bumps
Rollback decisions informed by which dimension degraded ("prompt v0.3.0 dropped anchor coverage from 0.9 to 0.6")

The composite version is Tier 3 data (operational metadata) -- safe to log at all privacy levels.

Future Considerations¶

LLM-as-judge quality evaluation: Score output quality numerically, log only the score
Per-user consent: A "share trace for debugging" opt-in for specific generations (e.g., via a "report issue" button)
Opik on-prem migration: The sanitizer remains valuable for developer access control regardless of hosting