Skip to content

๐Ÿ›ก๏ธ Privacy-First Telemetry

Field Value
Status Draft
Author Vivek Kalyanarangan
Created 2026-02-11
Component backend/ai-engine/src/observability/
Related Opik integration, V3 mindmap pipeline

Principles

1. Content Never Leaves the Process Boundary Unless Explicitly Opted In

User content and LLM derivatives (node labels, summaries, semantic anchors) are privacy-sensitive. LLM derivatives are lossy compressions of user input -- someone reading "Patient Cardiac History" or "Competitor threat from X" in an observability dashboard can infer what the user submitted.

Observability data exits the process at a single boundary (the Opik logging call). Sanitization happens at that boundary, nowhere else. The generation pipeline, SSE events, and frontend are completely unaffected.

2. Shape, Not Content

The structural shape of LLM outputs is safe to log. Knowing "5 themes extracted, depth=3, complexity=0.72, content_type=technical" reveals pipeline health without revealing what those themes say.

This distinction -- shape vs. content -- is the foundation of the architecture.

3. Binary Redaction, Not PII Scrubbing

Content either goes to Opik in full or not at all. No partial redaction, no regex-based PII stripping, no "mask the first 10 characters." These half-measures are fragile, language-dependent, and create false confidence. User content can be in any language, any format, about any topic.

4. Privacy Is an Access Control Problem, Not Just a Transit Problem

Even when Opik moves on-prem, developers should not be reading user content through observability dashboards. The sanitizer remains valuable for access control regardless of where Opik is hosted.


Privacy Taxonomy

Tier Data Examples Visible to developers?
Tier 1: Raw Input User-submitted text The text field from the request No (except dev environments)
Tier 2: LLM Derivatives Generated content derived from user input main_topic, node labels, summaries, semantic anchors No -- but their shape (counts, lengths, numeric fields) is safe
Tier 3: Operational Metadata Metrics about the process Token counts, latency, cost, model name, success/failure Yes, always

Three Privacy Levels

Environment variable: OBSERVABILITY_PRIVACY_LEVEL

Level Intended For Prompt Content LLM Response Content Structural Shape Operational Metrics
full Local development Full text Full text Yes Yes
redacted Production (default) Length only Counts, keys, numerics only Yes Yes
metadata_only Ultra-strict compliance Length only Nothing No Yes

Default: redacted. This provides enough structural signal for MLOps while keeping all text content out of Opik. Use metadata_only only if compliance requirements prohibit even structural shapes.


What You Can Observe at Each Level

Always available (all levels)

MLOps Goal Signal Source
Cost monitoring Cost per request, per LLM call, per model Token counts + pricing table
Latency monitoring Time per LLM call, total generation time latency_ms per span
Error/failure tracking Which calls fail, failure rates by operation success flag + error type
Token efficiency Input/output token ratio per operation Usage metrics per span
Model comparison All above metrics segmented by model Model name in span metadata

Available at redacted level

MLOps Goal Signal Source
Pipeline health "5 themes, depth 3, complexity 0.72" Numeric fields from V3AnalysisResult
Generation quality "4 nodes with 2 connections" vs "Fallback: 1 node" Node/connection/container counts
Semantic anchor quality "Generated 5 anchors" vs "Fallback: 0" Count from V3SemanticAnchorsResult
Content categorization Segment by content_type (technical/narrative/educational/mixed) Enum from analysis
Fallback rate How often layers fall back to single-node generation Binary fallback flag per layer

Requires full mode (dev only)

MLOps Goal When Needed
Prompt engineering Iterating on prompt text
Deep debugging Reproducing specific user issues locally with consent
Output quality review Evaluating whether generated labels are good

Example: Production trace (redacted)

Trace: v3_mindmap_generation
  input:  {"content_length": 51234, "enable_search": false}
  metadata: {"model": "gpt-4o", "generator": "v3"}

  Span: llm_parse_V3AnalysisResult
    input:  {"operation": "parse_V3AnalysisResult", "prompt_length": 51234}
    output: {"success": true, "tokens": 1847,
             "response_shape": {"schema": "V3AnalysisResult",
                                "key_themes_count": 5,
                                "suggested_depth": 3,
                                "complexity_score": 0.72,
                                "content_type": "technical"}}
    usage:  {prompt_tokens: 1200, completion_tokens: 647}
    cost:   $0.009535 | latency: 2340ms

  Span: llm_parse_V3LayerNodes (layer 1)
    input:  {"operation": "parse_V3LayerNodes", "prompt_length": 41087}
    output: {"success": true, "tokens": 923,
             "response_shape": {"schema": "V3LayerNodes",
                                "nodes_count": 4,
                                "connections_count": 2,
                                "containers_count": 1}}
    usage:  {prompt_tokens: 780, completion_tokens: 143}
    cost:   $0.003380 | latency: 1120ms

  Generation quality:
    {"total_layers": 4, "total_nodes": 13, "fallback_count": 0,
     "avg_nodes_per_layer": 3.25, "anchor_count": 5, "status": "success"}

A developer sees pipeline health and cost. They know nothing about what the user submitted or what the nodes say.


Quality Evaluation Without Content

Structural Quality Scores

Numeric quality signals computed in-pipeline, with only the scores logged to Opik:

Signal Calculation Healthy Range
fallback_count Layers that fell back to single-node generation 0
avg_nodes_per_layer Total non-root nodes / theme layer count 2.5 - 4.0
connection_density Total connections / total nodes 0.3 - 1.0
anchor_count Semantic anchors generated 3 - 7
anchor_coverage Anchors generated / anchors requested 0.8 - 1.0

LLM-as-Judge (future)

A second LLM evaluates output quality and produces a numeric score (0-1). The generation LLM already processes the user's content, so an evaluation LLM is not a new privacy surface. Only the score is logged, never the rationale text.


Architecture

                    Pipeline (full data)                    Opik Boundary
                    โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•                    โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

User Request โ”€โ”€โ†’ V3MindmapGenerator โ”€โ”€โ†’ LLM Calls โ”€โ”€โ†’ TrackedAsyncOpenAI
                      โ”‚                                       โ”‚
                      โ”‚ (SSE events unchanged)                โ”‚ (captures full prompt + response)
                      โ–ผ                                       โ–ผ
                 Frontend                              _log_llm_call_to_opik()
                                                              โ”‚
                                                     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                                                     โ”‚   SANITIZER     โ”‚
                                                     โ”‚  (privacy_level)โ”‚
                                                     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                                              โ”‚
                                                              โ–ผ
                                                        Opik Cloud
                                                   (only sanitized data)

The sanitizer sits at the Opik logging boundary -- the single point where data exits the process. The pipeline, SSE events, and frontend are completely unaffected.


Implementation Plan

Step 1: Add privacy level to OpikTracker

File: src/observability/tracker.py

  • Read OBSERVABILITY_PRIVACY_LEVEL env var in initialize() (default: redacted)
  • Validate: must be one of full, redacted, metadata_only
  • Expose as privacy_level property on the singleton
  • Modify track_generation and track_phase decorators: set capture_input/capture_output to True only when privacy level is full

Step 2: Create sanitizer module

File: src/observability/sanitizer.py (NEW, ~80 lines)

Function Purpose
sanitize_input(input_data, privacy_level) Replaces prompt content with length indicator
sanitize_output(output_data, privacy_level, schema_name) Replaces response content with structural shape
_describe_shape(response, schema_name) Extracts keys, list counts, numeric/boolean values; replaces string values with lengths

Pure functions, zero external dependencies, easily testable in isolation.

Step 3: Wire sanitizer into Opik logging boundary

File: src/observability/llm_client.py

  • In _log_llm_call_to_opik(): apply sanitize_input() and sanitize_output() before passing data to _tracker.client.span()

Step 4: Enrich generation trace with quality signals

File: src/agents/agents_v3.py

  • Track fallback_count, avg_nodes_per_layer, total_nodes, connection_density, anchor_count
  • Pass as numeric metadata in end_generation_trace(output_data={...})
  • Safe at all privacy levels

Step 5: Export and configure

Files: src/observability/__init__.py, serverless.yml, .env

  • Export sanitizer from the observability module
  • Add OBSERVABILITY_PRIVACY_LEVEL to serverless env config (default: redacted) and .env (set to full for dev)

Step 6: Tests

File: tests/test_sanitizer.py (NEW)

  • Test sanitize_input and sanitize_output at each privacy level
  • Test _describe_shape with actual V3 schema shapes
  • Test default-to-redacted when env var missing/invalid
  • Test that full mode passes data through unchanged

Key Design Decisions

Decision Rationale
Binary redaction, not PII scrubbing Regex-based PII detection is fragile and language-dependent. Content either goes to Opik in full or it doesn't.
Sanitization at the boundary, not in the pipeline The generation pipeline is untouched. SSE events are untouched. Only the Opik serialization path changes.
redacted as production default Structural shapes are genuinely useful for MLOps and don't expose content. metadata_only is overly restrictive for most use cases.
New module rather than inline changes Sanitization logic is testable in isolation, reusable across leak points, and keeps logging code focused.

Agentic Value Chain Versioning

As the ai-engine evolves, changes span multiple dimensions that affect output quality independently. We version the entire agentic pipeline as a composite unit:

AGENTIC_VALUE_CHAIN_VERSION = {
    "agentic_version": "0.0.1",   # Agent topology, models, work patterns
    "prompt_version": "0.0.1",    # Prompt text and template iterations
    "tooling_version": "0.0.1",   # Ancillary tool logic used by agents
}
Dimension What Changes Example
Agentic logic Agent count, orchestration pattern, underlying models Switching from 3-call sequential to a 2-agent parallel pattern
Prompts Prompt text, few-shot examples, output schemas Rewriting the analyze_content prompt for better theme extraction
Tooling Search integration, content preprocessing, format converters Adding a new content chunking strategy

Each dimension is independently semantic-versioned. The composite version is logged to Opik alongside every trace, enabling:

  • A/B comparison of prompt changes without logging prompt content (compare version hashes, not text)
  • Regression detection by correlating quality score drops with specific version bumps
  • Rollback decisions informed by which dimension degraded ("prompt v0.3.0 dropped anchor coverage from 0.9 to 0.6")

The composite version is Tier 3 data (operational metadata) -- safe to log at all privacy levels.


Future Considerations

  • LLM-as-judge quality evaluation: Score output quality numerically, log only the score
  • Per-user consent: A "share trace for debugging" opt-in for specific generations (e.g., via a "report issue" button)
  • Opik on-prem migration: The sanitizer remains valuable for developer access control regardless of hosting