PII Guard Processor

Compliance-grade PII detection and masking for US, EU, UK, and India regulatory standards — built into the Praxis data pipeline.

Supported signals: Logs

Why it matters

Healthcare, financial services, and SaaS organisations operating across the US, EU, UK, and India live under a stack of overlapping data-protection regimes — HIPAA, PCI-DSS, GDPR, UK-GDPR, SPDI Rules 2011, the DPDP Act 2023, RBI master directions, SEBI/IRDAI mandates, and NPCI rails. Sensitive identifiers — SSNs, payment cards, IBANs, national IDs (Aadhaar, PAN, DNI, BSN, Codice Fiscale, Steuer-ID, NIR, RRN, NINO), medical record numbers, passport data, email addresses, IP addresses, UPI handles, IFSC codes, ABHA health IDs — routinely leak into application logs, observability data, and SIEM pipelines, creating audit findings, breach-notification risk, and regulatory exposure.

The PII Guard Processor sits inline in the Praxis collection pipeline, detects these identifiers across log bodies and attributes in real time, and either reports (observe mode) or masks them (mask mode) before they leave the customer's environment — no PII ever lands in downstream stores.

Regulatory coverage out of the box

Standard	Regulator	What it protects	Detected identifiers
HIPAA Safe Harbor	HHS-OCR (US)	Protected Health Information	SSN, MRN, NPI, DEA registration, Medicare MBI, date of birth, VIN, IMEI
HIPAA (cross-emit)	HHS-OCR	Safe Harbor items D / F / N / O	Phone, email, MAC address, IPv4, IPv6
PCI-DSS	PCI-SSC	Cardholder data + auth data	Credit card (Luhn), CVV, magstripe Track 1/2, card expiry
GDPR (EU)	EDPB	Personal data, online identifiers	Email, phone, IBAN, IPv4/IPv6, MAC address, EU VAT, EU passport (MRZ)
GDPR (Spain)	AEPD	National identifier	DNI / NIE (mod-23 validated)
GDPR (Netherlands)	AP	National identifier	BSN (11-test validated)
GDPR (Italy)	Garante	National identifier	Codice Fiscale (16-char structured)
GDPR (Germany)	BfDI	National identifier	Steuer-ID (iterative mod-11 validated)
GDPR (France)	CNIL	National identifier	NIR / social security (mod-97 with Corsica substitution)
GDPR (Belgium)	APD	National identifier	Rijksregisternummer (mod-97 pre/post-2000)
GDPR (UK)	ICO	National identifier + location	UK National Insurance Number, UK postcode
SPDI Rules 2011	MEITY (IT Act)	Sensitive personal data (India)	Bank account, payment cards, PAN — cross-emitted from PCI/HIPAA/DPDP
DPDP Act 2023	MEITY	Personal data (India)	Aadhaar (Verhoeff), PAN (entity-type), Indian mobile, email
RBI-KYC	RBI	KYC officially valid documents	Aadhaar, PAN, voter EPIC, driving license, Indian passport
RBI-DPSC	RBI	Digital payment security controls	Payment card (cross-emit), card expiry, CVV/CVC2 variants, OTP
NPCI	NPCI	UPI / IMPS / NEFT rails	UPI VPA (`alice@okhdfcbank`), IFSC code (4L+0+6alnum)
SEBI	SEBI	Securities market participants	PAN, DEMAT (NSDL `IN`+14 / CDSL 16), UCC client code
IRDAI	IRDAI	Insurance + national health account	Policy number, ABHA health account ID (14 digits)

Pick standards individually with fine-grained category and pattern selection, or use standards: ["all"] to enable every preset in one click. A single regex match fans out to every active regulator — one pan_card hit produces accepted counts under dpdp, spdi, rbi_kyc, and sebi simultaneously, so compliance dashboards see per-regulator attribution without running multiple processors.

Granular per-standard configuration

Each standard exposes its own { enabled, categories, patterns } block. enabled overrides the legacy bulk standards list (true forces on, false forces off). categories narrows the standard to listed subcategories (empty = all). patterns is an explicit allowlist of detector names (empty = no allowlist; falls back to categories).

{
  "mode": "mask",
  "gdpr":    { "enabled": true,  "categories": ["contact_identifiers", "government_ids"] },
  "pci":     { "enabled": true,  "patterns":   ["credit_card", "cvv"] },
  "hipaa":   { "enabled": false },
  "dpdp":    { "enabled": true },
  "rbi_kyc": { "enabled": true,  "categories": ["government_ids"] }
}

The legacy standards: ["gdpr", "pci"] array and overrides: { ipv4: false } denylist remain supported for pipelines created before v0.3.

What makes it accurate (not just regex)

Naive regex generates false positives that drown the signal. PII Guard pairs every high-risk pattern with a format-aware validator:

Luhn checksum — all payment cards, NPI provider IDs, IMEI device IDs.
IBAN mod-97 — international bank account numbers (eliminates random 15–34-char alphanumeric runs).
US SSN range check — rejects the 000 / 666 / 9xx area codes and 00-group / 0000-serial combinations the SSA never issues.
Spanish DNI / NIE mod-23 letter check — filters random 8-digit-plus-letter sequences using the canonical TRWAGMYFPDXBNJZSQVHLCKE alphabet.
Dutch BSN 11-test — weighted-sum check filters labeled-but-non-BSN numbers.
German Steuer-IdNr mod-11 — iterative product check for the 11-digit tax ID.
French NIR mod-97 — with 2A / 2B Corsica department code substitution.
Belgian RRN mod-97 — both pre-2000 and post-2000 birth-year branches.
Italian Codice Fiscale mod-26 — position-weighted alphabet validation.
Indian Aadhaar Verhoeff — UIDAI's official check-digit algorithm rejects random 12-digit runs.
Indian PAN entity-type rule — 4th character must be one of A/B/C/F/G/H/J/L/P/T; rejects random 5L+4D+1L sequences.
Indian IFSC structural — position 5 must be 0; rejects random 11-char alphanumerics.
Calendar-valid date parser — rejects impossible dates of birth like 02/30/1995.

Validator rejections are tracked as a separate metric (pii_validator_rejected_total) so analysts can see how much noise the validators filter out — proof of detection quality, not just detection volume.

What it scans (not just bodies)

PII commonly lives in fields a body-only scanner can't see: user.email as a record attribute, db.user as a resource attribute, http.url in span events. PII Guard's scan_targets config lets operators include all three surfaces:

scan_targets: [body, log_attributes, resource_attributes]

Per-target metric labels show exactly where PII is leaking from, so dashboards can drive remediation at the right layer (instrumentation vs application vs platform).

Two operating modes

Mode	Behaviour	Use case
Observe (default)	Emits per-type detection metrics; does not mutate logs	Discovery phase, audit readiness, "where is our PII?"
Mask	Replaces matches in log bodies / attributes using configurable templates	Production compliance, pre-storage redaction

Masking is customisable per identifier type. Examples:

credit_card → ****-****-****-{last4} (preserve last 4 for support workflows)
ssn → XXX-XX-{last4} (HIPAA-compliant partial display)
Default fallback → <REDACTED:{type}>

Tokens {type}, {first4}, {last4} let teams meet both compliance and operability requirements simultaneously.

Operationally honest: exempt known false positives without disabling the detector

Real production traffic contains internal addresses, support mailboxes, and synthetic test data that operators want excluded from the FP count without turning off the whole detector.

exemptions:
  email:
    - [email protected]         # exact match
    - "*@acme.com"             # suffix glob: any internal email
  ipv4:
    - 10.0.0.1

Exempted matches still surface telemetry — pii_exempted_total — so dashboards see the noise the exemption is filtering, but the original bytes are never rewritten in mask mode. This is the difference between "we suppressed it because we know it's safe" and "we missed it."

Compliance dashboard friendly

Every accepted detection carries six bounded-cardinality labels:

Label	Values	Use
`pii_type`	preset name (closed enum)	What was detected
`standard`	`gdpr`, `pci`, `hipaa`, `spdi`, `dpdp`, `rbi_kyc`, `rbi_dpsc`, `npci`, `sebi`, `irdai`	Which regime applies
`regulator`	Western: `HHS-OCR`, `PCI-SSC`, `EDPB`, `ICO`, `AEPD`, `AP`, `Garante`, `BfDI`, `CNIL`, `APD` · Indian: `MEITY`, `RBI`, `NPCI`, `SEBI`, `IRDAI`	Who enforces it (per-country granularity for GDPR; per-rail for India)
`severity`	`low`, `medium`, `high`	Operational priority
`target`	`body`, `log_attributes`, `resource_attributes`	Where it leaked from
`co_located`	`true` / `false`	High-confidence PCI co-location (card + CVV / card + magstripe)

Pivot any of those for the right view: a CFO dashboard pivots on regulator, an SRE dashboard pivots on target, an audit report pivots on severity + co_located.

Performance

Built on RE2 — linear-time regex matching guarantees, no ReDoS class to defend against.
Per-standard regex unions — keeps each NFA small; ~3–5× faster than a single combined union.
Anchored prescreen — patterns that require a contextual label (CVV, MRN, NPI, BSN, VIN, IMEI, magstripe) are bypassed entirely on records without that label.
Configurable byte cap — limits.max_bytes_per_body (default 16 KB) protects throughput on pathological large bodies.

What it doesn't do (v1 honesty)

Logs signal only. Traces and metrics are planned for v2.
Regex-based detection. No NER / contextual detection — names, free-text addresses, and unstructured PHI need a separate processor.
String-typed attributes only. Nested attribute maps, binary attributes, and span events are not yet walked.

These limitations are tracked and prioritised — the v2 roadmap is available on request.

Why it matters​

Regulatory coverage out of the box​

Granular per-standard configuration​

What makes it accurate (not just regex)​

What it scans (not just bodies)​

Two operating modes​

Operationally honest: exempt known false positives without disabling the detector​

Compliance dashboard friendly​

Performance​

What it doesn't do (v1 honesty)​