PII Guard Processor
Compliance-grade PII detection and masking for US, EU, UK, and India regulatory standards — built into the Praxis data pipeline.
Supported signals: Logs
Why it matters
Healthcare, financial services, and SaaS organisations operating across the US, EU, UK, and India live under a stack of overlapping data-protection regimes — HIPAA, PCI-DSS, GDPR, UK-GDPR, SPDI Rules 2011, the DPDP Act 2023, RBI master directions, SEBI/IRDAI mandates, and NPCI rails. Sensitive identifiers — SSNs, payment cards, IBANs, national IDs (Aadhaar, PAN, DNI, BSN, Codice Fiscale, Steuer-ID, NIR, RRN, NINO), medical record numbers, passport data, email addresses, IP addresses, UPI handles, IFSC codes, ABHA health IDs — routinely leak into application logs, observability data, and SIEM pipelines, creating audit findings, breach-notification risk, and regulatory exposure.
The PII Guard Processor sits inline in the Praxis collection pipeline, detects these identifiers across log bodies and attributes in real time, and either reports (observe mode) or masks them (mask mode) before they leave the customer's environment — no PII ever lands in downstream stores.
Regulatory coverage out of the box
| Standard | Regulator | What it protects | Detected identifiers |
|---|---|---|---|
| HIPAA Safe Harbor | HHS-OCR (US) | Protected Health Information | SSN, MRN, NPI, DEA registration, Medicare MBI, date of birth, VIN, IMEI |
| HIPAA (cross-emit) | HHS-OCR | Safe Harbor items D / F / N / O | Phone, email, MAC address, IPv4, IPv6 |
| PCI-DSS | PCI-SSC | Cardholder data + auth data | Credit card (Luhn), CVV, magstripe Track 1/2, card expiry |
| GDPR (EU) | EDPB | Personal data, online identifiers | Email, phone, IBAN, IPv4/IPv6, MAC address, EU VAT, EU passport (MRZ) |
| GDPR (Spain) | AEPD | National identifier | DNI / NIE (mod-23 validated) |
| GDPR (Netherlands) | AP | National identifier | BSN (11-test validated) |
| GDPR (Italy) | Garante | National identifier | Codice Fiscale (16-char structured) |
| GDPR (Germany) | BfDI | National identifier | Steuer-ID (iterative mod-11 validated) |
| GDPR (France) | CNIL | National identifier | NIR / social security (mod-97 with Corsica substitution) |
| GDPR (Belgium) | APD | National identifier | Rijksregisternummer (mod-97 pre/post-2000) |
| GDPR (UK) | ICO | National identifier + location | UK National Insurance Number, UK postcode |
| SPDI Rules 2011 | MEITY (IT Act) | Sensitive personal data (India) | Bank account, payment cards, PAN — cross-emitted from PCI/HIPAA/DPDP |
| DPDP Act 2023 | MEITY | Personal data (India) | Aadhaar (Verhoeff), PAN (entity-type), Indian mobile, email |
| RBI-KYC | RBI | KYC officially valid documents | Aadhaar, PAN, voter EPIC, driving license, Indian passport |
| RBI-DPSC | RBI | Digital payment security controls | Payment card (cross-emit), card expiry, CVV/CVC2 variants, OTP |
| NPCI | NPCI | UPI / IMPS / NEFT rails | UPI VPA (alice@okhdfcbank), IFSC code (4L+0+6alnum) |
| SEBI | SEBI | Securities market participants | PAN, DEMAT (NSDL IN+14 / CDSL 16), UCC client code |
| IRDAI | IRDAI | Insurance + national health account | Policy number, ABHA health account ID (14 digits) |
Pick standards individually with fine-grained category and pattern selection, or use standards: ["all"] to enable every preset in one click. A single regex match fans out to every active regulator — one pan_card hit produces accepted counts under dpdp, spdi, rbi_kyc, and sebi simultaneously, so compliance dashboards see per-regulator attribution without running multiple processors.
Granular per-standard configuration
Each standard exposes its own { enabled, categories, patterns } block. enabled overrides the legacy bulk standards list (true forces on, false forces off). categories narrows the standard to listed subcategories (empty = all). patterns is an explicit allowlist of detector names (empty = no allowlist; falls back to categories).
{
"mode": "mask",
"gdpr": { "enabled": true, "categories": ["contact_identifiers", "government_ids"] },
"pci": { "enabled": true, "patterns": ["credit_card", "cvv"] },
"hipaa": { "enabled": false },
"dpdp": { "enabled": true },
"rbi_kyc": { "enabled": true, "categories": ["government_ids"] }
}
The legacy standards: ["gdpr", "pci"] array and overrides: { ipv4: false } denylist remain supported for pipelines created before v0.3.
What makes it accurate (not just regex)
Naive regex generates false positives that drown the signal. PII Guard pairs every high-risk pattern with a format-aware validator:
- Luhn checksum — all payment cards, NPI provider IDs, IMEI device IDs.
- IBAN mod-97 — international bank account numbers (eliminates random 15–34-char alphanumeric runs).
- US SSN range check — rejects the 000 / 666 / 9xx area codes and 00-group / 0000-serial combinations the SSA never issues.
- Spanish DNI / NIE mod-23 letter check — filters random 8-digit-plus-letter sequences using the canonical
TRWAGMYFPDXBNJZSQVHLCKEalphabet. - Dutch BSN 11-test — weighted-sum check filters labeled-but-non-BSN numbers.
- German Steuer-IdNr mod-11 — iterative product check for the 11-digit tax ID.
- French NIR mod-97 — with
2A/2BCorsica department code substitution. - Belgian RRN mod-97 — both pre-2000 and post-2000 birth-year branches.
- Italian Codice Fiscale mod-26 — position-weighted alphabet validation.
- Indian Aadhaar Verhoeff — UIDAI's official check-digit algorithm rejects random 12-digit runs.
- Indian PAN entity-type rule — 4th character must be one of
A/B/C/F/G/H/J/L/P/T; rejects random 5L+4D+1L sequences. - Indian IFSC structural — position 5 must be
0; rejects random 11-char alphanumerics. - Calendar-valid date parser — rejects impossible dates of birth like
02/30/1995.
Validator rejections are tracked as a separate metric (pii_validator_rejected_total) so analysts can see how much noise the validators filter out — proof of detection quality, not just detection volume.
What it scans (not just bodies)
PII commonly lives in fields a body-only scanner can't see: user.email as a record attribute, db.user as a resource attribute, http.url in span events. PII Guard's scan_targets config lets operators include all three surfaces:
scan_targets: [body, log_attributes, resource_attributes]
Per-target metric labels show exactly where PII is leaking from, so dashboards can drive remediation at the right layer (instrumentation vs application vs platform).
Two operating modes
| Mode | Behaviour | Use case |
|---|---|---|
| Observe (default) | Emits per-type detection metrics; does not mutate logs | Discovery phase, audit readiness, "where is our PII?" |
| Mask | Replaces matches in log bodies / attributes using configurable templates | Production compliance, pre-storage redaction |
Masking is customisable per identifier type. Examples:
credit_card→****-****-****-{last4}(preserve last 4 for support workflows)ssn→XXX-XX-{last4}(HIPAA-compliant partial display)- Default fallback →
<REDACTED:{type}>
Tokens {type}, {first4}, {last4} let teams meet both compliance and operability requirements simultaneously.
Operationally honest: exempt known false positives without disabling the detector
Real production traffic contains internal addresses, support mailboxes, and synthetic test data that operators want excluded from the FP count without turning off the whole detector.
exemptions:
email:
- "*@acme.com" # suffix glob: any internal email
ipv4:
- 10.0.0.1
Exempted matches still surface telemetry — pii_exempted_total — so dashboards see the noise the exemption is filtering, but the original bytes are never rewritten in mask mode. This is the difference between "we suppressed it because we know it's safe" and "we missed it."
Compliance dashboard friendly
Every accepted detection carries six bounded-cardinality labels:
| Label | Values | Use |
|---|---|---|
pii_type | preset name (closed enum) | What was detected |
standard | gdpr, pci, hipaa, spdi, dpdp, rbi_kyc, rbi_dpsc, npci, sebi, irdai | Which regime applies |
regulator | Western: HHS-OCR, PCI-SSC, EDPB, ICO, AEPD, AP, Garante, BfDI, CNIL, APD · Indian: MEITY, RBI, NPCI, SEBI, IRDAI | Who enforces it (per-country granularity for GDPR; per-rail for India) |
severity | low, medium, high | Operational priority |
target | body, log_attributes, resource_attributes | Where it leaked from |
co_located | true / false | High-confidence PCI co-location (card + CVV / card + magstripe) |
Pivot any of those for the right view: a CFO dashboard pivots on regulator, an SRE dashboard pivots on target, an audit report pivots on severity + co_located.
Performance
- Built on RE2 — linear-time regex matching guarantees, no ReDoS class to defend against.
- Per-standard regex unions — keeps each NFA small; ~3–5× faster than a single combined union.
- Anchored prescreen — patterns that require a contextual label (CVV, MRN, NPI, BSN, VIN, IMEI, magstripe) are bypassed entirely on records without that label.
- Configurable byte cap —
limits.max_bytes_per_body(default 16 KB) protects throughput on pathological large bodies.
What it doesn't do (v1 honesty)
- Logs signal only. Traces and metrics are planned for v2.
- Regex-based detection. No NER / contextual detection — names, free-text addresses, and unstructured PHI need a separate processor.
- String-typed attributes only. Nested attribute maps, binary attributes, and span events are not yet walked.
These limitations are tracked and prioritised — the v2 roadmap is available on request.