Skip to main content

PII Guard Processor

Compliance-grade PII detection and masking for US, EU, UK, and India regulatory standards — built into the Praxis data pipeline.

Supported signals: Logs


Why it matters

Healthcare, financial services, and SaaS organisations operating across the US, EU, UK, and India live under a stack of overlapping data-protection regimes — HIPAA, PCI-DSS, GDPR, UK-GDPR, SPDI Rules 2011, the DPDP Act 2023, RBI master directions, SEBI/IRDAI mandates, and NPCI rails. Sensitive identifiers — SSNs, payment cards, IBANs, national IDs (Aadhaar, PAN, DNI, BSN, Codice Fiscale, Steuer-ID, NIR, RRN, NINO), medical record numbers, passport data, email addresses, IP addresses, UPI handles, IFSC codes, ABHA health IDs — routinely leak into application logs, observability data, and SIEM pipelines, creating audit findings, breach-notification risk, and regulatory exposure.

The PII Guard Processor sits inline in the Praxis collection pipeline, detects these identifiers across log bodies and attributes in real time, and either reports (observe mode) or masks them (mask mode) before they leave the customer's environment — no PII ever lands in downstream stores.


Regulatory coverage out of the box

StandardRegulatorWhat it protectsDetected identifiers
HIPAA Safe HarborHHS-OCR (US)Protected Health InformationSSN, MRN, NPI, DEA registration, Medicare MBI, date of birth, VIN, IMEI
HIPAA (cross-emit)HHS-OCRSafe Harbor items D / F / N / OPhone, email, MAC address, IPv4, IPv6
PCI-DSSPCI-SSCCardholder data + auth dataCredit card (Luhn), CVV, magstripe Track 1/2, card expiry
GDPR (EU)EDPBPersonal data, online identifiersEmail, phone, IBAN, IPv4/IPv6, MAC address, EU VAT, EU passport (MRZ)
GDPR (Spain)AEPDNational identifierDNI / NIE (mod-23 validated)
GDPR (Netherlands)APNational identifierBSN (11-test validated)
GDPR (Italy)GaranteNational identifierCodice Fiscale (16-char structured)
GDPR (Germany)BfDINational identifierSteuer-ID (iterative mod-11 validated)
GDPR (France)CNILNational identifierNIR / social security (mod-97 with Corsica substitution)
GDPR (Belgium)APDNational identifierRijksregisternummer (mod-97 pre/post-2000)
GDPR (UK)ICONational identifier + locationUK National Insurance Number, UK postcode
SPDI Rules 2011MEITY (IT Act)Sensitive personal data (India)Bank account, payment cards, PAN — cross-emitted from PCI/HIPAA/DPDP
DPDP Act 2023MEITYPersonal data (India)Aadhaar (Verhoeff), PAN (entity-type), Indian mobile, email
RBI-KYCRBIKYC officially valid documentsAadhaar, PAN, voter EPIC, driving license, Indian passport
RBI-DPSCRBIDigital payment security controlsPayment card (cross-emit), card expiry, CVV/CVC2 variants, OTP
NPCINPCIUPI / IMPS / NEFT railsUPI VPA (alice@okhdfcbank), IFSC code (4L+0+6alnum)
SEBISEBISecurities market participantsPAN, DEMAT (NSDL IN+14 / CDSL 16), UCC client code
IRDAIIRDAIInsurance + national health accountPolicy number, ABHA health account ID (14 digits)

Pick standards individually with fine-grained category and pattern selection, or use standards: ["all"] to enable every preset in one click. A single regex match fans out to every active regulator — one pan_card hit produces accepted counts under dpdp, spdi, rbi_kyc, and sebi simultaneously, so compliance dashboards see per-regulator attribution without running multiple processors.

Granular per-standard configuration

Each standard exposes its own { enabled, categories, patterns } block. enabled overrides the legacy bulk standards list (true forces on, false forces off). categories narrows the standard to listed subcategories (empty = all). patterns is an explicit allowlist of detector names (empty = no allowlist; falls back to categories).

{
"mode": "mask",
"gdpr": { "enabled": true, "categories": ["contact_identifiers", "government_ids"] },
"pci": { "enabled": true, "patterns": ["credit_card", "cvv"] },
"hipaa": { "enabled": false },
"dpdp": { "enabled": true },
"rbi_kyc": { "enabled": true, "categories": ["government_ids"] }
}

The legacy standards: ["gdpr", "pci"] array and overrides: { ipv4: false } denylist remain supported for pipelines created before v0.3.


What makes it accurate (not just regex)

Naive regex generates false positives that drown the signal. PII Guard pairs every high-risk pattern with a format-aware validator:

  • Luhn checksum — all payment cards, NPI provider IDs, IMEI device IDs.
  • IBAN mod-97 — international bank account numbers (eliminates random 15–34-char alphanumeric runs).
  • US SSN range check — rejects the 000 / 666 / 9xx area codes and 00-group / 0000-serial combinations the SSA never issues.
  • Spanish DNI / NIE mod-23 letter check — filters random 8-digit-plus-letter sequences using the canonical TRWAGMYFPDXBNJZSQVHLCKE alphabet.
  • Dutch BSN 11-test — weighted-sum check filters labeled-but-non-BSN numbers.
  • German Steuer-IdNr mod-11 — iterative product check for the 11-digit tax ID.
  • French NIR mod-97 — with 2A / 2B Corsica department code substitution.
  • Belgian RRN mod-97 — both pre-2000 and post-2000 birth-year branches.
  • Italian Codice Fiscale mod-26 — position-weighted alphabet validation.
  • Indian Aadhaar Verhoeff — UIDAI's official check-digit algorithm rejects random 12-digit runs.
  • Indian PAN entity-type rule — 4th character must be one of A/B/C/F/G/H/J/L/P/T; rejects random 5L+4D+1L sequences.
  • Indian IFSC structural — position 5 must be 0; rejects random 11-char alphanumerics.
  • Calendar-valid date parser — rejects impossible dates of birth like 02/30/1995.

Validator rejections are tracked as a separate metric (pii_validator_rejected_total) so analysts can see how much noise the validators filter out — proof of detection quality, not just detection volume.


What it scans (not just bodies)

PII commonly lives in fields a body-only scanner can't see: user.email as a record attribute, db.user as a resource attribute, http.url in span events. PII Guard's scan_targets config lets operators include all three surfaces:

scan_targets: [body, log_attributes, resource_attributes]

Per-target metric labels show exactly where PII is leaking from, so dashboards can drive remediation at the right layer (instrumentation vs application vs platform).


Two operating modes

ModeBehaviourUse case
Observe (default)Emits per-type detection metrics; does not mutate logsDiscovery phase, audit readiness, "where is our PII?"
MaskReplaces matches in log bodies / attributes using configurable templatesProduction compliance, pre-storage redaction

Masking is customisable per identifier type. Examples:

  • credit_card****-****-****-{last4} (preserve last 4 for support workflows)
  • ssnXXX-XX-{last4} (HIPAA-compliant partial display)
  • Default fallback → <REDACTED:{type}>

Tokens {type}, {first4}, {last4} let teams meet both compliance and operability requirements simultaneously.


Operationally honest: exempt known false positives without disabling the detector

Real production traffic contains internal addresses, support mailboxes, and synthetic test data that operators want excluded from the FP count without turning off the whole detector.

exemptions:
email:
- [email protected] # exact match
- "*@acme.com" # suffix glob: any internal email
ipv4:
- 10.0.0.1

Exempted matches still surface telemetry — pii_exempted_total — so dashboards see the noise the exemption is filtering, but the original bytes are never rewritten in mask mode. This is the difference between "we suppressed it because we know it's safe" and "we missed it."


Compliance dashboard friendly

Every accepted detection carries six bounded-cardinality labels:

LabelValuesUse
pii_typepreset name (closed enum)What was detected
standardgdpr, pci, hipaa, spdi, dpdp, rbi_kyc, rbi_dpsc, npci, sebi, irdaiWhich regime applies
regulatorWestern: HHS-OCR, PCI-SSC, EDPB, ICO, AEPD, AP, Garante, BfDI, CNIL, APD · Indian: MEITY, RBI, NPCI, SEBI, IRDAIWho enforces it (per-country granularity for GDPR; per-rail for India)
severitylow, medium, highOperational priority
targetbody, log_attributes, resource_attributesWhere it leaked from
co_locatedtrue / falseHigh-confidence PCI co-location (card + CVV / card + magstripe)

Pivot any of those for the right view: a CFO dashboard pivots on regulator, an SRE dashboard pivots on target, an audit report pivots on severity + co_located.


Performance

  • Built on RE2 — linear-time regex matching guarantees, no ReDoS class to defend against.
  • Per-standard regex unions — keeps each NFA small; ~3–5× faster than a single combined union.
  • Anchored prescreen — patterns that require a contextual label (CVV, MRN, NPI, BSN, VIN, IMEI, magstripe) are bypassed entirely on records without that label.
  • Configurable byte caplimits.max_bytes_per_body (default 16 KB) protects throughput on pathological large bodies.

What it doesn't do (v1 honesty)

  • Logs signal only. Traces and metrics are planned for v2.
  • Regex-based detection. No NER / contextual detection — names, free-text addresses, and unstructured PHI need a separate processor.
  • String-typed attributes only. Nested attribute maps, binary attributes, and span events are not yet walked.

These limitations are tracked and prioritised — the v2 roadmap is available on request.