Skip to main content

Databricks

Overview

Databricks streams logs from a Praxis collector into a Databricks Unity Catalog table by:

  1. Uploading batches of log records as files (default: NDJSON gzip) into a Unity Catalog volume.
  2. Triggering a COPY INTO statement against a SQL warehouse to ingest those files into a target table.

The destination is logs-first; the table schema is owned by the user. The COPY INTO step is optional — set copy.mode=upload_only to keep just the file-drop step (useful when an external Databricks Workflow / DLT pipeline does the load).

Supported types: Logs

Authentication

Credential typeToken kindWhen to use
Databricks PAT (databrickspat)Workspace user PAT, sent as Authorization: Bearer <pat>Quick start; fine for non-prod / single-user installs.
OAuth2 (oauth2)Service-principal client credentials grant against the workspaceProduction; use a service principal so the token is rotatable and the SP can be granted least-privilege Unity Catalog permissions.

The exporter rejects any other credential type at config time.

Basic Configuration

ParameterTypeDefaultRequiredDescription
endpointstringYesWorkspace URL, e.g. https://adb-1234567890.0.azuredatabricks.net.
warehouse_idstringYes (unless copy.mode=upload_only)SQL Warehouse ID that runs the COPY INTO. Find it under SQL Warehouses → ⓘ.
target.catalogstringYesTarget Unity Catalog catalog.
target.schemastringYesTarget schema within the catalog.
target.tablestringYesTarget table. The exporter does not issue DDL — pre-create the table with the schema you want.
volume.catalogstringYesCatalog holding the staging volume.
volume.schemastringYesSchema holding the staging volume.
volume.namestringYesUnity Catalog volume name where uploaded files land. The exporter does not create the volume — pre-create it.
volume.subpath_templatestringNoSubdirectory template under the volume root. Supports time placeholders (e.g. dt=%Y-%m-%d/hr=%H/) for partition-friendly layout.

File format

ParameterTypeDefaultDescription
file.formatstringJSON (NDJSON)Format of uploaded files.
file.compressionstringgzipFile compression. Use none to disable.

COPY INTO control

ParameterTypeDefaultDescription
copy.modestringautoOne of auto, external, upload_only. auto runs COPY INTO after every batch upload. external skips the COPY (some other process loads). upload_only skips both COPY and the warehouse — files just land in the volume.
copy.interval_secondsintexporter defaultMinimum interval between COPY INTO triggers. Coalesces multiple uploads into one COPY.
copy.on_errorstringexporter defaultPass-through to COPY INTO ... ON_ERROR. Values per Databricks SQL docs (abort, continue).
copy.wait_timeoutintexporter defaultSeconds to wait for the COPY statement to complete.

Advanced

ParameterTypeDefaultDescription
advanced.max_batch_rowsintexporter defaultCap on records per uploaded file.
advanced.max_batch_bytesintexporter defaultCap on uncompressed bytes per uploaded file. Larger files are split.
advanced.timeoutintexporter defaultHTTP request timeout in seconds (for both file upload and COPY).
advanced.retry_on_failurebooltrueEnable automatic retries on send failures.
advanced.backpressure_queuebooltrueEnable a sending queue (with optional disk backing).

TLS (advanced.tls)

ParameterTypeDefaultDescription
insecure_skip_verifyboolfalseSkip TLS server cert verification. Not for production.
ca_filestringCustom CA bundle for the workspace endpoint.

Retry settings (advanced.retry_on_failure_settings)

ParameterTypeDefaultDescription
initial_intervalintexporter defaultInitial backoff in seconds.
max_intervalintexporter defaultMaximum backoff in seconds.
max_time_elapsedintexporter defaultMaximum total seconds spent retrying a batch.

Backpressure queue settings (advanced.backpressure_queue_settings)

ParameterTypeDefaultDescription
queue_sizeintexporter defaultMaximum buffered batches.
number_of_consumersintexporter defaultWorker count draining the queue.
enable_disk_backed_queuebooltruePersist the queue to a file_storage extension at ${COL_HOME}/metadata/databricks/<node_name> so it survives collector restarts. Default flipped to true in v0.3.

Pre-flight: catalog / schema / volume / table

-- Run as a workspace admin
USE CATALOG <catalog>;
USE SCHEMA <schema>;

CREATE VOLUME IF NOT EXISTS <volume_name>;

CREATE TABLE IF NOT EXISTS <table_name> (
timestamp TIMESTAMP,
observed_timestamp TIMESTAMP,
severity STRING,
body VARIANT,
attributes VARIANT,
resource_attributes VARIANT,
scope_name STRING,
trace_id STRING,
span_id STRING
);

-- Service principal needs WRITE on the volume and INSERT on the table
GRANT WRITE VOLUME ON VOLUME <catalog>.<schema>.<volume_name> TO `<service_principal>`;
GRANT MODIFY ON TABLE <catalog>.<schema>.<table_name> TO `<service_principal>`;
GRANT USAGE ON SCHEMA <catalog>.<schema> TO `<service_principal>`;
GRANT USE_CATALOG ON CATALOG <catalog> TO `<service_principal>`;

Example Configuration

{
"endpoint": "https://adb-1234567890.0.azuredatabricks.net",
"warehouse_id": "1234abcd56ef7890",

"target": {
"catalog": "telemetry",
"schema": "logs",
"table": "praxis_logs",
},

"volume": {
"catalog": "telemetry",
"schema": "logs",
"name": "praxis_staging",
"subpath_template": "dt=%Y-%m-%d/hr=%H/",
},

"file": {
"format": "JSON",
"compression": "gzip",
},

"copy": {
"mode": "auto",
"interval_seconds": 60,
"on_error": "continue",
"wait_timeout": 120,
},

"advanced": {
"max_batch_rows": 50000,
"max_batch_bytes": 16777216,
"timeout": 60,
"retry_on_failure": true,
"retry_on_failure_settings": {
"initial_interval": 5,
"max_interval": 60,
"max_time_elapsed": 600,
},
"backpressure_queue": true,
"backpressure_queue_settings": {
"queue_size": 1000,
"number_of_consumers": 4,
"enable_disk_backed_queue": true,
},
"tls": {
"insecure_skip_verify": false,
},
},
}

Limitations

  • Logs only. Metrics and traces are not supported.
  • No DDL. The exporter never creates catalogs, schemas, volumes, or tables. Pre-provision everything (script above).
  • At-least-once. Network retries during file upload or COPY INTO may produce duplicate rows. Add a primary key + MERGE INTO downstream if exactly-once is required.
  • One target per destination node. Routing to multiple tables requires multiple destination nodes.
  • Warehouse cost. Every COPY INTO keeps the SQL warehouse spinning. Use copy.interval_seconds to coalesce uploads, and consider a serverless warehouse with aggressive auto-stop.