Databricks
Overview
Databricks streams logs from a Praxis collector into a Databricks Unity Catalog table by:
- Uploading batches of log records as files (default: NDJSON gzip) into a Unity Catalog volume.
- Triggering a
COPY INTOstatement against a SQL warehouse to ingest those files into a target table.
The destination is logs-first; the table schema is owned by the user. The COPY INTO step is optional — set copy.mode=upload_only to keep just the file-drop step (useful when an external Databricks Workflow / DLT pipeline does the load).
Supported types: Logs
Authentication
| Credential type | Token kind | When to use |
|---|---|---|
Databricks PAT (databrickspat) | Workspace user PAT, sent as Authorization: Bearer <pat> | Quick start; fine for non-prod / single-user installs. |
OAuth2 (oauth2) | Service-principal client credentials grant against the workspace | Production; use a service principal so the token is rotatable and the SP can be granted least-privilege Unity Catalog permissions. |
The exporter rejects any other credential type at config time.
Basic Configuration
| Parameter | Type | Default | Required | Description |
|---|---|---|---|---|
endpoint | string | — | Yes | Workspace URL, e.g. https://adb-1234567890.0.azuredatabricks.net. |
warehouse_id | string | — | Yes (unless copy.mode=upload_only) | SQL Warehouse ID that runs the COPY INTO. Find it under SQL Warehouses → ⓘ. |
target.catalog | string | — | Yes | Target Unity Catalog catalog. |
target.schema | string | — | Yes | Target schema within the catalog. |
target.table | string | — | Yes | Target table. The exporter does not issue DDL — pre-create the table with the schema you want. |
volume.catalog | string | — | Yes | Catalog holding the staging volume. |
volume.schema | string | — | Yes | Schema holding the staging volume. |
volume.name | string | — | Yes | Unity Catalog volume name where uploaded files land. The exporter does not create the volume — pre-create it. |
volume.subpath_template | string | — | No | Subdirectory template under the volume root. Supports time placeholders (e.g. dt=%Y-%m-%d/hr=%H/) for partition-friendly layout. |
File format
| Parameter | Type | Default | Description |
|---|---|---|---|
file.format | string | JSON (NDJSON) | Format of uploaded files. |
file.compression | string | gzip | File compression. Use none to disable. |
COPY INTO control
| Parameter | Type | Default | Description |
|---|---|---|---|
copy.mode | string | auto | One of auto, external, upload_only. auto runs COPY INTO after every batch upload. external skips the COPY (some other process loads). upload_only skips both COPY and the warehouse — files just land in the volume. |
copy.interval_seconds | int | exporter default | Minimum interval between COPY INTO triggers. Coalesces multiple uploads into one COPY. |
copy.on_error | string | exporter default | Pass-through to COPY INTO ... ON_ERROR. Values per Databricks SQL docs (abort, continue). |
copy.wait_timeout | int | exporter default | Seconds to wait for the COPY statement to complete. |
Advanced
| Parameter | Type | Default | Description |
|---|---|---|---|
advanced.max_batch_rows | int | exporter default | Cap on records per uploaded file. |
advanced.max_batch_bytes | int | exporter default | Cap on uncompressed bytes per uploaded file. Larger files are split. |
advanced.timeout | int | exporter default | HTTP request timeout in seconds (for both file upload and COPY). |
advanced.retry_on_failure | bool | true | Enable automatic retries on send failures. |
advanced.backpressure_queue | bool | true | Enable a sending queue (with optional disk backing). |
TLS (advanced.tls)
| Parameter | Type | Default | Description |
|---|---|---|---|
insecure_skip_verify | bool | false | Skip TLS server cert verification. Not for production. |
ca_file | string | — | Custom CA bundle for the workspace endpoint. |
Retry settings (advanced.retry_on_failure_settings)
| Parameter | Type | Default | Description |
|---|---|---|---|
initial_interval | int | exporter default | Initial backoff in seconds. |
max_interval | int | exporter default | Maximum backoff in seconds. |
max_time_elapsed | int | exporter default | Maximum total seconds spent retrying a batch. |
Backpressure queue settings (advanced.backpressure_queue_settings)
| Parameter | Type | Default | Description |
|---|---|---|---|
queue_size | int | exporter default | Maximum buffered batches. |
number_of_consumers | int | exporter default | Worker count draining the queue. |
enable_disk_backed_queue | bool | true | Persist the queue to a file_storage extension at ${COL_HOME}/metadata/databricks/<node_name> so it survives collector restarts. Default flipped to true in v0.3. |
Pre-flight: catalog / schema / volume / table
-- Run as a workspace admin
USE CATALOG <catalog>;
USE SCHEMA <schema>;
CREATE VOLUME IF NOT EXISTS <volume_name>;
CREATE TABLE IF NOT EXISTS <table_name> (
timestamp TIMESTAMP,
observed_timestamp TIMESTAMP,
severity STRING,
body VARIANT,
attributes VARIANT,
resource_attributes VARIANT,
scope_name STRING,
trace_id STRING,
span_id STRING
);
-- Service principal needs WRITE on the volume and INSERT on the table
GRANT WRITE VOLUME ON VOLUME <catalog>.<schema>.<volume_name> TO `<service_principal>`;
GRANT MODIFY ON TABLE <catalog>.<schema>.<table_name> TO `<service_principal>`;
GRANT USAGE ON SCHEMA <catalog>.<schema> TO `<service_principal>`;
GRANT USE_CATALOG ON CATALOG <catalog> TO `<service_principal>`;
Example Configuration
{
"endpoint": "https://adb-1234567890.0.azuredatabricks.net",
"warehouse_id": "1234abcd56ef7890",
"target": {
"catalog": "telemetry",
"schema": "logs",
"table": "praxis_logs",
},
"volume": {
"catalog": "telemetry",
"schema": "logs",
"name": "praxis_staging",
"subpath_template": "dt=%Y-%m-%d/hr=%H/",
},
"file": {
"format": "JSON",
"compression": "gzip",
},
"copy": {
"mode": "auto",
"interval_seconds": 60,
"on_error": "continue",
"wait_timeout": 120,
},
"advanced": {
"max_batch_rows": 50000,
"max_batch_bytes": 16777216,
"timeout": 60,
"retry_on_failure": true,
"retry_on_failure_settings": {
"initial_interval": 5,
"max_interval": 60,
"max_time_elapsed": 600,
},
"backpressure_queue": true,
"backpressure_queue_settings": {
"queue_size": 1000,
"number_of_consumers": 4,
"enable_disk_backed_queue": true,
},
"tls": {
"insecure_skip_verify": false,
},
},
}
Limitations
- Logs only. Metrics and traces are not supported.
- No DDL. The exporter never creates catalogs, schemas, volumes, or tables. Pre-provision everything (script above).
- At-least-once. Network retries during file upload or
COPY INTOmay produce duplicate rows. Add a primary key +MERGE INTOdownstream if exactly-once is required. - One target per destination node. Routing to multiple tables requires multiple destination nodes.
- Warehouse cost. Every
COPY INTOkeeps the SQL warehouse spinning. Usecopy.interval_secondsto coalesce uploads, and consider a serverless warehouse with aggressive auto-stop.