Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Loading Data

Load data with Datui by passing it command line options and a path to open. The path can be a local file or directory, an S3 URI (Amazon S3 or MinIO), a GCS URI (gs://), or an HTTP/HTTPS URL. Cloud and HTTP support are included by default.

Remote data (S3, GCS, and HTTP)

You can open data from Amazon S3, Google Cloud Storage (GCS), S3-compatible storage (e.g. MinIO), or HTTP/HTTPS by passing the appropriate URI. Support is included by default when you build or install datui.

Same behavior for all cloud and HTTP: Parquet (and Parquet globs/prefixes) are read directly where supported; all other formats (CSV, JSON, NDJSON, etc.) are downloaded to a temporary file first, then loaded. That applies to AWS S3, custom S3 (MinIO, etc.), and GCS.

One remote path at a time: If you pass more than one path and the first is a remote URL (S3, GCS, or HTTP), datui reports an error. Open a single remote URL per run; local files can still be opened as multiple paths (concatenated).

Amazon S3 (s3://)

Credentials: Datui does not add its own credential system. It uses the same credentials as the rest of the AWS ecosystem:

  1. Environment variables (good for scripts and one-off use):

    • AWS_ACCESS_KEY_ID — access key
    • AWS_SECRET_ACCESS_KEY — secret key
    • AWS_REGION (or AWS_DEFAULT_REGION) — e.g. us-east-1
    • Optionally AWS_SESSION_TOKEN for temporary credentials
  2. Shared config (good for daily use):

    • ~/.aws/credentials — profiles and keys
    • ~/.aws/config — region and other settings
  3. IAM roles — If you run on EC2, ECS, Lambda, or similar, the instance/task role is used automatically; no env or config needed.

Set at least one of these before running Datui. Example:

export AWS_ACCESS_KEY_ID=AKIA...
export AWS_SECRET_ACCESS_KEY=...
export AWS_REGION=us-east-1
datui s3://my-bucket/data/myfile.parquet

Usage

  • Single Parquet file:
    datui s3://bucket-name/path/to/file.parquet

  • Hive-style partitioned data on S3:
    Use a prefix or glob so Polars can discover partitions:

    • datui s3://bucket-name/prefix/
    • datui "s3://bucket-name/prefix/**/*.parquet"
  • Other formats (CSV, JSON, NDJSON, etc.) on S3:
    Datui downloads the object to a temporary file and then loads it like a local file. Use the same URI style:
    datui s3://bucket-name/path/to/file.csv
    The file is downloaded to your system temp directory (or --temp-dir if set), then opened normally. This is the same approach used for HTTP/HTTPS URLs.

Configurable S3 (MinIO and other S3-compatible backends)

You can point S3 at a custom endpoint (e.g. MinIO) via config, environment variables, or CLI options. Priority is: CLI > env > config. That lets you keep a default in config and override per run when you use multiple backends.

Config — In ~/.config/datui/config.toml add a [cloud] section:

[cloud]
# MinIO example
s3_endpoint_url = "http://localhost:9000"
s3_access_key_id = "minioadmin"
s3_secret_access_key = "minioadmin"
s3_region = "us-east-1"

Environment variables (override config; standard for S3 tools):

  • AWS_ENDPOINT_URL or AWS_ENDPOINT_URL_S3 — custom endpoint (e.g. http://localhost:9000)
  • AWS_ACCESS_KEY_ID — access key
  • AWS_SECRET_ACCESS_KEY — secret key
  • AWS_REGION or AWS_DEFAULT_REGION — region (e.g. us-east-1)

CLI options (override env and config):

  • --s3-endpoint-url URL
  • --s3-access-key-id KEY
  • --s3-secret-access-key SECRET
  • --s3-region REGION

Examples for multiple backends without editing config each time:

# MinIO in another terminal/shell
export AWS_ENDPOINT_URL=http://localhost:9000
export AWS_ACCESS_KEY_ID=minioadmin
export AWS_SECRET_ACCESS_KEY=minioadmin
export AWS_REGION=us-east-1
datui s3://my-bucket/file.parquet

# Different MinIO or S3-compatible endpoint
datui --s3-endpoint-url https://s3.other.example s3://other-bucket/file.parquet

Google Cloud Storage (gs://)

You can open Parquet files from Google Cloud Storage using gs:// URIs. Credentials use Application Default Credentials:

  1. User or service account key file: set GOOGLE_APPLICATION_CREDENTIALS to the path of your JSON key file.
  2. gcloud CLI: run gcloud auth application-default login.
  3. GCE/Cloud Run: workload identity is used automatically.

Example:

export GOOGLE_APPLICATION_CREDENTIALS=/path/to/key.json
datui gs://my-bucket/path/to/file.parquet
  • Parquet and Hive-style prefixes/globs work as for S3:
    datui gs://bucket/prefix/ or datui "gs://bucket/prefix/**/*.parquet".
  • Other formats (CSV, JSON, NDJSON, etc.) work the same as S3 and HTTP: the object is downloaded to a temporary file, then loaded. Example:
    datui gs://my-bucket/path/to/file.csv

Minimal build (no cloud)

To build without S3 support and avoid the extra cloud dependencies:

cargo build --release --no-default-features

If you pass an S3 or gs:// URI to a binary built that way, you will see an error suggesting a build with default features.

Supported Formats

FormatExtensionsEager load onlyHive partitioning
Parquet.parquetNoYes
CSV (or other-delimited).csv, .tsv, .psv, etc.NoNo
NDJSON.jsonlNoNo
JSON.jsonYesNo
Arrow IPC / Feather v2.arrow, .ipc, .featherNoNo
Avro.avroYesNo
Excel.xls, .xlsx, .xlsm, .xlsbYesNo
ORC.orcYesNo

Eager load only — The file is read fully into memory before use; no lazy streaming. Hive partitioning — Use the --hive flag with a directory or glob; see Hive-partitioned data below. Excel — Use the --sheet flag to specify which sheet to open.

CSV date inference — By default, CSV string columns that look like dates (e.g. YYYY-MM-DD, YYYY-MM-DDTHH:MM:SS) are parsed as Polars Date/Datetime. Use --parse-dates false or set parse_dates = false in configuration to disable.

Compression

Compressed files are identified by extension and decompressed before loading. Use the --compression option to specify the format when the file has no extension or the extension is wrong.

Supported Compression Formats

  • gz
  • zstd
  • bzip2
  • xz

Hive-partitioned data

You can load a Hive-style partitioned dataset (e.g. a directory tree with key=value segment names such as year=2024/month=01/) by using the --hive flag and passing a directory or a glob pattern instead of a single file.

  • Directory: point at the partition root, e.g. datui --hive /path/to/data
  • Glob: use a pattern that matches the partition layout, e.g. datui --hive /path/to/data/**/*.parquet
    You may need to quote the glob so your shell does not expand it (e.g. datui --hive "/path/to/data/**/*.parquet").

Only Parquet is supported for hive-partitioned loading. If you pass a single file with --hive, it is loaded as usual and the flag is ignored.

Schema from one file (default): For faster loading, datui infers the Parquet schema from a single file along one partition branch (single-spine) instead of scanning all files. This applies to both local Hive directories and S3/GCS prefixes (e.g. s3://bucket/prefix/ or gs://bucket/prefix/). If your dataset has inconsistent schemas or other complications and you prefer Polars to discover the schema over all files, disable this with --single-spine-schema=false or set single_spine_schema = false under [file_loading] in configuration.

Partition columns (the keys from the path, e.g. year, month) are shown first in the table and listed in the Info panel under the Partitioned data tab.