Azure Blob Storage

Prerequisites

Steps to Set Up Azure Blob Storage Source

  1. In the Set up the source page, select Azure Blob Storage from the dropdown under Source type.
  2. Provide a name for the Azure Blob Storage connector to easily identify it.
  3. Enter the Azure Storage Account name and container name associated with your Azure Blob Storage.
  4. Choose your preferred Authentication method:
    i. If you are using a Storage Account Key, select Authenticate via Storage Account Key and provide the key.
    ii. If you are using a Service Principal, select Authenticate via Client Credentials.
    iii. Review the IAM role bindings for the Service Principal and gather app registration details as needed.
    iv. Enter the Directory (tenant) ID from Azure Portal into the Tenant ID field.
    v. Provide the Application (client) ID from Azure Portal into the Client ID field (note: this is not the secret ID).
    vi. Enter the Secret Value from the Azure Portal into the Client Secret field.
  5. Add a stream:
    i. Specify the File Type.
    ii. Select the file format for replication in the Format field using the dropdown. Supported formats include CSV, Parquet, Avro, and JSONL. Additional configuration options for each format are available by toggling the Optional fields button. For more information, see the File Format section.
    iii. Assign a name to the stream.
    iv. (Optional) Provide an Input schema if you want to enforce a specific schema; otherwise, it will be inferred automatically. By default, the schema is set to {}. For more details, refer to the User Schema section.
    v. (Optional) Specify Globs for selecting the files to replicate. This is a regular expression pattern for file matching. If you want to replicate all files in the container, use ** as the pattern. See the Path Patterns section for more details.
  6. (Optional) Enter the endpoint for data replication.
  7. (Optional) Provide the Start Date for data replication to begin. This date must be in YYYY-MM-DD or YYYY-MM-DDTHH:MM:SSZ format.

Path Patterns

This connector supports using glob-style patterns to match multiple files, eliminating the need for specific file paths. This allows:


User Schema

By specifying a schema, you gain better control over the output of the stream. If no schema is provided, the column names and data types will be inferred from the first matching file in the container. There are cases where you may want to enforce a schema, such as:

File Format Settings

CSV

Parquet

Parquet is a columnar data format providing efficient compression and encoding. Currently, partitioned parquet datasets are unsupported.

Avro

Avro uses the Fastavro library for parsing.

JSONL

No additional options are available for parsing JSONL files.