Azure Blob Storage
Prerequisites
- Tenant ID for the Microsoft Azure Application user
- Azure Blob Storage account name
- Name of the Azure blob storage container (Bucket)
Steps to Set Up Azure Blob Storage Source
- In the Set up the source page, select Azure Blob Storage from the dropdown under Source type.
- Provide a name for the Azure Blob Storage connector to easily identify it.
- Enter the Azure Storage Account name and container name associated with your Azure Blob Storage.
- Choose your preferred Authentication method:
i. If you are using a Storage Account Key, select Authenticate via Storage Account Key and provide the key.
ii. If you are using a Service Principal, select Authenticate via Client Credentials.
iii. Review the IAM role bindings for the Service Principal and gather app registration details as needed.
iv. Enter the Directory (tenant) ID from Azure Portal into the Tenant ID field.
v. Provide the Application (client) ID from Azure Portal into the Client ID field (note: this is not the secret ID).
vi. Enter the Secret Value from the Azure Portal into the Client Secret field.
- Add a stream:
i. Specify the File Type.
ii. Select the file format for replication in the Format field using the dropdown. Supported formats include CSV, Parquet, Avro, and JSONL. Additional configuration options for each format are available by toggling the Optional fields button. For more information, see the File Format section.
iii. Assign a name to the stream.
iv. (Optional) Provide an Input schema if you want to enforce a specific schema; otherwise, it will be inferred automatically. By default, the schema is set to {}
. For more details, refer to the User Schema section.
v. (Optional) Specify Globs for selecting the files to replicate. This is a regular expression pattern for file matching. If you want to replicate all files in the container, use ** as the pattern. See the Path Patterns section for more details.
- (Optional) Enter the endpoint for data replication.
- (Optional) Provide the Start Date for data replication to begin. This date must be in YYYY-MM-DD or YYYY-MM-DDTHH:MM:SSZ format.
Path Patterns
This connector supports using glob-style patterns to match multiple files, eliminating the need for specific file paths. This allows:
- Referencing all files with one pattern (e.g., ** matches every file in the container).
- Referencing files that may not exist yet but will in the future.
The path pattern must be provided, and multiple patterns can be combined using |
for more complex folder structures. The pattern should always start from the root of the bucket, excluding the bucket name.
User Schema
By specifying a schema, you gain better control over the output of the stream. If no schema is provided, the column names and data types will be inferred from the first matching file in the container. There are cases where you may want to enforce a schema, such as:
- Focusing on specific columns while the others are included in the
_ab_additional_properties
map.
- Handling a small initial dataset and ensuring that type inference is accurate for larger future datasets.
- Defining types for all columns manually.
- Anticipating future columns and including them in the schema rather than the
_ab_additional_properties
map.
File Format Settings
CSV
- Header Definition: Choose whether the file has a header row or not. The options are:
- User Provided: No header row; headers will be supplied manually.
- Autogenerated: No header row; headers are automatically generated as
f{i}
, where i
is the index.
- Default: Uses the header from the file.
- Delimiter: Specifies the character that separates the values in the CSV (default is a comma
,
). For tab-delimited files, use \t
.
- Double Quote: Determines if two quotes represent a single quote in the data (default is True).
- Encoding: Specify the character set (default is utf8).
- Escape Character: If needed, use an escape character (backslash
\
) to prefix reserved characters for proper parsing.
- False Values: Specify strings (e.g., “false”) that should be interpreted as false.
- Null Values: Specify strings (e.g., “NA”) to be interpreted as null values.
- Quote Character: Wrap values in quotes to avoid confusion with reserved characters. Default is
"
(double quote).
- Skip Rows After Header: Skips a number of rows following the header.
- Skip Rows Before Header: Skips a number of rows before the header.
- Strings Can Be Null: If true, strings matching the null value set will be treated as null.
- True Values: Specify case-sensitive strings (e.g., “true”) to be interpreted as true.
Parquet
Parquet is a columnar data format providing efficient compression and encoding. Currently, partitioned parquet datasets are unsupported.
- Convert Decimal Fields to Floats: Whether to convert decimal fields to floats. Not recommended due to potential loss of precision.
Avro
Avro uses the Fastavro library for parsing.
- Convert Double Fields to Strings: Converts double fields to strings, useful for preserving precision when dealing with high-precision decimals.
JSONL
No additional options are available for parsing JSONL files.