AWS S3
Prerequisites
- Access to the Amazon S3 bucket where the files to be replicated are stored.
- For private buckets, you must have an AWS account authorized to grant read permissions on the bucket.
Configuring the S3 Connector in xGen
- On the Set up the source screen, select S3 from the Source type dropdown menu.
- Provide a name for your S3 connector.
- Select the preferred data delivery method.
- Specify the name of the S3 bucket containing the files to replicate.
- Add a new stream:
i. Select the File Format.
ii. Use the dropdown in the Format box to choose the file type you want to replicate. Supported formats include CSV, Parquet, Avro, and JSONL. You can toggle the Optional fields button within this box to configure additional settings specific to the chosen format. See the File Format section below for more details.
iii. Assign a name to the stream.
iv. (Optional) Provide Globs, a pattern expressed as a regular expression, to specify which files to sync. Use **
to replicate all files in the bucket. For more advanced pattern matching, refer to the Globs section.
v. (Optional) Adjust the Days To Sync If History Is Full setting, which controls the lookback period used to determine files to sync when the state history reaches capacity. More information is available in the State section below.
vi. (Optional) Enter an Input schema if you want to enforce a specific schema. By default, this is {}
, and the schema is inferred automatically from the files. See the User Schema section for guidance on custom schemas.
vii. (Optional) Enable Schemaless mode to skip validation against a schema. If selected, the schema defaults to {"data": "object"}
, and all data will be nested under a "data"
field. This is useful when your record schemas change frequently.
viii. (Optional) Choose a Validation Policy to define how xGen handles records that don’t match the schema. Options include emitting the record anyway (with possible missing fields at the destination), skipping the record, or deferring processing until the next discovery cycle (within 24 hours).
- To authenticate with a private bucket:
- If using an IAM role, enter the AWS Role ARN.
- If using IAM user credentials, provide the AWS Access Key ID and AWS Secret Access Key.
Globs
This connector supports syncing multiple files using glob-style patterns rather than requiring explicit paths for every file. This allows you to:
- Reference many files with a single pattern, e.g.,
**
to indicate all files in the bucket.
- Include files created in the future, even if they don’t exist yet.
State
For incremental syncing, xGen processes files in chronological order, from oldest to newest. Up to 10,000 files are tracked in the connection’s “history” state. When the history is full, older entries are dropped, and only files modified between the newest file’s date and the configured Days to Sync If History Is Full (counting backwards) are synced.
S3 Provider Settings
- AWS Access Key ID: Part one of the credentials needed for accessing private buckets.
- AWS Secret Access Key: The corresponding secret key for authentication.
- Endpoint: Optional field used to specify non-Amazon S3 compatible services. Leave blank to use the default Amazon endpoint.
- Start Date: Optional UTC timestamp to specify a replication start point. Files modified before this date/time will be excluded. Format:
YYYY-MM-DDTHH:mm:ssZ
. Leaving blank replicates all files except those excluded by path patterns or prefixes.
File Format Settings
CSV
CSV files are plain text and often require precise reader settings for accurate parsing. These options should align with how the CSV files are created or exported to ensure consistency over time. Key settings include:
- Header Definition: Defines header behavior. Options are:
- User Provided: Assumes no header row in the CSV and uses user-supplied headers.
- Autogenerated: Assumes no header row and generates headers as
f{i}
(with i
starting at 0).
- Default: Uses the CSV file’s header row. Users can set the “Skip rows before header” option to ignore headers if necessary.
- Delimiter: Specifies the character that separates fields. Default is a comma (
,
), but you can use \t
for tab delimiters.
- Double Quote: Controls whether two double quotes inside a quoted value represent a single quote. Defaults to true.
- Encoding: Character encoding for the file, default is
utf8
.
- Escape Character: Defines a prefix for reserved characters to ensure proper parsing, commonly a backslash (
\
). If left blank, escaping is disabled.
- False Values: List of case-sensitive strings to interpret as false.
- Null Values: List of case-sensitive strings to interpret as null.
- Quote Character: Character used to wrap fields containing delimiters. Defaults to
"
.
- Skip Rows After Header: Number of rows to skip immediately after the header.
- Skip Rows Before Header: Number of rows to skip before the header.
- Strings Can Be Null: Whether strings matching the null set are treated as null or literal strings.
- True Values: List of case-sensitive strings to interpret as true.
Parquet
Parquet is a columnar data format designed for efficient storage and processing. Partitioned Parquet datasets are currently unsupported. Available option:
- Convert Decimal Fields to Floats: Converts decimal types to floats, which can cause precision loss, so use cautiously.
Avro
Uses the Fastavro library. Available option:
- Convert Double Fields to Strings: Recommended to prevent precision loss when working with high-precision decimals.
JSONL
Currently, no configurable parsing options are available.