AWS S3

Prerequisites

Configuring the S3 Connector in xGen

  1. On the Set up the source screen, select S3 from the Source type dropdown menu.
  2. Provide a name for your S3 connector.
  3. Select the preferred data delivery method.
  4. Specify the name of the S3 bucket containing the files to replicate.
  5. Add a new stream:
    i. Select the File Format.
    ii. Use the dropdown in the Format box to choose the file type you want to replicate. Supported formats include CSV, Parquet, Avro, and JSONL. You can toggle the Optional fields button within this box to configure additional settings specific to the chosen format. See the File Format section below for more details.
    iii. Assign a name to the stream.
    iv. (Optional) Provide Globs, a pattern expressed as a regular expression, to specify which files to sync. Use ** to replicate all files in the bucket. For more advanced pattern matching, refer to the Globs section.
    v. (Optional) Adjust the Days To Sync If History Is Full setting, which controls the lookback period used to determine files to sync when the state history reaches capacity. More information is available in the State section below.
    vi. (Optional) Enter an Input schema if you want to enforce a specific schema. By default, this is {}, and the schema is inferred automatically from the files. See the User Schema section for guidance on custom schemas.
    vii. (Optional) Enable Schemaless mode to skip validation against a schema. If selected, the schema defaults to {"data": "object"}, and all data will be nested under a "data" field. This is useful when your record schemas change frequently.
    viii. (Optional) Choose a Validation Policy to define how xGen handles records that don’t match the schema. Options include emitting the record anyway (with possible missing fields at the destination), skipping the record, or deferring processing until the next discovery cycle (within 24 hours).
  6. To authenticate with a private bucket:
    • If using an IAM role, enter the AWS Role ARN.
    • If using IAM user credentials, provide the AWS Access Key ID and AWS Secret Access Key.

Globs

This connector supports syncing multiple files using glob-style patterns rather than requiring explicit paths for every file. This allows you to:

State

For incremental syncing, xGen processes files in chronological order, from oldest to newest. Up to 10,000 files are tracked in the connection’s “history” state. When the history is full, older entries are dropped, and only files modified between the newest file’s date and the configured Days to Sync If History Is Full (counting backwards) are synced.

S3 Provider Settings

File Format Settings

CSV
CSV files are plain text and often require precise reader settings for accurate parsing. These options should align with how the CSV files are created or exported to ensure consistency over time. Key settings include:

Parquet
Parquet is a columnar data format designed for efficient storage and processing. Partitioned Parquet datasets are currently unsupported. Available option:

Avro
Uses the Fastavro library. Available option:

JSONL
Currently, no configurable parsing options are available.