S3 destination for batch exports

With batch exports, data can be exported to an S3 bucket.

Models

This section describes the models that can be exported to S3.

Note: New fields may be added to these models over time. Therefore, it is recommended that any downstream processes are able to handle additional fields being added to the exported files.

Events model

This is the default model for S3 batch exports. When exported in the Parquet file format, the schema is:

Field	Type	Description
uuid	`STRING`	The unique ID of the event within PostHog
event	`STRING`	The name of the event that was sent
distinct_id	`STRING`	The `distinct_id` of the user who sent the event
person_id	`STRING`	The ID of the person the event is attributed to, resolved at ingestion time
properties	`STRING`	A JSON object with all the properties sent along with the event, stored as a JSON-encoded string in Parquet
person_properties	`STRING`	A JSON object with the person's properties at ingestion time, stored as a JSON-encoded string in Parquet
elements_chain	`STRING`	The chain of DOM elements for `$autocapture` events. Empty for other events
timestamp	`TIMESTAMP`	When the event occurred, as reported by the client. Microsecond precision, UTC
created_at	`TIMESTAMP`	When PostHog ingested the event. Microsecond precision, UTC
_inserted_at	`TIMESTAMP`	Internal field used by batch exports to track export progress. Included in files but safe to ignore

Note: The types above describe the Parquet file format. Other file formats represent the same data differently. For example, in JSONLines exports the timestamp fields (timestamp, created_at, and _inserted_at) are ISO 8601 strings, and properties and person_properties are nested JSON objects rather than JSON-encoded strings.

Persons model

The schema of the persons model when exported in the Parquet file format is:

Field	Type	Description
team_id	`BIGINT`	The ID of the project (team) the person belongs to
distinct_id	`STRING`	A `distinct_id` associated with the person
person_id	`STRING`	The ID of the person for this (`team_id`, `distinct_id`) pair
properties	`STRING`	A JSON object with the latest person properties, stored as a JSON-encoded string in Parquet
person_distinct_id_version	`BIGINT`	Internal version of the person-to-`distinct_id` mapping, used by batch exports during merges
person_version	`BIGINT`	Internal version of the person's properties, used by batch exports during merges
created_at	`TIMESTAMP`	When the person was created. Microsecond precision, UTC
_inserted_at	`TIMESTAMP`	Internal field used by batch exports to track export progress. Included in files but safe to ignore
is_deleted	`BOOLEAN`	Whether the person has been deleted

Each export contains one row per (team_id, distinct_id) pair, mapped to their corresponding person_id and latest properties.

Note: The persons model only includes persons that have a person profile in PostHog. If your project has person profile processing disabled (via person_profiles: 'identified_only', person_profiles: 'never', or by sending events with $process_person_profile: false), anonymous users who have never been identified will not appear in the persons export. To count unique users including those without person profiles, you can fall back to distinct_id from the events model. See the example queries in each destination's documentation for details.

Note: As with the events model, these types describe the Parquet file format. In JSONLines exports, created_at and _inserted_at are ISO 8601 strings and properties is a nested JSON object rather than a JSON-encoded string.

Sessions model

You can view the schema for the sessions model in the configuration form when creating a batch export (there are a few too many fields to display here!).

Creating the batch export

Click Data management > Destinations in the left sidebar.
Click + New destination in the top-right corner.
Search for S3.
Click the + Create button.
Fill in the necessary configuration details.
Finalize the creation by clicking on "Create".
Done! The batch export will schedule its first run on the start of the next period.

S3 configuration

Configuring a batch export targeting S3 requires the following S3-specific configuration values:

Bucket name: The name of the S3 bucket where the data is to be exported.
Region: The AWS region where the bucket is located.
Key prefix: A key prefix to use for each S3 object created. This key can include template variables
Format: Select a file format to use in the export. See the S3 file formats section for details on which file formats are supported.
Max file size (MiB): If the size of the exported data exceeds this value, the data is split into multiple files. (Note that this is approximate and the actual file size may be slightly larger). If this value is not set, or is set to 0, the data is exported as a single file.
Compression: Select a compression method (like gzip) to use for exported files or no compression.
Encryption: Select a server-side encryption method (AES256 or aws:kms) for AWS to encrypt data at rest.
AWS Access Key ID (required): An AWS access key ID with access to the S3 bucket.
AWS Secret Access Key (required): An AWS secret access key with access to the S3 bucket.
AWS KMS Key ID: The AWS KMS Key ID to use for server-side encryption. Only required when selecting aws:kms encryption.
Events to exclude: A list of events to omit from the exported data.
Endpoint URL: Required if exporting to an S3-compatible blob storage. Must resolve to a publicly accessible address (internal or private IPs are not allowed).

S3 key prefix template variables

The key prefix provided for data exporting can include template variables which are formatted at runtime. All template variables are defined between curly brackets (for example {day}). This allows you partition files in your S3 bucket, such as by date.

Template variables include:

Date and time variables:
- year.
- month.
- day.
- hour.
- minute.
- second.
Name of the table exported (for example, 'events' or 'persons')
- table.
Batch export data bounds:
- data_interval_start.
- data_interval_end.

So, as an example, setting {year}-{month}-{day}_{table}/ as a key prefix, will produce files prefixed with keys like 2023-07-28_events/.

S3 file formats

PostHog S3 batch exports support two file formats for exporting data:

JSON lines.
Apache Parquet (latest version of the format specification is the only one supported).

The batch export format is selected via a drop down menu when creating or editing an export.

We intend to add support for other common formats, and format-specific configuration options. You can follow the roadmap to track progress.

Compression

Each file format supports a variety of compression methods. The compression method you choose can have a significant effect on the exported file size and the overall time taken to export the data. From our own internal testing, we would recommend using Parquet with zstd compression for the best combination of speed and file size.

Note on Parquet compression: The compression type is included in the file extension, even for Parquet files. For example, files compressed with zstd will have the extension parquet.zst. Since compression is embedded in the format itself, the file should be read directly as a Parquet file and not uncompressed first.

Manifest file

If you specify a max file size in your configuration, several files may be exported. In order to know when the export is complete, we send a manifest.json file (with the same prefix as the other files) once all the data files have been exported. This manifest file contains the key names of all the files exported.

S3-compatible blob storage

PostHog S3 batch exports may also export data to an S3-compatible blob storage like MinIO, Cloudflare R2, or Google Cloud Storage (GCS). Here we describe configuration tweaks that are required for S3-compatible blob storage destinations that we have tested.

MinIO

Set the Endpoint URL configuration to your MinIO instance's host and port, for example: https://my-minio-storage:9000.

Cloudflare R2

Set the Endpoint URL configuration to the following after replacing your account id: https://<ACCOUNT_ID>.r2.cloudflarestorage.com.
From the Region dropdown, select one of the Cloudflare R2 regions that correspond to your bucket, like "Automatic (AUTO)".

Google Cloud Storage (GCS)

Access to GCS for batch exports follows a similar process to accessing BigQuery as a Service Account is required:

Follow the steps in the BigQuery batch export documentation to create a Service Account.
Create a HMAC key for your Service Account.
Grant the Service Account the Storage Object User role or a custom role with at least the following permissions:
- storage.multipartUploads.abort
- storage.multipartUploads.create
- storage.multipartUploads.list
- storage.multipartUploads.listParts
- storage.objects.create
- storage.objects.delete
Use the HMAC key access key and secret key as AWS Access Key ID and AWS Secret Access Key respectively when configuring your batch export.
Finally, set the Endpoint URL configuration to: https://storage.googleapis.com.