Skip to main content

S3 Migration Guide

Upgrading to 1.0.0

This version introduces changes to the schema of data written to S3, which make it isomorphic to our V2 certified database destinations, as well as various improvements to format conversion.

  • Conversion failures are captured: values not matching the client schema will no longer break syncs using Avro or Parquet formats.
  • Changes introduced by the source or platform, such as NULLing bad values or truncating long ones, are visible in the metadata.
  • There is improved handling of various types in Avro and Parquet formats, including simplification of date and time types, more human-readable schemaless objects and arrays, and better support for unions in Parquet.
  • Sync and generation ids are made available, providing more information for debugging.

Schema Changes

The schema changes are as follows:

Old field nameNew field nameJSON TypeAvro TypeParquet TypeDescription
_airbyte_ab_id_airbyte_raw_idstring (UUID){ "type": "string", "logicalType": "uuid" }StringAirbyte-added unique identifier.
_airbyte_emitted_at_airbyte_extracted_atinteger (epoch ms){ "type": "string", "logicalType": "timestamp-millis" }Int64(logicalType=Timestamp(Milliseconds))Time at which the data was extracted from the source.
[NONEXISTENT]_airbyte_generation_idinteger{ "type": "long" }Int64Monotonically-increasing refresh id, if applicable.
[NONEXISTENT]_airbyte_metaobject (see below)Record (see below)Record (see below)Additional metadata, including change data capture info and sync id.
_airbyte_data[UNCHANGED]optional object(from client schema)(from client schema)Data payload when flattening is disabled.

The _airbyte_meta field is an object that currently has one field:

Field NameJSON TypeDescription

| changes | list | A record of any changes Airbyte made to the data for compatibility and/or to handle errors | | sync_id | integer | Monotonically-increasing integer representing the current sync |

The changes field is a list of objects, each of which represents a change to the data. Each object has the following fields:

Field NameJSON Type*Description
fieldstringThe name of the field that was affected
changestringThe type of change (currently only NULLED or TRUNCATED)
reasonstringThe reason for the change, including its origin (source, platform, or destination)

These schemas are subject to change, however any change that is not backward compatible (ie, additive) will be accompanied by a breaking change notice.

Data Changes

This version introduces changes to the data types used when writing to S3 in Avro or Parquet. The changes are as follows:

Primitive Types

All primitive types are unchanged:

Airbyte JSON Schema TypeOld Avro TypeOld Parquet TypeNew Avro TypeNew Parquet Type
string["null", "string"]Optional String[UNCHANGED][UNCHANGED]
boolean["null", "boolean"]Optional Boolean[UNCHANGED][UNCHANGED]
integer["null", "long"]Optional Int64[UNCHANGED][UNCHANGED]
number["null", "double"]Optional Double[UNCHANGED][UNCHANGED]

Date and Time Types

This change introduces simplification of the handling of dates, times, and timestamps in Parquet and Avro. Date and time types were stored in Avro as unions of integral logical types and strings, which in Parquet were represented as disjoint records. In practice, the date and time types were always converted, making the unions redundant. The resulting disjoint records were confusing.

Now all time types are converted to integral logical types. Values that cannot be converted will be nulled and tracked in _airbyte_meta.changes[].

Resulting Avro Changes

Avro users should only see changes at the schema level, the resulting data will appear the same.

Airbyte Time TypeOld Avro TypeNew Avro Type
date["null", { "type": "int", "logicalType": "date" }, "string"][null, int(logicalType="date")]
time without timezone["null", { "type": "long", "logicalType": "time-micros" }, "string"][null, long(logicalType="time-micros")]
time with timezone["null", { "type": "long", "logicalType": "time-micros" }, "string"][null, long(logicalType="time-micros")]
timestamp without timezone["null", { "type": "long", "logicalType": "timestamp-micros" }, "string"][null, long(logicalType="timestamp-micros")]
timestamp with timezone["null", { "type": "long", "logicalType": "timestamp-micros" }, "string"][null, long(logicalType="timestamp-micros")]

Resulting Parquet Changes

Airbyte Time TypeOld Parquet TypeNew Parquet Type
dateOptional Record { member0: Optional Int32(logical_type=Date), member1: Optional String }Optional Int32(logical_type=Date)
time without timezoneOptional Record { member0: Optional Int64(logical_type=Time(Microseconds)), member1: Optional String }Optional Int64(logical_type=Time(Microseconds))
time with timezoneOptional Record { member0: Optional Int64(logical_type=Time(Microseconds)), member1: Optional String }Optional Int64(logical_type=Time(Microseconds))
timestamp without timezoneOptional Record { member0: Optional Int64(logical_type=Timestamp(Microseconds)), member1: Optional String }Optional Int64(logical_type=Timestamp(Microseconds))
timestamp with timezoneOptional Record { member0: Optional Int64(logical_type=Timestamp(Microseconds)), member1: Optional String }Optional Int64(logical_type=Timestamp(Microseconds))

Note: the times and timestamps with timezones are converted to UTC. In Parquet, the field metadata is_adjusted_to_utc will always be true.

Time of Day Conversion Bugfix

Formerly, for the Airbyte type time_of_day_with_timezone, timezones were not respected when converting. This has been fixed.

Input DataOld Output DataNew Output Data
"12:00:00-01:00"43200000000 (12:00:00 UTC)46800000000 (13:00:00 UTC)
"04:00:00+02:00"14400000000 (04:00:00 UTC)720000000 (02:00:00 UTC)
"04:00:00+05:30"14400000000 (04:00:00 UTC)81000000000 (22:30:00 UTC)

Object Types

Formerly, in both Avro and Parquet formats, objects without schemas ("type": "object" without properties), and properties not listed in the properties field, were accumulated in _airbyte_additional_properties. The new behavior is

  • undocumented fields in object with schemas are silently dropped
  • objects without schemas are serialized into a JSON string
  • _airbyte_additional_properties is dropped entirely

For example, the following input schemas and data will result in the following outputs:

Input SchemaInput DataOutput SchemaOutput Data
{ "type": "object", "properties": { "id": { "type": "integer" } } }{ "id": 1, "name": "Alice" }{ "type": "object", "properties": { "id": { "type": "integer" } } }{ "id": 1 }
{ "type": "object" }{ "id": 1, "name": "Alice" }{ "type": "string" }"{\"id\": 1, \"name\": \"Alice\"}"

Note: dropped fields will not appear in _airbyte_meta.changes[]. Additionally, n object with null or empty properties ("properties": {}) will be treated as a schemaless object. This is because this usually indicates an upstream source is failing to report its schema properly, and the data is not actually extraneous.

Array Types

Formerly, arrays without types ("type": "array" with no items field) were converted to native arrays of string serializations of the underlying types. The same behavior was applied to arrays of union types ({ "type": "array", "items": { "oneOf": [ /* various types */ ] }).

Now:

  • Arrays without types are serialized into JSON array strings.
  • Arrays of unions are treated as arrays of mixed types.

For example, the following input schemas and data formerly resulted in:

Input SchemaInput DataOld Output SchemaOld Output Data
{ "type": "array", "items": { "type": "integer" } }[1, "Alice"]{ "type": "array", "items": ["null", "integer"] }[SYNC FAILED]
{ "type": "array" }[1, "Alice"]{ "type": "array", "items": [ "null", "string" ] }["1", "Alice"]
{ "type": "array", "items": { "oneOf": [ {"type": "integer", "type": "string"} ] } }[1, "Alice"]{ "type": "array", "items": [ "null", "string" ] }["1", "Alice"]
{ "type": "array", "items": { "oneOf": [ {"type": "integer", "type": "string"} ] } }[1, false]{ "type": "array", "items": [ "null", "string" ] }["1", "false"]

Now:

Input SchemaInput DataNew Output SchemaNew Output Data
{ "type": "array", "items": { "type": "integer" } }[1, "Alice"]{ "type": "array", "items": ["null", "integer"] }[1, null*]
{ "type": "array" }[1, "Alice"]{ "type": "string" }"[1, \"Alice\"]"
{ "type": "array", "items": { "oneOf": [ {"type": "integer", "type": "string"} ] } }[1, "Alice"]{ "type": "array", "items": [ "null", "integer", "string" ] }[1, "Alice"]
{ "type": "array", "items": { "oneOf": [ {"type": "integer", "type": "string"} ] } }[1, false]{ "type": "array", "items": [ "null", "integer", "string" ] }[1, null*]

*The nulled fields represent conversion failures and will appear in _airbyte_meta.changes[].

This behavior will be applied to both Avro and Parquet formats.

Union Types (Parquet Only)

Disjoint Record Improvements

Formerly, unions in Parquet were represented as anonymous disjoint records. For example:

Airbyte Union TypeOld Parquet Type
{"oneOf": [ {"type": "integer"}, {"type": "string"} ] }Optional Record { member0: Optional Int32, member1: Optional String }
{"oneOf": [ {"type": "boolean"}, {"type": "object", "properties": { "id": { "type": "integer" }, "name": { "type": string" } } } ] }Optional Record { member0: Optional Boolean, member1: Optional Record { id: Optional Uint32, name: Optional String } }

Now unions will be represented as typed disjoint records with named fields. For example:

Airbyte Union TypeNew Parquet Type
{"oneOf": [ {"type": "integer"}, {"type": "string"} ] }Optional Record { type: String, integer: Optional Int32, string: Optional String }
{"oneOf": [ {"type": "boolean"}, {"type": "object", "properties": { "id": { "type": "integer" }, "name": { "type": string" } } } ] }Optional Record { type: String, boolean: Optional Boolean, object: Optional Record { id: Optional Uint32, name: Optional String } }

Where

  • Only one field will always be set.
  • The type field will always be set and always be equal to the name of the field that was set.

Following the examples above:

Input DataNew Output Data
1{ "type": "integer", "integer": 1 }
{"id": 10, "name": "Alice"}{ "type": "object", "object": { "id": 10, "name": "Alice" } }

Disjoint Record Type Names

The following type and field names will be used for each Airbyte type:

Airbyte TypeParquet TypeField Name
{"type": "integer" }Optional Int64integer
{"type": "string" }Optional Stringstring
{"type": "boolean" }Optional Booleanboolean
{"type": "object" }Optional Recordobject
{"type": "array" }Optional Listarray
{"type": "string", "format": "date" }Optional Int32(logicalType=Date)date
{"type": "string", "format": "time", "airbyte_type": "time_with_timezone" }Optional Int64(Optional Int64(logical_type=Time(Microseconds))time_with_timezone
{"type": "string", "format": "time", "airbyte_type": "time_without_timezone" }Optional Int64(Optional Int64(logical_type=Time(Microseconds))time_without_timezone
{"type": "string", "format": "date-time", "airbyte_type": "timestamp_with_timezone" }Optional Int64(logical_type=Timestamp(Microseconds)))timestamp_with_timezone
{"type": "string", "format": "date-time", "airbyte_type": "timestamp_without_timezone" }Optional Int64(Optional Int64(logical_type=Timestamp(Microseconds)))timestamp_without_timezone

Merging Like Options in Unions

The above behavior applies only to unions of distinct types. Unions of the same type will continue to be merged. If the result is a single option, the union will be demoted to a field of that type. For example:

Airbyte Union TypeAvro TypeParquet Type
{"oneOf": [ {"type": "integer"}, {"type": "integer"} ] }["null", "integer"]Optional Int64
{"oneOf": [ {"type": "integer"}, {"type": "string"}, {"type": "integer" } ] }["null", "integer", "string"]Optional Record { type: Optional String, integer: Optional Int64, string: Optional String }

Unions of objects will continue to be merged into a single object with a union schema. If this is not possible due to field type conflicts, an exception will be thrown and the sync will fail.

Object 1 PropertiesObject 2 PropertiesMerged Object
{id: integer}{name: string}{id: integer, name: string}
{id: integer, name: string}{id: integer, birthday: date}{id: integer, name: string, birthday: date}
{id: integer}{id: string}[SYNC FAILED]

Unions of arrays with different item schemas continue not to be supported.