Avro in Java

Generic Record in Avro

A GenericRecord is used to create an avro object from a schema, the schema being referenced as

It's not the most recommended way of creating Avro objects because things can fail at runtime.

Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse("") //Fill with string of AVRO schema

GenericRecordBuilder customerBuilder = new GenericRecordBuilder(schema);
customerBuilder.set("first_name", "John");
// etc.

GenericData.Record customer = customerBuilder.build();

Specific Record in Avro

A SpecificRecord is also an Avro object but it is obtained using code generation

Avro Tools

Avro Tools is a CLI that can be used to read avro files

Example:

java -jar avro-tools-1.8.2.jar tojson --pretty customer-generic.avro

Result will print out the byte content of the avro file to json

You can also use getschema which will return the schema in the avro file instead

Reflection in Avro

You can use a java class and Reflect.data.get().getSchema(ReflectedCustomer.class); To generate a schema from a class in java

Schema Evolution - Theory

4 kinds of Schema Evolution

  1. Backward: a backward compatible change is when a new schema can be used to read old data
  2. Forward: a forward compatible change is when an old schema can be used to read new data
  3. Full: which is both backward and forward
  4. Breaking: which is none of the above

Schema Evolution Compatibility Matrix - 202404101240

Backward:

Before

{
  "type" :  "record",
  "namespace" : "com.example",
  "name" : "CustomerAddress",
  "fields" : [
    { "name": "address", "type": "string" },
    { "name": "city", "type": "string" },
    { "name": "postcode", "type": ["int", "string"] },
    { "name": "type", "type": "enum", "symbols": ["PO BOX", "RESIDENTIAL", "ENTERPRISE"] },
    { "name": "createStamp", "type" : "long", "logicalType": "timestamp-millis"  }
  ]
}

After

{
  "type" :  "record",
  "namespace" : "com.example",
  "name" : "CustomerAddress",
  "fields" : [
    { "name": "address", "type": "string" },
    { "name": "city", "type": "string" },
    { "name": "postcode", "type": ["int", "string"] },
    { "name": "type", "type": "enum", "symbols": ["PO BOX", "RESIDENTIAL", "ENTERPRISE"] },
    { "name": "createStamp", "type" : "long", "logicalType": "timestamp-millis"  },
    { "name": "isCool", "type" : "boolean", "default" : true } // new! with default
  ]
}

Forward

Before

{
  "type" :  "record",
  "namespace" : "com.example",
  "name" : "CustomerAddress",
  "fields" : [
    { "name": "address", "type": "string" },
    { "name": "city", "type": "string" },
    { "name": "postcode", "type": ["int", "string"] },
    { "name": "type", "type": "enum", "symbols": ["PO BOX", "RESIDENTIAL", "ENTERPRISE"] },
    { "name": "createStamp", "type" : "long", "logicalType": "timestamp-millis"  }
  ]
}

After

{
  "type" :  "record",
  "namespace" : "com.example",
  "name" : "CustomerAddress",
  "fields" : [
    { "name": "address", "type": "string" },
    { "name": "city", "type": "string" },
    { "name": "postcode", "type": ["int", "string"] },
    { "name": "type", "type": "enum", "symbols": ["PO BOX", "RESIDENTIAL", "ENTERPRISE"] },
    { "name": "createStamp", "type" : "long", "logicalType": "timestamp-millis"  },
    { "name": "isCool", "type" : "boolean" } // new! without default
  ]
}

Full

Before

{
  "type" :  "record",
  "namespace" : "com.example",
  "name" : "CustomerAddress",
  "fields" : [
    { "name": "address", "type": "string" },
    { "name": "city", "type": "string" },
    { "name": "postcode", "type": ["int", "string"] },
    { "name": "type", "type": "enum", "symbols": ["PO BOX", "RESIDENTIAL", "ENTERPRISE"] },
    { "name": "createStamp", "type" : "long", "logicalType": "timestamp-millis"  }
  ]
}

After

{
  "type" :  "record",
  "namespace" : "com.example",
  "name" : "CustomerAddress",
  "fields" : [
    { "name": "address", "type": "string" },
    { "name": "city", "type": "string" },
    { "name": "postcode", "type": ["int", "string"] },
    { "name": "type", "type": "enum", "symbols": ["PO BOX", "RESIDENTIAL", "ENTERPRISE"] },
    { "name": "createStamp", "type" : "long", "logicalType": "timestamp-millis"  },
    { "name": "isCool", "type" : "boolean" } // new! with default
  ]
}

Breaking

Advice

  1. Make your primary key required
  2. Give default values to all the fields that could be removed in the future
  3. Be very careful when using Enums as they can't evolve over time
  4. Don't rename fields, add aliases instead
  5. When evolving a schema ALWAYS give default values
  6. When evolving a schema NEVER delete a required field

References

Flashcards