Avro Schemas
What is Avro?
Avro is defined by a schema written in JSON
Advantages
- Fully typed
- Compressed automatically
- Schema comes along with the data
- Documentation is embedded in the schema
- Data can be read across any language
- Schema can evolve over time, in a safe manner
Disadvantages
- Avro support for some languages may be lacking
- Can't print without using avro tool since it's compressed and serialized
Avro is the only supported data format from Confluent Schema Registry
Avro Primitive Types
- null
- boolean
- int
- long
- float
- double
- bytes
- string
.avsc are Avro schema files
Example:
{"type" : "string"}
Avro Record Schema Definition
Defined using JSON
Fields include:
- Name
- Namespace
- Doc
- Aliases
- Fields
- Name
- Doc
- Type
- Default (value)
Example:
{
"type" : "record",
"name" : "Customer",
"namespace" : "com.example",
"doc" : "Avro schema for our Customer",
"fields" : [
{"name": "first_name", "type" : "string", "doc" : "First name of the customer" },
{"name": "last_name", "type" : "string", "doc" : "Last name of the customer" },
{"name": "age", "type" : "int", "doc" : "Age of the customer" },
{"name": "height", "type" : "float", "doc" : "Height in cms" },
{"name": "weight", "type" : "float", "doc" : "Weight in kgs" },
{"name": "automated_email", "type" : "boolean", "default" : true, "doc" : "true if the user wants marketing emails" },
]
}
Avro Complex Types
- Enums
- Once an enum is set, changing the enum values is forbidden if you want to maintain compatibility
- Arrays
- Maps (Object with types given a key type)
- Unions
- Most often used for optional values
- Default type on unions must be the first type in the array supplied
Example:
[{
"type" : "record",
"namespace" : "com.example",
"name" : "CustomerAddress",
"fields" : [
{ "name": "address", "type": "string" },
{ "name": "city", "type": "string" },
{ "name": "postcode", "type": ["int", "string"] },
{ "name": "type", "type": "enum", "symbols": ["PO BOX", "RESIDENTIAL", "ENTERPRISE"] }
]
},
{
"type" : "record",
"name" : "Customer",
"namespace" : "com.example",
"doc" : "Avro schema for our Customer",
"fields" : [
{"name": "first_name", "type" : "string", "doc" : "First name of the customer" },
{"name": "middle_name", "type" : "string", "doc" : "Last name of the customer" },
{"name": "last_name", "type" : "string", "doc" : "Last name of the customer" },
{"name": "age", "type" : "int", "doc" : "Age of the customer" },
{"name": "height", "type" : "float", "doc" : "Height in cms" },
{"name": "weight", "type" : "float", "doc" : "Weight in kgs" },
{"name": "automated_email", "type" : "boolean", "default" : true, "doc" : "true if the user wants marketing emails" },
{"name": "customer_emails", "type" : "array", "items" : "string", "default" : [], "doc" : "user emails" },
{"name": "customer_address", "type" : "com.example.CustomerAddress", "doc" : "user address" },
]
}]
Avro Logical Types
- decimals (bytes)
- date (int)
- time-millis (long)
- timestamp-millis (long)
Logical types don't play well with unions currently
Example
{
"type" : "record",
"namespace" : "com.example",
"name" : "CustomerAddress",
"fields" : [
{ "name": "address", "type": "string" },
{ "name": "city", "type": "string" },
{ "name": "postcode", "type": ["int", "string"] },
{ "name": "type", "type": "enum", "symbols": ["PO BOX", "RESIDENTIAL", "ENTERPRISE"] },
{ "name": "createStamp", "type" : "long", "logicalType": "timestamp-millis" }
]
}
The Complex Case of Decimals
Floats and Doubles are floating binary point types, they represent a number like this: 10001.100010110011
Decimal is a floating decimal point type. They represent a number like this:
12345.656788
Some decimals cannot be represent accurately as floats or doubles
People use floats and doubles for scientific computation (imprecise computations) because these are fast
People use decimals for money. That's why it got created. Use decimal when you need 'exactly accurate' results
Avoid using decimals as a logical type for now. Use a string instead.
References
Flashcards
In Avro, adding a field to a record without default is a breaking schema evolution
Which is an optional field in an Avro record?:: doc