This blog entry picks up the topic Kafka again to give a more detailed technical insight.

Apache Kafka has become one of the most important platforms for highly scalable systems and processing large data volumes in modern IT systems. The trend of using Kafka for analytics and data hub projects is increasing continously.

Kafka is a streaming platform that sends message streams as simple chain of bytes. This can be anything from structured or completely unstructured data as well as images.

Kafka does not care about the content. The data is passed on without checks, only the Consumer checks the data.

Kafka users know the problem of source data changing, fields being added or deleted, or the properties of a field changing. If the communication between Producer and Consumer is not perfectly coordinated, there will be errors at the Consumer‘s side which can quickly lead to problems in a productive environment.

The Producer usually provides the data in one of the following formats to structure the data:

CSV - XML - JSON - AVRO

CSV

Comma Separated Values is the oldest format for data exchange. The term CSV has been used since approximately 1983, while the format has been used since the early 70s.

The CSV format is not standardized. It is suitable for transferring simple data such as numbers, character strings, etc. However, character strings with commas, apostrophes, or other special characters can lead to problems. These values can be put in quotation marks; If the value already contains quotation marks, escape characters or escape sequences are used for clear delimitation.

CSV is suitable for streaming to Kafka but might not be the first choice as parsing of the data can be quite complex because of the above mentioned issues.

XML

This can be used for data exchange but it is not recommended. The parsing of the XML data is very CPU-intensive and therefore does not correspond to modern standards.

JSON

The probably most popular protocol for Kafka. JSON is omnipresent in almost every programming language, and almost all modern applications are using it.

JSON is based on key value pairs without external structure information. Therefore it is possible to store new information or different data types with the same name without problems.

It also means that the application processing the data has to verify that the selected fields exist and that the data has the required format.

In some cases the structure information is passed on with JSON, meaning that a key value pair becomes a document.

For example, „Number“:1 becomes {„Name“:“Number“, „Value“:1, „Type“:“Integer“, „Key position“:0}.

AVRO

The fact that the data in JSON can have any structure and that it is the Consumer‘s task to check the data can lead to problems if changes are made in the application producing the data without telling the target application. This is the basic principle of AVRO. AVRO‘s popularity in the BigData community has tremendously grown.

Schemas in AVRO are defined via JSON. Various data types are supported, every field can be documented, and an AVRO object always consists of a schema and the data.

These schemas are stored in a central location, each schema has an id. If the version of a schema changes, e.g. through inserting fields, the id also changes.

If an AVRO message is sent to Kafka, the id of the used schema can be found in the beginning of the message. Therefore, each message includes the structure information.

tcVISION communicates with the schema registry of our partner Confluent or the schema registry of Hortonworks.

The only slight drawback is the fact that the development cycle with AVRO takes a bit longer than with JSON. But then again there are no issues with data formats. AVRO is stored binarily.

The output tables for Kafka in the tcVISION repository look like any other output tables (e.g. Oracle, SQL-Server, etc) independent from the used protocol.

The following principle applies here:

Changes at the data source should be communicated to the data target in a timely manner.

If this is not the case, this will only be noticed for CSV, JSON, or XML when values in the output are missing or not used over a period of time.

For AVRO the structure information is necessary to read the messages. If a field does not appear or is added without adjusting the schema accordingly, there will be an error when parsing the message. Changes directly lead to errors instead of incorrect results.

Our tcVISION solution supports the Kafka protocols CSV, JSON, and AVRO. For AVRO the schema registries of Confluent and Hortonworks are supported.