Evolution of Data Ingestion and Product Instrumentation at Prezi
In this talk, we will cover how we took our need to better understand how our users use our product and how we ended up designing a system for product instrumentation and event processing to get those insights. Even though this does not sound hard, we were not as successful as we’d like to be at first. Thus we redesigned our data ingestion pipeline a couple of times to get to the state where we are today.
We will start by covering how our data ingestion pipeline evolved, from starting with semi-structured event data copied to S3 with a bash script, to using Avro with Confluent schema registry ingesting events from Apache Kafka with Apache Gobblin to S3. Even though Apache Kafka helps a lot in scaling, just using Kafka alone is not your silver bullet. We had to introduce multiple components like using Avro format and the schema registry to solve for the missing pieces. We will also cover how we built around this ecosystem to make sure our system is foolproof; make it less painful to instrument the right events; and define how instrumentation works across the various platforms we support (Mac, Windows, iOS, Android, Web). Lastly, we will cover how this framework allows us to move faster, by benefiting data science projects and self-service analytics in general.