An Introduction to Apache Spark
A flexible data processing framework Photo by Joshua Sortino on Unsplash Apache Spark and MapReduce are the two most common big data processing frameworks. In this post, we will look at the features of Spark and show where it excels when compared with MapReduce. MapReduce uses a split-apply-combine strategy for data analysis and it involves storing the split data on to…
Introducing Beekeeper Time-To-Live (TTL)
Automate clean up of temporary Hive tables in the data lake Photo by Annie Spratt on Unsplash In 2019, we announced our open-source automated data clean up service Beekeeper. At Expedia Group™, we use Beekeeper to delete unreferenced data snapshots left behind by various data processing tools that follow the snapshot isolation pattern. ExpediaGroup/beekeeper Beekeeper is a service that schedules orphaned paths and expired metadata for deletion. The original inspiration for a…github.com Temporary Data in a Large Data Lake Beekeeper is great at handling cases where data is often restated but what happens…
On Spark, Hive, and Small Files: An In-Depth Look at Spark Partitioning Strategies
Author: Zachary Ennenga Airbnb’s new office building, 650 Townsend Background At Airbnb, our offline data processing ecosystem contains many mission-critical, time-sensitive jobs — it is essential for us to maximize the stability and efficiency of our data pipeline infrastructure. So, when a few months back, we encountered a recurring issue that caused significant outages of our data…
PinalyticsDB: A Time Series Database on top of Hbase
PinalyticsDB is Pinterest’s proprietary time series database. At Pinterest, we rely on PinalyticsDB as a backend for storing and visualizing thousands of time series reports such as the sample case below, segmented by country. PinalyticsDB was built several years ago on…
Evolution of Netflix Conductor:
By Anoop Panicker and Kishore Banala Conductor is a workflow orchestration engine developed and open-sourced by Netflix. If you’re new to Conductor, this earlier blogpost and the documentation should help you get started and acclimatized to Conductor. Netflix Conductor: A microservices orchestrator visibility into distributed workflowsmedium.com In the last two years since inception, Conductor has seen wide adoption and…
MezzFS — Mounting object storage in Netflix’s media processing platform
MezzFS (short for “Mezzanine File System”) is a tool we’ve developed at Netflix that mounts cloud objects as local files via FUSE. It’s used extensively in our media processing platform, which includes services like Archer and runs features like…
Building a dynamic and responsive Pinterest
In 2015, the majority of content on Pinterest was pregenerated for users prior to login. It was stored statically in HBase and served directly upon entering the service. (More details can be found in the blog post Building a smarter home feed.) …
Auto Scaling Production Services on Titus
Over the past three years, Netflix has been investing in container technology. A large part of this investment has been around Titus, Netflix’s container management platform that was open sourced in April of 2018. Titus schedules application containers…
Improving HBase backup efficiency at Pinterest
Pinterest has one of the largest HBase production deployments in the industry. HBase is one of the foundational building blocks of our infrastructure and powers many of our critical services including our graph database (Zen), our general-purpose key-value store (UMS), our…
machine-learning
security
performance
amazon-web-services
api
Latest news, articles and updates montly delivered to your inbox.