This week’s system design refresher:
Caching Pitfalls Every Developer Should Know (Youtube video)
Encoding vs Encryption vs Tokenization
Kubernetes Tools Stack Wheel
Fixing bugs automatically at Meta Scale
The one-line change that reduced clone times by a whopping 99%, says Pinterest
SPONSOR US
Register for POST/CON 24 | April 30 - May 1 (Sponsored)
Postman’s annual user conference will be one of 2024’s top developer events and an unforgettable experience! Join the wider API community in San Francisco and be the first to learn about the latest Postman product advancements, elevate your skills in hands-on workshops with Postman experts, and hear from industry leaders, including:
James Q Quick
Shweta Palande
Kedasha Kerr
Hannah Seligson
Rui Barbosa
And more!
See the full agenda and register now to get a 30% Early Adopter discount!
Bonus: Did we mention there's an awesome after-party with a special celebrity guest?
Caching Pitfalls Every Developer Should Know
Encoding vs Encryption vs Tokenization
Encoding, encryption, and tokenization are three distinct processes that handle data in different ways for various purposes, including data transmission, security, and compliance.
In system designs, we need to select the right approach for handling sensitive information.
Encoding
Encoding converts data into a different format using a scheme that can be easily reversed. Examples include Base64 encoding, which encodes binary data into ASCII characters, making it easier to transmit data over media that are designed to deal with textual data.
Encoding is not meant for securing data. The encoded data can be easily decoded using the same scheme without the need for a key.
Encryption
Encryption involves complex algorithms that use keys for transforming data. Encryption can be symmetric (using the same key for encryption and decryption) or asymmetric (using a public key for encryption and a private key for decryption).
Encryption is designed to protect data confidentiality by transforming readable data (plaintext) into an unreadable format (ciphertext) using an algorithm and a secret key. Only those with the correct key can decrypt and access the original data.
Tokenization
Tokenization is the process of substituting sensitive data with non-sensitive placeholders called tokens. The mapping between the original data and the token is stored securely in a token vault. These tokens can be used in various systems and processes without exposing the original data, reducing the risk of data breaches.
Tokenization is often used for protecting credit card information, personal identification numbers, and other sensitive data. Tokenization is highly secure, as the tokens do not contain any part of the original data and thus cannot be reverse-engineered to reveal the original data. It is particularly useful for compliance with regulations like PCI DSS.
Latest articles
If you’re not a paid subscriber, here’s what you missed this month.
To receive all the full articles and support ByteByteGo, consider subscribing:
Kubernetes Tools Stack Wheel
Kubernetes tools continually evolve, offering enhanced capabilities and simplifying container orchestration. The innumerable choice of tools speaks about the vastness and the scope of this dynamic ecosystem, catering to diverse needs in the world of containerization.
In fact, getting to know about the existing tools themselves can be a significant endeavor. With new tools and updates being introduced regularly, staying informed about their features, compatibility, and best practices becomes essential for Kubernetes practitioners, ensuring they can make informed decisions and adapt to the ever-changing landscape effectively.
This tool stack streamlines the decision-making process and keeps up with that evolution, ultimately helping you to choose the right combination of tools for your use cases.
Over to you: I am sure there would be a few awesome tools that are missing here. Which one would you like to add?
Fixing bugs automatically at Meta Scale
Wouldn't it be nice if a system could automatically detect and fix bugs for us
Meta released a paper about how they automated end-to-end repair at the Facebook scale. Let's take a closer look.
The goal of a tool called SapFix is to simplify debugging by automatically generating fixes for specific issues.
How successful has SapFix been?
Here are some details that have been made available:
Used on six key apps in the Facebook app family (Facebook, Messenger, Instagram, FBLite, Workplace and Workchat). Each app consists of tens of millions of lines of code
It generated 165 patches for 57 crashes in a 90-day pilot phase
The median time from fault detection to fix sent for human approval was 69 minutes.
Here’s how SapFix actually works:
Developers submit changes for review using Phabricator (Facebook’s CI system)
SapFix selects appropriate test cases from Sapienz (Facebook’s automated test case design system) and executes them on the Diff submitted for review
When SapFix detects a crash due to the Diff, it tries to generate potential fixes. There are 4 types of fixes - template, mutation, full revert and partial revert.
For generating a fix, SapFix runs tests on the patched builds and checks what works. Think of it like solving a puzzle by trying out different pieces.
Once the patches are tested, SapFix selects a candidate patch and sends it to a human reviewer for review through Phabricator.
The primary reviewer is the developer who raised the change that caused the crash. This developer often has the best technical context. Other engineers are also subscribed to the proposed Diff.
The developer can accept the patch proposed by SapFix. However, the developer can also reject the fix and discard it.
Reference:
The one-line change that reduced clone times by a whopping 99%, says Pinterest.
While it may sound cliché, small changes can definitely create a big impact.
The Engineering Productivity team at Pinterest witnessed this first-hand.
They made a small change in the Jenkins build pipeline of their monorepo codebase called Pinboard.
And it brought down clone times from 40 minutes to a staggering 30 seconds.
For reference, Pinboard is the oldest and largest monorepo at Pinterest. Some facts about it:
350K commits
20 GB in size when cloned fully
60K git pulls on every business day
Cloning monorepos having a lot of code and history is time consuming. This was exactly what was happening with Pinboard.
The build pipeline (written in Groovy) started with a “Checkout” stage where the repository was cloned for the build and test steps.
The clone options were set to shallow clone, no fetching of tags and only fetching the last 50 commits.
But it missed a vital piece of optimization.
The Checkout step didn’t use the Git refspec option.
This meant that Git was effectively fetching all refspecs for every build. For the Pinboard monorepo, it meant fetching more than 2500 branches.
𝐒𝐨 - 𝐰𝐡𝐚𝐭 𝐰𝐚𝐬 𝐭𝐡𝐞 𝐟𝐢𝐱?
The team simply added the refspec option and specified which ref they cared about. It was the “master” branch in this case.
This single change allowed Git clone to deal with only one branch and significantly reduced the overall build time of the monorepo.
Reference:
SPONSOR US
Get your product in front of more than 500,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing hi@bytebytego.com.
I love your work. Your newsletter is very useful.
I have a doubt. In the image about "Encoding vs Encryption vs Tokenization", in the first section "Encoding", is correct the subtitle "cipher text" for lock ?