Note: This article is written in collaboration with the engineering team of Amazon Key. Special thanks to Kaushik Mani (Director at Amazon) and Vijay Nagarajan (Engineering Leader) from Amazon Key for walking us through the architecture and challenges they faced while building this system. All credit for the technical details and diagrams shared in this article goes to the Amazon Key Engineering Team.
Picture a customer buzzing with excitement for their package, only to find a "delivery failed" slip because a locked gate stood in the way.
That’s what Amazon faced in 2018. At the time, delivery delays were frequently caused by delivery associates being unable to enter access-controlled areas in residential and commercial properties, including gated communities, leading to missed deliveries and poor customer experiences.
The access control systems in such areas were never designed to work with modern logistics. Moreover, the hardware systems were wildly fragmented with no common standard. Many of these devices were hardwired into buildings in ways that make network access unreliable or downright impossible.
To bridge this gap, Amazon launched an initiative to address a common but impactful delivery challenge: how to provide drivers with secure access to gated residential and commercial communities and buildings.
The result was Amazon Key: a system that allows verified delivery associates to unlock gates and doors at the right time, with the right permissions, and only for the duration needed to complete a delivery. What started with a small internal team and a few device installations has now grown into a system that unlocks 100 million doors annually, across 10+ countries and 4 continents, with five unlocks every second.
We recently had the wonderful opportunity to sit with Kaushik Mani and Vijay Nagarajan from the Amazon Key team to learn how they built this system and the challenges they faced.
In this article, we are bringing you our findings and what we learned about Amazon’s culture of entrepreneurship that played a key role in making Amazon Key a reality.
The Beginnings
The origins of Amazon Key can be traced back to 2016.
Upon moving to Seattle to work for Amazon, Kaushik noticed a number of underutilized parking spots in buildings. However, technology to open them up to users was lacking. Kaushik worked on his own initiative to solve this problem for a Seattle apartment building. The solution needed to address a highly fragmented category of garage door mechanisms.
Kaushik invented a cloud-connected universal key that worked on any electrically controlled lock. This later became known as Amazon Key. Kaushik deployed it at customer locations to validate the solution. He wrote the business plan around his invention and pitched it to several leaders across Amazon. After 6 months of pitching the idea and receiving several refusals, Kaushik was eventually funded by Marketplace.
He worked on the idea for a year, but unfortunately, no customers signed up. Building owners loved the idea of selling parking but worried about trust and safety aspects of allowing access to anyone. So, in 2018, he pivoted the business case to benefit only last-mile deliveries for building access and launched the first version of Amazon Key with 1 firmware and 1 hardware engineer.
From a stats point of view, Amazon delivery associates deliver over a billion packages annually to tens of millions of customers living in access-restricted apartment buildings in the US, EU, and Japan. Without Amazon Key, a driver must engage with the customer, property manager, or other residents to receive access when faced with an access control barrier. This results in packages not being delivered on time and reliably due to drivers not gaining access to the apartment building/gated communities. One example highlighted by the team involved a customer expecting a cereal delivery at 4 a.m., timed precisely so breakfast could be prepared on schedule. If the delivery associate couldn’t get past a gate, that delivery would fail, turning what should be a seamless experience into a broken promise.
Not having on-demand access to these buildings was a core capability gap, resulting in repeated defects, the impact of which started becoming more pronounced as customer demand and delivery speeds increased. This shifted the idea from “interesting” to “strategic necessity”.
By 2018, Amazon Key entered early-stage operations. The rollout was intentionally conservative, targeting just 100 buildings to start, followed by another 100 shortly after. But even in this narrow scope, the system’s potential became clear.
Still, scaling was far from straightforward. As Kaushik Mani put it brilliantly: "It takes $5 to build the solution, but $95 to make it secure." That ratio became even more daunting in the context of Amazon’s global footprint, where every expansion introduced a new set of hardware, protocols, and deployment challenges.
And that’s where the Amazon Key engineering team came together to build a solution that made this scale possible.
The First Attempt: Serverless
Amazon started where most pragmatic teams start: build the simplest thing that works.
They created a small, Ethernet-connected device that could physically integrate with most ACS (Access-Control System) hardware. When a delivery associate showed up at a gate, they’d tap in the Amazon Flex App, which triggered a cloud command to the device via AWS IoT to open the gate.
See the diagram below that shows this setup:
The system stack looked something like this:
Hardware: A small, Ethernet-connected device installed on-site, physically wired to the building’s existing access control systems. It could trigger unlocks for 95% of gate types.
AWS IoT: Provided secure, two-way communication between the device and the cloud.
AWS Lambda (Java): Handled unlock requests, triggered by delivery associates using the mobile app.
DynamoDB: Stores metadata like device status, gate mappings, and property permissions.
Amazon Flex App: The application delivery associates use to initiate access requests when approaching a gate.
The early design was lean, minimal, and pragmatic. The goal was to give Amazon delivery associates a reliable way to unlock residential gates. This approach worked well for limited properties.
However, as Amazon Key expanded, hundreds of devices became thousands. A few cities became multiple countries. The rapid expansion created issues for the solution’s scalability in different areas:
1 - Device and Infra Challenges
When deploying access devices across thousands of properties, physical installation constraints quickly became a major factor in hardware design.
Space limitations were a frequent issue. Callboxes, where devices were often installed, have very limited internal space. Larger hardware simply could not fit inside these enclosures. Similarly, at common installation points like mail rooms, front doors, or outdoor callboxes, Ethernet connectivity was often unavailable, making it impractical to rely on wired networking as a standard installation requirement. Running new Ethernet cables required drilling through walls, digging up pathways, or convincing building managers to modify shared infrastructure.
Initially, the device supported two relay ports and two Wiegand ports, allowing it to control multiple access points and interact with various types of legacy systems. However, based on field experience, the design was streamlined to one relay and one Wiegand port. This change reduced hardware complexity and footprint, which made installations easier and more reliable.
Over time, the team also decided to stop using the Wiegand interface during installations. The main reason was operational: Wiegand credentials often change periodically, making long-term integration unreliable and harder to maintain without frequent updates.
2 - Serverless Backend with Java Cold Starts
Another problem the team faced was Java-based Lambdas with higher cold start times. It might be fine for a background task, but not when a delivery associate is standing at a gate holding a package to deliver.
Also, the initial backend design used a shared gateway across countries. This meant launching in new regions required coordination with multiple teams, environments, and deployment pipelines.
3 - Field Resilience
The unlock commands were fire-and-forget.
The delivery associates had no feedback. They’d tap “unlock” and hope for the best. If a device was offline, the system didn’t know. Also, the operations team couldn’t see device health or network quality remotely. Troubleshooting meant sending people into the field, which was slow, expensive, and frustrating.
Phase 2: Re-architecting for Scale
The early Lambda-based design was fast to build, but faced issues in terms of scalability. Therefore, the Amazon Key engineering team decided to design the system for a global scale.
Two key changes they made were as follows:
1 - Moving Away from Ethernet
No matter how smart the backend was, if the device couldn’t connect reliably, nothing else mattered. So the hardware team built something better.
They created a new, compact, cellular-enabled device, small enough to install discreetly and robust enough to operate in the field. It had multi-carrier support baked in, spanning 70+ countries, with failover if one network dropped. It removed the earlier dependency on local building infrastructure.
This change shifted the device from “hard to deploy” to “install and forget.” It also gave Amazon a repeatable, global deployment model, which was critical for international rollout.
2 - The Move to Containers
The backend team knew Lambda wasn’t going to cut it anymore: not at this scale
The delivery associates needed real-time unlocks. That meant persistent device connections, low tail latency, and guaranteed CPU availability. Lambda (especially with Java) wasn’t ideal for that. Therefore, they moved to ECS Fargate, Amazon’s container-native compute layer.
So, why was ECS Fargate chosen specifically?
Vijay Nagarajan from the Amazon Key engineering team gave a few important reasons for this choice:
Fargate allows fine-grained control over vCPU and memory settings, which helped the team optimize performance based on real-world load.
Deploying across multiple Availability Zones (AZs) allowed the system to balance cost and availability.
Fargate can scale based on concrete indicators like outstanding request count, memory pressure, or CPU usage.
Compared to running the same workload on EC2, Fargate proved to be more cost-efficient, especially since it avoided the operational burden of managing EC2 instances.
Unlike Lambda, which spins up and tears down execution contexts, Fargate tasks remain provisioned and continuously running, making them well-suited for latency-sensitive workflows like device unlocks
In short, Fargate gave them the control and performance of EC2, without the ops overhead of managing VMs.
The hardest part of this evolution was surgically moving life traffic to the newer services. The team had to create feature flags to support both the flows. This meant that every time they introduced a new service, they had to slowly migrate traffic to the newer service.
To make things clear, Lambda wasn’t completely abandoned. There are still a few use cases that rely on Lambda, such as:
One-time device JITR registration that happens at the factory.
During installation, they also take the image of where the device is installed. The post-processing to scan for malware/size resolution is still a Lambda function.
Benefits
Together, cellular hardware and containerized backend unlocked a new class of capabilities such as:
Faster unlock response times (<1.5s consistently).
Global expansion without localized network constraints.
The ability to handle edge cases gracefully, like intermittent connectivity, retries, and failovers.
It also laid the foundation for the next evolution: breaking the system into modular services, standardizing workflows, and monitoring what matters.
Adopting Microservices
The move to ECS Fargate was also a chance to break apart the system into clean, well-scoped services. Amazon Key went from a collection of Lambda functions to a cohesive service-oriented architecture that could be evolved in the future.
See the diagram below for a high-level view of the system running on ECS Fargate:
This wasn’t microservices for microservices’ sake. Each service took care of a real need. Here are the details of the key services that were a part of the overall architecture:
1 - Provisioning App
This service was developed to simplify the installation process and support onboarding new properties into the Amazon Key backend.
2 - Key Gateway Service
This service handles requests originating from the Flex App when a delivery is scheduled to an Amazon Key-enabled property. It also enabled the system to support international launches by managing region-specific traffic routing.
3 - Access Management Service
This service maintains the relationships between gates, properties, and devices. It also manages mappings with Amazon Logistics and associated job workflows such as installation and maintenance.
4 - Device Management Service
Built on top of AWS IoT, this service provided a wrapper around commands sent to devices. It supported both synchronous and asynchronous APIs and streamed device performance metrics into Redshift for analysis.
5 - AMZL Onboarding Service
This service listens to events from the Access Management Service and onboards properties into the Amazon Logistics system, allowing them to be used in routing and delivery workflows.
6 - OTA Management Service
This service provides pipelines for deploying firmware updates to devices in the field. It supported two modes:
One-time OTA used to test firmware with a small cohort of devices.
Campaign-based OTA, which rolled out updates across device pools grouped by geographical location until all were up to date.
7 - Flex App
The Flex App is used by Amazon delivery associates to complete package deliveries. It integrates with Amazon Key to support access control during deliveries.
8 - Data Lake and Analytics
Metrics from different services are pushed into Redshift for analytics. Dashboards built using AWS QuickSight provide both aggregated summaries and device-level drilldowns.
The Cellular Connectivity Challenge
Switching from Ethernet to cellular connectivity solved key deployment issues but introduced new challenges.
Devices were installed in a variety of environments, such as exposed call boxes outside communities or electrical rooms located deep within buildings, and faced varying connectivity conditions. Cellular performance was inconsistent and location-dependent, often changing throughout the day. These fluctuations impacted the reliability of time-sensitive operations like unlock requests, especially during delivery hours.
To address this, the team developed the Intelligent Connection Manager (ICM) to improve device behavior when cellular performance is inconsistent, achieving better availability, failover, and reconnection times.
ICM enables the system to respond to real-world variability in connectivity without manual intervention, helping maintain access availability during critical delivery windows.
Here’s what the ICM does in more detail:
Monitoring: ICM continuously monitors device performance using:
EventBridge to capture real-time events
Redshift for historical trend analysis
Step Functions to coordinate health checks and remediation workflows
S3 for storage and processing support
Analysis: The system identifies poorly performing devices by evaluating recent interaction history and applying defined rules.
Remediation: When issues are detected, ICM triggers automated corrective actions such as:
Rebooting the device
Rescanning the network
Switching cellular carriers
If these measures are not sufficient, the metrics help the operations team determine when manual servicing is required.
Phase 3: Platform Expansion
As Amazon Key matured, the team shifted its focus from solving Amazon's internal delivery challenges to building a general-purpose access platform. The system had already proven it could scale reliably. The next step was to extend that capability to external delivery providers.
In 2023, Amazon Key announced its integration with Grubhub. This marked the transition from a closed product to a more extensible platform. To support this, the team introduced a new architectural component: the Partner Gateway Service.
This service acts as a boundary between Amazon Key's internal systems and third-party applications. It exposes a clean, stable API that allows vetted external partners to request access to secure properties without revealing internal implementation details.
See the diagram below that shows the expanded architecture:
The Partner Gateway Service was designed to ensure that expansion did not compromise system integrity or performance. It provides features such as:
Partner onboarding workflows: Validates and registers new third-party delivery partners into the system.
Authentication and authorization: Ensures that only approved entities can request unlocks, and only under permitted conditions.
Rate limiting and security enforcement: Prevents abuse and isolates faults to protect core services.
Interface abstraction: Maintains a clear separation between partner-facing APIs and internal services to avoid tight coupling.
Security and Access Control
Amazon Key uses different authentication and authorization mechanisms for internal and third-party delivery systems, tailored to their integration models and security requirements.
For Amazon’s internal delivery flow, authentication is performed by AMZL (Amazon Logistics) services when a driver uses the Flex App. Once the delivery associate arrives at the property and it has been verified that the specific driver has an assigned delivery for that property at this time, Amazon Key’s backend issues a short-lived token to unlock the gate. This token is used to authorize access during the delivery attempt.
For third-party delivery providers, mutual TLS (mTLS) is used for authentication during the onboarding process. A profile is created per partner, and all communication is secured through mTLS to validate both client and server identities. The Partner Gateway Service handles authorization and request orchestration for these external systems.
Some key points shared by Vijay Nagarajan regarding access expiry and revocation are as follows:
Access is controlled through a time-bound token issued when the driver arrives and parks at the location.
The token is extended at 30-second intervals during the delivery window to maintain access while the delivery is in progress.
In failure scenarios, such as device unavailability due to power loss, the system notifies the driver through the Flex App that 1-click access is not available. In such cases, the drivers can continue accessing the property through standard access mechanisms (for example, through the lobby). One such example cited was power loss during a hurricane in Florida, which made certain devices unreachable.
Architectural Boundaries and Team Ownership
Operating so many services and features requires Amazon Key to ensure a clean separation of concerns. Here’s a quick look at the distribution of responsibility and the overall team composition.
The Gateway Service is responsible for performing authentication and authorization, and for orchestrating requests from client applications.
The Access Management Service stores and manages the relationships between devices, access points, installation jobs, and address mappings.
The Partner Gateway Service also handles authentication and authorization, using mutual TLS (mTLS), and orchestrates requests from external delivery partners.
The Device Service does not hold contextual information about where a device is located. Instead, it maintains knowledge about the type of device and the operations it supports.
The backend infrastructure team is responsible for managing all these services. The broader team is organized into three focus areas:
App Development
Front-End Development
Backend Common Infrastructure
Results and Impact
By rethinking both hardware and software architecture, Amazon Key was able to evolve from a niche internal solution focused on delivery associates into a large-scale, extensible access control platform. The results reflect a system that is not only convenient, secure, reliable and adaptable to operational realities, but also now serves a wide variety of audiences including: property owners/managers, residents, and guests, across single-family, multifamily, and commercial properties.
Today, Amazon Key supports over 100 million successful unlocks annually with extremely high system availability and low end-to-end latency from app tap to physical unlock. Because of this, first-attempt delivery success has improved, and defects per building has reduced.
These improvements directly contributed to more efficient deliveries, lower support costs, and a more seamless access control experience, at scale.
Learnings
Here are some key learnings from Amazon Key’s experience building their system:
Evolve as You Scale: The initial serverless design was ideal for fast iteration, but couldn’t support the demands of a global system with strict latency and connectivity requirements. Transitioning to ECS Fargate enabled more consistent performance and stateful processing.
Measure What Matters: Instead of measuring generic uptime, the team focused on availability during delivery hours. This shifted the optimization focus to periods that affect end users, resulting in more actionable metrics and better system tuning.
Standardization Enables Speed: Using a consistent tech stack (Java across services, Infrastructure as Code, and AWS-native tooling) allowed the team to move faster without sacrificing maintainability. Reuse became a strength rather than a constraint.
Plan for Imperfect Environments: Many design decisions assume ideal conditions. In practice, field deployments introduced a range of variables: poor signal strength, hardware variation, and environmental impact. Designing with these constraints in mind was key to building resilience.
Operate Based on Data: By feeding all system metrics into a centralized analytics pipeline, the team could proactively identify issues, validate changes, and understand usage patterns. This led to faster incident response and continuous system improvement.
Use Tools Where They Fit: Lambda was not abandoned entirely. It remained useful for stateless, low-latency tasks. However, core workflows were moved to ECS for predictability and control. Tool choice became a matter of fit, not philosophy.
Design for Growth: The Partner Gateway Service shows the value of designing for extensibility. It enabled Amazon Key to expand from a single-use product into a multi-tenant platform, supporting external partners without disrupting core operations.
At the end, Vijay Nagarajan shared one key point regarding the journey: “It’s easy to say we would have arrived at the scaling architecture right away. But there are so many unknowns when we scale, especially in the business we are in. We have to inevitably grow through the learning phase. We could have potentially accelerated the learning phase by getting some of the basic metrics/telemetry from the device and prioritizing the OTA infrastructure. But for the rest, we are evolving in the right direction.”
SPONSOR US
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing sponsorship@bytebytego.com.