Towards An Asynchronous Architecture

Introduction

Scaling a stateful system is no easy feat, especially when your database backend starts becoming the bottleneck as your service grows. As traffic increases, you can try to squeeze all the optimization and efficiency out of the existing backend, and if the scale requirements are still not met, then you should ideally consider evolving your system to meet the demand or evolve your system as demands change. common first step is to introduce load balancing with sharding to distribute the load and enhance scalability. However, these methods can sometimes add unnecessary complexity to an already intricate system. A powerful yet simpler approach to boost scalability is by leveraging asynchronous processing.

In this blog, I’ll guide you through an example of a distributed system and show how using asynchronous systems can resolve bottlenecks and scalability issues. We’ll discuss why a simple synchronous design works initially, what challenges it faces as load increases, and how switching to asynchronous processing can transform your system into a scalable architecture.

Sample Case Study: An Order Processing Systems

An order processing system is a core component of many e-commerce businesses. It handles the entire lifecycle of an order, ensuring customer requests are processed efficiently and delivered on time. The core business use case it solves is to provide a seamless workflow for managing orders from creation to fulfillment. In its simplest form, the system manages transitions through different stages as orders are placed, inventory is checked, payments are processed, and items are shipped and delivered to customers. These state transitions ensure that the order lifecycle is tracked and completed successfully, providing visibility and consistency across the entire process.

The Synchronous Approach: Starting Simple

In the initial iteration of our order processing system, everything is designed to happen synchronously. Let’s break down the core components and workflow:

  • When a customer places an order, the backend API writes the order to a MySQL database and then immediately makes a call to the inventory service to reserve the necessary items.

  • If the inventory reservation is successful, the order proceeds to the next stage (e.g., payment and shipping). If not, the order is canceled.

This design is simple, easy to implement, and provides immediate consistency. However, as traffic increases, several key issues arise:

  1. Scalability Bottlenecks: The inventory service is part of the critical path, meaning every order request depends on the inventory service being available and able to respond quickly. During peak loads, this dependency quickly becomes a bottleneck, limiting the overall throughput of the system.

  2. Reduced Availability: If the inventory service is temporarily unavailable, the entire order process fails, making it a single point of failure. Users are unable to place orders, leading to lost revenue and poor customer satisfaction.

  3. Increased Latency: Since the backend waits for the inventory reservation to complete, users may experience delays. This affects user experience, especially if the inventory service takes longer to respond or faces performance issues.

  4. Poor Resilience: There is limited scope for retry mechanisms or graceful degradation if the inventory service fails. Any failure directly impacts the user request, resulting in a suboptimal experience.

Moving to Asynchronous Inventory Reservation: Introducing Message Buses

To address these challenges, the second iteration of the design introduces asynchronous inventory reservation using a message bus/pubsub system like Kafka. Here’s how the transition solves the problems:

  1. Decoupled Processing: Instead of waiting for the inventory reservation to complete, the order mgmt service simply emits an event to a message bus topic (e.g., "OrderCreated"). A subscriber to these events then processes reservations asynchronously and updates the order state from CREATED to RESERVED. This decoupling allows the order service to immediately respond to the user, significantly improving response times.

  1. Scalability with Message Bus: The message bus's partitioning mechanism allows the inventory service to scale horizontally. Multiple instances of the inventory service can consume events concurrently, distributing the load and ensuring high throughput, even during peak times.

  2. Increased Resilience: With the message bus acting as a buffer, the inventory service can process events at its own pace. If the inventory service is temporarily down, the messages remain in the bus, allowing for eventual processing once the service is back up. This significantly reduces downtime and improves the system’s overall resilience.

  3. Retry and Fault Tolerance: The message bus's event retention and replay capabilities provide an in-built retry mechanism. If inventory reservation fails due to a temporary issue, the event can be retried until it succeeds. This makes the system more robust to transient failures without negatively impacting the user experience.

  4. Loosely Coupled Services: By using a message bus to publish and consume events, the order service and inventory service become loosely coupled. This makes the system easier to maintain and modify, as changes to the inventory service do not directly affect the order service.

Example State Transitions in the Asynchronous Design

In this design, order state transitions are managed as follows:

  1. CREATED: The order is initially created in the database, and an event is published to the message bus.

  2. RESERVED: The inventory service consumes the event and tries to reserve inventory. If successful, the state moves to RESERVED. If inventory is unavailable, the state moves to CANCELLED.

  3. PAID: Once inventory is reserved, the payment process is initiated.

  4. SHIPPED and DELIVERED: Subsequent states like shipping and delivery follow, potentially also driven by asynchronous events.

This approach ensures that each component of the system can operate independently, with the message bus facilitating communication between services without making them directly dependent on each other.

Summary:

Benefits of Asynchronous Design

By transitioning to an asynchronous architecture using a message bus like Kafka, the system gains:

  • Scalability: Both the order and inventory services can handle more requests, scaling independently as needed.

  • Resilience: The decoupling ensures the system can continue processing orders even if some services are down temporarily.

  • Lower Latency: Users receive an immediate response when placing orders, improving the overall user experience.

  • Ease of Maintenance: Loose coupling allows independent development, deployment, and scaling of different services.

Trade-offs of Asynchronous Design

While asynchronous design brings several benefits, it also comes with challenges:

  • Complexity: Asynchronous systems are more complex to implement and debug. Handling events, retries, and consistency requires sophisticated architecture.

  • Eventual Consistency: Asynchronous designs often lead to eventual consistency, where the system may not reflect the most up-to-date state immediately.

  • Operational Overhead: Maintaining a message bus, managing partitions, and ensuring message delivery can add operational overhead.

  • Failure Handling: Failures in asynchronous workflows are harder to diagnose. Retry mechanisms, dead-letter queues, and idempotent consumers add complexity.

  • Debugging: Since services operate independently, debugging issues requires tracing messages across components, making it challenging to find the root cause.

Conclusion

Moving from a synchronous design to an asynchronous architecture can significantly improve the scalability, resilience, and performance of your distributed system. By using a message bus as an event broker, you can decouple tightly integrated services, reduce latency, and ensure the system remains operational even in the face of failures. Asynchronous design patterns are powerful tools in your scalability toolkit—consider them as you face growing loads and complex workflows in your system.

Next
Next

Get the Problem Right, the Solution Will Follow