On API Design
External service API is very different from Internal service API. There are certain factors to keep in mind before we publish API. API changes are not agile in nature.
How to classify External vs Internal?
If a single team can develop, test and update the API, then it is and Internal API. If the changes are to be reviewed and approved by stakeholders beyond the team, then it is most likely an external API. The sign-off process is a step in preventing production issues.
Once exposed for public consumption across teams:
- API definitions are etched in stone: The API becomes collectively owned, The service provider is accountable for the API but every consumer is responsible for the API integrity. The consumers of the API can expose the API to other downstream systems and might build unforeseen usecases.
- Backward compatibility is a must: Not every team works on the same timelines/priorities. Backward compatible APIs allow flexibility in terms of project management across streams.
- Loose coupling is a feature: It is highly desirable to deprecate, refactor and continuously evolve software. An application with fewer dependencies makes it easy to maintain. E.g: Services should not expose dependencies on internally used libraries (loggers, build libs, DI frameworks, serialization libs, metrics). It is worth designing a separate interface package and implementation package.
- Facilitate multiple implementations of the API: The true test of an platform agnostic API is in the ability to have multiple implementations/representations. E.g: The API should facilitate lambda driven consumption with the same ease as that of EC2 based implementation. The changes to data store (Oracle, Mysql, RODB, RocksDB, Athena etc) should not impact the client. Think of pagination, nullables, access-controls, and other platform specific needs before you finalize the contract
- Aesthetics of an API matter: An API should have high signal to noise ratio. The API should be intuitive and unambiguous to new consumers (unlike setDestinationInfo, GetDI in API). API idempotency or the lack of it (stateful-ness) should be documented clearly.
- Versioning is a necessary evil: Versioning becomes inevitable and it is very difficult to expose versioning without violating the aesthetics of an API
- Trace the usecase and not just the user: In a rapidly evolving env, the APIs get used in unintended ways. It is important to share the responsibility of API ownership with downstream consumers. Allowing clients to add metadata, helps taking decisions related to Non Functional usage (such as throttles, timeouts, retrys, SLA definition, cache TTLs), regulatory concerns (UCI compliance) and Functional aspects (Deprecation, contract change)
Remember:
- A good API is not built in a day, it is iterative.
- Involve all the stakeholders when formalizing the design of API.
Checklist for an API
- Should adhere to single-responsibility principle. If an operation feels out of place, do not model it as part of the Service. When in doubt omit the feature rather than include.
- API should be agnostic of implementation concerns. (Implementation is defined as a. structure - language of choice; b. behavior - non functional characteristics; and c. functionality - how the result is achieved). API is all about what and not about how. E.g: The API should not expose 'Transactional context' or 'concurrency' elements, these are implementation concerns.
- The identifiers/values used in an API should state the 'Units'. (Address-Id, Place-Id, Tracking-Id). It is possible for an API to have multiple identifiers (especially when working on data from multiple sources), so being explicit is a good idea. E.g: How would it work for 64-bit address-id vs legacy address-id
- Backward compatibility is a must. Model all fields as optional (like proto-buff), Avoid changes field types, Invariant changes should always be done with client collaboration (even when the API does not take client input or return a result)
- Always have end to end tests for gazing backward compatibility. Interface handshake is a hard task, do not under estimate this.
- Input should always be validated and output should always be encoded
- Prefer value-objects/envelopes to primitives
- Prefer idempotent behavior where possible. Always document stateful behavior and allowed usages (Do not repeat the mistakes of Calendar API in java SDK)
- Circuit-breakers and Caching introduce a quasi-stateful-ness into the application. Document the normal and alternative behavior characteristics
- Filtering and Pagination should be thought of early
- Batch vs Realtime uses should be designed and defined clearly. If need be have a separate fleets to handle these concerns as the SLAs and usecases are going to be different
- Synchronous/Blocking, Asynchronous/Blocking and Asynchronous/Non-Blocking are concerns to keep in mind. The internal implementation/platform insights need to be accounted for when modelling these. [Coral async is async-blocking by default]
- Amazon as a convention prefers to fail-fast rather than retry. Do not retry unless you are head/tail of the request/call hierarchy. Do not attempt retry on external service dependencies without consensus. Document the expectations clearly
- When returning values, identify the need to return collection/list/array types. Prefer returning empty collection/list/array to returning null
- Modelling as a micro-service is very different from modelling as a monolith. Micro-services allow different concerns to be solved in agile fashion. For e.g: Introducing/Changing the caching strategy or storage strategy (DBMS, key-value store) is easier in micro-service based approach.
- Think of feature-toggles aka weblabs. It is preferable to have have a mechanism for toggling in real-time. Always test your toggles with automated tests
- Model admin/control-plane in a reliable/responsive way (no single-point failures). Control plane should push/poll data-plane and not the other way.
- Access Controls and App-Security should be baked into design. If there is a need for Fine-Grained-Access, have provisions for that as part of API
- Do not invent Authentication/encryption for your service, always rely on Amazon conventions and info-sec approved mechanisms
- Do not model Tier-1 and Tier-2 to have separate availability concerns. Instead, model fleets with different latency/non-functional characteristics. Availability is a must have feature, but low-latency is a targeted/specific feature.
- Think of the sanity tests, warm-ups/priming needed before allowing the first call to hit the fleet
- Avoid fallback behavior/defaults. For e.g: It is difficult to trace/replicate the behavior/issues when a single host is serving the requests in fallback mode. Design for consistency across the fleet (deployment/rollback being an documented exception)
- Model exceptions returned clearly. Most clients would be interested in Retryable and Non-Retryable exceptions. Having scenario specific error-codes help clients and service owners build usage-centric dashboards.
On Deployment Concerns
Distributed applications minimize the component failure risks. The availability and ability to absorb surges improves by having application deployed on multiple hosts and availability zones. Most applications in Amazon identify themselves as Tier-1/2 based on the impact their non-availability creates. The Operational Excellence (OE) processes are geared to reduce these issues. In 2020 we want to have similar availability characteristics for all usecases. We will have fleets catering to different consumers, to achieve client specific asks (latency, regulatory compliance). The mechanisms to ensure high-availability (monitoring, Full CI/CD, sanity/health checks, canaries) will be the same across all the fleets.
- Always have an homogeneous fleet, i.e. all hosts in the fleet should have the same instance type. This keeps the maintenance and IMR overhead low.
- Multiple instance-types imply multiple load-tests and different non-functional characteristics (max-conns, heap-size, disk usage, cpu usage). Have mechanisms to ensure fleet specific load-tests and metrics
- Document the SLAs supported per fleet and alternate/exception behavior
- Document retry behavior. Should be consistent across all fleets
- Document the transient scale-up and scale-down behavior (may be fleet specific) and SOPs to do so
- Document dependencies and fail-over behavior (cache outage, dependent service outage, circuit-breaker outage)
- Document the deployment windows, dashboards, health-checks, rollback behaviors per fleet
- Always have an SOP for on-call to refer to
- Auto-scaling is a desirable feature. Enable this where possible
- Control plane and Data plane have separate characteristics, model them separately.
- Data plane should not speak to config store directly.
- Control plane should act as intermediary between data-plane and environment (config, data).
- Jobs should be triggered on control-plane and not data-plane
- Always have Canary tests, sanity tests and health checks. The monitors and metrics should be automated (LPT) where possible
On Implementation Concerns
- Always validate the input before using it
- Always encode the output before returning it
- Prefer Idempotency and Immutability
- Allow multiple Service-Provider-Implementations (SPI) to co-exist, but have a mechanism to control the exposure
- Prefer concurrency via Non-blocking mode to blocking mode
- Unit tests, Integ-tests, Version compatibility tests and End-to-End tests are must have for every feature added/removed
- App-Security and metrics are needed for day-1
- Access Control and Fine Grained Access Control are to be achieved by convention where possible
- Prefer existing team conventions to practices. Do not introduce inconsistently implemented practices.
- Separate internal data-structures from external data-structures. Allow external and internal interfaces to evolve separately.
- Adhere to the team coding-guidelines consistently.
- Testability trumps Clean-Code, It is ok to have features built into code to facilitate ease of testing. (Corollary: A modular, clean code is easy to test without multiple changes)
On Operational Concerns
Test for
- Correctness of metrics
- Simulated conditions (cached vs non-cached, dependency availability vs non-availability, error conditions, retrys)
- Leverage Gremlin (CPU/RAM/IO - stress, soak, break tests)
- Throttle behavior
- Feature Toggle behavior
- Cache recovery
- Fleet recovery
- Spill over/resource exhaustion (VIP, SQS)
- Monitor Snitch, JMX, Application
Ease of testing/maintenance (level 0 being easiest, level 3 being hardest)
Level 0: A pure static function with no side effects
Level 1: A class that has immutable state. Think a Whole Value that replaces a primitive, like EmailAddress or PhoneNumber.
Level 2: A class that has mutable state and may operate against behaviorless dependencies like Level 1 Whole Values.
Level 3: A class that operates against a dependency with its own behaviors