Designing Authentication for Enterprise SSO: Lessons from Building KID

When we started building KID, our authentication API, we assumed the hard part would be implementing the protocols correctly. OIDC and SAML are well-documented standards with mature libraries. How hard could it be?

The protocols were the easy part. The hard part was everything the specs do not say: how real identity providers behave, how enterprise IT teams configure them, and what happens at 2 a.m. when a certificate expires.

Enterprise SSO is a compatibility problem, not a protocol problem

Every enterprise deal eventually produces the same sentence: "We need this to work with our IdP." In practice that means Okta, Entra ID, Ping, OneLogin, and a long tail of on-premises ADFS installations, each with its own interpretation of the standards.

A few examples from our integration log:

One IdP sends SAML assertions with a NotOnOrAfter window of exactly 60 seconds. Combined with 30 seconds of clock skew on the customer side, roughly 1% of logins failed - but only for one tenant, and only intermittently.
Another IdP omits the email claim unless an admin enables a non-default mapping. The user exists, authentication succeeds, and provisioning still fails.
ADFS installations frequently sign with certificates that the IT team rotates manually, on a schedule nobody wrote down.

None of these are spec violations you can reject outright. They are conditions you have to absorb.

Design decisions that held up

Treat every assertion as untrusted input

We validate signatures, audience, issuer, and time windows before anything else touches the payload, and we normalize claims into an internal identity record with explicit per-tenant mapping rules. IdP quirks stay at the boundary; the rest of KID never sees them.

Allow generous, configurable clock skew - and log when you use it

We accept up to 120 seconds of skew by default, but every login that needed the allowance is logged. That log is how we caught the 60-second assertion window above before the customer noticed.

Automate metadata and certificate rotation

Most SAML outages are certificate outages. KID polls IdP metadata endpoints daily, trusts both the current and the incoming signing certificate during a rotation window, and alerts the tenant admin 30 days before a pinned certificate expires. Since shipping this, certificate-related login failures across all tenants dropped to zero.

Make provisioning a first-class feature

Authentication answers "who is this"; enterprises also need "who should exist". We learned to stop treating SCIM and just-in-time provisioning as afterthoughts. JIT provisioning with per-tenant attribute mapping now covers most deployments, and the rule is simple: if a claim is missing, fail with a message the customer's IT admin can act on, not a generic 500.

The testing matrix nobody warns you about

Our CI runs protocol-level tests against recorded fixtures from every IdP family we support: assertion variants, encrypted and signed combinations, missing claims, expired certificates, and clock skew scenarios. The fixture library started as a debugging aid and became one of our most valuable assets - every production incident adds a fixture, so no integration bug bites us twice.

What we would tell our past selves

Budget more time for the second IdP than the first. The first integration teaches you the protocol. The second teaches you which of your assumptions were actually IdP-specific.
Error messages are a product feature. The person debugging a failed SSO login is usually a customer IT admin, not your engineer. Write for them.
Rotation is the steady state. Certificates, metadata, signing keys, admin contacts - everything rotates. Design for the rotation, not the setup.

Enterprise SSO is unglamorous work, but it is also where trust is won. The customer's first real interaction with your platform is a login. It should be boring, in the best sense of the word.