50 Cloud & DevOps Interview Questions and Answers (2026)
50 cloud and DevOps interview questions covering AWS Lambda, Docker, Microservices, API Gateway, S3, serverless, and Azure Entra ID. With code examples.
On this page
Cloud and DevOps knowledge is no longer optional for backend engineers. AWS Lambda, Docker, and microservices show up in job requirements for roles that were purely application-focused a few years ago. Azure Entra ID is now a standard topic for any role touching enterprise Microsoft environments.
These 50 questions cover what interviewers are actually asking in 2026 across six cloud and DevOps topics: serverless and AWS Lambda, API Gateway, S3, Docker, microservices, and Azure Entra ID. Answers are written to be said out loud in an interview: specific, grounded in real behavior, with working examples where they help. If you're also brushing up on the application side, our Node.js interview questions guide pairs well with the Lambda and serverless sections here, since most serverless backends in 2026 are written in Node.js.
Category 1: Serverless and AWS Lambda (Q1-Q8)
Serverless and Lambda questions test whether you understand the execution model behind the trend, not just the marketing pitch. Expect questions about cold starts, concurrency, and the operational tradeoffs of going serverless.
Q1. What is serverless computing and what problem does it solve?
Serverless computing is a cloud execution model where the cloud provider manages the server infrastructure entirely. You write and deploy code (a function), define what triggers it, and the provider handles provisioning, scaling, patching, and availability. You pay only for actual execution time, not for idle capacity.
The problem it solves: traditional server deployments require you to provision capacity in advance. A VM or container running 24/7 costs money even at 3am when no traffic arrives. Auto-scaling helps, but still requires managing the scaling configuration. Serverless eliminates the operational overhead entirely.
- Stateless execution: functions do not persist state between invocations.
- Cold starts: functions take extra time on first invocation after an idle period.
- Maximum execution time: AWS Lambda caps at 15 minutes per invocation.
- Best fit: event-driven tasks, API backends, data processing, scheduled jobs.
When NOT to use serverless:
- Long-running processes like video encoding or ML training.
- Applications needing persistent connections, such as WebSocket servers (unless using the Lambda WebSocket API via API Gateway).
- Workloads that run continuously at high volume, where an always-on EC2 instance or container may be cheaper.
Q2. How does AWS Lambda work? Explain the execution model.
Lambda runs your function code inside an execution environment: a managed container that AWS provisions, runs, and destroys. You never see the server.
Execution flow:
- A trigger event arrives (HTTP request via API Gateway, S3 upload, SQS message, EventBridge rule, etc.).
- Lambda allocates an execution environment with your chosen memory and CPU.
- AWS downloads your deployment package (code plus dependencies), initializes the runtime, and runs your init code outside the handler (once per environment).
- Lambda calls your handler function with the event and context objects.
- Your handler processes the event and returns a response.
- The execution environment stays alive for a period (typically 5 to 15 minutes) to handle subsequent invocations (warm start).
- After inactivity, the environment is frozen and eventually destroyed.
// Node.js Lambda handler
exports.handler = async (event, context) => {
// event: the trigger payload (HTTP request, S3 event, etc.)
// context: runtime info (function name, remaining time, etc.)
console.log("Event:", JSON.stringify(event));
console.log("Remaining time:", context.getRemainingTimeInMillis(), "ms");
// Business logic here
const result = await processEvent(event);
// Return response (format depends on trigger type)
return {
statusCode: 200,
headers: { "Content-Type": "application/json" },
body: JSON.stringify(result),
};
};
// Code OUTSIDE the handler runs once per execution environment (cold start)
// Good for: database connections, SDK clients, config loading
const dbClient = new DatabaseClient({ host: process.env.DB_HOST });Lambda supports Node.js, Python, Java, Go, Ruby, .NET, and custom runtimes via the Runtime API.
Q3. What is a Lambda cold start and how do you reduce it?
A cold start occurs when Lambda must create a new execution environment before running your function. This adds 100ms to several seconds of latency before your handler even starts. Warm starts reuse an existing environment and run in milliseconds.
Cold starts happen when:
- The function has not been invoked recently (environment was destroyed).
- Traffic spikes cause more concurrent executions than existing environments.
- You deploy a new function version.
- The function runs inside a VPC (adds latency for ENI attachment, now much improved with Hyperplane ENI, but VPC still adds some overhead).
Cold start mitigation strategies:
1. Provisioned Concurrency: keeps N execution environments pre-initialized and always warm. Eliminates cold starts for those instances entirely. Costs money even when idle.
# Set provisioned concurrency on a specific version or alias
aws lambda put-provisioned-concurrency-config \
--function-name my-api \
--qualifier prod \
--provisioned-concurrent-executions 10- 2. SnapStart (Java only): pre-initializes the Java runtime snapshot at publish time. Reduces Java cold starts from seconds to sub-100ms.
- 3. Keep deployment packages small: a smaller zip means a faster download and faster cold start. Remove unused dependencies and use Lambda Layers for shared libs.
- 4. Choose fast runtimes: Node.js and Python have the fastest cold starts. Java and .NET are slowest. Go is also fast.
- 5. Move init code outside the handler: database connections, SDK clients, and config loading done outside the handler run once per environment, not per invocation.
- 6. Schedule warm-up pings: invoke the function every 5 minutes via EventBridge to prevent the environment from going cold. Works but is imprecise and not reliable for burst traffic.
Q4. What is Lambda concurrency and what are the two types?
Concurrency is the number of Lambda function instances executing at the same time. By default, Lambda can run up to 1,000 concurrent executions per account per region (a soft limit that can be raised).
Unreserved concurrency is the pool shared by all functions in your account. If one function consumes all available concurrency during a spike, other functions in the same account get throttled.
Reserved concurrency is a hard allocation for a specific function. It has two effects: it guarantees that function can always scale up to the reserved amount, and it hard-caps the function at that amount (it cannot use more even if the pool allows).
# Reserve 100 concurrent executions for the payments function
aws lambda put-function-concurrency \
--function-name payment-processor \
--reserved-concurrent-executions 100
# Remove reserved concurrency (return to unreserved pool)
aws lambda delete-function-concurrency \
--function-name payment-processorWhen Lambda exceeds its concurrency limit it throttles, returning a 429 TooManyRequestsException for synchronous invocations. For asynchronous invocations, Lambda retries automatically.
Provisioned concurrency (different from reserved) pre-initializes environments. Reserved concurrency sets a hard maximum. You can combine both: set reserved concurrency to 50 and provisioned to 10 (10 always warm, up to 50 total).
Q5. What are Lambda Layers and when do you use them?
A Lambda Layer is a .zip archive containing libraries, a custom runtime, configuration, or other dependencies. You attach up to 5 layers to a function. Lambda merges the layer contents into the /opt directory at runtime.
Use layers for:
- Sharing common libraries across multiple functions (avoid bundling the same 200MB dependency in every function zip).
- Keeping deployment packages small for faster cold starts and easier deployments.
- Distributing internal utilities or helper code across your team.
- Custom runtimes for unsupported languages.
# Create a layer from a zip containing Node.js dependencies
zip -r layer.zip nodejs/
aws lambda publish-layer-version \
--layer-name my-shared-libs \
--description "Shared utilities and database client" \
--zip-file fileb://layer.zip \
--compatible-runtimes nodejs20.x nodejs22.x
# Attach the layer to a function
aws lambda update-function-configuration \
--function-name my-api \
--layers arn:aws:lambda:us-east-1:123456789:layer:my-shared-libs:3In your function, access layer contents at /opt:
const { sharedUtil } = require("/opt/nodejs/shared-util");AWS also publishes public layers, such as the AWS X-Ray SDK layer and the Lambda Powertools layer for Python and Node.js.
Q6. What are the two Lambda invocation types and how does error handling differ?
Synchronous invocation: the caller waits for Lambda to finish and return a response. API Gateway, ALB, and direct SDK calls use synchronous invocation. If the function throws, the error is returned to the caller immediately. Lambda does NOT automatically retry synchronous failures.
// Direct synchronous invocation from SDK
const result = await lambda.invoke({
FunctionName: "my-function",
InvocationType: "RequestResponse", // synchronous
Payload: JSON.stringify({ key: "value" }),
}).promise();Asynchronous invocation: the caller sends the event and gets a 202 Accepted response immediately. Lambda processes the event in the background. S3 event notifications, SNS, EventBridge, and SES use async invocation. Lambda automatically retries failed async invocations up to 2 times (3 total attempts) with delays between retries.
// Asynchronous invocation
await lambda.invoke({
FunctionName: "my-function",
InvocationType: "Event", // asynchronous
Payload: JSON.stringify({ key: "value" }),
}).promise();
// Returns immediately with 202Dead Letter Queue (DLQ): for async invocations that fail after all retries, configure a DLQ (SQS queue or SNS topic) to receive the failed event. Inspect DLQ messages to diagnose persistent failures.
aws lambda update-function-configuration \
--function-name my-function \
--dead-letter-config TargetArn=arn:aws:sqs:us-east-1:123:my-dlqLambda Destinations (the newer approach): route invocation results to SQS, SNS, EventBridge, or another Lambda function on success OR failure. More flexible than DLQ because it captures both success and failure events.
Q7. What are Lambda's key limits and how do you work around them?
| Limit | Value | Workaround |
|---|---|---|
| Max execution duration | 15 minutes | Use Step Functions for longer workflows |
| Max memory | 10,240 MB (10 GB) | Break into smaller functions |
| Deployment package (zip) | 50 MB direct, 250 MB unzipped | Use Lambda Layers, container images |
| Container image size | 10 GB | Fine for most use cases |
| /tmp storage | 10 GB (increased in 2022) | Use S3 for larger temp files |
| Concurrency (default) | 1,000 per region | Request limit increase |
| Environment variables | 4 KB total | Use SSM Parameter Store or Secrets Manager |
| Payload (sync invocation) | 6 MB request, 6 MB response | Stream response, use S3 for large payloads |
VPC-specific behavior: Lambda functions inside a VPC can access private resources (RDS, ElastiCache) but cannot access the public internet unless routed through a NAT Gateway. Always add a NAT Gateway if VPC Lambda functions need outbound internet access.
Q8. How do you monitor and debug AWS Lambda in production?
CloudWatch Logs: every console.log(), print(), or fmt.Println() call from your handler is captured automatically. Each function gets its own log group (/aws/lambda/function-name). Use structured logging (JSON) for better queryability.
// Structured logging for CloudWatch Insights queries
console.log(JSON.stringify({
level: "INFO",
message: "Order processed",
orderId: event.orderId,
duration: Date.now() - startTime,
userId: event.userId,
}));CloudWatch Metrics: Lambda automatically publishes:
- Invocations: total calls.
- Duration: execution time (p50, p95, p99).
- Errors: function-level errors.
- Throttles: invocations rejected due to concurrency limits.
- ConcurrentExecutions: peak concurrent instances.
AWS X-Ray: distributed tracing. Add the X-Ray SDK to trace downstream calls (DynamoDB, S3, HTTP calls) and view flame graphs of where time is spent.
const AWSXRay = require("aws-xray-sdk-core");
const AWS = AWSXRay.captureAWS(require("aws-sdk"));
// All AWS SDK calls now appear in X-Ray tracesLambda Powertools (Node.js / Python): an AWS-maintained utility library adding structured logging, tracing, and metrics with minimal code.
const { Logger } = require("@aws-lambda-powertools/logger");
const logger = new Logger({ serviceName: "order-service" });
logger.info("Processing order", { orderId: event.orderId });Category 2: AWS API Gateway (Q9-Q14)
API Gateway questions test whether you know which API type and integration to reach for, and how authorization and throttling actually work under the hood.
Q9. What is AWS API Gateway and what are the three API types?
API Gateway is a fully managed service for creating, publishing, securing, and monitoring APIs at any scale. It acts as the front door for applications to access backend services: Lambda functions, EC2, ECS, or any HTTP endpoint.
REST API: the original, most feature-rich option. Supports request/response transformation, request validation, usage plans, API keys, custom domain names, caching, and fine-grained IAM permissions. More expensive and complex to configure.
HTTP API: launched in 2020 as a simpler, cheaper alternative. Supports Lambda and HTTP integrations, JWT authorizers, and OIDC/OAuth 2.0. Up to 71% cheaper than REST API. Lacks some REST API features (no built-in response transformation, no usage plans). Best for most modern serverless APIs.
WebSocket API: for two-way stateful communication. Maintains persistent connections. Used for real-time chat, live dashboards, and multiplayer games. Supports $connect, $disconnect, and custom route keys.
Q10. What integration types does API Gateway support?
API Gateway can route requests to different backends depending on the integration type.
Lambda Proxy: the most common. API Gateway passes the full HTTP request (headers, query params, body, path params) to Lambda as a structured event. Lambda returns a structured response object. Zero request/response transformation by API Gateway.
// Lambda receives this event from API Gateway proxy integration
{
"httpMethod": "POST",
"path": "/users",
"headers": { "Content-Type": "application/json", "Authorization": "Bearer ..." },
"queryStringParameters": { "include": "profile" },
"body": "{\"name\":\"Alice\",\"email\":\"alice@example.com\"}",
"requestContext": { "requestId": "abc-123", "stage": "prod" }
}- HTTP: proxy the request to any publicly routable HTTP endpoint. Useful for putting API Gateway in front of an existing server or third-party service.
- AWS Service: directly call an AWS service action (SQS SendMessage, DynamoDB PutItem) without going through Lambda. Reduces latency and cost by removing the Lambda layer.
- Mock: return a hardcoded response from API Gateway itself. Useful for mocking endpoints during development or returning maintenance-mode responses.
Q11. How does API Gateway handle authorization?
Three built-in authorization mechanisms:
IAM Authorization: requests must be signed with AWS Signature Version 4. Best for internal service-to-service calls and AWS CLI/SDK access. Not for public APIs since every caller needs AWS credentials.
Lambda Authorizer (formerly Custom Authorizer): API Gateway calls a Lambda function with the request token (or full request). The Lambda returns an IAM policy (allow/deny) and optionally a context object passed to the backend. Results can be cached by API Gateway to reduce Lambda calls.
// Lambda Authorizer handler
exports.handler = async (event) => {
const token = event.authorizationToken; // Bearer <jwt>
try {
const decoded = jwt.verify(token.split(" ")[1], process.env.JWT_SECRET);
return {
principalId: decoded.sub,
policyDocument: {
Version: "2012-10-17",
Statement: [{ Effect: "Allow", Action: "execute-api:Invoke", Resource: event.methodArn }],
},
context: { userId: decoded.sub, email: decoded.email },
};
} catch {
throw new Error("Unauthorized");
}
};JWT Authorizer (HTTP API only): API Gateway validates JWT tokens directly without invoking a Lambda function. Configure the issuer URL (Cognito, Auth0, Okta) and API Gateway verifies signature and claims automatically. No Lambda cost, lower latency.
# Serverless Framework: HTTP API with JWT authorizer
httpApi:
authorizers:
jwtAuthorizer:
type: jwt
identitySource: $request.header.Authorization
issuerUrl: https://cognito-idp.us-east-1.amazonaws.com/us-east-1_XXX
audience:
- my-app-client-idQ12. How does API Gateway throttling work?
API Gateway enforces throttling at two levels.
Account-level throttle: default 10,000 requests/second and 5,000 burst (requests allowed in the first second). Shared across all APIs in the region.
Stage-level and method-level throttle: set per API stage or per individual route/method. Overrides the account default for that specific resource.
When throttled, API Gateway returns 429 Too Many Requests.
Usage Plans (REST API): pair with API keys to set per-customer throttle limits and monthly request quotas. Useful for monetized APIs.
# Set throttle on a specific method via AWS CLI
aws apigateway update-stage \
--rest-api-id abc123 \
--stage-name prod \
--patch-operations \
op=replace,path=/defaultRouteSettings/throttlingRateLimit,value=1000 \
op=replace,path=/defaultRouteSettings/throttlingBurstLimit,value=500Behind the scenes, API Gateway uses a token bucket algorithm. The burst limit is the bucket capacity (tokens available instantly). The rate limit is the refill rate (tokens added per second).
Q13. What are API Gateway stages and how do you use them for deployments?
A stage is a named reference to a specific deployment of your API (e.g., dev, staging, prod). Each stage has its own URL, throttle settings, logging configuration, and stage variables.
# Stage URLs
https://abc123.execute-api.us-east-1.amazonaws.com/dev/users
https://abc123.execute-api.us-east-1.amazonaws.com/prod/usersStage variables work like environment variables for API Gateway. Use them to point different stages to different Lambda function aliases or backends.
# Stage variable: lambdaAlias = "dev" in dev stage, "prod" in prod stage
# Lambda integration URI uses the stage variable:
# arn:aws:apigateway:us-east-1:lambda:path/.../functions/${stageVariables.lambdaAlias}/invocationsCanary deployments: route a percentage of traffic to a new stage deployment before full release. If metrics look good, promote the canary. If not, roll back.
aws apigateway create-deployment \
--rest-api-id abc123 \
--stage-name prod \
--canary-settings percentTraffic=10,stageVariableOverrides='{"lambdaAlias":"canary"}'Custom domains: map your own domain (api.yourcompany.com) to an API Gateway endpoint using Route 53 and ACM certificates, hiding the default execute-api URL.
Q14. What is the difference between REST API and HTTP API in API Gateway?
| Feature | REST API | HTTP API |
|---|---|---|
| Price | ~$3.50/million requests | ~$1.00/million requests |
| JWT authorizers | No (use Lambda authorizer) | Yes (native, no Lambda needed) |
| Response transformation | Yes (mapping templates) | No |
| Request validation | Yes | No |
| Usage plans + API keys | Yes | No |
| WebSocket | No | No (separate WebSocket API type) |
| Private integrations (VPC Link) | Yes | Yes |
| OpenAPI import/export | Yes | Partial |
| Caching | Yes | No |
| Latency | Slightly higher | Lower |
Choose HTTP API when you are building a new serverless API, you use JWT auth (Cognito, Auth0), and you do not need request/response transformation or usage plans. It is simpler, cheaper, and faster for the majority of use cases.
Choose REST API when you need built-in caching, response transformation via mapping templates, usage plans for monetization, or API keys with per-key throttle controls.
Category 3: AWS S3 (Q15-Q20)
S3 questions test your understanding of object storage fundamentals: storage classes, versioning, access control, and the presigned URL pattern that almost every file-upload feature depends on. If your data layer also includes a NoSQL store alongside S3, our NoSQL interview questions guide covers DynamoDB and MongoDB patterns that frequently pair with S3 for media and document storage.
Q15. What is Amazon S3 and what are its core concepts?
Amazon S3 (Simple Storage Service) is AWS's object storage service. It stores any file type as an object inside a container called a bucket. S3 is designed for 99.999999999% (11 nines) durability and scales from bytes to exabytes.
Bucket: a container for objects. Bucket names are globally unique across all AWS accounts. Each bucket lives in one AWS region.
Object: a file stored in S3. Consists of the object data (binary) plus metadata (key-value pairs). Maximum object size is 5TB. Objects over 5GB require multipart upload.
Key: the full path and name of the object within the bucket, for example images/profile/user-1001/avatar.jpg. S3 is flat (no real folders): the slash is just part of the key name, but the console displays it as folders.
URL format: https://bucket-name.s3.amazonaws.com/path/to/object
Or regional: https://bucket-name.s3.us-east-1.amazonaws.com/path/to/objectKey S3 capabilities:
- Object versioning: keep multiple versions of the same key.
- Lifecycle policies: automatically transition or delete objects by age.
- Replication: cross-region or same-region replication.
- Event notifications: trigger Lambda, SQS, or SNS on object events.
- Static website hosting: serve HTML/CSS/JS files as a static site.
- Presigned URLs: temporary access to private objects.
Q16. What are S3 storage classes and when do you use each?
S3 offers several storage classes optimized for different access patterns and cost profiles.
- S3 Standard: the default. Low latency, high availability (99.99%). Best for frequently accessed data: active user uploads, application assets, frequently read datasets. Most expensive per GB.
- S3 Intelligent-Tiering: automatically moves objects between frequent-access, infrequent-access, and archive tiers based on access patterns. Small monthly monitoring fee per object. Best when access patterns are unpredictable.
- S3 Standard-IA (Infrequent Access): lower storage cost than Standard, but adds a per-GB retrieval fee. Minimum storage duration of 30 days. Best for data accessed once a month or less: backups, disaster recovery copies.
- S3 One Zone-IA: same as Standard-IA but stored in one AZ only. Lower cost but lower durability. Best for secondary backup copies.
- S3 Glacier Instant Retrieval: archival storage, millisecond retrieval. Best for quarterly-accessed data (medical images, annual reports).
- S3 Glacier Flexible Retrieval: archival storage, retrieval in minutes to hours. Best for backups accessed a few times per year.
- S3 Glacier Deep Archive: the cheapest storage class. Retrieval in 12 to 48 hours. Best for regulatory long-term retention (7+ year compliance data).
# Upload directly to a specific storage class
aws s3 cp myfile.zip s3://my-bucket/backups/ \
--storage-class STANDARD_IA
# Configure lifecycle rule to transition to Glacier after 90 days
aws s3api put-bucket-lifecycle-configuration \
--bucket my-bucket \
--lifecycle-configuration file://lifecycle.jsonQ17. What is S3 versioning and what problems does it solve?
When versioning is enabled on a bucket, S3 stores every version of every object rather than overwriting. Each PUT creates a new version with a unique version ID. Deleting a versioned object adds a delete marker (it is not truly deleted until you delete all versions).
Problems it solves:
- Accidental overwrites: restore a previous version with one API call.
- Accidental deletions: delete the delete marker to restore the object.
- Ransomware protection: an attacker cannot overwrite history.
- Audit trail: full history of every object mutation.
# Enable versioning
aws s3api put-bucket-versioning \
--bucket my-bucket \
--versioning-configuration Status=Enabled
# List all versions of an object
aws s3api list-object-versions \
--bucket my-bucket \
--prefix images/avatar.jpg
# Restore previous version (copy old version over current)
aws s3api copy-object \
--bucket my-bucket \
--copy-source my-bucket/images/avatar.jpg?versionId=OLD_VERSION_ID \
--key images/avatar.jpgCombined with lifecycle policies, you can expire old versions automatically to control costs (keep last 5 versions, delete older ones after 90 days).
MFA Delete: requires MFA authentication to permanently delete a versioned object. Extra protection against accidental or malicious permanent deletion.
Q18. What are S3 presigned URLs and when do you use them?
A presigned URL is a time-limited, signed URL that grants temporary access to a specific S3 object to anyone who has the URL, without requiring AWS credentials. The URL encodes the identity of the signer, the target object, and the expiration time.
Use cases:
- Allow users to download a private file without making the bucket public.
- Allow users to upload directly to S3 from a browser, bypassing your server: a PUT presigned URL lets the browser PUT directly to S3 without proxying through your backend.
// Generate a presigned GET URL (download)
const { S3Client, GetObjectCommand } = require("@aws-sdk/client-s3");
const { getSignedUrl } = require("@aws-sdk/s3-request-presigner");
const client = new S3Client({ region: "us-east-1" });
const url = await getSignedUrl(
client,
new GetObjectCommand({ Bucket: "my-bucket", Key: "reports/q1-2026.pdf" }),
{ expiresIn: 3600 } // expires in 1 hour
);
// URL is safe to send to the client, only works for 1 hour
// Generate a presigned PUT URL (upload)
const uploadUrl = await getSignedUrl(
client,
new PutObjectCommand({
Bucket: "my-bucket",
Key: `uploads/user-${userId}/avatar.jpg`,
ContentType: "image/jpeg",
}),
{ expiresIn: 300 } // 5 minutes to start the upload
);
// Client uses this URL to PUT the file directly to S3
// Your backend never touches the file dataThe direct-upload pattern (presigned PUT) is the right way to handle large file uploads. It offloads all the bandwidth and processing from your server to S3.
Q19. How do S3 event notifications work with Lambda?
S3 can invoke a Lambda function when specific events happen on a bucket:
- ObjectCreated (PUT, POST, COPY, CompleteMultipartUpload).
- ObjectRemoved (DELETE).
- ObjectRestore (from Glacier).
- Replication events.
# 1. Add Lambda permission to allow S3 to invoke it
aws lambda add-permission \
--function-name image-processor \
--principal s3.amazonaws.com \
--statement-id s3-invoke \
--action lambda:InvokeFunction \
--source-arn arn:aws:s3:::my-upload-bucket \
--source-account 123456789012
# 2. Configure the S3 event notification
aws s3api put-bucket-notification-configuration \
--bucket my-upload-bucket \
--notification-configuration '{
"LambdaFunctionConfigurations": [{
"LambdaFunctionArn": "arn:aws:lambda:us-east-1:123:function:image-processor",
"Events": ["s3:ObjectCreated:*"],
"Filter": {
"Key": { "FilterRules": [
{ "Name": "prefix", "Value": "uploads/" },
{ "Name": "suffix", "Value": ".jpg" }
]}
}
}]
}'Lambda receives an S3 event:
exports.handler = async (event) => {
for (const record of event.Records) {
const bucket = record.s3.bucket.name;
const key = decodeURIComponent(record.s3.object.key.replace(/\+/g, " "));
console.log(`Processing: s3://${bucket}/${key}`);
await generateThumbnail(bucket, key);
}
};Q20. What is the difference between an S3 bucket policy and an ACL?
S3 Access Control Lists (ACLs) are a legacy mechanism. They define basic read/write permissions per object or bucket for specific AWS accounts or predefined groups (all users, authenticated users). AWS recommends disabling ACLs and using bucket policies instead for all new buckets.
S3 Bucket Policy: a JSON IAM-style resource policy attached to the bucket. Supports fine-grained conditions, IP restrictions, MFA requirements, and cross-account access. More powerful and auditable than ACLs.
// Bucket policy: allow a specific IAM role to read, deny all others
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowReadFromAppRole",
"Effect": "Allow",
"Principal": { "AWS": "arn:aws:iam::123456789:role/app-role" },
"Action": ["s3:GetObject", "s3:ListBucket"],
"Resource": [
"arn:aws:s3:::my-bucket",
"arn:aws:s3:::my-bucket/*"
]
},
{
"Sid": "DenyPublicAccess",
"Effect": "Deny",
"Principal": "*",
"Action": "s3:*",
"Resource": ["arn:aws:s3:::my-bucket/*"],
"Condition": {
"Bool": { "aws:SecureTransport": "false" }
}
}
]
}Block Public Access settings: a separate account-level and bucket-level setting that overrides ACLs and bucket policies to prevent any public access, even if a policy accidentally grants it. Enable this on all non-public buckets.
Category 4: Docker (Q21-Q32)
Docker questions cover the full lifecycle: images, containers, networking, volumes, multi-stage builds, security, and orchestration. This is the largest category in the guide because Docker fluency underpins almost every microservices and CI/CD question that follows. Once you're comfortable with the concepts here, our NestJS interview questions guide shows how these containers typically run a Node.js backend in production.
Q21. What is Docker and how does it differ from a virtual machine?
Docker is a platform for packaging, distributing, and running applications in containers. A container is an isolated process running on the host OS, packaged with all its dependencies.
The key difference from a virtual machine:
VM: runs a full guest OS on top of a hypervisor. Each VM has its own kernel, virtual hardware, and OS installation. Boot time is measured in minutes, and overhead is gigabytes of memory per VM.
Container: shares the host OS kernel. Only packages the application and its user-space dependencies. Boot time is measured in milliseconds, and overhead is megabytes.
Practical result: you can run dozens of containers on a machine where only 3 to 4 VMs would fit. Containers also start in milliseconds, making them ideal for scaling and CI/CD.
The trade-off: VMs provide stronger isolation (separate kernel). Containers share the host kernel, so a kernel exploit can potentially escape container isolation. In practice, production environments often run containers inside VMs (e.g., EC2 instances running Docker) to get both performance and isolation.
Q22. Explain Docker's architecture (client, daemon, containerd, registry).
Docker uses a client-server architecture.
Docker CLI (client): the command-line tool you interact with. Sends REST API requests to the Docker daemon via a Unix socket (/var/run/docker.sock).
Docker daemon (dockerd): the long-running background service. Manages images, containers, networks, and volumes. Delegates container lifecycle management to containerd.
containerd: an industry-standard container runtime (CNCF project). Handles the actual container lifecycle: pulling images, creating, starting, and stopping containers. Kubernetes also uses containerd directly.
runc: the low-level container runtime that containerd calls to actually spawn containers using Linux namespaces and cgroups.
Docker Registry: stores and distributes Docker images. Docker Hub is the default public registry. ECR (AWS), GCR (Google), and ACR (Azure) are popular managed alternatives. You can also self-host with Harbor or a private registry.
CLI --> dockerd (REST API) --> containerd --> runc --> container process
|
--> registry (pull/push images)Q23. What is the difference between a Docker image and a container?
Image: a read-only, immutable template. Built from a Dockerfile. Consists of stacked layers, each representing a filesystem change. An image is like a class in OOP.
Container: a running (or stopped) instance of an image. Created with docker run. Gets its own writable layer on top of the image layers. All writes go into this ephemeral writable layer: when the container is deleted, the writes are lost unless you use volumes. A container is like an object (instance) of the class.
Image layers (read-only):
Layer 3: COPY . /app
Layer 2: RUN npm install
Layer 1: FROM node:22-alpine
Container:
Writable layer (container-specific changes)
[Image layers below: shared, read-only]Multiple containers can run from the same image simultaneously. They all share the same read-only image layers (saving disk space) but each has its own writable layer.
# Image vs container commands
docker images # list images
docker build -t myapp . # build image from Dockerfile
docker rmi myapp # remove image
docker ps # list running containers
docker ps -a # list all containers (including stopped)
docker run myapp # create and start a container from image
docker rm my-container # remove stopped containerQ24. How do you write an efficient Dockerfile?
A good Dockerfile produces a small, fast-to-build, secure image.
# Use a specific version tag, never just "latest"
FROM node:22-alpine
# Set working directory
WORKDIR /app
# Copy dependency files FIRST (before source code)
# This way the npm install layer is cached as long as package.json doesn't change
COPY package.json package-lock.json ./
# Install dependencies
RUN npm ci --only=production
# Copy source code AFTER installing dependencies
COPY src/ ./src/
# Run as non-root user for security
USER node
# Document the port (does not actually expose, use docker run -p or compose)
EXPOSE 3000
# Use ENTRYPOINT + CMD pattern
# ENTRYPOINT: the executable
# CMD: default arguments (can be overridden at runtime)
ENTRYPOINT ["node"]
CMD ["src/server.js"]Key best practices:
- Order layers from least-to-most frequently changed (dependencies before source).
- Use .dockerignore to exclude node_modules, .git, test files, and README.
- Use alpine or distroless base images to minimize attack surface and size.
- Never RUN apt-get in multiple separate RUN commands: chain them with
&&to avoid creating unnecessary intermediate layers. - Never store secrets in ENV variables or COPY .env files: use secrets at runtime instead.
- Run as non-root with the USER directive or a dedicated user.
Q25. What is Docker layer caching and how does it affect build speed?
Every instruction in a Dockerfile creates a layer. Docker caches each layer. When you rebuild, Docker reuses cached layers from the cache until it encounters a layer that has changed, then it re-executes all subsequent layers.
This means layer order matters for build speed.
# BAD: source code changes every build, invalidates npm install cache
FROM node:22-alpine
WORKDIR /app
COPY . . # copies everything including package.json AND source code
RUN npm ci # runs every time ANY file changes
# GOOD: separate dependency install from source code copy
FROM node:22-alpine
WORKDIR /app
COPY package*.json ./ # only package files
RUN npm ci # cached as long as package.json is unchanged
COPY . . # source code, cache miss here is cheapIn the GOOD example, if you change only a source file, Docker reuses the npm ci layer (which is slow) and only re-executes the COPY . . layer (fast).
# Build with no cache (force full rebuild)
docker build --no-cache -t myapp .
# See image layers and their sizes
docker history myapp
docker inspect myappQ26. What is Docker Compose and when do you use it?
Docker Compose defines and runs multi-container applications using a single YAML file. Instead of running docker network create, docker run, and manually linking containers, you describe all services in docker-compose.yml and bring everything up with one command.
services:
api:
build: .
ports:
- "3000:3000"
environment:
- NODE_ENV=development
- DATABASE_URL=postgres://user:password@db:5432/myapp
- REDIS_URL=redis://cache:6379
depends_on:
db:
condition: service_healthy
cache:
condition: service_started
volumes:
- ./src:/app/src # mount source code for hot reload in dev
db:
image: postgres:16-alpine
environment:
POSTGRES_USER: user
POSTGRES_PASSWORD: password
POSTGRES_DB: myapp
volumes:
- postgres_data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U user -d myapp"]
interval: 5s
timeout: 5s
retries: 5
cache:
image: redis:7-alpine
ports:
- "6379:6379"
volumes:
postgres_data:docker compose up -d # start all services in background
docker compose logs -f api # follow logs for the api service
docker compose down # stop and remove containers and networks
docker compose down -v # also remove named volumesDocker Compose is ideal for local development and CI/CD test environments. In production, use Kubernetes or ECS for orchestration.
Q27. What Docker network drivers exist and when do you use each?
Docker ships five built-in network drivers.
bridge: the default for containers on the same host. Creates a virtual bridge network. Containers can communicate by container name (automatic DNS). Use for local development multi-container setups.
# Create a custom bridge network (recommended over default bridge)
docker network create my-network
docker run --network my-network --name api myapp
docker run --network my-network --name db postgres
# api can reach db at hostname "db"- host: the container shares the host's network stack directly. No NAT, no port mapping needed. Best performance, no network isolation. Use for performance-sensitive workloads where networking overhead matters.
- overlay: spans multiple Docker hosts. Required for Docker Swarm multi-host communication. Containers on different machines communicate as if on the same network. Used in Docker Swarm clusters.
- macvlan: assigns a MAC address to the container, making it appear as a physical device on the network. Used for legacy applications that expect to be on the physical network.
- none: complete network isolation. The container has only a loopback interface, with no external communication.
Q28. What is the difference between Docker volumes and bind mounts?
Volumes: managed by Docker, stored in Docker's storage area (/var/lib/docker/volumes/). Created explicitly or automatically. Portable, shareable between containers, and can be backed up with docker volume commands. Best for production persistent data.
Bind mounts: mount a specific host directory or file into the container. The host path must exist. Commonly used in development to mount source code for hot-reloading.
# Volume (Docker-managed)
docker run -v postgres_data:/var/lib/postgresql/data postgres:16
docker volume ls
docker volume inspect postgres_data
# Bind mount (host directory mounted)
docker run -v $(pwd)/src:/app/src myapp # source code hot reload
# Or named syntax:
docker run --mount type=bind,source=$(pwd)/src,target=/app/src myapptmpfs mount: stored in host memory only, never written to disk. For temporary sensitive data (secrets, temp files) that must not persist.
Production rule: use volumes for databases and persistent state. Use bind mounts only in development for code mounting. Never bind-mount secrets.
Q29. What is a multi-stage Docker build and why is it important?
A multi-stage build uses multiple FROM instructions in a single Dockerfile. Each stage can use a different base image. Only the final stage becomes the shipped image, earlier stages are discarded.
This solves the "fat build image" problem: build tools (compilers, package managers, test frameworks) needed to build the app are not needed to run it.
# Stage 1: build
FROM node:22 AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci # includes devDependencies
COPY . .
RUN npm run build # compile TypeScript, bundle assets
RUN npm test # run tests during build
# Stage 2: production image
FROM node:22-alpine AS production
WORKDIR /app
# Only copy what is needed to RUN the app
COPY package*.json ./
RUN npm ci --only=production # production deps only
COPY --from=builder /app/dist ./dist # compiled output only
USER node
EXPOSE 3000
CMD ["node", "dist/server.js"]The production image contains only the Alpine Node.js runtime, production dependencies, and compiled output. The node:22 build environment (3x larger) is discarded.
Result: a production image might be 80MB instead of 800MB. Smaller images mean faster push and pull times, a smaller attack surface, and lower ECR storage costs.
Q30. How do you debug a crashing Docker container?
# Step 1: Check container status and exit code
docker ps -a
# Exit code 0: clean exit | 1: error | 137: OOMKilled (out of memory) | 143: SIGTERM
# Step 2: Check logs
docker logs my-container
docker logs --tail 100 my-container # last 100 lines
docker logs -f my-container # follow (stream)
# Step 3: Inspect container config
docker inspect my-container
# Look for: environment variables, mount points, network config, exit code
# Step 4: Shell into a running container
docker exec -it my-container sh # or bash if available
docker exec -it my-container env # print environment variables
# Step 5: Start the container with shell override (if it crashes on start)
docker run -it --entrypoint sh my-image
# Manually run the start command to see errors
# Step 6: Check resource usage
docker stats my-container # CPU, memory, network I/O
# Step 7: For OOMKilled (exit 137)
docker inspect my-container --format='{{.HostConfig.Memory}}'
# Increase memory limit in run command or compose fileCommon causes of container crashes:
- Application crash at startup, caused by a misconfigured env var or missing dependency.
- OOM kill (exit 137): increase the memory limit.
- Port already in use: check what is bound to the host port.
- Volume mount issue: the path does not exist, or there's a permissions problem.
- Health check failure: the container fails the configured health check threshold.
Q31. What are Docker container security best practices?
Run as non-root:
# Create a user in the Dockerfile
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
USER appuserUse minimal base images:
# Distroless: no shell, no package manager, minimal attack surface
FROM gcr.io/distroless/nodejs22-debian12Scan images for vulnerabilities:
docker scout cve myapp:latest # Docker Scout
trivy image myapp:latest # Trivy (free, widely used)Use a read-only filesystem:
docker run --read-only myapp
# Application can still write to explicitly defined tmpfs mountsDrop capabilities:
docker run --cap-drop ALL --cap-add NET_BIND_SERVICE myappNever run in privileged mode in production:
# Bad: gives container root-level host access
docker run --privileged myapp
# Good: only specific needed capability
docker run --cap-add SYS_PTRACE myappLimit resources:
docker run --memory 512m --cpus 0.5 myappUse Docker Content Trust for image signing and verification in production pipelines.
Q32. What is the difference between Docker Swarm and Kubernetes?
Both are container orchestration platforms that manage clusters of containers across multiple hosts.
Docker Swarm: Docker's native clustering tool. Simple to set up and operate. Built into Docker Engine. Uses docker-compose.yml (stack files) for service definitions. Suitable for smaller deployments and teams migrating from Compose.
Kubernetes: the industry-standard container orchestration platform (CNCF). More complex but far more powerful. Richer ecosystem (Helm, service meshes, operators). Supported by every major cloud provider as a managed service (EKS, GKE, AKS). The standard choice for production at scale.
Key differences:
- Setup: Swarm initializes in minutes; Kubernetes requires significant configuration (or use a managed service).
- Scaling: both auto-scale, but Kubernetes has more control (HPA, VPA, KEDA).
- Networking: Kubernetes has a richer networking model (Ingress, NetworkPolicy, CNI plugins).
- Storage: Kubernetes has more storage options (PersistentVolumes, StorageClasses).
- Ecosystem: Kubernetes has a vastly larger ecosystem and community.
Category 5: Microservices (Q33-Q42)
Microservices questions test whether you understand the distributed systems trade-offs, not just the buzzwords. Expect questions about communication patterns, failure handling, and data consistency across service boundaries.
Q33. What is a microservices architecture and how does it differ from monolith?
A monolith is a single deployable unit. All features, business logic, and data access live in one codebase and process. Simple to develop initially but grows harder to scale, deploy, and maintain as teams and complexity grow.
Microservices architecture decomposes an application into small, independently deployable services. Each service owns its domain, its data store, and its deployment lifecycle. Services communicate over the network via APIs or message queues.
Benefits:
- Independent deployment: deploy the payment service without redeploying orders.
- Independent scaling: scale the image processing service 10x without scaling auth.
- Technology flexibility: each service can use the right language and database.
- Team autonomy: each team owns and operates their service end to end.
- Fault isolation: a crash in the notification service does not take down checkout.
Drawbacks:
- Distributed system complexity: network failures, latency, partial failures.
- Data consistency: no simple ACID transaction across service boundaries.
- Operational overhead: more services means more things to monitor, deploy, and scale.
- Debugging is harder: distributed tracing required across service boundaries.
When to start with a monolith: early-stage product, small team, unknown domain boundaries. Prematurely decomposing into microservices creates distributed system problems before you have the organizational scale to benefit.
Q34. How do microservices communicate with each other?
Two communication patterns: synchronous (request-response) and asynchronous (message-based).
Synchronous: the caller waits for a response.
- REST over HTTP/HTTPS: simple, widely understood, human-readable. Uses HTTP methods (GET, POST, PUT, DELETE). Good for CRUD operations and request-response where the caller needs an immediate answer.
- gRPC: binary protocol using Protocol Buffers. Faster than REST (roughly 7x), strongly typed contracts via .proto files, bidirectional streaming. Good for high-frequency inter-service calls, streaming data, and polyglot environments.
- GraphQL: flexible query language. The client specifies exact data needed. Good for API aggregation layers and mobile clients with varied data needs.
Asynchronous: the caller publishes a message and continues.
- Message queues (SQS, RabbitMQ): point-to-point. One producer, one consumer. Good for task queues, work distribution, and decoupling producer from consumer.
- Event streaming (Kafka, Kinesis): one producer, many consumers. Events are retained and replayable. Good for event sourcing, real-time analytics, and decoupled event-driven architectures.
- Publish/Subscribe (SNS, Redis Pub/Sub): publisher sends to a topic, multiple subscribers receive. Good for broadcasting events to multiple services.
The general rule: use synchronous communication when you need an immediate response. Use async messaging for operations where eventual processing is acceptable (order placed event, then email notification, then inventory update).
Q35. What is service discovery and why does it matter in microservices?
Service discovery is the mechanism by which microservices find each other's network locations. In a static environment, you could hardcode IP addresses. In a dynamic cloud environment where services scale up and down and containers get new IPs constantly, this is impossible.
Two patterns:
Client-side discovery: the calling service queries a service registry (Consul, Eureka, etcd) to get the current list of healthy instances for the target service. The client performs its own load balancing.
Server-side discovery: the client sends a request to a load balancer (AWS ALB, Kubernetes Service). The load balancer queries the registry and routes to a healthy instance. The client does not need to know about discovery.
Kubernetes handles service discovery automatically: every Service gets a DNS name that resolves to healthy pod IPs via kube-dns. The calling service uses http://payment-service/charge and Kubernetes routes it.
AWS App Mesh and Consul Connect are service mesh solutions that add service discovery plus mTLS, circuit breaking, and observability as a sidecar proxy, without changing application code.
Q36. What is the Circuit Breaker pattern and how does it work?
The Circuit Breaker prevents a single failing service from causing cascading failures across an entire system. It monitors calls to a downstream service and "trips" when the failure rate exceeds a threshold.
Three states:
- Closed (normal): requests flow through. Success and failure rates are tracked.
- Open (tripped): failure threshold exceeded. All requests fail immediately without contacting the downstream service. Returns cached data or a fallback response. Gives the failing service time to recover.
- Half-Open (recovery check): after a timeout, a limited number of test requests are allowed through. If they succeed, the circuit closes. If they fail, it opens again.
// Example using opossum (Node.js circuit breaker library)
const CircuitBreaker = require("opossum");
const options = {
timeout: 3000, // fail if takes longer than 3s
errorThresholdPercentage: 50, // open if more than 50% fail
resetTimeout: 30000, // try again after 30s
};
const breaker = new CircuitBreaker(callPaymentService, options);
breaker.fallback(() => ({ status: "deferred", message: "Payment queued for retry" }));
breaker.on("open", () => logger.warn("Payment service circuit OPEN"));
breaker.on("halfOpen", () => logger.info("Payment service circuit HALF-OPEN"));
breaker.on("close", () => logger.info("Payment service circuit CLOSED"));
// Usage
const result = await breaker.fire(paymentData);Q37. What is the Saga pattern and when do you use it?
The Saga pattern manages distributed transactions across multiple microservices without two-phase commit (2PC). Since each service owns its own database, you cannot do a traditional SQL transaction across them. The Saga breaks the distributed transaction into a series of local transactions, each with a compensating transaction for rollback.
Two Saga implementations:
Choreography: each service publishes events. Downstream services listen and react. No central coordinator.
OrderService: "OrderCreated" event -->
PaymentService: charges card, publishes "PaymentCompleted" or "PaymentFailed"
InventoryService: listens for PaymentCompleted, reserves stock
NotificationService: listens for both, sends email
If payment fails: OrderService listens for PaymentFailed, cancels the orderOrchestration: a central Saga orchestrator sends commands to each service and waits for responses. Step Functions (AWS) is a managed orchestrator.
Orchestrator: Command "ChargeCard" --> PaymentService
PaymentService: "PaymentCompleted" --> Orchestrator
Orchestrator: Command "ReserveStock" --> InventoryService
InventoryService: "StockInsufficient" --> Orchestrator
Orchestrator: Command "RefundCard" --> PaymentService (compensating transaction)Choreography is simpler for small flows. Orchestration is easier to reason about and debug for complex multi-step workflows.
Q38. What is event sourcing?
Event sourcing is a pattern where instead of storing the current state of an entity, you store the sequence of events that led to that state. The current state is derived by replaying all events.
Traditional: users table has row { id: 1, email: "bob@new.com", balance: 850 }
Event sourcing: events table has:
{ eventId: 1, type: "UserCreated", data: { email: "bob@old.com" } }
{ eventId: 2, type: "EmailChanged", data: { email: "bob@new.com" } }
{ eventId: 3, type: "Deposit", data: { amount: 1000 } }
{ eventId: 4, type: "Withdrawal", data: { amount: 150 } }
Current state = replay events 1-4: { email: "bob@new.com", balance: 850 }Benefits:
- Complete audit trail by default.
- Time travel: reconstruct state at any point in history.
- Event replay: rebuild read models, fix bugs by replaying events.
- Natural fit for event-driven architectures.
Drawbacks:
- Querying current state requires replaying events (use projections or read models).
- Event schema changes require migration strategies.
- Increased complexity for simple CRUD use cases.
Event sourcing is commonly paired with CQRS (Command Query Responsibility Segregation), where writes go through events and reads come from optimized read models (materialized projections).
Q39. What is distributed tracing and which tools implement it?
Distributed tracing tracks a single request as it flows through multiple microservices. Each service adds a trace span with timing information. You can see the full request lifecycle and identify where latency lives.
Without distributed tracing, debugging a slow request in a 20-service architecture is nearly impossible: the log is split across 20 services with no shared request ID.
Key concepts:
- Trace: the complete journey of one request through all services.
- Span: a single unit of work within a trace (one service call).
- Trace ID: a unique ID propagated through all HTTP headers so spans from different services can be linked.
- Parent Span ID: links child spans to their parent.
OpenTelemetry is the vendor-neutral standard for generating traces, metrics, and logs. Instrument your service once, export to any backend.
// OpenTelemetry instrumentation (Node.js)
const { NodeSDK } = require("@opentelemetry/sdk-node");
const { OTLPTraceExporter } = require("@opentelemetry/exporter-trace-otlp-http");
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({ url: "http://collector:4318/v1/traces" }),
});
sdk.start();
// Now all HTTP calls and database queries are auto-instrumentedPopular backends: Jaeger (open source), Zipkin (open source), AWS X-Ray (native AWS), Datadog APM, Honeycomb, and Grafana Tempo.
Q40. What is a service mesh and when do you need one?
A service mesh is an infrastructure layer that manages all network communication between microservices. It runs as sidecar proxies (one per service pod) that intercept and handle all inbound and outbound traffic without changing application code.
Capabilities a service mesh provides:
- mTLS between all services (automatic encryption and mutual authentication).
- Circuit breaking and retry policies.
- Traffic splitting (canary deployments, A/B testing).
- Observability (automatic metrics and traces for all service-to-service calls).
- Service discovery.
Popular service meshes: Istio, Linkerd, Consul Connect, and AWS App Mesh.
When you need a service mesh:
- Zero-trust security: every service-to-service call must be encrypted and authenticated, and you cannot implement this in every service's code.
- Observability across 20+ services without adding SDK code everywhere.
- Advanced traffic management (canaries, dark launches) at the platform level.
When you probably do NOT need it:
- Fewer than 10 services.
- The team does not have Kubernetes expertise to operate Istio or Linkerd.
- The complexity of operating the mesh exceeds the benefit.
Start without a service mesh. Add it when you have a clear, specific pain point, usually around security or observability at scale.
Q41. What is the 12-factor app methodology?
The 12-factor app is a methodology for building cloud-native, scalable, maintainable software-as-a-service applications. Relevant factors for microservices interviews:
- III. Config: store config in environment variables, not in code or config files checked into source control. No hardcoded URLs, credentials, or environment-specific values.
- IV. Backing services: treat databases, queues, and caches as attached resources accessed via URL. Swapping a local Postgres for a managed RDS instance should require only a config change.
- VI. Processes: execute the app as one or more stateless processes. Shared state lives in a backing service (Redis, database), not in process memory. This makes horizontal scaling trivial.
- VII. Port binding: export services via port binding. The service is self-contained and does not rely on a web server injection. Works naturally with Docker and Lambda.
- IX. Disposability: maximize robustness with fast startup and graceful shutdown. Handle SIGTERM, drain in-flight requests, release resources. Enables zero-downtime deployments and auto-scaling.
- XI. Logs: treat logs as event streams. Write to stdout. Let the platform (Docker, Kubernetes, CloudWatch) collect and route them.
Q42. How do you handle authentication and authorization across microservices?
Two patterns dominate.
Centralized authentication with token propagation: one auth service issues JWTs. Each downstream service validates the JWT independently, with no network call to the auth service per request, just local signature verification.
Client -> API Gateway -> (validates JWT) -> Order Service (validates JWT) -> Payment Service
[extracts userId from JWT claims]The API Gateway or service mesh validates the token. Services trust the validated identity propagated in headers (X-User-ID, X-User-Roles).
Service-to-service auth: services calling each other must also authenticate.
- Short-lived service JWTs signed with service-specific keys.
- Mutual TLS (mTLS) via a service mesh, with no application code changes.
- AWS IAM roles and SigV4 signing for AWS-native architectures (Lambda calling Lambda, ECS service calling DynamoDB).
Authorization: each service enforces its own authorization rules based on the user identity in the token. Central policy enforcement (Open Policy Agent, AWS IAM) handles complex permission models.
The key principle: never pass usernames and passwords between services. Use short-lived tokens. Rotate signing keys regularly.
Category 6: Azure Entra ID (Q43-Q50)
Azure Entra ID questions test identity and access management knowledge for enterprise Microsoft environments. Expect questions on OAuth flows, managed identities, and the policy engine that controls conditional access.
Q43. What is Microsoft Azure Entra ID (formerly Azure Active Directory)?
Azure Entra ID (rebranded from Azure AD in 2023) is Microsoft's cloud-based Identity and Access Management (IAM) service. It is the identity backbone for Microsoft 365, Azure resources, and any third-party application registered with a tenant.
Core functions:
- Authentication: verify who a user or application is (login).
- Authorization: control what an authenticated identity can access.
- Single Sign-On (SSO): users log in once and access many applications.
- Multi-Factor Authentication (MFA): an extra verification step.
- Conditional Access: a policy engine that controls access based on context, such as location, device compliance, and risk level.
- Application management: register apps and define their permissions.
Entra ID is not the same as Windows Active Directory (AD DS). AD DS is an on-premises directory service using LDAP and Kerberos. Entra ID is cloud-native, uses OAuth 2.0 and OpenID Connect, and manages cloud identities. Azure AD Connect synchronizes on-premises AD users to Entra ID for hybrid environments.
Q44. What is the difference between an App Registration and a Service Principal?
App Registration: the global definition of an application in Entra ID. You create one App Registration in the home tenant. It defines the app's identity, redirect URIs, the API permissions it requests, and its certificate or secret credentials. Think of it as the blueprint.
Service Principal: the local instance of the App Registration within a specific tenant. When an app registration is created, or when a multi-tenant app is consented to in another tenant, a Service Principal is automatically created in that tenant. It carries the actual permissions granted to the app in that tenant.
For a single-tenant app, one App Registration creates one Service Principal in the same tenant. For a multi-tenant app, one App Registration creates N Service Principals: one per tenant that installs or consents to the app.
# Create an app registration via Azure CLI
az ad app create \
--display-name "my-backend-api" \
--sign-in-audience AzureADMyOrg
# Get the app's Service Principal object ID
az ad sp show --id <app-id>
# Create a client secret for the app
az ad app credential reset \
--id <app-id> \
--append \
--display-name "ci-cd-secret"Q45. What are the OAuth 2.0 flows supported by Entra ID and when do you use each?
Entra ID supports several OAuth 2.0 and OpenID Connect flows. Picking the right one depends on whether a user is present and whether the client can keep a secret.
- Authorization Code Flow: the standard flow for web apps where a user interactively logs in. The browser redirects to Entra ID, the user authenticates, Entra returns an authorization code, and the backend exchanges the code for tokens. Use for web applications where a human is present.
- Authorization Code + PKCE: an extension of Authorization Code for public clients, such as single-page apps and mobile apps, that cannot securely store a client secret. PKCE (Proof Key for Code Exchange) replaces the client secret. Use for SPAs (React, Angular), mobile apps, and desktop apps.
- Client Credentials Flow: the application authenticates directly with Entra ID using its own credentials (client ID plus secret or certificate). No user is involved, and the flow returns an access token for the application itself. Use for daemon processes, background services, CI/CD pipelines, and service-to-service calls.
- On-Behalf-Of (OBO): a middle-tier API receives a user token and exchanges it for a token scoped to a downstream API, preserving the user's identity. Use for API-to-API calls where the user's identity must propagate downstream.
- Device Code Flow: for devices with no browser or limited input capability. The device displays a code, and the user enters it on another device to approve. Use for CLI tools, IoT devices, and TV apps.
# Client credentials flow (Python MSAL)
import msal
app = msal.ConfidentialClientApplication(
client_id="<app-id>",
client_credential="<client-secret>",
authority="https://login.microsoftonline.com/<tenant-id>"
)
result = app.acquire_token_for_client(
scopes=["https://graph.microsoft.com/.default"]
)
access_token = result["access_token"]Q46. What are Managed Identities in Azure and why are they preferred?
A Managed Identity is a Service Principal whose credentials are automatically managed by Azure. You never create, store, or rotate a client secret or certificate: Azure handles the key lifecycle entirely.
There are two types:
- System-assigned managed identity: tied to a specific Azure resource, such as a VM, App Service, or Azure Function. Created with the resource and deleted with the resource, in a one-to-one relationship.
- User-assigned managed identity: created as a standalone Azure resource. Can be assigned to multiple Azure resources and has an independent lifecycle.
# Enable system-assigned managed identity on an App Service
az webapp identity assign \
--resource-group my-rg \
--name my-api
# Grant that identity permission to read from Key Vault
az keyvault set-policy \
--name my-keyvault \
--object-id <managed-identity-principal-id> \
--secret-permissions get list# In application code, no credentials needed.
# The Azure SDK automatically acquires tokens using the managed identity.
from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient
credential = DefaultAzureCredential()
client = SecretClient(vault_url="https://my-keyvault.vault.azure.net/", credential=credential)
secret = client.get_secret("database-password")Why managed identities are preferred over client secrets:
- No secret to rotate, store, or accidentally leak to source control.
- Reduced attack surface, since there are no static credentials that can be stolen.
- Automatic credential rotation by Azure.
- Works seamlessly with Key Vault, Storage, SQL, Service Bus, and most Azure services.
- Full audit trail via Entra ID sign-in logs.
Q47. What is Conditional Access in Entra ID?
Conditional Access is Entra ID's policy engine for access control decisions based on contextual signals. Instead of a simple allow or deny on identity, it evaluates who is accessing what, from where, on what device, and at what risk level.
Policy structure: if a set of conditions is met, then grant or block controls apply.
Conditions include:
- User or group membership.
- The application being accessed.
- Sign-in risk level, detected by Identity Protection (low, medium, high).
- Device compliance, such as Intune-managed or hybrid joined.
- Location: named locations, IP ranges, or countries.
- Client app: browser, mobile app, or legacy auth.
Controls include:
- Block access.
- Require MFA.
- Require a compliant device.
- Require a hybrid Azure AD joined device.
- Require an approved client app.
- Require a password change.
Example policy: "Require MFA for all admin portal access from outside the corporate network"
Conditions:
Users: Admins group
Application: Azure Management Portal
Location: Exclude corporate IP range
Controls: Require MFACommon use cases:
- Require MFA for all external access.
- Block legacy authentication protocols that do not support MFA.
- Require compliant devices for accessing sensitive apps.
- Block access from high-risk sign-ins automatically.
Q48. What is the difference between Azure RBAC and Entra ID roles?
Azure RBAC controls access to Azure resources: storage accounts, virtual machines, resource groups, and subscriptions. Roles are assigned at a scope (management group, subscription, resource group, or resource). Built-in roles include Owner, Contributor, Reader, and over 100 service-specific roles.
# Grant a service principal Contributor access to a resource group
az role assignment create \
--assignee <service-principal-object-id> \
--role "Contributor" \
--resource-group my-resource-group
# Grant read access to a specific storage account
az role assignment create \
--assignee <user-or-sp-object-id> \
--role "Storage Blob Data Reader" \
--scope /subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/<account>Entra ID roles (directory roles) control access to Entra ID itself and to Microsoft 365 services. Examples include Global Administrator, User Administrator, Application Administrator, and Security Reader.
- Managing Azure infrastructure (VMs, storage, networking): use Azure RBAC.
- Managing users, groups, app registrations, and Conditional Access: use Entra ID roles.
- Accessing Azure resources from an application: use Azure RBAC on a managed identity.
Q49. How does Single Sign-On (SSO) work in Entra ID?
SSO lets users authenticate once with Entra ID and access multiple applications without re-entering credentials. Entra ID supports three SSO protocols.
- OpenID Connect (OIDC): modern, token-based. Entra ID returns an ID token (user identity) and an access token (API access). Best for new cloud-native apps.
- SAML 2.0: an XML-based federation standard, common for enterprise SaaS apps such as Salesforce and ServiceNow. Entra ID acts as the Identity Provider (IdP) and the app is the Service Provider (SP). No passwords are exchanged: assertions are signed XML documents.
- Password-based SSO: Entra ID stores credentials for apps that do not support federated SSO, acting as a browser extension credential vault. Treat this as a last resort.
SSO session flow (OIDC):
- User accesses App A while not logged in, and is redirected to the Entra ID login page.
- User authenticates with credentials and MFA.
- Entra ID sets a session cookie and issues tokens for App A.
- User accesses App B; the browser sends the Entra ID session cookie.
- Entra ID validates the existing session and issues tokens for App B without requiring re-authentication.
Token lifetime: access tokens default to 1 hour. Refresh tokens allow silent re-authentication for around 90 days before requiring an interactive login.
Q50. What is Privileged Identity Management (PIM) in Entra ID?
PIM is an Entra ID feature that manages just-in-time privileged access to Azure resources and Entra ID roles. Instead of giving permanent admin access (standing privileges), users are made eligible for privileged roles and must activate them when needed.
How it works:
- An administrator marks a user as eligible for the Global Administrator role.
- The user has no admin access by default.
- When needed, the user activates the role via PIM, providing justification and optionally MFA or manager approval.
- The user has admin access for a configurable time window, typically 1 to 8 hours.
- Access expires automatically, and all activation requests are logged.
Benefits:
- Reduces attack surface: stolen credentials for a non-admin account cannot immediately be used for admin tasks.
- Requires justification for every privileged action.
- Provides a full audit trail of who used which privileged role, when, and why.
- Supports approval workflows for sensitive roles.
# Check active PIM role assignments
az role assignment list --include-classic-administrators
# Via Microsoft Graph (list eligible assignments)
GET /roleManagement/directory/roleEligibilitySchedules
?$filter=principalId eq '{user-id}'Quick Reference: All 50 Questions at a Glance
Use this table to scan every question and its core concept in one pass. It's the fastest way to spot the topics you need to revisit before an interview.
| # | Question | Core concept |
|---|---|---|
| Q1 | What is serverless computing | No server management, pay-per-execution, trade-offs |
| Q2 | How does Lambda work | Execution environment lifecycle, handler, warm/cold start |
| Q3 | Lambda cold starts and prevention | Causes, provisioned concurrency, SnapStart, runtime choice |
| Q4 | Lambda concurrency types | Unreserved vs reserved vs provisioned |
| Q5 | Lambda Layers | Shared dependencies, /opt directory, 5 max layers |
| Q6 | Synchronous vs asynchronous invocation | Caller waits vs fire-and-forget, DLQ, retries |
| Q7 | Lambda limits | 15-minute timeout, 10GB memory, 1,000 concurrency, 6MB payload |
| Q8 | Lambda monitoring and debugging | CloudWatch Logs, Metrics, X-Ray, Powertools |
| Q9 | API Gateway and the three API types | REST vs HTTP vs WebSocket |
| Q10 | API Gateway integration types | Lambda proxy, HTTP, AWS Service, Mock |
| Q11 | API Gateway authorization | IAM, Lambda Authorizer, JWT Authorizer |
| Q12 | API Gateway throttling | Token bucket, account/stage limits, 429 response |
| Q13 | API Gateway stages and deployments | Stage variables, canary deployments, custom domains |
| Q14 | REST API vs HTTP API | Price, features, JWT support, caching |
| Q15 | S3 core concepts | Bucket, object, key, URL format |
| Q16 | S3 storage classes | Standard, IA, Glacier, Intelligent-Tiering, Deep Archive |
| Q17 | S3 versioning | Multiple versions, delete markers, MFA Delete |
| Q18 | Presigned URLs | Temporary access, GET and PUT, direct upload pattern |
| Q19 | S3 event notifications and Lambda | Event types, permission model, structured trigger event |
| Q20 | Bucket policy vs ACL | JSON resource policies vs legacy ACLs, Block Public Access |
| Q21 | Docker vs virtual machines | Shared kernel vs full OS, boot time, resource use |
| Q22 | Docker architecture | CLI, dockerd, containerd, runc, registry |
| Q23 | Docker image vs container | Read-only template vs running instance, writable layer |
| Q24 | Writing an efficient Dockerfile | Layer order, dependencies before code, non-root, .dockerignore |
| Q25 | Docker layer caching | Cache invalidation order, build speed optimization |
| Q26 | Docker Compose | Multi-service YAML, depends_on, healthcheck, volumes |
| Q27 | Docker network drivers | bridge, host, overlay, macvlan, none |
| Q28 | Volumes vs bind mounts | Docker-managed vs host path, production vs dev |
| Q29 | Multi-stage Docker builds | Fat build vs lean production image, --from copy |
| Q30 | Debugging a crashing container | ps -a, logs, inspect, exec -it, exit codes |
| Q31 | Container security best practices | Non-root, minimal image, scan, read-only FS, cap-drop |
| Q32 | Docker Swarm vs Kubernetes | Simplicity vs ecosystem, production scale |
| Q33 | Monolith vs microservices | Deployment, scaling, team autonomy, trade-offs |
| Q34 | Inter-service communication | REST, gRPC, message queues, event streaming |
| Q35 | Service discovery | Client-side vs server-side, Consul, Kubernetes DNS |
| Q36 | Circuit Breaker pattern | Closed, Open, Half-Open states, opossum library |
| Q37 | Saga pattern | Distributed transactions, choreography vs orchestration |
| Q38 | Event sourcing | Events as the source of truth, replay, projections |
| Q39 | Distributed tracing | Trace, span, OpenTelemetry, Jaeger, X-Ray |
| Q40 | Service mesh | Sidecar proxy, mTLS, traffic management, Istio |
| Q41 | 12-factor app methodology | Config, stateless processes, port binding, disposability, logs |
| Q42 | Auth across microservices | JWT propagation, mTLS, service-to-service tokens |
| Q43 | What is Azure Entra ID | IAM service, authentication, SSO, Conditional Access |
| Q44 | App Registration vs Service Principal | Blueprint vs instance, multi-tenant model |
| Q45 | OAuth 2.0 flows in Entra ID | Auth Code, PKCE, Client Credentials, OBO, Device Code |
| Q46 | Managed Identities | Auto-managed credentials, system vs user-assigned |
| Q47 | Conditional Access | Policy engine, signals, grant/block controls |
| Q48 | Azure RBAC vs Entra ID roles | Resource access vs directory access |
| Q49 | Single Sign-On (SSO) | OIDC, SAML, session cookies, token lifetime, CAE |
| Q50 | Privileged Identity Management (PIM) | Just-in-time access, eligible vs active, audit trail |
Frequently Asked Questions
What level of cloud and DevOps knowledge do these 50 questions target?
This guide spans junior fundamentals through senior architecture decisions. Questions 1 through 20 (serverless, Lambda, API Gateway, S3) cover the AWS building blocks that most backend roles touch directly, and are reasonable for mid-level candidates to answer confidently.
Questions 21 through 42 (Docker and microservices) go deeper into operational and architectural tradeoffs, such as the Saga pattern, service mesh, and distributed tracing, which separate mid-level from senior and staff candidates. Questions 43 through 50 (Azure Entra ID) target roles in enterprise Microsoft environments and security-focused positions.
How does AWS Lambda compare to running containers on ECS or Kubernetes?
Both run your code without you managing physical servers, but the operational model and cost profile differ significantly.
| AWS Lambda | ECS / Kubernetes | |
|---|---|---|
| Billing | Per invocation and execution time | Per running instance, regardless of traffic |
| Scaling | Automatic, near-instant, to zero | Configured autoscaling, rarely to zero |
| Max runtime | 15 minutes per invocation | Unbounded, long-running processes are fine |
| Cold starts | Yes, mitigated with provisioned concurrency (Q3) | No, containers stay warm |
| Best fit | Event-driven, bursty, API backends | Steady high-traffic services, persistent connections |
A common pattern is to start with Lambda for new features, since it has the lowest operational overhead, and move a service to ECS or Kubernetes once it runs continuously at high enough volume that always-on compute becomes cheaper than per-invocation billing.
How can I practice these AWS, Docker, and Azure concepts before an interview?
Most of these concepts can be tested locally without an AWS or Azure bill, using Docker Desktop, the AWS Free Tier, and LocalStack to emulate AWS services.
# Run LocalStack to emulate S3, Lambda, and API Gateway locally
docker run -d -p 4566:4566 --name localstack localstack/localstack
# Create a bucket against the local endpoint
aws --endpoint-url=http://localhost:4566 s3 mb s3://my-test-bucket
# Build and run a multi-stage Dockerfile from Q29 locally
docker build -t myapp .
docker run -p 3000:3000 myappHow do I generate an S3 presigned URL for a file upload?
Use the AWS SDK's request presigner to generate a time-limited PUT URL, then have the client upload directly to S3 without the file passing through your backend. This is the pattern covered in Q18.
const { S3Client, PutObjectCommand } = require("@aws-sdk/client-s3");
const { getSignedUrl } = require("@aws-sdk/s3-request-presigner");
const client = new S3Client({ region: "us-east-1" });
const uploadUrl = await getSignedUrl(
client,
new PutObjectCommand({
Bucket: "my-bucket",
Key: `uploads/user-${userId}/avatar.jpg`,
ContentType: "image/jpeg",
}),
{ expiresIn: 300 } // 5 minutes to start the upload
);
// Send uploadUrl to the client; it PUTs the file directly to S3What happens when a Lambda function hits its concurrency limit?
It depends on the invocation type, covered in Q4 and Q6. For synchronous invocations (API Gateway, ALB), Lambda returns a 429 TooManyRequestsException to the caller immediately, and the request is not retried automatically.
- For asynchronous invocations (S3, SNS, EventBridge), Lambda queues the event and retries automatically once the throttle clears, up to its retry policy.
- Reserved concurrency (Q4) can make this worse if set too low for a function's real traffic, since it hard-caps that function even when the account has unused capacity.
- Provisioned concurrency (Q3) does not prevent throttling on its own. It only keeps a fixed number of environments warm; traffic above that number still scales through the normal (cold-start-prone) path unless reserved concurrency is also raised.
The fix is almost always to request a concurrency limit increase for the account or function, add a queue (SQS) in front of the function to smooth bursts, or both.
Related Articles
30 Node.js Interview Questions and Answers (2026)
30 Node.js interview questions with full answers: event loop, streams, clustering, worker threads, memory leaks, and security. Updated for 2026.
30 NestJS Interview Questions and Answers (2026)
30 NestJS interview questions with full answers: modules, DI, guards, pipes, interceptors, JWT auth, microservices, and testing. Updated for 2026.
42 NoSQL Database Interview Questions and Answers (2026)
42 NoSQL interview questions covering MongoDB, Redis, and DynamoDB: aggregation pipelines, data structures, GSI vs LSI, and CAP theorem. Updated for 2026.