Interview Questions
This file is split into two parts:
- Quick definitions — the classic “what is X” questions, so you don’t blank out on the basics. Start here.
- Scenario-based questions — real problems you’ll be asked in interviews to see how you think
Part 1: Definitions and technical questions
Section titled “Part 1: Definitions and technical questions”For each technology, the questions you’ll be asked in interviews.
How to use this section:
- Read the question and try to answer it yourself (out loud is best)
- If you’re stuck, open the hint — it gives you a lead without giving away the answer
- Open the answer to compare with yours
Q: What is Git?
💡 Hint
Think of a backup system, with a history and the ability to work as a team.✅ Answer
A distributed version control system. It keeps a history of every code change and lets multiple people work together without stepping on each other's toes.Q: Merge vs Rebase?
💡 Hint
Both are used to integrate changes from one branch into another. One preserves the history as-is, the other rewrites it.✅ Answer
Merge preserves the history (creates a merge commit). Rebase rewrites it (linear history, cleaner, but more dangerous since you're rewriting commits).Q: Pull vs Fetch?
💡 Hint
Both retrieve remote changes. One applies them directly, the other doesn't.✅ Answer
Fetch downloads remote changes without applying them. Pull = fetch + merge. Pull applies them directly.Q: A colleague and you modified the same line — what happens?
💡 Hint
Git can't decide on its own which version to keep.✅ Answer
A merge conflict. Git shows you both versions, you choose which one to keep (or combine both), then you commit the resolution.Q: What’s your team’s Git workflow?
💡 Hint
Think of the cycle: create a branch → work → push → request a review → merge.✅ Answer
We create a branch per feature, commit on it, push, and open a Pull Request. A colleague reviews the code, and if it's good we merge into main. Nobody pushes directly to main — everything goes through a PR.Q: What’s the difference between git add and git commit?
💡 Hint
One prepares files, the other saves them. Think of a box you fill then seal.✅ Answer
git add stages files (staging area), git commit saves them in the history. It's like putting items in a box (add) then sealing and labeling the box (commit).Q: What is a branch?
💡 Hint
Think of a parallel universe of the code where you can work without touching the main version.✅ Answer
A parallel copy of the code. You develop on it without touching the main branch (main). When it's ready, you merge.Q: What is a Pull Request?
💡 Hint
It's the mechanism to propose your changes to the team before integrating them into the main branch.✅ Answer
A request to merge code. You create a branch, work on it, and when it's ready you open a PR on GitHub. A colleague reviews your code (code review), and if it's good, you merge into main. It ensures code is checked before it reaches production.Q: Explain the 755 permissions
💡 Hint
3 blocks of 3 permissions (read, write, execute) for 3 categories of people. Each permission has a numeric value.✅ Answer
3 blocks (owner/group/others). read=4, write=2, execute=1. 755 = owner can do everything (7), group and others can read and execute (5).Q: What is an environment variable?
💡 Hint
Think of a way to pass configuration to an application without putting it in the code.✅ Answer
A value stored in the system, accessible by programs. Used to pass configuration (database URL, API keys, debug mode) without putting it in the code.export MY_VAR="value" to create one.Q: A process is eating all the CPU — how do you find and kill it?
💡 Hint
There's a command to list processes sorted by consumption, and another to stop a process by its number (PID).✅ Answer
top or ps aux to find it (sort by CPU), kill <PID> to stop it, kill -9 <PID> if that's not enough.Q: How do you check disk space on a server?
💡 Hint
A short command with a flag that makes the output human-readable.✅ Answer
df -h — shows used and available disk space on each partition. The -h = human-readable (GB, MB instead of bytes). A full disk is a frequent cause of production crashes.Q: How do you see which process is listening on a port?
💡 Hint
Thess command with the right flags, combined with grep to filter.✅ Answer
ss -tlnp | grep <port> — shows the process listening on that port.Q: What is sudo?
💡 Hint
Think "Run as administrator" on Windows.✅ Answer
"Super User DO" — run a command as administrator (root). Needed to install software, modify system config, etc.Q: Difference between > and >>?
💡 Hint
Both redirect command output to a file. One overwrites, the other doesn't.✅ Answer
> overwrites the file. >> appends to the end. Example: echo "log" > file.txt replaces the content, echo "log" >> file.txt adds a line.Q: How do you view a service’s logs?
💡 Hint
There's a specific command for systemd services, and a classic directory for system logs.✅ Answer
journalctl -u service_name for systemd services, or look in /var/log/ for classic system logs.Q: What is a process?
💡 Hint
Every program running on your machine is one. Each has a unique number.✅ Answer
A program currently running. When you runpython3 main.py, it creates a process. Each process has a unique number (PID). You can see them with ps aux or top.Q: What is the PATH?
💡 Hint
It's what the system checks when you type a program name in the terminal. If the program isn't there...✅ Answer
An environment variable containing the list of directories where the system looks for programs. When you typepython3, Linux searches through PATH directories to find the file. If you get "command not found", it's often because the program isn't in the PATH.Networking
Section titled “Networking”Q: What is an IP address?
💡 Hint
It's an identifier. There are two types depending on whether you're on the Internet or on a local network.✅ Answer
An identifier for a machine on the network. Public = visible on the Internet. Private = visible only on the local network.Q: What is a port?
💡 Hint
A machine can run multiple services (web, SSH, database). The port identifies which one.✅ Answer
A number (1-65535) that identifies a service on a machine. 22=SSH, 80=HTTP, 443=HTTPS, 5432=PostgreSQL.Q: What is DNS?
💡 Hint
Think of a phone book that translates something human-readable into something machine-readable.✅ Answer
The system that translates domain names (google.com) into IP addresses. Without DNS, you'd have to remember the IP of every website.Q: Difference between TCP and UDP?
💡 Hint
One is reliable but slower, the other is fast but doesn't verify anything. Think HTTP vs video streaming.✅ Answer
TCP is reliable (verifies that data arrives in order). UDP is fast (no verification). HTTP uses TCP, video streaming often uses UDP.Q: A user tells you “the site doesn’t work” — where do you start?
💡 Hint
A command that gives you the HTTP response code. The code tells you what type of problem it is (network, proxy, code).✅ Answer
curl the site to see the response code (200, 502, timeout). If timeout → network/DNS problem. If 502 → the app behind the proxy is down. If 500 → bug in the code.Q: What happens when you type a URL in your browser?
💡 Hint
5 steps: address resolution, request, server-side processing, response, rendering. Think DNS, HTTP, and the browser.✅ Answer
1. DNS resolution — the browser asks a DNS to translate the domain name into an IP address. 2. Sending the request — the browser sends an HTTP request to the server. 3. Server-side processing — the server receives the request and prepares the response. 4. Server response — the server sends back the content (HTML/CSS/JS and data in JSON). 5. Rendering — the browser assembles and displays the page.Q: What is a CIDR /24?
💡 Hint
It's a notation to describe a subnet. The number after / indicates how many IP addresses are available.✅ Answer
A subnet of 256 IP addresses. Example: 10.0.1.0/24 = 10.0.1.0 to 10.0.1.255. The higher the number after /, the fewer addresses.Q: What is a firewall?
💡 Hint
Think of a bouncer controlling who enters and exits a building.✅ Answer
A filter that controls incoming and outgoing network traffic. It allows or blocks traffic based on rules (port, source IP, protocol). On Linux,ufw is a simple tool to configure the firewall.Q: What does a 502 code mean?
💡 Hint
It's a proxy problem — the server receiving your request can't reach the server behind it.✅ Answer
Bad Gateway — the proxy/load balancer server can't reach the application server behind it. Common cause: the application has crashed.Q: Difference between HTTP and HTTPS?
💡 Hint
The S stands for "Secure". Think of the padlock in the browser's address bar.✅ Answer
HTTPS = HTTP + encryption (TLS/SSL). Data is encrypted between your browser and the server — nobody can read it in transit. The padlock in the browser = HTTPS. Today, every serious site must use HTTPS.Q: What is a reverse proxy?
💡 Hint
It's an intermediate server between users and your application. It can do several useful things (traffic distribution, HTTPS, caching).✅ Answer
A server that sits in front of your application and receives requests on its behalf. It can distribute traffic between multiple servers, handle HTTPS, cache content, etc. Nginx is the most common reverse proxy.Q: What is a load balancer?
💡 Hint
If you have multiple servers, how do you distribute requests between them?✅ Answer
A tool that distributes traffic across multiple servers. If you have 3 backend servers, the load balancer sends each request to a different server to spread the load. If a server goes down, the load balancer stops sending traffic to it.Docker
Section titled “Docker”Q: Difference between image and container?
💡 Hint
Think of a recipe vs a cooked dish. One is a template, the other is a running instance.✅ Answer
Image = read-only template (the recipe). Container = running instance (the cooked dish). One image can create multiple containers.Q: What is a Dockerfile?
💡 Hint
A text file with instructions. Think of the keywords: FROM, COPY, RUN, CMD.✅ Answer
A text file that describes step by step how to build a Docker image. FROM for the base, COPY for files, RUN for commands, CMD for the startup command.Q: A container keeps crashing in a loop — how do you debug it?
💡 Hint
The first thing is always the logs. If the container isn't running anymore, there's a way to launch the image with a shell instead of the app.✅ Answer
docker logs <container> to read the logs. If the container isn't running anymore, docker run -it --entrypoint bash <image> to get inside and investigate manually.Q: Why use a multi-stage build?
💡 Hint
The goal is the final image size. You separate the build phase and the runtime phase.✅ Answer
To reduce the final image size. You build in a heavy image (with build tools), then copy only the result into a lightweight image. The frontend goes from 500 MB to 20 MB.Q: How do containers communicate with each other in Docker Compose?
💡 Hint
Docker Compose automatically creates something that lets containers find each other by their service name.✅ Answer
Via an internal network created automatically. Each container is accessible by its service name (e.g.:backend:8000, db:5432). It's service discovery through internal DNS.Q: Difference between CMD and ENTRYPOINT?
💡 Hint
One can be overridden at launch, the other can't. Which one is used 90% of the time?✅ Answer
CMD = default command, can be overridden at launch. ENTRYPOINT = fixed command, arguments fromdocker run are appended after it. In practice, CMD is enough 90% of the time.Q: What is Docker?
💡 Hint
Think of a way to package an application with everything it needs to run the same way everywhere.✅ Answer
A tool that packages an application with all its dependencies into an isolated container. The container runs the same way everywhere (your PC, a server, the cloud).Q: What is Docker Compose?
💡 Hint
When you have multiple containers (backend, frontend, database), you need a tool to manage them together.✅ Answer
A tool for managing multiple containers together with a YAML file. You define services, networks, and volumes, thendocker compose up launches everything at once.Q: What is a Docker volume?
💡 Hint
By default, container data disappears when it's deleted. How do you persist data?✅ Answer
Persistent storage. Without a volume, data disappears when the container is deleted. Essential for databases — data survives container restarts.Q: Difference between COPY and ADD in a Dockerfile?
💡 Hint
Both copy files. One does more than the other — but is that always desirable?✅ Answer
Both copy files into the image.COPY does a simple copy. ADD can also decompress archives (.tar.gz) and download from URLs. In practice, always use COPY — it's more explicit.Q: What is a Docker registry?
💡 Hint
Think of GitHub, but for Docker images instead of source code.✅ Answer
A server that stores Docker images. Docker Hub is the default public registry. In the workplace, private registries (AWS ECR, GitHub Container Registry) are often used to store your own images.Q: Why does the order of instructions in a Dockerfile matter?
💡 Hint
Docker uses a layer caching system. If a layer changes, all subsequent layers are rebuilt.✅ Answer
Because of caching. Docker executes each instruction as a layer. If a layer hasn't changed, Docker reuses the cache. By puttingCOPY requirements.txt + RUN pip install BEFORE COPY . ., dependencies are only reinstalled when they actually change — not on every code modification.Q: What is CI/CD?
💡 Hint
CI = before deployment (verify). CD = the deployment itself (deliver).✅ Answer
CI = automatic verification on every push (lint, tests). CD = automatic (or semi-automatic) deployment. The goal: detect bugs as early as possible and deploy with confidence.Q: What is “fail fast”?
💡 Hint
If a quick step fails, do you still run the long steps?✅ Answer
If lint fails, you don't run the tests. If tests fail, you don't build. You stop as soon as a problem is detected to avoid wasting time.Q: Where do you put secrets in a pipeline?
💡 Hint
Never in the code, never in the committed YAML. There's a dedicated place in GitHub/GitLab for that.✅ Answer
In the CI secrets (GitHub Secrets, GitLab Variables). They're injected at runtime and never appear in the logs.Q: A test passes locally but fails in CI — why?
💡 Hint
Think about the differences between your machine and the CI runner: versions, environment variables, available services.✅ Answer
Often an environment difference: different Python/Node version, missing environment variable, dependency not installed, or the test depends on a service (DB) that doesn't exist in CI.Q: How do you rollback if a deployment breaks production?
💡 Hint
Docker images are tagged with the commit hash. How do you use that to go back?✅ Answer
You redeploy the previous Docker image. That's why we tag images with the commit hash — you can go back to any version in a few minutes.Q: What are the stages of a typical CI/CD pipeline?
💡 Hint
4 stages in order. If the first fails, the next ones don't run.✅ Answer
Lint (code quality) → Tests → Build (artifact construction) → Deploy. Each stage blocks the next if it fails.Q: Difference between Continuous Delivery and Continuous Deployment?
💡 Hint
Both start with "Continuous D...". The difference: does a human press a button before prod?✅ Answer
Delivery = ready to deploy but manual button. Deployment = automatic deployment to prod. Most companies do Delivery (a human validates before prod).Q: What is a runner?
💡 Hint
The pipeline doesn't execute itself in a vacuum — it needs a machine to run on.✅ Answer
The machine (server) that executes pipeline jobs. GitHub provides free runners (ubuntu-latest). You can also use self-hosted runners for more control.Q: What is a blue/green deployment?
💡 Hint
Two identical environments. One serves prod, the other waits for the new version. You switch traffic all at once.✅ Answer
A deployment strategy with two identical environments. "Blue" serves prod, you deploy the new version to "green", test it, then switch the traffic. If it breaks, you switch back in seconds. Advantage: instant rollback.Q: What is a canary deployment?
💡 Hint
Instead of deploying to everyone at once, you start with a small percentage. The name comes from canaries in coal mines.✅ Answer
You deploy the new version to a small percentage of servers (e.g., 5%). You monitor the metrics. If everything's fine, you gradually increase (25% → 50% → 100%). If it breaks, only 5% of users are impacted.Q: What is EC2?
💡 Hint
Think of renting a computer instead of buying one.✅ Answer
A virtual server in the cloud. You choose the power (CPU, RAM), the OS, and you pay by the hour.Q: How do you connect to an EC2?
💡 Hint
A remote connection protocol + a key file downloaded when the instance was created.✅ Answer
Via SSH with a key pair:ssh -i ~/devops-key.pem ubuntu@PUBLIC_IP. The .pem key is downloaded when the instance is created.Q: Your EC2 is unresponsive — what are the first things you check?
💡 Hint
3 things: the instance itself (is it running?), the network (is the port open?), and the address (does it have a public IP?).✅ Answer
1. Is the instance "Running" in the AWS console? 2. Does the Security Group allow SSH (22) and HTTP (80) ports? 3. Does the instance have a public IP? 4. If everything looks fine on the AWS side, SSH in and check the app logs.VPC and Networking
Section titled “VPC and Networking”Q: What is a VPC?
💡 Hint
It's your private network in AWS. You put your resources in it and control who can access what.✅ Answer
Virtual Private Cloud — an isolated network in AWS. You put your resources in it (EC2, RDS). You control the subnets, routing, and access.Q: Difference between public and private subnet?
💡 Hint
One is accessible from the Internet, the other isn't. Think about where you'd put a web server vs a database.✅ Answer
Public = accessible from the Internet (via Internet Gateway). Private = no direct Internet access. You put web servers in public, databases in private.Q: What is a Security Group?
💡 Hint
It's like a firewall. It controls traffic by port and source. It's "stateful" — what does that mean?✅ Answer
A virtual firewall attached to an instance. It filters inbound (ingress) and outbound (egress) traffic by port and source IP. "Stateful" = if you allow inbound traffic on a port, the outbound response is automatically allowed.Q: What is an Internet Gateway?
💡 Hint
Without it, your VPC is completely isolated from the Internet. It's the door between your private network and the outside world.✅ Answer
The door that connects your VPC to the Internet. Without an Internet Gateway, no resource in the VPC can access the Internet (and nobody can access it from the Internet).Q: Why use RDS instead of installing PostgreSQL on an EC2?
💡 Hint
Think about everything you DON'T have to manage with RDS: backups, updates, high availability.✅ Answer
RDS handles automatic backups, security updates, replication and high availability. You don't have to maintain the database server yourself. The extra cost is offset by the time saved.Q: What is Multi-AZ on RDS?
💡 Hint
Your database is copied to a 2nd location. If the first one goes down...✅ Answer
Your database is automatically replicated to a 2nd datacenter (Availability Zone). If the first one fails, the 2nd takes over automatically. That's high availability.Q: How do you protect your database on AWS?
💡 Hint
Think about the subnet (where it's placed) and the Security Group (who's allowed to connect to it).✅ Answer
You put it in a private subnet (no public IP), with a Security Group that only allows port 5432 from the EC2's Security Group. Never direct access from the Internet.Q: What is S3?
💡 Hint
File storage in the cloud. Unlimited, high durability, cheap.✅ Answer
Simple Storage Service — unlimited object (file) storage in the cloud. Used for backups, static files (images, CSS, JS from a frontend), logs, data exports.Q: How do you secure an S3 bucket?
💡 Hint
By default a bucket is private. The danger is making it public by mistake.✅ Answer
By default, an S3 bucket is private (that's good). You verify that "Block all public access" is enabled. You control access via bucket policies and IAM roles. Never public access unless for intentionally public static content (frontend).Q: What is IAM?
💡 Hint
AWS's permission system. Who is allowed to do what.✅ Answer
Identity and Access Management. Manages users (Users), roles (Roles) and permissions (Policies). The key principle: least privilege — only grant the permissions that are strictly necessary.Q: User vs Role — what’s the difference?
💡 Hint
One is permanent (a person or a program), the other is temporary (you "assume" it when needed).✅ Answer
User = a permanent account for a person or a program (with fixed credentials). Role = a set of temporary permissions that a service can "assume" (e.g.: an EC2 that needs to access S3 uses a role, not a user).Q: What is an IAM Policy?
💡 Hint
Think of the document that describes permissions. It's in JSON format.✅ Answer
A JSON document that defines permissions: which actions (e.g.:s3:GetObject) are allowed or denied, on which resources (e.g.: a specific bucket). You attach it to a User, Group or Role to grant these rights.Q: What is the principle of least privilege?
💡 Hint
A fundamental security rule: you grant the minimum permissions needed, nothing more.✅ Answer
Grant only the permissions strictly necessary to do the job, and nothing more. If a Lambda only needs to read an S3 bucket, you give it onlys3:GetObject on that specific bucket — not AdministratorAccess. This limits the damage if credentials are compromised.Lambda and SQS
Section titled “Lambda and SQS”Q: When to use Lambda vs EC2?
💡 Hint
Think about execution duration and frequency. One runs 24/7, the other runs on demand.✅ Answer
Lambda = short tasks (<15 min), occasional, with automatic scaling (webhooks, file processing). EC2 = applications running continuously 24/7 (web API, server). With Lambda you pay per execution, with EC2 you pay by the hour even when idle.Q: What is SQS and why is it useful?
💡 Hint
Think of a queue. Instead of processing messages directly (and risking losing them if it crashes), you put them in...✅ Answer
Simple Queue Service — a managed message queue. You put messages in, another program consumes them. If the consumer crashes, the message stays in the queue and will be reprocessed. Useful for decoupling services, absorbing traffic spikes, and never losing data.ECS and EKS
Section titled “ECS and EKS”Q: What’s the difference between ECS and EKS?
💡 Hint
Both run containers on AWS. One is AWS-specific and simpler, the other is a portable standard.✅ Answer
ECS = AWS-specific container orchestration (simpler, no control plane fees). EKS = managed Kubernetes (standard, multi-cloud portable, but more complex and more expensive ~$75/month base).Q: What is Fargate?
💡 Hint
An ECS mode where you don't manage any servers. You just provide your Docker image and the amount of CPU/RAM.✅ Answer
A "serverless" mode for ECS — you provide your Docker image, define CPU and RAM, AWS launches the container somewhere in the cloud. You never see a machine, you don't manage any servers. You only pay for the CPU/RAM used.Q: What is AWS?
💡 Hint
The world's largest cloud provider. You rent computing resources instead of buying them.✅ Answer
A cloud computing provider. You rent servers (EC2), storage (S3), databases (RDS) and many other services, on demand. You pay for what you use.Q: What is RDS?
💡 Hint
Think of a database where AWS handles all the maintenance for you.✅ Answer
Relational Database Service — a managed database by AWS. You choose the engine (PostgreSQL, MySQL...), AWS handles backups, updates, and high availability.Q: What is DynamoDB?
💡 Hint
AWS's NoSQL alternative. Instead of SQL tables with fixed columns, you store...✅ Answer
A NoSQL managed database by AWS. Instead of SQL tables with fixed columns, you store flexible JSON documents. Scaling is automatic and pricing is per-request.Q: When to use RDS vs DynamoDB?
💡 Hint
Think about the data type: does it have relationships (users → orders → products)?✅ Answer
RDS when your data has relationships and you need complex SQL queries. DynamoDB for simple data at very high traffic (sessions, cache, counters). When in doubt, RDS — it's more versatile.Q: What is ECS?
💡 Hint
You give it Docker images, it runs, monitors and scales them. With Fargate, you don't even manage servers.✅ Answer
Elastic Container Service — you give AWS your Docker images, and it runs, monitors and scales them. With Fargate, you manage no servers — you pay only for CPU and RAM used.Q: What is EKS?
💡 Hint
Managed Kubernetes on AWS. AWS manages one part, you manage the other. The advantage is portability.✅ Answer
Elastic Kubernetes Service — managed Kubernetes on AWS. AWS manages the control plane, you manage the workers. Advantage over ECS: K8s is a standard portable across any cloud.Q: What is Lambda?
💡 Hint
Code that runs without a server. You only pay when your code runs.✅ Answer
Serverless — you send your code, AWS runs it when needed, you pay per execution. No server to manage. Ideal for short, one-off tasks (<15 min).Q: When to use Lambda vs EC2 vs ECS?
💡 Hint
Think about execution duration and whether the app needs to run continuously or not.✅ Answer
Lambda for short tasks (<15 min) and one-off. ECS/EKS for containerized apps running continuously with auto-scaling. EC2 when you need full server control or for small simple projects.Q: What is a cold start?
💡 Hint
The first Lambda execution is slower. Why?✅ Answer
The first Lambda execution is slower because AWS has to start an environment. Subsequent executions (warm start) are faster because the environment is already ready.Q: Difference between horizontal and vertical scaling?
💡 Hint
One adds power to a machine, the other adds machines. Which one has a physical limit?✅ Answer
Vertical = increase the power of a machine (more CPU, more RAM). Horizontal = add more machines. Vertical has a physical limit, horizontal is virtually unlimited. In the cloud, horizontal scaling is preferred.Q: What is the shared responsibility model?
💡 Hint
AWS and you each have a share of responsibility for security. Who manages what?✅ Answer
AWS manages security of the cloud (datacenters, physical network, hypervisors). You manage security in the cloud (your data, your Security Groups, your IAM policies, your code). If your Security Group is open to everyone, that's your fault, not AWS's.Terraform
Section titled “Terraform”Q: What is Infrastructure as Code?
💡 Hint
Instead of clicking in a console to create servers, you do what?✅ Answer
Describe your infrastructure in code files instead of clicking in a console. Reproducible, versioned in Git, auditable, shareable.Q: Explain plan, apply, destroy
💡 Hint
Three steps: preview, execute, delete. Which one do you always do first?✅ Answer
plan shows what will change without doing anything. apply executes the changes. destroy deletes everything. You always run plan before apply to verify.Q: What is the state file and why is it important?
💡 Hint
Terraform needs to know what CURRENTLY exists to compare with what you want. It stores that in a file.✅ Answer
A JSON file that records the current state of the infrastructure. Terraform compares it with your code to know what to create/modify/delete. Never edit it by hand, never commit it (it can contain secrets).Q: How do you interact with a resource that already exists on AWS but not in your Terraform?
💡 Hint
There's a keyword different fromresource that FETCHES information instead of CREATING something.✅ Answer
With adata block. Unlike resource which creates something, data fetches information that already exists (an AMI, a VPC, an existing Security Group).Q: Someone modified the infrastructure by hand in the AWS console — what happens?
💡 Hint
The state file no longer matches reality. Terraform will detect the difference on the nextplan. What is that called?✅ Answer
That's drift. On the nextterraform plan, Terraform shows the differences between the code and reality. Either you import the change into the code, or apply overwrites the manual change.Q: What is Terraform?
💡 Hint
A tool for describing your infrastructure in code files instead of clicking in a console.✅ Answer
An Infrastructure as Code tool. You describe your infra in HCL files, Terraform creates/modifies/deletes it. Versionable, reproducible, collaborative.Q: Terraform vs CloudFormation?
💡 Hint
One is multi-cloud, the other is specific to a single cloud provider.✅ Answer
Terraform is multi-cloud (AWS, GCP, Azure). CloudFormation is AWS-specific. Terraform has a larger community and more readable syntax.Q: What is a Terraform module?
💡 Hint
Think of a function in programming — reusable code you call with parameters.✅ Answer
A reusable block of Terraform code. Instead of copy-pasting the same config for each environment, you create a module and call it with different parameters. It's like a function in programming.Q: What is a Terraform provider?
💡 Hint
Terraform alone can't do anything. It needs plugins to talk to AWS, GCP, etc.✅ Answer
A plugin that connects Terraform to a service (AWS, GCP, Azure, GitHub...). The AWS provider allows Terraform to create EC2s, S3 buckets, RDS instances. Without a provider, Terraform can't talk to anything.Ansible
Section titled “Ansible”Q: What is Ansible?
💡 Hint
A server configuration tool. The keyword is "agentless" — it doesn't need to install anything on the target server.✅ Answer
A configuration management tool. Configures servers in an automated way, agentless (connects via SSH, no need to install anything on the target server).Q: Ansible vs Terraform?
💡 Hint
One creates the infrastructure, the other configures what runs on it. Think "building the house" vs "furnishing it".✅ Answer
Terraform creates the infrastructure (the server exists). Ansible configures what runs on it (installs Docker, copies files, launches the app). Terraform builds the house, Ansible furnishes it.Q: What is idempotence?
💡 Hint
What happens if you run the same playbook 10 times in a row?✅ Answer
Running a playbook multiple times always gives the same result. If Docker is already installed, Ansible doesn't reinstall it. That's what makes it safe to re-run.Q: What is a playbook?
💡 Hint
It's a file in a format you know well (used everywhere in DevOps). It describes tasks to execute.✅ Answer
A YAML file that describes tasks to execute on servers. Each task uses a module (apt, copy, service) and is named for readability.Q: How do you manage secrets in Ansible?
💡 Hint
Ansible has a built-in tool to encrypt files. Its name makes you think of a safe.✅ Answer
With Ansible Vault. You encrypt files containing secrets, and at execution time you pass--ask-vault-pass to decrypt them.Q: What is an Ansible inventory?
💡 Hint
Ansible needs to know which machines to act on. There's a file for that.✅ Answer
The file that lists the servers Ansible will act on. It contains IP addresses or hostnames, organized in groups (web, db, etc.). Ansible connects via SSH to each machine in the inventory to execute tasks.Q: What is an Ansible role?
💡 Hint
When your playbook grows, you need to organize it into reusable components.✅ Answer
A way to organize a playbook into reusable components. A role bundles tasks, files, templates, and variables related to a function (e.g., a "docker" role that installs and configures Docker). You can reuse the same role across multiple playbooks.Kubernetes
Section titled “Kubernetes”Q: What is Kubernetes?
💡 Hint
Think of an orchestra conductor for containers. It manages 3 main things: deployment, scaling, and...✅ Answer
A container orchestrator. It manages the deployment, scaling and high availability of your containers on a cluster of machines.Q: What is a Pod?
💡 Hint
It's the basic unit. Most of the time, 1 pod = 1 container.✅ Answer
The basic unit of K8s. 1 pod ≈ 1 container. Kubernetes doesn't manage containers directly — it manages pods.Q: A pod crashes — what does Kubernetes do?
💡 Hint
K8s maintains the number of replicas defined in the Deployment. If one is missing, it...✅ Answer
The Deployment detects that a pod is missing and automatically recreates one. That's self-healing. That's why you never create pods directly — you go through a Deployment.Q: What’s the difference between port and targetPort in a Service?
💡 Hint
One is the "entry" port of the Service, the other is the port the container actually listens on. They can be different.✅ Answer
port = the port to access the Service (from inside the cluster). targetPort = the port on the container that traffic is redirected to. Often the same, but you could map port 80 of the Service to port 8000 of the container.Q: How do you update an app without downtime on K8s?
💡 Hint
K8s replaces pods one by one, not all at once. It waits for the new one to be ready before deleting the old one. What is that called?✅ Answer
Rolling update (the default). Kubernetes creates a new pod with the new version, waits for it to be ready (health check), then deletes the old one. Pods are replaced one by one — users don't see any downtime.Q: Difference between Docker and Kubernetes?
💡 Hint
One runs ONE container, the other orchestrates dozens/hundreds across multiple machines.✅ Answer
Docker runs ONE container. Kubernetes orchestrates dozens/hundreds of containers across multiple machines (scheduling, scaling, self-healing).Q: What is a Deployment?
💡 Hint
You never create pods directly. You go through an object that manages them for you.✅ Answer
An object that manages a group of identical pods. It maintains the desired replica count, manages updates (rolling update), and recreates crashed pods.Q: What is a K8s Service?
💡 Hint
Pods have IPs that change on every restart. You need a stable access point.✅ Answer
A stable network access point to a group of pods. Pods have ephemeral IPs, the Service has a fixed IP and distributes traffic across the pods.Q: What is a Namespace?
💡 Hint
Think of folders to organize and isolate resources in a cluster.✅ Answer
A way to isolate resources in a cluster. Useful for separating environments (dev, staging, prod) or teams.Q: What is an Ingress?
💡 Hint
How do you make external HTTP requests reach the right Services inside the cluster?✅ Answer
A K8s object that manages HTTP(S) routing to Services. It lets you say "requests toapi.mysite.com go to the backend Service" and "requests to mysite.com go to the frontend Service". It's the HTTP entry point of the cluster.Q: What is a ConfigMap and a Secret?
💡 Hint
How do you pass configuration and secrets to your pods without putting them in the Docker image?✅ Answer
K8s objects for storing configuration. A ConfigMap stores non-sensitive data (URLs, feature flags). A Secret stores sensitive data (passwords, API keys) encoded in base64. Both are injected into pods as environment variables or files.Q: What is a liveness probe and a readiness probe?
💡 Hint
K8s needs to know if your pods are alive and ready. It uses two different types of checks.✅ Answer
Health checks that K8s runs on your pods. The liveness probe checks if the pod is alive — if it fails, K8s restarts the pod. The readiness probe checks if the pod is ready for traffic — if it fails, K8s stops sending requests without restarting it.Q: Difference between ClusterIP, NodePort, and LoadBalancer?
💡 Hint
These are the three K8s Service types. Each exposes the Service at a different level of accessibility.✅ Answer
Three K8s Service types. ClusterIP (default) = accessible only from inside the cluster. NodePort = accessible from outside via a port on each node. LoadBalancer = creates an external load balancer (cloud provider) redirecting to the Service. In production, you typically use an Ingress in front of a ClusterIP Service.Monitoring
Section titled “Monitoring”Q: What are the 3 pillars of observability?
💡 Hint
Three types of data: numbers, text, and the path of a request.✅ Answer
Metrics (numbers — CPU, response time), Logs (text messages from applications), Traces (the path of a request through multiple services).Q: How do you tell the difference between a code problem and an infrastructure problem?
💡 Hint
If all instances have the same problem, it's probably the code. If it's just one instance... think about resources.✅ Answer
You check infrastructure metrics first (CPU, RAM, disk, network). If everything is normal on the infra side but the app returns errors → it's a code bug (ticket for the devs). If CPU is at 100% or disk is full → it's an infra problem (your problem).Q: How do you know if your app is slow?
💡 Hint
You don't look at the average (it hides problems). You look at a percentile — which one?✅ Answer
The p95 or p99 of latency in Grafana. The p95 = 95% of requests are faster than this value. If the p95 is at 2 seconds, 5% of your users are waiting more than 2 seconds.Q: What’s a good alert vs a bad alert?
💡 Hint
A good alert prompts you to act. A bad alert, you end up ignoring. Think symptoms vs causes.✅ Answer
Good: actionable, based on symptoms ("the 5xx error rate exceeds 5%"). Bad: noise ("CPU at 80%" — maybe that's normal). If you receive an alert and your reaction is "meh", delete the alert.Q: What’s the difference between Prometheus and Grafana?
💡 Hint
One collects data, the other displays it. Think sensor vs dashboard.✅ Answer
Prometheus collects and stores metrics (it scrapes /metrics every 15s). Grafana displays them in dashboards. Prometheus = the sensor, Grafana = the dashboard.Q: Why is monitoring important?
💡 Hint
Without monitoring, how do you know your app is working correctly?✅ Answer
Without monitoring, you don't know if your app works correctly. You detect problems before users, identify bottlenecks, and have data for decisions.Q: What is Prometheus?
💡 Hint
A metrics collection tool. It fetches data itself (pull model) instead of waiting for apps to send it.✅ Answer
A metrics collection system using pull model. It scrapes/metrics endpoints from applications at regular intervals and stores data as time series.Q: What is Grafana?
💡 Hint
It's the visualization tool that goes with Prometheus. Think dashboards and graphs.✅ Answer
A visualization tool. It connects to data sources (Prometheus, etc.) and creates dashboards with graphs and alerts.Q: Difference between pull and push model?
💡 Hint
Who initiates data collection? The monitoring server, or the application itself?✅ Answer
Pull = Prometheus fetches the data (scrape). Push = applications send the data. Pull is simpler to manage and debug.Q: What are SLI, SLO, and SLA?
💡 Hint
Three levels: what you measure, what you aim for, what you contractually commit to.✅ Answer
SLI (Service Level Indicator) = the measured metric (e.g., 99.2% of requests respond in under 200ms). SLO (Service Level Objective) = the internal target (e.g., we aim for 99.5%). SLA (Service Level Agreement) = the contractual commitment with the client (e.g., if we drop below 99%, we refund). SLI measures, SLO guides, SLA commits.Part 2: Scenario-based questions
Section titled “Part 2: Scenario-based questions”Scenario 1 — Deploying a web app to production
Section titled “Scenario 1 — Deploying a web app to production”“You join a startup. They have a web app (React frontend + backend API + PostgreSQL database). Everything runs on the CTO’s laptop. How do you put this in production?”
How to approach the question
Section titled “How to approach the question”Don’t dive straight into tools. Ask questions first:
- How many users? (10? 10,000? 1 million?)
- What budget? ($0? $50/month? $1,000/month?)
- What team? (1 dev, 10 devs? Is there a DevOps?)
- What are the availability requirements? (side project vs. banking app)
- Is the frontend static (just built HTML/JS) or does it need server-side rendering?
This last question is key, because it completely changes the architecture for the frontend.
The frontend — 3 different approaches
Section titled “The frontend — 3 different approaches”Our case: React with Vite = static frontend. The build produces static files (HTML/CSS/JS) that can be served from any web server or CDN.
Approach 1 — CDN / Static hosting (the simplest and most performant)
The built frontend is just static files. No need for a server for this.
| Service | What it is | Cost | Complexity |
|---|---|---|---|
| S3 + CloudFront | S3 bucket (storage) + AWS CDN (worldwide distribution) | ~$0-5/month | Low |
| Vercel | Specialized frontend hosting, auto deployment from Git | Free (hobby) | Very low |
| Netlify | Same concept as Vercel | Free (hobby) | Very low |
| AWS Amplify Hosting | AWS service to host frontend apps, auto deployment from Git | Free (Free Tier) | Low |
When to choose: Almost always for a static frontend (built React, Vue, Angular). It’s faster (CDN = servers close to users), cheaper, and you have no server to manage.
Approach 2 — Nginx in a container (what we do in the Hands-on Project)
You build the frontend, then serve the files with nginx in a Docker container. This is what we do in Module 3.
When to choose: When you want everything in the same docker-compose to simplify the deployment, or when you need a custom reverse proxy (complex routing rules).
Approach 3 — Server-Side Rendering (Next.js, Nuxt, etc.)
If the frontend does SSR (HTML is generated server-side), then it needs a Node.js server running at all times. In that case, you treat it like a backend (EC2, ECS, App Runner, etc.).
When to choose: Critical SEO (e-commerce, blog), dynamic content that changes often.
The backend + database — From simplest to most robust
Section titled “The backend + database — From simplest to most robust”Option A: 1 server, Docker Compose (MVP / side project)
1 EC2 (t3.small)├── Frontend (nginx)├── Backend (API container)└── PostgreSQL (container with volume)Pros: Quick to set up, cheap (~$15/month), a single machine. Cons: Single point of failure. DB in Docker = risky (no automatic backup). No scaling. When to choose: MVP, side project, <100 users, ~$0 budget.
Option B: EC2 + RDS (serious startup)
VPC├── Public subnet│ └── EC2 (backend in Docker)└── Private subnet └── RDS PostgreSQL (automatic backups)+ S3 + CloudFront (static frontend)Pros: The DB is managed (backups, auto updates). Network separation. The frontend on CDN is fast and free. You can add a 2nd EC2 + load balancer later. Cons: More expensive (~$50-100/month). You manage the EC2s yourself (OS updates, Docker, etc.). When to choose: App in production, real users, need for reliability, small team.
Option C: ECS Fargate (scaling without managing servers)
VPC├── Public subnet│ └── Application Load Balancer├── Private subnet│ ├── ECS Fargate (backend containers, auto-scaling)│ └── RDS PostgreSQL Multi-AZ+ S3 + CloudFront (frontend)+ Route 53 (DNS)ECS (Elastic Container Service) runs your Docker containers without you managing servers. Fargate = you give it a Docker image, define CPU/RAM, it launches the container somewhere in the cloud. You never see a machine.
Pros: Auto-scaling, no servers to manage, high availability. You push a Docker image and it’s deployed. Cons: More expensive than bare EC2 (~$100-300/month). More complex configuration (task definitions, services, target groups…). When to choose: Variable traffic, need for scaling, don’t want to manage EC2s.
Option D: AWS App Runner (the simplest for containers)
App Runner (backend container)+ RDS PostgreSQL+ S3 + CloudFront (frontend)App Runner is the simplest AWS service to run a container web app. You give it your Docker image (or your source code) and it handles everything: build, deployment, scaling, HTTPS, load balancing.
Pros: Ultra simple. No network configuration. Auto-scaling included. Automatic HTTPS. Cons: Less control than ECS. No default VPC (configurable). More expensive at high traffic. When to choose: You want to deploy fast, you don’t want to configure VPC/ALB/ECS, small team without a dedicated DevOps.
Option E: AWS Amplify (integrated frontend + backend)
Amplify is a complete platform that can host a static frontend AND a backend (via Lambda functions or a GraphQL API).
Pros: All-in-one: hosting, auth, API, database. Auto deployment from Git. Ideal for fullstack devs who don’t want to touch infra. Cons: Strong vendor lock-in (you’re tied to the Amplify way of doing things). Less control. Can become limiting for complex architectures. When to choose: Small fullstack project, rapid prototyping, no DevOps on the team.
Option F: Kubernetes / EKS (large scale)
EKS (managed Kubernetes)├── Backend deployments (auto-scaling)├── Worker deployments├── Ingress Controller (HTTP routing)+ RDS Multi-AZ+ S3 + CloudFront (frontend)+ Helm for packagingPros: Massive scaling, portability (not locked to AWS), fine-grained orchestration. Cons: Complex to operate. EKS costs ~$75/month just for the control plane. Over-engineering if you don’t have 10+ microservices. When to choose: Many microservices, large DevOps team, need for multi-cloud portability.
The global comparison table
Section titled “The global comparison table”| Option | Complexity | Monthly cost* | Scaling | Server management | Use case |
|---|---|---|---|---|---|
| EC2 + Docker Compose | Low | ~$15 | No | Yes | MVP |
| EC2 + RDS | Medium | ~$50-100 | Manual | Yes | Serious startup |
| App Runner + RDS | Low | ~$30-80 | Auto | No | Small team, fast to prod |
| ECS Fargate + RDS | High | ~$100-300 | Auto | No | Variable traffic, scaling |
| Amplify | Low | ~$0-50 | Auto | No | Prototyping, solo fullstack |
| EKS (K8s) | Very high | ~$200+ | Auto | Partially | Microservices, large scale |
*Approximate costs for a modest-sized app.
Outside AWS — the alternatives
Section titled “Outside AWS — the alternatives”| Service | What it is | When to use |
|---|---|---|
| Railway / Render | PaaS (Platform as a Service). You push your code, they deploy. | Side projects, small apps, don’t want to deal with AWS |
| Fly.io | Edge containers (close to users). | Global APIs, low latency |
| DigitalOcean App Platform | Simple PaaS, cheaper than AWS. | SMBs, startups that want simplicity |
| GCP Cloud Run | Google’s equivalent of App Runner. Serverless containers. | Already on GCP |
| Azure Container Apps | Microsoft’s equivalent of App Runner. | Already on Azure |
In an interview, mentioning that alternatives exist shows that you don’t only know one provider.
What the recruiter expects
Section titled “What the recruiter expects”Not the perfect answer. They want to see that you:
- Ask questions before answering (budget, scale, constraints, team)
- Know multiple options and can compare them (not just “EC2 and that’s it”)
- Separate concerns: the static frontend doesn’t need a server, the DB should be managed
- Can explain the trade-offs: simplicity vs. control vs. cost vs. scaling
- Don’t suggest Kubernetes for 50 users — but you can explain when K8s makes sense
Scenario 2 — The site is down in production
Section titled “Scenario 2 — The site is down in production”“It’s 2 PM, you get an alert: the site isn’t responding. Users are complaining. What do you do?”
The method (from broadest to most specific)
Section titled “The method (from broadest to most specific)”Step 1 — Confirm and scope the problem (30 seconds)
# Is the site responding?curl -I https://mysite.com# If timeout → network/DNS/server down problem# If 502 → the proxy server is running but the app behind it is down# If 500 → the app is running but crashing
# Is it just me or everyone?# Test from another network / a colleagueStep 2 — Check the infrastructure (2 minutes)
# Is the server up?ssh user@server# If "Connection refused" → the server is down or the SSH port is blocked# → Check in the AWS console: instance running? Security Group OK?
# Are resources OK?top # CPU, RAMdf -h # Disk full?Step 3 — Check the services (2 minutes)
docker ps # Are the containers running?docker logs backend --tail 100 # Recent errors?systemctl status nginx # Is the reverse proxy running?Step 4 — Check dependencies
# Is the database responding?docker exec -it db psql -U user -c "SELECT 1;"
# Are external services responding?curl https://external-api-we-use.com/healthStep 5 — Fix and communicate
- Fix the problem (restart the service, free up disk space, rollback the last deployment…)
- Communicate: notify the team, update the status page
- After the incident: write a post-mortem (what happened, why, how to prevent it from happening again)
The most common causes
Section titled “The most common causes”| Symptom | Likely cause | Quick fix |
|---|---|---|
| Total timeout | Server down or Security Group | Restart the instance, check network rules |
| 502 Bad Gateway | The app crashed behind the proxy | docker restart backend, check the logs |
| 500 Internal Error | Bug in the code or DB unreachable | App logs, check the DB connection |
| Very slow site | CPU/RAM saturated, slow DB queries | top, check slow queries |
| Disk full | Logs accumulating, Docker images | df -h, docker system prune, log rotation |
What the recruiter expects
Section titled “What the recruiter expects”- A structured method, not panic
- You start by checking, not by changing things
- You communicate with the team during debugging
- You mention post-mortem (learning after the incident)
Scenario 3 — Setting up a CI/CD pipeline
Section titled “Scenario 3 — Setting up a CI/CD pipeline”“The team of 5 devs deploys manually via SSH. It takes 30 min and breaks one out of three times. How do you improve this?”
The concrete problem
Section titled “The concrete problem”Today:
- A dev finishes their code
- They SSH into the server
- They run
git pullon the server - They restart the app manually
- They cross their fingers
Problems: no tests before deployment, no rollback possible, only one dev knows how to do it, it breaks often.
The progressive solution
Section titled “The progressive solution”Phase 1 — CI (1-2 days to set up)
# On every push to main:Lint → Tests → Build Docker image → Push to registry- Devs get immediate feedback: “your code breaks the tests”
- You never deploy code that doesn’t compile or doesn’t pass tests
- Impact: we stop deploying broken code
Phase 2 — CD to a staging environment (3-5 days)
# After CI:Automatic deploy to a staging server- Devs and the product owner test on staging before production
- Staging is a copy of production (same config, same infra)
- Impact: we test in real conditions before production
Phase 3 — CD to production (when the team is confident)
# If staging is OK (tests pass, QA validated):Manual approval → Deploy to prod- A human validates before production (Continuous Delivery, not Deployment)
- Automatic rollback if the health check fails
- Impact: deployment in 5 min instead of 30, no SSH connection needed
Why not automate everything at once?
Section titled “Why not automate everything at once?”Because trust is built progressively. If tests don’t cover enough cases, a 100% automated deployment to production will deploy bugs faster. Phase 1 → Phase 2 → Phase 3 lets the team build confidence at each step.
What the recruiter expects
Section titled “What the recruiter expects”- You don’t suggest “let’s set up Kubernetes” right away
- You think progressively (quick wins first)
- You mention staging (never directly to production)
- You mention rollback
Scenario 4 — Managing secrets
Section titled “Scenario 4 — Managing secrets”“A dev committed a database password to the Git repo. What do you do?”
Immediate reaction (emergency)
Section titled “Immediate reaction (emergency)”- Change the password immediately — the absolute priority. Even if “nobody saw it”, consider it compromised.
- Check access — has anyone used this password since the commit?
- Remove from Git — careful, a simple
git rmis NOT enough. The password stays in the history. You’d need to rewrite the history (git filter-branchorbfg), but that’s heavy. The most important thing is point 1: change the password.
Set up protections
Section titled “Set up protections”| Measure | What it does |
|---|---|
.gitignore | Ignore .env files, credentials.json, etc. |
| Pre-commit hook | Scan commits BEFORE they’re pushed (tools: gitleaks, detect-secrets) |
| GitHub Secret Scanning | GitHub automatically detects committed secrets and alerts you |
| Environment variables | Secrets live in the server’s env, not in the code |
| Secrets manager | AWS Secrets Manager, HashiCorp Vault — secure and centralized storage |
The rule
Section titled “The rule”Code is public by default (even a private repo can leak). Secrets must never be in the code. Period.
Scenario 5 — Choosing the right infrastructure for each project
Section titled “Scenario 5 — Choosing the right infrastructure for each project”“We have 4 projects to host. How do you choose the infrastructure for each one?”
Project A: Internal REST API with 1,000 requests/day
Section titled “Project A: Internal REST API with 1,000 requests/day”Context: API used by an internal mobile app. Low traffic, minimal budget, one person to maintain.
Best choice: Lambda + API Gateway
Why: very low traffic, no need for a server running 24/7. Lambda = you only pay when a request comes in. Cost: nearly $0 (Free Tier). API Gateway handles HTTPS, rate limiting, and routing.
Possible alternatives:
- App Runner: if the API is containerized and you want something simple without adapting the code for Lambda. Slightly more expensive but zero code adaptation.
- EC2: overkill. You’re paying for a 24/7 server for 1,000 requests/day — that’s waste.
Project B: Web SaaS with 10,000 users/day
Section titled “Project B: Web SaaS with 10,000 users/day”Context: Web application (React + API + PostgreSQL). Regular traffic during the day, low at night. Team of 5 devs. Needs reliability.
Best choice: ECS Fargate + RDS + S3/CloudFront
CloudFront (CDN) → S3 (static frontend)ALB → ECS Fargate (API containers, auto-scaling) └── RDS PostgreSQL (private subnet)Why: regular traffic, the app must run at all times, persistent connection to the DB. ECS Fargate = no servers to manage, auto-scaling for spikes. RDS = managed DB.
Possible alternatives:
- EC2 + RDS: cheaper, but you manage the servers (updates, Docker, monitoring). Good choice if the budget is tight and someone on the team knows how to manage servers.
- App Runner + RDS: simpler than ECS, but less control over the network (VPC peering, custom security groups). Good for a quick v1.
- Lambda: technically possible, but cold starts degrade the user experience, and DB connections are complicated to manage (you need RDS Proxy).
Project C: Processing uploaded files (resizing images)
Section titled “Project C: Processing uploaded files (resizing images)”Context: Users upload photos. They must be resized to 3 sizes and stored. Variable volume: sometimes 10 uploads/day, sometimes 10,000.
Best choice: Lambda + S3 (event-driven architecture)
User → upload → S3 bucket (originals) │ └── trigger Lambda → resize → S3 bucket (results)Why: purely event-driven. A file arrives in S3 → Lambda triggers automatically → processes the file → puts the result back in S3. No need for a server between uploads. Automatic scaling (100 uploads at the same time → 100 Lambdas in parallel).
Possible alternatives:
- ECS with an SQS queue: if processing takes >15 min (Lambda’s limit) or requires a lot of memory (>10 GB). SQS = queue, ECS = workers that consume the queue.
- Step Functions + Lambda: if processing has multiple steps (resize → watermark → optimize → notify). Step Functions orchestrates the Lambdas.
Project D: Company showcase site / blog
Section titled “Project D: Company showcase site / blog”Context: Marketing site with static content. No custom backend, just content that rarely changes. Near-zero budget.
Best choice: Amplify Hosting (or Vercel / Netlify)
Why: it’s static content. No need for a server, a container, or anything complex. You push to Git, the site is automatically deployed on a worldwide CDN.
Git push → Amplify Hosting → Worldwide CDN → usersCost: free (Amplify Free Tier, or Vercel/Netlify free plan).
Possible alternatives:
- S3 + CloudFront: same result, manual configuration. Better if you want full control on the AWS side.
- EC2 with nginx: absolute overkill. A 24/7 server to serve HTML files — that’s a waste of money and time.
The decision table
Section titled “The decision table”| Criteria | Lambda | App Runner | ECS Fargate | EC2 | Amplify / Vercel |
|---|---|---|---|---|---|
| Traffic | Sporadic | Low constant | Variable / high | Constant | Static |
| Execution duration | < 15 min | Unlimited | Unlimited | Unlimited | N/A |
| Stateful | No | No | Yes | Yes | No |
| DB connection | Complicated | Easy | Easy | Easy | No (or via API) |
| Scaling | Auto, instant | Auto | Auto (configurable) | Manual / ASG | Auto (CDN) |
| Server management | None | None | None | You | None |
| Low traffic cost | ~$0 | ~$5-15/month | ~$20-50/month | ~$15-30/month | ~$0 |
| High traffic cost | Can spike | Medium | Predictable | Predictable | Low (CDN) |
| Config complexity | Low | Very low | High | Medium | Very low |
What the recruiter expects
Section titled “What the recruiter expects”- You don’t give the same answer for all 4 projects
- You justify with concrete criteria (traffic, duration, cost, state, team)
- You know the limits of each solution AND the alternatives
- You know that “the best choice” depends on the context — there is no universal answer
- You separate static frontend / backend / async processing: each has a different solution
Scenario 6 — Infrastructure as Code: a colleague modified the infra by hand
Section titled “Scenario 6 — Infrastructure as Code: a colleague modified the infra by hand”“Your team uses Terraform. You run
terraform planand you see changes that nobody made in the code. What’s happening and how do you handle it?”
What happened
Section titled “What happened”Someone modified the infrastructure directly in the AWS console (added a Security Group rule, changed an instance type, etc.) without going through Terraform. The Terraform state file no longer matches reality.
This is called drift.
How to resolve it
Section titled “How to resolve it”Option A — Import the change into Terraform (if the change is intentional)
# 1. Identify what changedterraform plan# ~ aws_security_group.web will be updated in-place# - ingress rule for port 3306 (added manually)
# 2. Add the rule in the Terraform code so it matches reality# 3. Re-plan → no changes → code and reality are synchronizedterraform plan# No changes.Option B — Force a return to the code (if the change is a mistake)
# terraform apply will put the infra back to the state described by the codeterraform apply# The manual change will be overwrittenPrevent the problem
Section titled “Prevent the problem”- Team rule: you NEVER touch the console to modify the infrastructure. Everything goes through code + pull request.
- Restrictive IAM: limit modification permissions in the console for production environments.
- Drift detection: run
terraform planregularly (in CI) to detect drift.
Scenario 7 — Monitoring and alerting
Section titled “Scenario 7 — Monitoring and alerting”“Your app has been running in production for 3 months. The CTO tells you: ‘We have users complaining it’s slow but we don’t know why.’ How do you set up monitoring?”
Step 1 — Define what you want to measure
Section titled “Step 1 — Define what you want to measure”The 4 golden signals (Google SRE’s “Golden Signals”):
| Signal | Question | Example metric |
|---|---|---|
| Latency | Is it fast? | Response time at the 95th percentile |
| Traffic | How many people? | Requests per second |
| Errors | Does it work? | 5xx error rate |
| Saturation | Is it full? | CPU, RAM, disk, DB connections |
Step 2 — Instrument the app
Section titled “Step 2 — Instrument the app”App → exposes /metrics → Prometheus scrapes → Grafana displays- Add the Prometheus library to the app (for our project:
prometheus-fastapi-instrumentator) - Deploy Prometheus + Grafana (docker-compose is the simplest)
Step 3 — Create the dashboards
Section titled “Step 3 — Create the dashboards”One dashboard per “audience”:
- Technical dashboard: latency, errors, CPU, RAM, DB slow queries
- Business dashboard: number of active users, number of tasks created (for the CTO)
Step 4 — Configure alerts
Section titled “Step 4 — Configure alerts”Good alerts:
- “The 5xx error rate exceeds 5% for 5 minutes” → actionable (there’s a bug or a service is down)
- “The p95 response time exceeds 2 seconds for 10 minutes” → actionable (degraded performance)
Bad alerts:
- “CPU at 80%” → not actionable on its own (80% CPU might be normal if the app runs fine)
- “1 error 404” → noise (a user typed a wrong URL, that’s normal)
What the recruiter expects
Section titled “What the recruiter expects”- You know the Golden Signals or a similar framework
- You distinguish between technical and business metrics
- You know that an alert must be actionable
- You don’t suggest monitoring 200 metrics at once
Scenario 8 — Blue-green / Canary deployment
Section titled “Scenario 8 — Blue-green / Canary deployment”“How do you deploy to production without downtime and without risking breaking it for all users?”
Option A — Blue-Green
Section titled “Option A — Blue-Green” ┌─── Blue (v1.0 — current) ◄── 100% of trafficLoad Balancer ──────┤ └─── Green (v1.1 — new) ◄── 0% of traffic- You deploy v1.1 to Green (while Blue still serves users)
- You test Green (smoke tests, sanity check)
- You switch the load balancer: Green receives 100% of the traffic
- If it works → you delete Blue. If it breaks → you switch back to Blue in 10 seconds.
Pros: Instant rollback. Zero downtime. Cons: Double infrastructure during the transition (cost). Problem if the DB schema changed between v1.0 and v1.1.
Option B — Canary
Section titled “Option B — Canary” ┌─── v1.0 ◄── 95% of trafficLoad Balancer ──────┤ └─── v1.1 ◄── 5% of traffic (the "canaries")- You deploy v1.1 to a few instances
- You send 5% of traffic to v1.1
- You monitor the metrics (errors, latency)
- If everything is fine → 25% → 50% → 100%. If it breaks → 0% and rollback.
Pros: You detect bugs with limited impact (5% of users). Cons: More complex to set up. Requires good monitoring to detect issues.
Option C — Rolling Update
Section titled “Option C — Rolling Update”This is what Kubernetes does by default. You replace instances one by one:
Start: [v1.0] [v1.0] [v1.0] [v1.0]Step 1: [v1.1] [v1.0] [v1.0] [v1.0]Step 2: [v1.1] [v1.1] [v1.0] [v1.0]Step 3: [v1.1] [v1.1] [v1.1] [v1.0]End: [v1.1] [v1.1] [v1.1] [v1.1]Pros: Simple, native in K8s, no double infrastructure. Cons: Slower rollback. During the transition, two versions coexist.
Which one to choose?
Section titled “Which one to choose?”| Strategy | Complexity | Rollback | Use case |
|---|---|---|---|
| Blue-Green | Medium | Instant | Critical apps, few deployments |
| Canary | High | Fast | High-traffic apps, need to test in real conditions |
| Rolling | Low | Medium | Most cases, K8s default |