Skip to content

Interview Questions

This file is split into two parts:

  1. Quick definitions — the classic “what is X” questions, so you don’t blank out on the basics. Start here.
  2. Scenario-based questions — real problems you’ll be asked in interviews to see how you think

Part 1: Definitions and technical questions

Section titled “Part 1: Definitions and technical questions”

For each technology, the questions you’ll be asked in interviews.

How to use this section:

  1. Read the question and try to answer it yourself (out loud is best)
  2. If you’re stuck, open the hint — it gives you a lead without giving away the answer
  3. Open the answer to compare with yours

Q: What is Git?

💡 HintThink of a backup system, with a history and the ability to work as a team.
✅ AnswerA distributed version control system. It keeps a history of every code change and lets multiple people work together without stepping on each other's toes.

Q: Merge vs Rebase?

💡 HintBoth are used to integrate changes from one branch into another. One preserves the history as-is, the other rewrites it.
✅ AnswerMerge preserves the history (creates a merge commit). Rebase rewrites it (linear history, cleaner, but more dangerous since you're rewriting commits).

Q: Pull vs Fetch?

💡 HintBoth retrieve remote changes. One applies them directly, the other doesn't.
✅ AnswerFetch downloads remote changes without applying them. Pull = fetch + merge. Pull applies them directly.

Q: A colleague and you modified the same line — what happens?

💡 HintGit can't decide on its own which version to keep.
✅ AnswerA merge conflict. Git shows you both versions, you choose which one to keep (or combine both), then you commit the resolution.

Q: What’s your team’s Git workflow?

💡 HintThink of the cycle: create a branch → work → push → request a review → merge.
✅ AnswerWe create a branch per feature, commit on it, push, and open a Pull Request. A colleague reviews the code, and if it's good we merge into main. Nobody pushes directly to main — everything goes through a PR.

Q: What’s the difference between git add and git commit?

💡 HintOne prepares files, the other saves them. Think of a box you fill then seal.
✅ Answergit add stages files (staging area), git commit saves them in the history. It's like putting items in a box (add) then sealing and labeling the box (commit).

Q: What is a branch?

💡 HintThink of a parallel universe of the code where you can work without touching the main version.
✅ AnswerA parallel copy of the code. You develop on it without touching the main branch (main). When it's ready, you merge.

Q: What is a Pull Request?

💡 HintIt's the mechanism to propose your changes to the team before integrating them into the main branch.
✅ AnswerA request to merge code. You create a branch, work on it, and when it's ready you open a PR on GitHub. A colleague reviews your code (code review), and if it's good, you merge into main. It ensures code is checked before it reaches production.

Q: Explain the 755 permissions

💡 Hint3 blocks of 3 permissions (read, write, execute) for 3 categories of people. Each permission has a numeric value.
✅ Answer3 blocks (owner/group/others). read=4, write=2, execute=1. 755 = owner can do everything (7), group and others can read and execute (5).

Q: What is an environment variable?

💡 HintThink of a way to pass configuration to an application without putting it in the code.
✅ AnswerA value stored in the system, accessible by programs. Used to pass configuration (database URL, API keys, debug mode) without putting it in the code. export MY_VAR="value" to create one.

Q: A process is eating all the CPU — how do you find and kill it?

💡 HintThere's a command to list processes sorted by consumption, and another to stop a process by its number (PID).
✅ Answertop or ps aux to find it (sort by CPU), kill <PID> to stop it, kill -9 <PID> if that's not enough.

Q: How do you check disk space on a server?

💡 HintA short command with a flag that makes the output human-readable.
✅ Answerdf -h — shows used and available disk space on each partition. The -h = human-readable (GB, MB instead of bytes). A full disk is a frequent cause of production crashes.

Q: How do you see which process is listening on a port?

💡 HintThe ss command with the right flags, combined with grep to filter.
✅ Answerss -tlnp | grep <port> — shows the process listening on that port.

Q: What is sudo?

💡 HintThink "Run as administrator" on Windows.
✅ Answer"Super User DO" — run a command as administrator (root). Needed to install software, modify system config, etc.

Q: Difference between > and >>?

💡 HintBoth redirect command output to a file. One overwrites, the other doesn't.
✅ Answer> overwrites the file. >> appends to the end. Example: echo "log" > file.txt replaces the content, echo "log" >> file.txt adds a line.

Q: How do you view a service’s logs?

💡 HintThere's a specific command for systemd services, and a classic directory for system logs.
✅ Answerjournalctl -u service_name for systemd services, or look in /var/log/ for classic system logs.

Q: What is a process?

💡 HintEvery program running on your machine is one. Each has a unique number.
✅ AnswerA program currently running. When you run python3 main.py, it creates a process. Each process has a unique number (PID). You can see them with ps aux or top.

Q: What is the PATH?

💡 HintIt's what the system checks when you type a program name in the terminal. If the program isn't there...
✅ AnswerAn environment variable containing the list of directories where the system looks for programs. When you type python3, Linux searches through PATH directories to find the file. If you get "command not found", it's often because the program isn't in the PATH.

Q: What is an IP address?

💡 HintIt's an identifier. There are two types depending on whether you're on the Internet or on a local network.
✅ AnswerAn identifier for a machine on the network. Public = visible on the Internet. Private = visible only on the local network.

Q: What is a port?

💡 HintA machine can run multiple services (web, SSH, database). The port identifies which one.
✅ AnswerA number (1-65535) that identifies a service on a machine. 22=SSH, 80=HTTP, 443=HTTPS, 5432=PostgreSQL.

Q: What is DNS?

💡 HintThink of a phone book that translates something human-readable into something machine-readable.
✅ AnswerThe system that translates domain names (google.com) into IP addresses. Without DNS, you'd have to remember the IP of every website.

Q: Difference between TCP and UDP?

💡 HintOne is reliable but slower, the other is fast but doesn't verify anything. Think HTTP vs video streaming.
✅ AnswerTCP is reliable (verifies that data arrives in order). UDP is fast (no verification). HTTP uses TCP, video streaming often uses UDP.

Q: A user tells you “the site doesn’t work” — where do you start?

💡 HintA command that gives you the HTTP response code. The code tells you what type of problem it is (network, proxy, code).
✅ Answercurl the site to see the response code (200, 502, timeout). If timeout → network/DNS problem. If 502 → the app behind the proxy is down. If 500 → bug in the code.

Q: What happens when you type a URL in your browser?

💡 Hint5 steps: address resolution, request, server-side processing, response, rendering. Think DNS, HTTP, and the browser.
✅ Answer1. DNS resolution — the browser asks a DNS to translate the domain name into an IP address. 2. Sending the request — the browser sends an HTTP request to the server. 3. Server-side processing — the server receives the request and prepares the response. 4. Server response — the server sends back the content (HTML/CSS/JS and data in JSON). 5. Rendering — the browser assembles and displays the page.

Q: What is a CIDR /24?

💡 HintIt's a notation to describe a subnet. The number after / indicates how many IP addresses are available.
✅ AnswerA subnet of 256 IP addresses. Example: 10.0.1.0/24 = 10.0.1.0 to 10.0.1.255. The higher the number after /, the fewer addresses.

Q: What is a firewall?

💡 HintThink of a bouncer controlling who enters and exits a building.
✅ AnswerA filter that controls incoming and outgoing network traffic. It allows or blocks traffic based on rules (port, source IP, protocol). On Linux, ufw is a simple tool to configure the firewall.

Q: What does a 502 code mean?

💡 HintIt's a proxy problem — the server receiving your request can't reach the server behind it.
✅ AnswerBad Gateway — the proxy/load balancer server can't reach the application server behind it. Common cause: the application has crashed.

Q: Difference between HTTP and HTTPS?

💡 HintThe S stands for "Secure". Think of the padlock in the browser's address bar.
✅ AnswerHTTPS = HTTP + encryption (TLS/SSL). Data is encrypted between your browser and the server — nobody can read it in transit. The padlock in the browser = HTTPS. Today, every serious site must use HTTPS.

Q: What is a reverse proxy?

💡 HintIt's an intermediate server between users and your application. It can do several useful things (traffic distribution, HTTPS, caching).
✅ AnswerA server that sits in front of your application and receives requests on its behalf. It can distribute traffic between multiple servers, handle HTTPS, cache content, etc. Nginx is the most common reverse proxy.

Q: What is a load balancer?

💡 HintIf you have multiple servers, how do you distribute requests between them?
✅ AnswerA tool that distributes traffic across multiple servers. If you have 3 backend servers, the load balancer sends each request to a different server to spread the load. If a server goes down, the load balancer stops sending traffic to it.

Q: Difference between image and container?

💡 HintThink of a recipe vs a cooked dish. One is a template, the other is a running instance.
✅ AnswerImage = read-only template (the recipe). Container = running instance (the cooked dish). One image can create multiple containers.

Q: What is a Dockerfile?

💡 HintA text file with instructions. Think of the keywords: FROM, COPY, RUN, CMD.
✅ AnswerA text file that describes step by step how to build a Docker image. FROM for the base, COPY for files, RUN for commands, CMD for the startup command.

Q: A container keeps crashing in a loop — how do you debug it?

💡 HintThe first thing is always the logs. If the container isn't running anymore, there's a way to launch the image with a shell instead of the app.
✅ Answerdocker logs <container> to read the logs. If the container isn't running anymore, docker run -it --entrypoint bash <image> to get inside and investigate manually.

Q: Why use a multi-stage build?

💡 HintThe goal is the final image size. You separate the build phase and the runtime phase.
✅ AnswerTo reduce the final image size. You build in a heavy image (with build tools), then copy only the result into a lightweight image. The frontend goes from 500 MB to 20 MB.

Q: How do containers communicate with each other in Docker Compose?

💡 HintDocker Compose automatically creates something that lets containers find each other by their service name.
✅ AnswerVia an internal network created automatically. Each container is accessible by its service name (e.g.: backend:8000, db:5432). It's service discovery through internal DNS.

Q: Difference between CMD and ENTRYPOINT?

💡 HintOne can be overridden at launch, the other can't. Which one is used 90% of the time?
✅ AnswerCMD = default command, can be overridden at launch. ENTRYPOINT = fixed command, arguments from docker run are appended after it. In practice, CMD is enough 90% of the time.

Q: What is Docker?

💡 HintThink of a way to package an application with everything it needs to run the same way everywhere.
✅ AnswerA tool that packages an application with all its dependencies into an isolated container. The container runs the same way everywhere (your PC, a server, the cloud).

Q: What is Docker Compose?

💡 HintWhen you have multiple containers (backend, frontend, database), you need a tool to manage them together.
✅ AnswerA tool for managing multiple containers together with a YAML file. You define services, networks, and volumes, then docker compose up launches everything at once.

Q: What is a Docker volume?

💡 HintBy default, container data disappears when it's deleted. How do you persist data?
✅ AnswerPersistent storage. Without a volume, data disappears when the container is deleted. Essential for databases — data survives container restarts.

Q: Difference between COPY and ADD in a Dockerfile?

💡 HintBoth copy files. One does more than the other — but is that always desirable?
✅ AnswerBoth copy files into the image. COPY does a simple copy. ADD can also decompress archives (.tar.gz) and download from URLs. In practice, always use COPY — it's more explicit.

Q: What is a Docker registry?

💡 HintThink of GitHub, but for Docker images instead of source code.
✅ AnswerA server that stores Docker images. Docker Hub is the default public registry. In the workplace, private registries (AWS ECR, GitHub Container Registry) are often used to store your own images.

Q: Why does the order of instructions in a Dockerfile matter?

💡 HintDocker uses a layer caching system. If a layer changes, all subsequent layers are rebuilt.
✅ AnswerBecause of caching. Docker executes each instruction as a layer. If a layer hasn't changed, Docker reuses the cache. By putting COPY requirements.txt + RUN pip install BEFORE COPY . ., dependencies are only reinstalled when they actually change — not on every code modification.

Q: What is CI/CD?

💡 HintCI = before deployment (verify). CD = the deployment itself (deliver).
✅ AnswerCI = automatic verification on every push (lint, tests). CD = automatic (or semi-automatic) deployment. The goal: detect bugs as early as possible and deploy with confidence.

Q: What is “fail fast”?

💡 HintIf a quick step fails, do you still run the long steps?
✅ AnswerIf lint fails, you don't run the tests. If tests fail, you don't build. You stop as soon as a problem is detected to avoid wasting time.

Q: Where do you put secrets in a pipeline?

💡 HintNever in the code, never in the committed YAML. There's a dedicated place in GitHub/GitLab for that.
✅ AnswerIn the CI secrets (GitHub Secrets, GitLab Variables). They're injected at runtime and never appear in the logs.

Q: A test passes locally but fails in CI — why?

💡 HintThink about the differences between your machine and the CI runner: versions, environment variables, available services.
✅ AnswerOften an environment difference: different Python/Node version, missing environment variable, dependency not installed, or the test depends on a service (DB) that doesn't exist in CI.

Q: How do you rollback if a deployment breaks production?

💡 HintDocker images are tagged with the commit hash. How do you use that to go back?
✅ AnswerYou redeploy the previous Docker image. That's why we tag images with the commit hash — you can go back to any version in a few minutes.

Q: What are the stages of a typical CI/CD pipeline?

💡 Hint4 stages in order. If the first fails, the next ones don't run.
✅ AnswerLint (code quality) → Tests → Build (artifact construction) → Deploy. Each stage blocks the next if it fails.

Q: Difference between Continuous Delivery and Continuous Deployment?

💡 HintBoth start with "Continuous D...". The difference: does a human press a button before prod?
✅ AnswerDelivery = ready to deploy but manual button. Deployment = automatic deployment to prod. Most companies do Delivery (a human validates before prod).

Q: What is a runner?

💡 HintThe pipeline doesn't execute itself in a vacuum — it needs a machine to run on.
✅ AnswerThe machine (server) that executes pipeline jobs. GitHub provides free runners (ubuntu-latest). You can also use self-hosted runners for more control.

Q: What is a blue/green deployment?

💡 HintTwo identical environments. One serves prod, the other waits for the new version. You switch traffic all at once.
✅ AnswerA deployment strategy with two identical environments. "Blue" serves prod, you deploy the new version to "green", test it, then switch the traffic. If it breaks, you switch back in seconds. Advantage: instant rollback.

Q: What is a canary deployment?

💡 HintInstead of deploying to everyone at once, you start with a small percentage. The name comes from canaries in coal mines.
✅ AnswerYou deploy the new version to a small percentage of servers (e.g., 5%). You monitor the metrics. If everything's fine, you gradually increase (25% → 50% → 100%). If it breaks, only 5% of users are impacted.

Q: What is EC2?

💡 HintThink of renting a computer instead of buying one.
✅ AnswerA virtual server in the cloud. You choose the power (CPU, RAM), the OS, and you pay by the hour.

Q: How do you connect to an EC2?

💡 HintA remote connection protocol + a key file downloaded when the instance was created.
✅ AnswerVia SSH with a key pair: ssh -i ~/devops-key.pem ubuntu@PUBLIC_IP. The .pem key is downloaded when the instance is created.

Q: Your EC2 is unresponsive — what are the first things you check?

💡 Hint3 things: the instance itself (is it running?), the network (is the port open?), and the address (does it have a public IP?).
✅ Answer1. Is the instance "Running" in the AWS console? 2. Does the Security Group allow SSH (22) and HTTP (80) ports? 3. Does the instance have a public IP? 4. If everything looks fine on the AWS side, SSH in and check the app logs.

Q: What is a VPC?

💡 HintIt's your private network in AWS. You put your resources in it and control who can access what.
✅ AnswerVirtual Private Cloud — an isolated network in AWS. You put your resources in it (EC2, RDS). You control the subnets, routing, and access.

Q: Difference between public and private subnet?

💡 HintOne is accessible from the Internet, the other isn't. Think about where you'd put a web server vs a database.
✅ AnswerPublic = accessible from the Internet (via Internet Gateway). Private = no direct Internet access. You put web servers in public, databases in private.

Q: What is a Security Group?

💡 HintIt's like a firewall. It controls traffic by port and source. It's "stateful" — what does that mean?
✅ AnswerA virtual firewall attached to an instance. It filters inbound (ingress) and outbound (egress) traffic by port and source IP. "Stateful" = if you allow inbound traffic on a port, the outbound response is automatically allowed.

Q: What is an Internet Gateway?

💡 HintWithout it, your VPC is completely isolated from the Internet. It's the door between your private network and the outside world.
✅ AnswerThe door that connects your VPC to the Internet. Without an Internet Gateway, no resource in the VPC can access the Internet (and nobody can access it from the Internet).

Q: Why use RDS instead of installing PostgreSQL on an EC2?

💡 HintThink about everything you DON'T have to manage with RDS: backups, updates, high availability.
✅ AnswerRDS handles automatic backups, security updates, replication and high availability. You don't have to maintain the database server yourself. The extra cost is offset by the time saved.

Q: What is Multi-AZ on RDS?

💡 HintYour database is copied to a 2nd location. If the first one goes down...
✅ AnswerYour database is automatically replicated to a 2nd datacenter (Availability Zone). If the first one fails, the 2nd takes over automatically. That's high availability.

Q: How do you protect your database on AWS?

💡 HintThink about the subnet (where it's placed) and the Security Group (who's allowed to connect to it).
✅ AnswerYou put it in a private subnet (no public IP), with a Security Group that only allows port 5432 from the EC2's Security Group. Never direct access from the Internet.

Q: What is S3?

💡 HintFile storage in the cloud. Unlimited, high durability, cheap.
✅ AnswerSimple Storage Service — unlimited object (file) storage in the cloud. Used for backups, static files (images, CSS, JS from a frontend), logs, data exports.

Q: How do you secure an S3 bucket?

💡 HintBy default a bucket is private. The danger is making it public by mistake.
✅ AnswerBy default, an S3 bucket is private (that's good). You verify that "Block all public access" is enabled. You control access via bucket policies and IAM roles. Never public access unless for intentionally public static content (frontend).

Q: What is IAM?

💡 HintAWS's permission system. Who is allowed to do what.
✅ AnswerIdentity and Access Management. Manages users (Users), roles (Roles) and permissions (Policies). The key principle: least privilege — only grant the permissions that are strictly necessary.

Q: User vs Role — what’s the difference?

💡 HintOne is permanent (a person or a program), the other is temporary (you "assume" it when needed).
✅ AnswerUser = a permanent account for a person or a program (with fixed credentials). Role = a set of temporary permissions that a service can "assume" (e.g.: an EC2 that needs to access S3 uses a role, not a user).

Q: What is an IAM Policy?

💡 HintThink of the document that describes permissions. It's in JSON format.
✅ AnswerA JSON document that defines permissions: which actions (e.g.: s3:GetObject) are allowed or denied, on which resources (e.g.: a specific bucket). You attach it to a User, Group or Role to grant these rights.

Q: What is the principle of least privilege?

💡 HintA fundamental security rule: you grant the minimum permissions needed, nothing more.
✅ AnswerGrant only the permissions strictly necessary to do the job, and nothing more. If a Lambda only needs to read an S3 bucket, you give it only s3:GetObject on that specific bucket — not AdministratorAccess. This limits the damage if credentials are compromised.

Q: When to use Lambda vs EC2?

💡 HintThink about execution duration and frequency. One runs 24/7, the other runs on demand.
✅ AnswerLambda = short tasks (<15 min), occasional, with automatic scaling (webhooks, file processing). EC2 = applications running continuously 24/7 (web API, server). With Lambda you pay per execution, with EC2 you pay by the hour even when idle.

Q: What is SQS and why is it useful?

💡 HintThink of a queue. Instead of processing messages directly (and risking losing them if it crashes), you put them in...
✅ AnswerSimple Queue Service — a managed message queue. You put messages in, another program consumes them. If the consumer crashes, the message stays in the queue and will be reprocessed. Useful for decoupling services, absorbing traffic spikes, and never losing data.

Q: What’s the difference between ECS and EKS?

💡 HintBoth run containers on AWS. One is AWS-specific and simpler, the other is a portable standard.
✅ AnswerECS = AWS-specific container orchestration (simpler, no control plane fees). EKS = managed Kubernetes (standard, multi-cloud portable, but more complex and more expensive ~$75/month base).

Q: What is Fargate?

💡 HintAn ECS mode where you don't manage any servers. You just provide your Docker image and the amount of CPU/RAM.
✅ AnswerA "serverless" mode for ECS — you provide your Docker image, define CPU and RAM, AWS launches the container somewhere in the cloud. You never see a machine, you don't manage any servers. You only pay for the CPU/RAM used.

Q: What is AWS?

💡 HintThe world's largest cloud provider. You rent computing resources instead of buying them.
✅ AnswerA cloud computing provider. You rent servers (EC2), storage (S3), databases (RDS) and many other services, on demand. You pay for what you use.

Q: What is RDS?

💡 HintThink of a database where AWS handles all the maintenance for you.
✅ AnswerRelational Database Service — a managed database by AWS. You choose the engine (PostgreSQL, MySQL...), AWS handles backups, updates, and high availability.

Q: What is DynamoDB?

💡 HintAWS's NoSQL alternative. Instead of SQL tables with fixed columns, you store...
✅ AnswerA NoSQL managed database by AWS. Instead of SQL tables with fixed columns, you store flexible JSON documents. Scaling is automatic and pricing is per-request.

Q: When to use RDS vs DynamoDB?

💡 HintThink about the data type: does it have relationships (users → orders → products)?
✅ AnswerRDS when your data has relationships and you need complex SQL queries. DynamoDB for simple data at very high traffic (sessions, cache, counters). When in doubt, RDS — it's more versatile.

Q: What is ECS?

💡 HintYou give it Docker images, it runs, monitors and scales them. With Fargate, you don't even manage servers.
✅ AnswerElastic Container Service — you give AWS your Docker images, and it runs, monitors and scales them. With Fargate, you manage no servers — you pay only for CPU and RAM used.

Q: What is EKS?

💡 HintManaged Kubernetes on AWS. AWS manages one part, you manage the other. The advantage is portability.
✅ AnswerElastic Kubernetes Service — managed Kubernetes on AWS. AWS manages the control plane, you manage the workers. Advantage over ECS: K8s is a standard portable across any cloud.

Q: What is Lambda?

💡 HintCode that runs without a server. You only pay when your code runs.
✅ AnswerServerless — you send your code, AWS runs it when needed, you pay per execution. No server to manage. Ideal for short, one-off tasks (<15 min).

Q: When to use Lambda vs EC2 vs ECS?

💡 HintThink about execution duration and whether the app needs to run continuously or not.
✅ AnswerLambda for short tasks (<15 min) and one-off. ECS/EKS for containerized apps running continuously with auto-scaling. EC2 when you need full server control or for small simple projects.

Q: What is a cold start?

💡 HintThe first Lambda execution is slower. Why?
✅ AnswerThe first Lambda execution is slower because AWS has to start an environment. Subsequent executions (warm start) are faster because the environment is already ready.

Q: Difference between horizontal and vertical scaling?

💡 HintOne adds power to a machine, the other adds machines. Which one has a physical limit?
✅ AnswerVertical = increase the power of a machine (more CPU, more RAM). Horizontal = add more machines. Vertical has a physical limit, horizontal is virtually unlimited. In the cloud, horizontal scaling is preferred.

Q: What is the shared responsibility model?

💡 HintAWS and you each have a share of responsibility for security. Who manages what?
✅ AnswerAWS manages security of the cloud (datacenters, physical network, hypervisors). You manage security in the cloud (your data, your Security Groups, your IAM policies, your code). If your Security Group is open to everyone, that's your fault, not AWS's.

Q: What is Infrastructure as Code?

💡 HintInstead of clicking in a console to create servers, you do what?
✅ AnswerDescribe your infrastructure in code files instead of clicking in a console. Reproducible, versioned in Git, auditable, shareable.

Q: Explain plan, apply, destroy

💡 HintThree steps: preview, execute, delete. Which one do you always do first?
✅ Answerplan shows what will change without doing anything. apply executes the changes. destroy deletes everything. You always run plan before apply to verify.

Q: What is the state file and why is it important?

💡 HintTerraform needs to know what CURRENTLY exists to compare with what you want. It stores that in a file.
✅ AnswerA JSON file that records the current state of the infrastructure. Terraform compares it with your code to know what to create/modify/delete. Never edit it by hand, never commit it (it can contain secrets).

Q: How do you interact with a resource that already exists on AWS but not in your Terraform?

💡 HintThere's a keyword different from resource that FETCHES information instead of CREATING something.
✅ AnswerWith a data block. Unlike resource which creates something, data fetches information that already exists (an AMI, a VPC, an existing Security Group).

Q: Someone modified the infrastructure by hand in the AWS console — what happens?

💡 HintThe state file no longer matches reality. Terraform will detect the difference on the next plan. What is that called?
✅ AnswerThat's drift. On the next terraform plan, Terraform shows the differences between the code and reality. Either you import the change into the code, or apply overwrites the manual change.

Q: What is Terraform?

💡 HintA tool for describing your infrastructure in code files instead of clicking in a console.
✅ AnswerAn Infrastructure as Code tool. You describe your infra in HCL files, Terraform creates/modifies/deletes it. Versionable, reproducible, collaborative.

Q: Terraform vs CloudFormation?

💡 HintOne is multi-cloud, the other is specific to a single cloud provider.
✅ AnswerTerraform is multi-cloud (AWS, GCP, Azure). CloudFormation is AWS-specific. Terraform has a larger community and more readable syntax.

Q: What is a Terraform module?

💡 HintThink of a function in programming — reusable code you call with parameters.
✅ AnswerA reusable block of Terraform code. Instead of copy-pasting the same config for each environment, you create a module and call it with different parameters. It's like a function in programming.

Q: What is a Terraform provider?

💡 HintTerraform alone can't do anything. It needs plugins to talk to AWS, GCP, etc.
✅ AnswerA plugin that connects Terraform to a service (AWS, GCP, Azure, GitHub...). The AWS provider allows Terraform to create EC2s, S3 buckets, RDS instances. Without a provider, Terraform can't talk to anything.

Q: What is Ansible?

💡 HintA server configuration tool. The keyword is "agentless" — it doesn't need to install anything on the target server.
✅ AnswerA configuration management tool. Configures servers in an automated way, agentless (connects via SSH, no need to install anything on the target server).

Q: Ansible vs Terraform?

💡 HintOne creates the infrastructure, the other configures what runs on it. Think "building the house" vs "furnishing it".
✅ AnswerTerraform creates the infrastructure (the server exists). Ansible configures what runs on it (installs Docker, copies files, launches the app). Terraform builds the house, Ansible furnishes it.

Q: What is idempotence?

💡 HintWhat happens if you run the same playbook 10 times in a row?
✅ AnswerRunning a playbook multiple times always gives the same result. If Docker is already installed, Ansible doesn't reinstall it. That's what makes it safe to re-run.

Q: What is a playbook?

💡 HintIt's a file in a format you know well (used everywhere in DevOps). It describes tasks to execute.
✅ AnswerA YAML file that describes tasks to execute on servers. Each task uses a module (apt, copy, service) and is named for readability.

Q: How do you manage secrets in Ansible?

💡 HintAnsible has a built-in tool to encrypt files. Its name makes you think of a safe.
✅ AnswerWith Ansible Vault. You encrypt files containing secrets, and at execution time you pass --ask-vault-pass to decrypt them.

Q: What is an Ansible inventory?

💡 HintAnsible needs to know which machines to act on. There's a file for that.
✅ AnswerThe file that lists the servers Ansible will act on. It contains IP addresses or hostnames, organized in groups (web, db, etc.). Ansible connects via SSH to each machine in the inventory to execute tasks.

Q: What is an Ansible role?

💡 HintWhen your playbook grows, you need to organize it into reusable components.
✅ AnswerA way to organize a playbook into reusable components. A role bundles tasks, files, templates, and variables related to a function (e.g., a "docker" role that installs and configures Docker). You can reuse the same role across multiple playbooks.

Q: What is Kubernetes?

💡 HintThink of an orchestra conductor for containers. It manages 3 main things: deployment, scaling, and...
✅ AnswerA container orchestrator. It manages the deployment, scaling and high availability of your containers on a cluster of machines.

Q: What is a Pod?

💡 HintIt's the basic unit. Most of the time, 1 pod = 1 container.
✅ AnswerThe basic unit of K8s. 1 pod ≈ 1 container. Kubernetes doesn't manage containers directly — it manages pods.

Q: A pod crashes — what does Kubernetes do?

💡 HintK8s maintains the number of replicas defined in the Deployment. If one is missing, it...
✅ AnswerThe Deployment detects that a pod is missing and automatically recreates one. That's self-healing. That's why you never create pods directly — you go through a Deployment.

Q: What’s the difference between port and targetPort in a Service?

💡 HintOne is the "entry" port of the Service, the other is the port the container actually listens on. They can be different.
✅ Answerport = the port to access the Service (from inside the cluster). targetPort = the port on the container that traffic is redirected to. Often the same, but you could map port 80 of the Service to port 8000 of the container.

Q: How do you update an app without downtime on K8s?

💡 HintK8s replaces pods one by one, not all at once. It waits for the new one to be ready before deleting the old one. What is that called?
✅ AnswerRolling update (the default). Kubernetes creates a new pod with the new version, waits for it to be ready (health check), then deletes the old one. Pods are replaced one by one — users don't see any downtime.

Q: Difference between Docker and Kubernetes?

💡 HintOne runs ONE container, the other orchestrates dozens/hundreds across multiple machines.
✅ AnswerDocker runs ONE container. Kubernetes orchestrates dozens/hundreds of containers across multiple machines (scheduling, scaling, self-healing).

Q: What is a Deployment?

💡 HintYou never create pods directly. You go through an object that manages them for you.
✅ AnswerAn object that manages a group of identical pods. It maintains the desired replica count, manages updates (rolling update), and recreates crashed pods.

Q: What is a K8s Service?

💡 HintPods have IPs that change on every restart. You need a stable access point.
✅ AnswerA stable network access point to a group of pods. Pods have ephemeral IPs, the Service has a fixed IP and distributes traffic across the pods.

Q: What is a Namespace?

💡 HintThink of folders to organize and isolate resources in a cluster.
✅ AnswerA way to isolate resources in a cluster. Useful for separating environments (dev, staging, prod) or teams.

Q: What is an Ingress?

💡 HintHow do you make external HTTP requests reach the right Services inside the cluster?
✅ AnswerA K8s object that manages HTTP(S) routing to Services. It lets you say "requests to api.mysite.com go to the backend Service" and "requests to mysite.com go to the frontend Service". It's the HTTP entry point of the cluster.

Q: What is a ConfigMap and a Secret?

💡 HintHow do you pass configuration and secrets to your pods without putting them in the Docker image?
✅ AnswerK8s objects for storing configuration. A ConfigMap stores non-sensitive data (URLs, feature flags). A Secret stores sensitive data (passwords, API keys) encoded in base64. Both are injected into pods as environment variables or files.

Q: What is a liveness probe and a readiness probe?

💡 HintK8s needs to know if your pods are alive and ready. It uses two different types of checks.
✅ AnswerHealth checks that K8s runs on your pods. The liveness probe checks if the pod is alive — if it fails, K8s restarts the pod. The readiness probe checks if the pod is ready for traffic — if it fails, K8s stops sending requests without restarting it.

Q: Difference between ClusterIP, NodePort, and LoadBalancer?

💡 HintThese are the three K8s Service types. Each exposes the Service at a different level of accessibility.
✅ AnswerThree K8s Service types. ClusterIP (default) = accessible only from inside the cluster. NodePort = accessible from outside via a port on each node. LoadBalancer = creates an external load balancer (cloud provider) redirecting to the Service. In production, you typically use an Ingress in front of a ClusterIP Service.

Q: What are the 3 pillars of observability?

💡 HintThree types of data: numbers, text, and the path of a request.
✅ AnswerMetrics (numbers — CPU, response time), Logs (text messages from applications), Traces (the path of a request through multiple services).

Q: How do you tell the difference between a code problem and an infrastructure problem?

💡 HintIf all instances have the same problem, it's probably the code. If it's just one instance... think about resources.
✅ AnswerYou check infrastructure metrics first (CPU, RAM, disk, network). If everything is normal on the infra side but the app returns errors → it's a code bug (ticket for the devs). If CPU is at 100% or disk is full → it's an infra problem (your problem).

Q: How do you know if your app is slow?

💡 HintYou don't look at the average (it hides problems). You look at a percentile — which one?
✅ AnswerThe p95 or p99 of latency in Grafana. The p95 = 95% of requests are faster than this value. If the p95 is at 2 seconds, 5% of your users are waiting more than 2 seconds.

Q: What’s a good alert vs a bad alert?

💡 HintA good alert prompts you to act. A bad alert, you end up ignoring. Think symptoms vs causes.
✅ AnswerGood: actionable, based on symptoms ("the 5xx error rate exceeds 5%"). Bad: noise ("CPU at 80%" — maybe that's normal). If you receive an alert and your reaction is "meh", delete the alert.

Q: What’s the difference between Prometheus and Grafana?

💡 HintOne collects data, the other displays it. Think sensor vs dashboard.
✅ AnswerPrometheus collects and stores metrics (it scrapes /metrics every 15s). Grafana displays them in dashboards. Prometheus = the sensor, Grafana = the dashboard.

Q: Why is monitoring important?

💡 HintWithout monitoring, how do you know your app is working correctly?
✅ AnswerWithout monitoring, you don't know if your app works correctly. You detect problems before users, identify bottlenecks, and have data for decisions.

Q: What is Prometheus?

💡 HintA metrics collection tool. It fetches data itself (pull model) instead of waiting for apps to send it.
✅ AnswerA metrics collection system using pull model. It scrapes /metrics endpoints from applications at regular intervals and stores data as time series.

Q: What is Grafana?

💡 HintIt's the visualization tool that goes with Prometheus. Think dashboards and graphs.
✅ AnswerA visualization tool. It connects to data sources (Prometheus, etc.) and creates dashboards with graphs and alerts.

Q: Difference between pull and push model?

💡 HintWho initiates data collection? The monitoring server, or the application itself?
✅ AnswerPull = Prometheus fetches the data (scrape). Push = applications send the data. Pull is simpler to manage and debug.

Q: What are SLI, SLO, and SLA?

💡 HintThree levels: what you measure, what you aim for, what you contractually commit to.
✅ AnswerSLI (Service Level Indicator) = the measured metric (e.g., 99.2% of requests respond in under 200ms). SLO (Service Level Objective) = the internal target (e.g., we aim for 99.5%). SLA (Service Level Agreement) = the contractual commitment with the client (e.g., if we drop below 99%, we refund). SLI measures, SLO guides, SLA commits.

Scenario 1 — Deploying a web app to production

Section titled “Scenario 1 — Deploying a web app to production”

“You join a startup. They have a web app (React frontend + backend API + PostgreSQL database). Everything runs on the CTO’s laptop. How do you put this in production?”

Don’t dive straight into tools. Ask questions first:

  • How many users? (10? 10,000? 1 million?)
  • What budget? ($0? $50/month? $1,000/month?)
  • What team? (1 dev, 10 devs? Is there a DevOps?)
  • What are the availability requirements? (side project vs. banking app)
  • Is the frontend static (just built HTML/JS) or does it need server-side rendering?

This last question is key, because it completely changes the architecture for the frontend.

Our case: React with Vite = static frontend. The build produces static files (HTML/CSS/JS) that can be served from any web server or CDN.

Approach 1 — CDN / Static hosting (the simplest and most performant)

The built frontend is just static files. No need for a server for this.

ServiceWhat it isCostComplexity
S3 + CloudFrontS3 bucket (storage) + AWS CDN (worldwide distribution)~$0-5/monthLow
VercelSpecialized frontend hosting, auto deployment from GitFree (hobby)Very low
NetlifySame concept as VercelFree (hobby)Very low
AWS Amplify HostingAWS service to host frontend apps, auto deployment from GitFree (Free Tier)Low

When to choose: Almost always for a static frontend (built React, Vue, Angular). It’s faster (CDN = servers close to users), cheaper, and you have no server to manage.

Approach 2 — Nginx in a container (what we do in the Hands-on Project)

You build the frontend, then serve the files with nginx in a Docker container. This is what we do in Module 3.

When to choose: When you want everything in the same docker-compose to simplify the deployment, or when you need a custom reverse proxy (complex routing rules).

Approach 3 — Server-Side Rendering (Next.js, Nuxt, etc.)

If the frontend does SSR (HTML is generated server-side), then it needs a Node.js server running at all times. In that case, you treat it like a backend (EC2, ECS, App Runner, etc.).

When to choose: Critical SEO (e-commerce, blog), dynamic content that changes often.

The backend + database — From simplest to most robust

Section titled “The backend + database — From simplest to most robust”

Option A: 1 server, Docker Compose (MVP / side project)

1 EC2 (t3.small)
├── Frontend (nginx)
├── Backend (API container)
└── PostgreSQL (container with volume)

Pros: Quick to set up, cheap (~$15/month), a single machine. Cons: Single point of failure. DB in Docker = risky (no automatic backup). No scaling. When to choose: MVP, side project, <100 users, ~$0 budget.

Option B: EC2 + RDS (serious startup)

VPC
├── Public subnet
│ └── EC2 (backend in Docker)
└── Private subnet
└── RDS PostgreSQL (automatic backups)
+ S3 + CloudFront (static frontend)

Pros: The DB is managed (backups, auto updates). Network separation. The frontend on CDN is fast and free. You can add a 2nd EC2 + load balancer later. Cons: More expensive (~$50-100/month). You manage the EC2s yourself (OS updates, Docker, etc.). When to choose: App in production, real users, need for reliability, small team.

Option C: ECS Fargate (scaling without managing servers)

VPC
├── Public subnet
│ └── Application Load Balancer
├── Private subnet
│ ├── ECS Fargate (backend containers, auto-scaling)
│ └── RDS PostgreSQL Multi-AZ
+ S3 + CloudFront (frontend)
+ Route 53 (DNS)

ECS (Elastic Container Service) runs your Docker containers without you managing servers. Fargate = you give it a Docker image, define CPU/RAM, it launches the container somewhere in the cloud. You never see a machine.

Pros: Auto-scaling, no servers to manage, high availability. You push a Docker image and it’s deployed. Cons: More expensive than bare EC2 (~$100-300/month). More complex configuration (task definitions, services, target groups…). When to choose: Variable traffic, need for scaling, don’t want to manage EC2s.

Option D: AWS App Runner (the simplest for containers)

App Runner (backend container)
+ RDS PostgreSQL
+ S3 + CloudFront (frontend)

App Runner is the simplest AWS service to run a container web app. You give it your Docker image (or your source code) and it handles everything: build, deployment, scaling, HTTPS, load balancing.

Pros: Ultra simple. No network configuration. Auto-scaling included. Automatic HTTPS. Cons: Less control than ECS. No default VPC (configurable). More expensive at high traffic. When to choose: You want to deploy fast, you don’t want to configure VPC/ALB/ECS, small team without a dedicated DevOps.

Option E: AWS Amplify (integrated frontend + backend)

Amplify is a complete platform that can host a static frontend AND a backend (via Lambda functions or a GraphQL API).

Pros: All-in-one: hosting, auth, API, database. Auto deployment from Git. Ideal for fullstack devs who don’t want to touch infra. Cons: Strong vendor lock-in (you’re tied to the Amplify way of doing things). Less control. Can become limiting for complex architectures. When to choose: Small fullstack project, rapid prototyping, no DevOps on the team.

Option F: Kubernetes / EKS (large scale)

EKS (managed Kubernetes)
├── Backend deployments (auto-scaling)
├── Worker deployments
├── Ingress Controller (HTTP routing)
+ RDS Multi-AZ
+ S3 + CloudFront (frontend)
+ Helm for packaging

Pros: Massive scaling, portability (not locked to AWS), fine-grained orchestration. Cons: Complex to operate. EKS costs ~$75/month just for the control plane. Over-engineering if you don’t have 10+ microservices. When to choose: Many microservices, large DevOps team, need for multi-cloud portability.

OptionComplexityMonthly cost*ScalingServer managementUse case
EC2 + Docker ComposeLow~$15NoYesMVP
EC2 + RDSMedium~$50-100ManualYesSerious startup
App Runner + RDSLow~$30-80AutoNoSmall team, fast to prod
ECS Fargate + RDSHigh~$100-300AutoNoVariable traffic, scaling
AmplifyLow~$0-50AutoNoPrototyping, solo fullstack
EKS (K8s)Very high~$200+AutoPartiallyMicroservices, large scale

*Approximate costs for a modest-sized app.

ServiceWhat it isWhen to use
Railway / RenderPaaS (Platform as a Service). You push your code, they deploy.Side projects, small apps, don’t want to deal with AWS
Fly.ioEdge containers (close to users).Global APIs, low latency
DigitalOcean App PlatformSimple PaaS, cheaper than AWS.SMBs, startups that want simplicity
GCP Cloud RunGoogle’s equivalent of App Runner. Serverless containers.Already on GCP
Azure Container AppsMicrosoft’s equivalent of App Runner.Already on Azure

In an interview, mentioning that alternatives exist shows that you don’t only know one provider.

Not the perfect answer. They want to see that you:

  • Ask questions before answering (budget, scale, constraints, team)
  • Know multiple options and can compare them (not just “EC2 and that’s it”)
  • Separate concerns: the static frontend doesn’t need a server, the DB should be managed
  • Can explain the trade-offs: simplicity vs. control vs. cost vs. scaling
  • Don’t suggest Kubernetes for 50 users — but you can explain when K8s makes sense

Scenario 2 — The site is down in production

Section titled “Scenario 2 — The site is down in production”

“It’s 2 PM, you get an alert: the site isn’t responding. Users are complaining. What do you do?”

The method (from broadest to most specific)

Section titled “The method (from broadest to most specific)”

Step 1 — Confirm and scope the problem (30 seconds)

Fenêtre de terminal
# Is the site responding?
curl -I https://mysite.com
# If timeout → network/DNS/server down problem
# If 502 → the proxy server is running but the app behind it is down
# If 500 → the app is running but crashing
# Is it just me or everyone?
# Test from another network / a colleague

Step 2 — Check the infrastructure (2 minutes)

Fenêtre de terminal
# Is the server up?
ssh user@server
# If "Connection refused" → the server is down or the SSH port is blocked
# → Check in the AWS console: instance running? Security Group OK?
# Are resources OK?
top # CPU, RAM
df -h # Disk full?

Step 3 — Check the services (2 minutes)

Fenêtre de terminal
docker ps # Are the containers running?
docker logs backend --tail 100 # Recent errors?
systemctl status nginx # Is the reverse proxy running?

Step 4 — Check dependencies

Fenêtre de terminal
# Is the database responding?
docker exec -it db psql -U user -c "SELECT 1;"
# Are external services responding?
curl https://external-api-we-use.com/health

Step 5 — Fix and communicate

  • Fix the problem (restart the service, free up disk space, rollback the last deployment…)
  • Communicate: notify the team, update the status page
  • After the incident: write a post-mortem (what happened, why, how to prevent it from happening again)
SymptomLikely causeQuick fix
Total timeoutServer down or Security GroupRestart the instance, check network rules
502 Bad GatewayThe app crashed behind the proxydocker restart backend, check the logs
500 Internal ErrorBug in the code or DB unreachableApp logs, check the DB connection
Very slow siteCPU/RAM saturated, slow DB queriestop, check slow queries
Disk fullLogs accumulating, Docker imagesdf -h, docker system prune, log rotation
  • A structured method, not panic
  • You start by checking, not by changing things
  • You communicate with the team during debugging
  • You mention post-mortem (learning after the incident)

Scenario 3 — Setting up a CI/CD pipeline

Section titled “Scenario 3 — Setting up a CI/CD pipeline”

“The team of 5 devs deploys manually via SSH. It takes 30 min and breaks one out of three times. How do you improve this?”

Today:

  1. A dev finishes their code
  2. They SSH into the server
  3. They run git pull on the server
  4. They restart the app manually
  5. They cross their fingers

Problems: no tests before deployment, no rollback possible, only one dev knows how to do it, it breaks often.

Phase 1 — CI (1-2 days to set up)

# On every push to main:
Lint → Tests → Build Docker image → Push to registry
  • Devs get immediate feedback: “your code breaks the tests”
  • You never deploy code that doesn’t compile or doesn’t pass tests
  • Impact: we stop deploying broken code

Phase 2 — CD to a staging environment (3-5 days)

# After CI:
Automatic deploy to a staging server
  • Devs and the product owner test on staging before production
  • Staging is a copy of production (same config, same infra)
  • Impact: we test in real conditions before production

Phase 3 — CD to production (when the team is confident)

# If staging is OK (tests pass, QA validated):
Manual approval → Deploy to prod
  • A human validates before production (Continuous Delivery, not Deployment)
  • Automatic rollback if the health check fails
  • Impact: deployment in 5 min instead of 30, no SSH connection needed

Because trust is built progressively. If tests don’t cover enough cases, a 100% automated deployment to production will deploy bugs faster. Phase 1 → Phase 2 → Phase 3 lets the team build confidence at each step.

  • You don’t suggest “let’s set up Kubernetes” right away
  • You think progressively (quick wins first)
  • You mention staging (never directly to production)
  • You mention rollback

“A dev committed a database password to the Git repo. What do you do?”

  1. Change the password immediately — the absolute priority. Even if “nobody saw it”, consider it compromised.
  2. Check access — has anyone used this password since the commit?
  3. Remove from Git — careful, a simple git rm is NOT enough. The password stays in the history. You’d need to rewrite the history (git filter-branch or bfg), but that’s heavy. The most important thing is point 1: change the password.
MeasureWhat it does
.gitignoreIgnore .env files, credentials.json, etc.
Pre-commit hookScan commits BEFORE they’re pushed (tools: gitleaks, detect-secrets)
GitHub Secret ScanningGitHub automatically detects committed secrets and alerts you
Environment variablesSecrets live in the server’s env, not in the code
Secrets managerAWS Secrets Manager, HashiCorp Vault — secure and centralized storage

Code is public by default (even a private repo can leak). Secrets must never be in the code. Period.


Scenario 5 — Choosing the right infrastructure for each project

Section titled “Scenario 5 — Choosing the right infrastructure for each project”

“We have 4 projects to host. How do you choose the infrastructure for each one?”

Project A: Internal REST API with 1,000 requests/day

Section titled “Project A: Internal REST API with 1,000 requests/day”

Context: API used by an internal mobile app. Low traffic, minimal budget, one person to maintain.

Best choice: Lambda + API Gateway

Why: very low traffic, no need for a server running 24/7. Lambda = you only pay when a request comes in. Cost: nearly $0 (Free Tier). API Gateway handles HTTPS, rate limiting, and routing.

Possible alternatives:

  • App Runner: if the API is containerized and you want something simple without adapting the code for Lambda. Slightly more expensive but zero code adaptation.
  • EC2: overkill. You’re paying for a 24/7 server for 1,000 requests/day — that’s waste.

Context: Web application (React + API + PostgreSQL). Regular traffic during the day, low at night. Team of 5 devs. Needs reliability.

Best choice: ECS Fargate + RDS + S3/CloudFront

CloudFront (CDN) → S3 (static frontend)
ALB → ECS Fargate (API containers, auto-scaling)
└── RDS PostgreSQL (private subnet)

Why: regular traffic, the app must run at all times, persistent connection to the DB. ECS Fargate = no servers to manage, auto-scaling for spikes. RDS = managed DB.

Possible alternatives:

  • EC2 + RDS: cheaper, but you manage the servers (updates, Docker, monitoring). Good choice if the budget is tight and someone on the team knows how to manage servers.
  • App Runner + RDS: simpler than ECS, but less control over the network (VPC peering, custom security groups). Good for a quick v1.
  • Lambda: technically possible, but cold starts degrade the user experience, and DB connections are complicated to manage (you need RDS Proxy).

Project C: Processing uploaded files (resizing images)

Section titled “Project C: Processing uploaded files (resizing images)”

Context: Users upload photos. They must be resized to 3 sizes and stored. Variable volume: sometimes 10 uploads/day, sometimes 10,000.

Best choice: Lambda + S3 (event-driven architecture)

User → upload → S3 bucket (originals)
└── trigger Lambda → resize → S3 bucket (results)

Why: purely event-driven. A file arrives in S3 → Lambda triggers automatically → processes the file → puts the result back in S3. No need for a server between uploads. Automatic scaling (100 uploads at the same time → 100 Lambdas in parallel).

Possible alternatives:

  • ECS with an SQS queue: if processing takes >15 min (Lambda’s limit) or requires a lot of memory (>10 GB). SQS = queue, ECS = workers that consume the queue.
  • Step Functions + Lambda: if processing has multiple steps (resize → watermark → optimize → notify). Step Functions orchestrates the Lambdas.

Context: Marketing site with static content. No custom backend, just content that rarely changes. Near-zero budget.

Best choice: Amplify Hosting (or Vercel / Netlify)

Why: it’s static content. No need for a server, a container, or anything complex. You push to Git, the site is automatically deployed on a worldwide CDN.

Git push → Amplify Hosting → Worldwide CDN → users

Cost: free (Amplify Free Tier, or Vercel/Netlify free plan).

Possible alternatives:

  • S3 + CloudFront: same result, manual configuration. Better if you want full control on the AWS side.
  • EC2 with nginx: absolute overkill. A 24/7 server to serve HTML files — that’s a waste of money and time.
CriteriaLambdaApp RunnerECS FargateEC2Amplify / Vercel
TrafficSporadicLow constantVariable / highConstantStatic
Execution duration< 15 minUnlimitedUnlimitedUnlimitedN/A
StatefulNoNoYesYesNo
DB connectionComplicatedEasyEasyEasyNo (or via API)
ScalingAuto, instantAutoAuto (configurable)Manual / ASGAuto (CDN)
Server managementNoneNoneNoneYouNone
Low traffic cost~$0~$5-15/month~$20-50/month~$15-30/month~$0
High traffic costCan spikeMediumPredictablePredictableLow (CDN)
Config complexityLowVery lowHighMediumVery low
  • You don’t give the same answer for all 4 projects
  • You justify with concrete criteria (traffic, duration, cost, state, team)
  • You know the limits of each solution AND the alternatives
  • You know that “the best choice” depends on the context — there is no universal answer
  • You separate static frontend / backend / async processing: each has a different solution

Scenario 6 — Infrastructure as Code: a colleague modified the infra by hand

Section titled “Scenario 6 — Infrastructure as Code: a colleague modified the infra by hand”

“Your team uses Terraform. You run terraform plan and you see changes that nobody made in the code. What’s happening and how do you handle it?”

Someone modified the infrastructure directly in the AWS console (added a Security Group rule, changed an instance type, etc.) without going through Terraform. The Terraform state file no longer matches reality.

This is called drift.

Option A — Import the change into Terraform (if the change is intentional)

Fenêtre de terminal
# 1. Identify what changed
terraform plan
# ~ aws_security_group.web will be updated in-place
# - ingress rule for port 3306 (added manually)
# 2. Add the rule in the Terraform code so it matches reality
# 3. Re-plan → no changes → code and reality are synchronized
terraform plan
# No changes.

Option B — Force a return to the code (if the change is a mistake)

Fenêtre de terminal
# terraform apply will put the infra back to the state described by the code
terraform apply
# The manual change will be overwritten
  • Team rule: you NEVER touch the console to modify the infrastructure. Everything goes through code + pull request.
  • Restrictive IAM: limit modification permissions in the console for production environments.
  • Drift detection: run terraform plan regularly (in CI) to detect drift.

“Your app has been running in production for 3 months. The CTO tells you: ‘We have users complaining it’s slow but we don’t know why.’ How do you set up monitoring?”

Step 1 — Define what you want to measure

Section titled “Step 1 — Define what you want to measure”

The 4 golden signals (Google SRE’s “Golden Signals”):

SignalQuestionExample metric
LatencyIs it fast?Response time at the 95th percentile
TrafficHow many people?Requests per second
ErrorsDoes it work?5xx error rate
SaturationIs it full?CPU, RAM, disk, DB connections
App → exposes /metrics → Prometheus scrapes → Grafana displays
  • Add the Prometheus library to the app (for our project: prometheus-fastapi-instrumentator)
  • Deploy Prometheus + Grafana (docker-compose is the simplest)

One dashboard per “audience”:

  • Technical dashboard: latency, errors, CPU, RAM, DB slow queries
  • Business dashboard: number of active users, number of tasks created (for the CTO)

Good alerts:

  • “The 5xx error rate exceeds 5% for 5 minutes” → actionable (there’s a bug or a service is down)
  • “The p95 response time exceeds 2 seconds for 10 minutes” → actionable (degraded performance)

Bad alerts:

  • “CPU at 80%” → not actionable on its own (80% CPU might be normal if the app runs fine)
  • “1 error 404” → noise (a user typed a wrong URL, that’s normal)
  • You know the Golden Signals or a similar framework
  • You distinguish between technical and business metrics
  • You know that an alert must be actionable
  • You don’t suggest monitoring 200 metrics at once

Scenario 8 — Blue-green / Canary deployment

Section titled “Scenario 8 — Blue-green / Canary deployment”

“How do you deploy to production without downtime and without risking breaking it for all users?”

┌─── Blue (v1.0 — current) ◄── 100% of traffic
Load Balancer ──────┤
└─── Green (v1.1 — new) ◄── 0% of traffic
  1. You deploy v1.1 to Green (while Blue still serves users)
  2. You test Green (smoke tests, sanity check)
  3. You switch the load balancer: Green receives 100% of the traffic
  4. If it works → you delete Blue. If it breaks → you switch back to Blue in 10 seconds.

Pros: Instant rollback. Zero downtime. Cons: Double infrastructure during the transition (cost). Problem if the DB schema changed between v1.0 and v1.1.

┌─── v1.0 ◄── 95% of traffic
Load Balancer ──────┤
└─── v1.1 ◄── 5% of traffic (the "canaries")
  1. You deploy v1.1 to a few instances
  2. You send 5% of traffic to v1.1
  3. You monitor the metrics (errors, latency)
  4. If everything is fine → 25% → 50% → 100%. If it breaks → 0% and rollback.

Pros: You detect bugs with limited impact (5% of users). Cons: More complex to set up. Requires good monitoring to detect issues.

This is what Kubernetes does by default. You replace instances one by one:

Start: [v1.0] [v1.0] [v1.0] [v1.0]
Step 1: [v1.1] [v1.0] [v1.0] [v1.0]
Step 2: [v1.1] [v1.1] [v1.0] [v1.0]
Step 3: [v1.1] [v1.1] [v1.1] [v1.0]
End: [v1.1] [v1.1] [v1.1] [v1.1]

Pros: Simple, native in K8s, no double infrastructure. Cons: Slower rollback. During the transition, two versions coexist.

StrategyComplexityRollbackUse case
Blue-GreenMediumInstantCritical apps, few deployments
CanaryHighFastHigh-traffic apps, need to test in real conditions
RollingLowMediumMost cases, K8s default