Interview Prep

Top 25 Real-Time DevOps Interview Questions & Answers

Real-World Scenarios | Practical Solutions

This is not a textbook definition guide. This guide is designed to help you answer the question "Tell me about a time when..." or "How do you handle X in production?".

These answers are written in a natural, conversational style suitable for experienced engineers.

1. Git & CI/CD (Real Scenarios)

1. A developer accidentally pushed AWS Access Keys to a public GitHub repo. How do you handle this?
Critical Security Incident

This is a P0 (Priority Zero) incident. I would follow these exact steps:

  1. Revoke Immediately: The moment keys are public, they are compromised. I would go to the AWS IAM console and deactivate/delete those keys immediately to stop any unauthorized access.
  2. Clean History: Simply deleting the file in a new commit isn't enough because the keys exist in the commit history. I would use a tool like BFG Repo-Cleaner or git filter-branch to scrub that specific file from the entire Git history.
  3. Force Push: After cleaning, I would run git push --force to update the remote repository.
  4. Rotate Secrets: I would generate new keys and update them in our CI/CD secrets manager (like Jenkins Credentials or GitHub Secrets).
2. Explain the Git Branching Strategy used in your project.
Workflow

In my current project, we follow the Feature Branch Workflow combined with Pull Requests:

  • Main/Master Branch: This is our production-ready code. It is locked; no one can push directly to it.
  • Feature Branches: When I start a task (e.g., creating a login page), I create a branch named feature/login-page from Main.
  • Pull Request (PR): Once I'm done, I push my branch and raise a PR. This triggers our CI pipeline which runs unit tests.
  • Review & Merge: A senior engineer reviews the code. If the tests pass and the review is approved, we merge it into Main using the "Squash and Merge" strategy to keep history clean.
3. What is the difference between Git Merge and Git Rebase? When do you use which?

This is a common debate, but here is how we apply it practically:

Git Merge: It preserves history exactly as it happened. It creates a "Merge Commit". We use this when merging Feature branches into the Main branch because we want a true record of when features were added.

Git Rebase: It rewrites history to make it look linear. It picks up my changes and places them on top of the latest code. I use this locally on my feature branch before raising a PR. It keeps my branch clean and up-to-date with Main without creating messy "merge commits".

Rule of Thumb: Never rebase a public/shared branch, only your local private branch.

4. Your CI/CD pipeline passed, but the deployment failed in Production. How do you debug?
Troubleshooting

This usually happens due to environment differences. I would check:

  1. Config/Secrets: Did we add a new environment variable (e.g., DB_PASSWORD) in the Dev environment but forget to add it to the Production secrets manager?
  2. Networking: Is the Production server able to talk to the Database? I'd check Security Groups or Firewalls.
  3. Logs: I would immediately check the application logs (CloudWatch or Splunk) to see the crash error.
  4. Version Mismatch: Are we using the same Docker image tag? I’d verify the SHA/Tag of the deployed image.
5. Explain how you setup a Jenkins Pipeline from scratch.

I usually write a Declarative Pipeline using a `Jenkinsfile` stored in the git repo. It has these stages:

pipeline { agent any stages { stage('Checkout') { steps { git 'https://github.com/my-repo.git' } } stage('Build') { steps { sh 'mvn clean package' } } stage('Test') { steps { sh 'mvn test' } } stage('Docker Build') { steps { sh 'docker build -t myapp:v1 .' } } stage('Deploy') { steps { sh './deploy.sh' } } } }

I configure a Webhook in GitHub so that every time a developer pushes code, Jenkins automatically triggers this pipeline.

2. Linux & Shell (Real Scenarios)

6. A production server is running extremely slow. How do you troubleshoot?
Performance Tuning

I follow a standard drill to identify the bottleneck:

  1. Check CPU/Load: I run top. If the "Load Average" is higher than the number of CPU cores, the server is overloaded. I look for the process consuming the most %CPU.
  2. Check Memory: I look at the memory line in top or run free -m. If free memory is near zero and "Swap" is being used heavily, it means the application is thrashing (swapping to disk), which kills performance.
  3. Check Disk I/O: If CPU and RAM are fine, it might be the disk. I run iostat or iotop to see if a process is reading/writing too much data.
  4. Check Disk Space: df -h. A full disk can cause services to hang.
7. You get an error "No space left on device", but 'df -h' shows 30% space free. Why?

This is a classic Linux interview question. The issue is likely Inodes.

In Linux, every file uses one Inode (Index Node). If an application creates millions of tiny 0kb files (like session files or cache), you can run out of Inodes even if you have GBs of storage left.

Solution: I run df -i to check Inode usage. If it's 100%, I find the folder with millions of files and delete them.

8. How do you find which process is listening on Port 8080?

I use the netstat or ss command:

netstat -tulpn | grep 8080

This will show me the Process ID (PID) and the name of the service (e.g., java or python) holding that port. If I need to free the port, I can then kill that PID.

9. What is a Zombie Process and how do you kill it?

A Zombie process (marked with `Z` in top) is a process that has completed execution but its entry is still in the process table because the Parent process hasn't read its exit code.

You cannot kill a Zombie directly because it's already dead! The only way to remove it is to kill its Parent Process. The parent will restart, or init will take over and clean up the zombies.

10. How do you check live logs of a service?

I use the tail command. For example: tail -f /var/log/nginx/error.log.

The -f flag stands for "follow", which streams the logs in real-time to my screen as they are written.

3. AWS Cloud (Real Scenarios)

11. I created a web server in a Public Subnet, but I cannot access it from the internet. Why?
VPC Troubleshooting

I would check the "Path to the Internet" in this order:

  1. Internet Gateway: Does the VPC have an Internet Gateway (IGW) attached?
  2. Route Table: Does the Public Subnet's Route Table have a route 0.0.0.0/0 -> IGW?
  3. Security Group: Does the EC2 Security Group allow Inbound traffic on Port 80 (HTTP) from 0.0.0.0/0?
  4. NACL: Is the Network ACL blocking traffic at the subnet level?
  5. Public IP: Did the instance actually get a Public IP address assigned?
12. How do you reduce AWS costs for a Dev environment?
Cost Optimization

I implement a few strategies:

  • Auto-Stop: I use AWS Lambda to automatically stop all Dev EC2 instances at 7 PM on Friday and start them at 8 AM on Monday. This saves about 30% cost.
  • Spot Instances: For stateless workloads (like Jenkins agents or batch processing), I use Spot Instances which are up to 90% cheaper.
  • S3 Lifecycle: I set up rules to move old logs to S3 Glacier Deep Archive after 30 days.
13. Explain the difference between Security Group and NACL.

Think of Security Group (SG) as the Doorman for the House (Instance). It is Stateful—if the doorman lets a guest in, he remembers them and lets them out automatically.

Think of NACL as the Guard at the Street Gate (Subnet). It is Stateless—the guard checks your ID when you enter AND when you leave. You must explicitly allow traffic in both directions.

14. How do you securely store Database passwords?

We never store passwords in plain text code or Git. We use AWS Secrets Manager or Systems Manager (SSM) Parameter Store.

The application retrieves the password at runtime using the AWS SDK or an environment variable injection, ensuring the secret never touches the disk.

15. What is the difference between Horizontal and Vertical Scaling?

Vertical Scaling (Scale Up): Increasing the size of the machine (e.g., t2.micro to t2.large). It requires downtime (reboot) and has a limit.

Horizontal Scaling (Scale Out): Adding more machines (e.g., going from 2 servers to 5 servers). This is done using Auto Scaling Groups. It has zero downtime and unlimited scale.

4. Docker (Real Scenarios)

16. Why does a Docker container exit immediately after starting?
Container Fundamentals

A container is not a Virtual Machine; it is just a process. It only stays alive as long as the Main Process (PID 1) is running.

If you run a container with a command like echo "Hello", it prints hello and finishes. Since the task is done, the container exits. To keep it running, you need a long-running process like a Web Server (Nginx) or use sleep infinity for debugging.

17. How do you optimize Docker Image size?

Small images are faster to pull and more secure. My techniques:

  1. Use Alpine: Switch from ubuntu (700MB) to alpine (5MB) base images.
  2. Multi-Stage Builds: Use a large image with compilers to build the app, then copy only the executable binary to a tiny runtime image. This discards all the source code and build tools.
  3. .dockerignore: Exclude files like `.git`, `node_modules`, and local logs from the build context.
18. How do containers communicate with each other?

By default, they can talk via IP, but IPs change. The best way is to create a User-Defined Bridge Network.

docker network create my-net

When containers are on the same custom network, they can reach each other using their Container Name as a DNS hostname (e.g., `ping db-container`).

5. Kubernetes (Real Scenarios)

19. A Pod status shows 'CrashLoopBackOff'. What does this mean and how do you fix it?
K8s Debugging

This means the Pod starts, crashes, K8s restarts it, and it crashes again in a loop.

Fix: I immediately check the logs using kubectl logs [pod-name]. Usually, it's an application error (like a missing database password or a syntax error in code). If logs are empty, I check kubectl describe pod to see if it ran out of memory (OOMKilled).

20. Explain the difference between ClusterIP, NodePort, and LoadBalancer.
  • ClusterIP: The default. Exposes the service on an internal IP. Only reachable from inside the cluster. Good for Databases.
  • NodePort: Opens a specific port (e.g., 30005) on every Worker Node. You can access it via NodeIP:30005. Good for testing.
  • LoadBalancer: Provisions a real Cloud Load Balancer (AWS ALB) to expose the app to the internet. Good for Production Web Apps.
21. What is the difference between Deployment and StatefulSet?

Deployment: Used for "Stateless" apps like Web Servers. Pods are interchangeable. If Pod-ABC dies, Pod-XYZ replaces it. They don't have unique identities.

StatefulSet: Used for Databases. Pods have fixed identities (db-0, db-1). Order matters (db-0 starts before db-1). Storage is sticky—if db-0 dies, the new db-0 reconnects to the exact same hard drive.

22. What is Ingress?

If I have 10 microservices, I don't want to buy 10 Load Balancers (expensive!).

Ingress is a smart router. It sits behind one Load Balancer and routes traffic based on rules. For example, domain.com/cart goes to Cart Service, and domain.com/user goes to User Service.

6. Terraform (IaC)

23. What is Terraform State and why do we lock it?

The State File (`terraform.tfstate`) is the map. It tells Terraform what resources exist in the cloud.

We Lock it (using DynamoDB) to prevent race conditions. Imagine two developers run terraform apply at the exact same second. Without locking, they would both try to write to the state file and corrupt it. Locking ensures only one person can modify infrastructure at a time.

24. Someone manually deleted an EC2 instance. What happens when you run Terraform?
Drift Detection

This is called Drift. When I run terraform plan, Terraform refreshes the state and realizes "Hey, the State says there should be a server, but AWS says it's gone."

Terraform will show a plan to Create that resource again to bring the reality back in sync with the code.

Note: We have included 25 high-detail scenarios here. After completing the course, we will share all possible interview questions with practical answers including edge cases for Prometheus, Grafana, and Ansible.