Python Scenario Based Interview Questions

Question: Suppose you have a 10gb textfile. Write the function to read the file line by line without loading the entire file into memory.
Answer:

def read_large_file(file_path):
with open(file_path, ‘r’, encoding=’utf-8′) as file:
for line in file:
yield line.strip()

🧠 How It Works

open(file_path, 'r'): Opens the file in read-only mode.

for line in file: This is a generator-based read.

yield line.strip(): Makes this a generator function, so you can iterate lazily through lines.

Question: You receive log files with millions of lines in the following format:
[2025-05-30 10:24:55] ERROR: Service failed due to timeout.
Write a Python script to count how many ERROR messages occurred per day.

Answer:

from collections import defaultdict
import re

log_counts = defaultdict(int)

with open(“server_logs.txt”, “r”) as file:
for line in file:
match = re.match(r”[(\d{4}-\d{2}-\d{2})”, line)
if match and “ERROR” in line:
date = match.group(1)
log_counts[date] += 1

for date, count in sorted(log_counts.items()):
print(f”{date}: {count} ERRORs”)

Question:You built a data pipeline where you want to capture specific errors like missing columns in a dataframe. How would you implement custom exception handling?

Answer:

class MissingColumnError(Exception):
pass

def validate_columns(df, required_columns):
missing = [col for col in required_columns if col not in df.columns]
if missing:
raise MissingColumnError(f”Missing columns: {‘, ‘.join(missing)}”)

import pandas as pd

df = pd.DataFrame({“id”: [1, 2], “amount”: [100, 200]})
try:
validate_columns(df, [“id”, “amount”, “timestamp”])
except MissingColumnError as e:
print(f”Data validation failed: {e}”)

Question:Given a list of transactions, group them by user_id and calculate the total transaction amount per user.

transactions = [
{“user_id”: 1, “amount”: 100},
{“user_id”: 2, “amount”: 200},
{“user_id”: 1, “amount”: 150},
]

Answer:from collections import defaultdict

user_totals = defaultdict(int)

for tx in transactions:
user_totals[tx[“user_id”]] += tx[“amount”]

print(dict(user_totals)) # {1: 250, 2: 200}

Question:How to calculate age from birth date in pyspark?

Answer:from pyspark.sql import functions as F from pyspark.sql import types as T df = spark.createDataFrame( data=[ (1, “foo”, datetime.strptime(“1999-12-19”, “%Y-%m-%d”), datetime.strptime(“1999-12-19”, “%Y-%m-%d”).date()), (2, “bar”, datetime.strptime(“1989-12-14”, “%Y-%m-%d”), datetime.strptime(“1989-12-14”, “%Y-%m-%d”).date()), ], schema=T.StructType([ T.StructField(“id”, T.IntegerType(), True), T.StructField(“name”, T.StringType(), True), T.StructField(“birth_ts”, T.TimestampType(),True), T.StructField(“birth_date”, T.DateType(), True) ]) ) df = df.withColumn(“age_ts”, F.floor(F.datediff(F.current_timestamp(), F.col(“birth_ts”))/365.25)) df = df.withColumn(“age_date”, F.floor(F.datediff(F.current_date(), F.col(“birth_date”))/365.25)) df.show()

* I will update the list on Daily basis