Python Scenario Based Interview Questions
Question: Suppose you have a 10gb textfile. Write the function to read the file line by line without loading the entire file into memory.
Answer:
def read_large_file(file_path):
with open(file_path, ‘r’, encoding=’utf-8′) as file:
for line in file:
yield line.strip()
š§ How It Works
open(file_path, 'r')
: Opens the file in read-only mode.
for line in file
: This is a generator-based read.
yield line.strip()
: Makes this a generator function, so you can iterate lazily through lines.
Question: You receive log files with millions of lines in the following format:[2025-05-30 10:24:55] ERROR: Service failed due to timeout.
Write a Python script to count how many ERROR messages occurred per day.
Answer:
from collections import defaultdict
import re
log_counts = defaultdict(int)
with open(“server_logs.txt”, “r”) as file:
for line in file:
match = re.match(r”[(\d{4}-\d{2}-\d{2})”, line)
if match and “ERROR” in line:
date = match.group(1)
log_counts[date] += 1
for date, count in sorted(log_counts.items()):
print(f”{date}: {count} ERRORs”)
Question:You built a data pipeline where you want to capture specific errors like missing columns in a dataframe. How would you implement custom exception handling?
Answer:
class MissingColumnError(Exception):
pass
def validate_columns(df, required_columns):
missing = [col for col in required_columns if col not in df.columns]
if missing:
raise MissingColumnError(f”Missing columns: {‘, ‘.join(missing)}”)
import pandas as pd
df = pd.DataFrame({“id”: [1, 2], “amount”: [100, 200]})
try:
validate_columns(df, [“id”, “amount”, “timestamp”])
except MissingColumnError as e:
print(f”Data validation failed: {e}”)
Question:Given a list of transactions, group them by user_id
and calculate the total transaction amount per user.
transactions = [
{“user_id”: 1, “amount”: 100},
{“user_id”: 2, “amount”: 200},
{“user_id”: 1, “amount”: 150},
]
Answer:from collections import defaultdict
user_totals = defaultdict(int)
for tx in transactions:
user_totals[tx[“user_id”]] += tx[“amount”]
print(dict(user_totals)) # {1: 250, 2: 200}
Question:How to calculate age from birth date in pyspark?
Answer:from pyspark.sql import functions as F from pyspark.sql import types as T df = spark.createDataFrame( data=[ (1, “foo”, datetime.strptime(“1999-12-19”, “%Y-%m-%d”), datetime.strptime(“1999-12-19”, “%Y-%m-%d”).date()), (2, “bar”, datetime.strptime(“1989-12-14”, “%Y-%m-%d”), datetime.strptime(“1989-12-14”, “%Y-%m-%d”).date()), ], schema=T.StructType([ T.StructField(“id”, T.IntegerType(), True), T.StructField(“name”, T.StringType(), True), T.StructField(“birth_ts”, T.TimestampType(),True), T.StructField(“birth_date”, T.DateType(), True) ]) ) df = df.withColumn(“age_ts”, F.floor(F.datediff(F.current_timestamp(), F.col(“birth_ts”))/365.25)) df = df.withColumn(“age_date”, F.floor(F.datediff(F.current_date(), F.col(“birth_date”))/365.25)) df.show()
* I will update the list on Daily basis