Loading

Quipoin Menu

Learn • Practice • Grow

python-for-ai / Data Cleaning
interview

Q1. You have a DataFrame with duplicate rows. Identify duplicates using duplicated() and drop them using drop_duplicates().
duplicates = df.duplicated()
df_clean = df.drop_duplicates()   # keeps first occurrence
# or df.drop_duplicates(subset=[''col1'',''col2''], keep=''last'')
Duplicates can skew analysis.

Q2. The ''salary'' column has outliers. Replace values greater than 200,000 with 200,000 using boolean indexing.
df.loc[df[''salary''] > 200000, ''salary''] = 200000
# or: df[''salary''] = df[''salary''].clip(upper=200000)
This capping technique is standard for handling outliers.

Q3. You have a column ''date'' with data types object (string). Convert it to datetime and then extract year, month, and day into new columns.
df[''date''] = pd.to_datetime(df[''date''])
df[''year''] = df[''date''].dt.year
df[''month''] = df[''date''].dt.month
df[''day''] = df[''date''].dt.day
pd.to_datetime handles many formats. Also df[''date''].dt.dayofweek for weekday.

Q4. The ''category'' column has inconsistent text: ''Electronics'', ''electronics'', ''Electronics '' (spaces). Clean it to proper case and strip spaces.
df[''category''] = df[''category''].str.strip().str.capitalize()
This chain handles spaces and capitalizes first letter. For lower case: .str.lower(). String methods in pandas are vectorized.

Q5. You have a dataset with NaN values in both numeric and categorical columns. Use SimpleImputer from sklearn to fill numeric with median and categorical with most frequent.
from sklearn.impute import SimpleImputer
num_imputer = SimpleImputer(strategy=''median'')
cat_imputer = SimpleImputer(strategy=''most_frequent'')
df_num = num_imputer.fit_transform(df[[''age'', ''income'']])
df_cat = cat_imputer.fit_transform(df[[''city'']])
This is practical ML preprocessing.