Data Cleaning Interview Questions

Q1. Scenario: You have a DataFrame with duplicate rows. Identify duplicates using duplicated() and drop them using drop_duplicates().

duplicates = df.duplicated(); df_clean = df.drop_duplicates(). Keep first by default. Also df.drop_duplicates(subset=[''col1'',''col2''], keep=''last'') for subsets. Duplicates can skew analysis.

Q2. Scenario: The 'salary' column has outliers. Replace values greater than 200,000 with 200,000 using boolean indexing.

df.loc[df[''salary''] > 200000, ''salary''] = 200000. Also use np.clip: df[''salary''] = df[''salary''].clip(upper=200000). This capping technique is standard for handling outliers.

Q3. Scenario: You have a column 'date' with data types object (string). Convert it to datetime and then extract year, month, and day into new columns.

df[''date''] = pd.to_datetime(df[''date'']); df[''year''] = df[''date''].dt.year; df[''month''] = df[''date''].dt.month; df[''day''] = df[''date''].dt.day. pd.to_datetime handles many formats. Also df[''date''].dt.dayofweek for weekday.

Q4. Scenario: The 'category' column has inconsistent text: ''Electronics'', ''electronics'', ''Electronics '' (spaces). Clean it to proper case and strip spaces.

df[''category''] = df[''category''].str.strip().str.capitalize(). This chain handles spaces and capitalizes first letter. For lower case: .str.lower(). String methods in pandas are vectorized.

Q5. Scenario: You have a dataset with NaN values in both numeric and categorical columns. Use SimpleImputer from sklearn to fill numeric with median and categorical with most frequent.

from sklearn.impute import SimpleImputer; num_imputer = SimpleImputer(strategy=''median''); cat_imputer = SimpleImputer(strategy=''most_frequent''); df_num = num_imputer.fit_transform(df[[''age'',''income'']]); df_cat = cat_imputer.fit_transform(df[[''city'']]). This is practical ML preprocessing.

Welcome to Quipoin

Quipoin Menu