Q1. You have a DataFrame with duplicate rows. Identify duplicates using duplicated() and drop them using drop_duplicates().
duplicates = df.duplicated()
df_clean = df.drop_duplicates() # keeps first occurrence
# or df.drop_duplicates(subset=[''col1'',''col2''], keep=''last'')Duplicates can skew analysis.Q2. The ''salary'' column has outliers. Replace values greater than 200,000 with 200,000 using boolean indexing.
df.loc[df[''salary''] > 200000, ''salary''] = 200000
# or: df[''salary''] = df[''salary''].clip(upper=200000)This capping technique is standard for handling outliers.Q3. You have a column ''date'' with data types object (string). Convert it to datetime and then extract year, month, and day into new columns.
df[''date''] = pd.to_datetime(df[''date''])
df[''year''] = df[''date''].dt.year
df[''month''] = df[''date''].dt.month
df[''day''] = df[''date''].dt.daypd.to_datetime handles many formats. Also df[''date''].dt.dayofweek for weekday.Q4. The ''category'' column has inconsistent text: ''Electronics'', ''electronics'', ''Electronics '' (spaces). Clean it to proper case and strip spaces.
df[''category''] = df[''category''].str.strip().str.capitalize()This chain handles spaces and capitalizes first letter. For lower case: .str.lower(). String methods in pandas are vectorized.Q5. You have a dataset with NaN values in both numeric and categorical columns. Use SimpleImputer from sklearn to fill numeric with median and categorical with most frequent.
from sklearn.impute import SimpleImputer
num_imputer = SimpleImputer(strategy=''median'')
cat_imputer = SimpleImputer(strategy=''most_frequent'')
df_num = num_imputer.fit_transform(df[[''age'', ''income'']])
df_cat = cat_imputer.fit_transform(df[[''city'']])This is practical ML preprocessing.