Back to Prompt Library
Data Analysis
data cleaningdata qualityETLanalyticsdata engineering

Data Cleaning Checklist

A dataset-specific cleaning checklist that catches structural errors, missing data, and outliers before they corrupt your analysis.

Prompt Template

Generate a comprehensive data cleaning checklist for the dataset described as: [DATASET_DESCRIPTION]. The dataset contains [NUMBER_OF_ROWS] rows and [NUMBER_OF_COLUMNS] columns. It will be used for: [INTENDED_ANALYSIS]. Known data quality issues: [KNOWN_ISSUES]. The checklist should cover: (1) Structural checks — column names (standardized, no spaces, consistent case), data types (verify each column's type matches its content), duplicate rows (detection and removal criteria), (2) Missing data — for each column, specify: what % missing is acceptable, and the imputation or removal strategy, (3) Outlier detection — which columns to check for outliers, the method to use (IQR, Z-score, domain-specific rules), and the threshold for flagging vs. removing, (4) Consistency checks — cross-column validation rules (e.g., "end_date must be after start_date"), (5) Domain-specific checks — [DOMAIN_SPECIFIC_RULES — e.g., email format, phone number format, valid country codes], (6) Documentation — what to log for each cleaning action (original value, new value, reason), (7) Final validation — 3 sanity checks to run after cleaning to confirm the dataset is ready. Format as a step-by-step checklist with checkboxes.

How to use this prompt

  1. Copy the prompt template using the button above.
  2. Paste it into your preferred AI assistant (ChatGPT, Claude, Gemini, etc.).
  3. Replace all bracketed placeholders like [TOPIC] with your specific details.
  4. Send the prompt and refine the output as needed.
Advertisement