The most expensive mistake in AI implementation is discovering data problems after the project is already resourced and underway. Budget has been allocated, vendor agreements signed, timelines communicated to leadership — and then the data audit reveals that the training data is incomplete, inaccessible, or structured in a way that makes the intended use case infeasible.

A data readiness assessment, conducted before any architecture decisions are made, prevents this. It’s not a lengthy process — most assessments can be completed in two to three weeks — but it requires asking the right questions of the right people, with enough technical depth to get honest answers.

When to run a data readiness assessment: Before any AI initiative is formally scoped, budgeted, or vendor-evaluated. The output of the assessment should inform the project scope and timeline — not the other way around.

The Five Dimensions of Data Readiness

1. Volume

Machine learning models require sufficient training examples to generalize reliably. How much is enough depends on the problem type, model architecture, and the variability of your data — but there are useful rules of thumb.

For supervised classification problems, you typically need at least 1,000 labeled examples per class, and more if the classes are visually or semantically similar. For regression problems, the requirement is highly dependent on the number of features and their correlations. For large language model fine-tuning, requirements are lower — high-quality domain examples in the hundreds to low thousands can meaningfully improve performance.

Volume questions to answer in your assessment:

2. Quality

Data quality problems compound in AI systems. A model trained on data with 15% labeling errors will internalize those errors as valid patterns. Quality issues manifest as systematic biases, unexpected failure modes, and models that perform well in testing but fail on real-world inputs that don’t match the quality profile of the training data.

A practical quality audit samples records and checks for:

3. Accessibility

Data that exists but cannot be accessed at the volume and frequency required by an ML pipeline is not useful data. Accessibility issues are among the most common blockers we encounter — and among the most underestimated.

Common blockers

API rate limits, ERP export constraints, data warehouse performance issues, manual export processes that can’t be automated.

Questions to answer

Can data be accessed programmatically? At what volume and frequency? What are the SLAs for data availability?

4. Governance and Compliance

Before you can use data for AI training, you need to confirm you’re permitted to. This is not a legal technicality — it’s a project risk that has killed AI initiatives at late stages when compliance teams discover that data being used to train models contains PII subject to GDPR, PHI subject to HIPAA, or commercially sensitive information that contractually cannot be used for machine learning.

Key compliance checks:

5. Relevance

The most overlooked dimension. Data can be voluminous, high quality, accessible, and legally cleared — and still be the wrong data for the intended use case. Relevance problems occur when the available data doesn’t reflect the conditions under which the model will operate, or when the target variable isn’t actually captured in the available records.

A demand forecasting model trained on pre-pandemic sales data will struggle to predict demand in the current environment. A customer churn model that relies on engagement signals won’t generalize to a customer segment that uses the product differently. Checking for relevance requires domain expertise, not just data engineering.

Prioritizing Remediation

Almost every data readiness assessment uncovers issues. The question is which ones to address before building, and which can be managed during the project.

Issues that must be addressed before build begins:

Issues that can be managed in parallel:

A thorough data readiness assessment takes two to three weeks and typically costs a small fraction of the project budget it’s protecting. The organizations that skip it don’t save time — they trade a known, manageable cost for an unknown, potentially project-ending discovery at the worst possible moment.

Build the assessment into every AI project, treat its findings as binding inputs to project scoping, and your AI initiatives will have a materially higher probability of reaching production on time and on budget.