ICMR MIDAS 2.0 Framework - Technical Version

ICMR MIDAS 2.0 Framework

(Metric-based Integrity and Data Assessment System)

Dataset Quality and Trust Framework
(Technical Version)

Version 1.0 | Draft for Expert Validation

Date: 10.30.2025

Prepared by:

Indian Council of Medical Research (ICMR)

Division of Development Research

New Delhi, India

Document Purpose

Technical Version of MIDAS 2.0 is a quantitative framework for assessing dataset quality, integrity, interoperability, and privacy. It is to be completed by the Nodal Centre using data from the Lite Version submitted by Independent Centers. The Nodal Centre may seek additional information from submitting Nodal Centre for validation and computation.

Confidential Draft – For Review Only

Please do not distribute without authorization from ICMR.

SECTION – A

Basic Information

Dataset Title
Version / DOI / Handle
Submitting PI / Custodian
Date of Assessment
Assessor Name / Affiliation

What is MIDAS 2.0 (Technical Version)

MIDAS 2.0 (Metric-based Integrity and Data Assessment System) is a quantitative framework to evaluate quality of biomedical and health datasets. The framework is based on fifteen domains, each focusing on different parameters contributing to the quality of dataset. The framework can be implemented through two versions of rubric that is (i) Lite Version and (ii) Technical Version. This is technical version and is for the nodal centres to evaluate and validate the quality of dataset, based on the information submitted by dataset custodian via the lite version. This version of rubric includes detailed metrics, computation steps, and evidence-based validation requirements for calculating the Composite Quality Index (CQI) and Privacy-Risk Score (PRS).

How the MIDAS 2.0 scores are calculated

For a given dataset, quality is assessed across 15 domains where each domain is scored from 0 to 4, where 0 = absent and 4 = exemplary. Each domain in the technical version corresponds to a domain in the lite version, further complementing with additional clarifications. Both lite version of rubric and the additional information is provided by the data custodians.

The Composite Quality Index (CQI) is computed as:

CQI = Sum of domain scores Maximum possible score × 100

Use the highest level score where all statements are true. If any statement at that level is missing, step down one level. Attach the requested evidence. If a domain is formally marked as “If Applicable” (see Domain 11), the CQI denominator has to be reduced accordingly in the assessment record.

The Privacy-Risk Score (PRS) is calculated separately on a scale of 0–100 using method detailed in Annexure I. PRS must be documented in the assessment record, including method, sensitivity class, and final score.

How the Scores are Interpreted

Based on the Composite Quality Index (CQI) datasets can be classified into six performance bands as follows:

Aggregated CQI band	Grade	Interpretation
≥ 95	Diamond	Global exemplar; candidate for reference standard
85 – 94	Platinum	Bestpractice dataset
70 – 84	Gold	Highquality dataset
50 – 69	Silver	Permissible but improvement plan must be recorded
25 – 49	Bronze	Embargo until targeted enhancements completed
< 25	Remediation	Iterative QA and resubmission required

Further, both CQI and PRS matrix can determine level of dataset release (Open / Controlled / Restricted ). We strongly recommend that clinical-genomic or high-stigma data should be default to Controlled unless PRS is Low with strong, independently verified Differential Privacy.

CQI × PRS Release Matrix

PRS/CQI ↓	≥95 Diamond	85–94 Platinum	70–84 Gold	50–69 Silver	25-49 Bronze
Low (0–15)	Open	Open/Controlled	Controlled/Open	Controlled	Restricted
Moderate (16–40)	Open/Controlled	Controlled	Controlled	Restricted	Restricted
High (41–70)	Controlled	Controlled	Restricted	Restricted	Restricted
Very High (71–100)	Restricted	Restricted	Restricted	Restricted	Restricted

How to Fill this Form

The Nodal Centre should:

Verify all information from the Lite Version and supporting documents.
Assign scores objectively based on the highest level where all statements are true.
Record evidence references, remarks, and computation notes.
Use Annexure I to calculate the Privacy-Risk Score (PRS).

Where data are incomplete, the Nodal Centre may seek clarifications or additional documentation from the respective data custodian to ensure transparency and accuracy.

What Happens After Submission

Once the Technical Version is completed, the Nodal Centre compiles the final CQI and PRS values and recommend the dataset’s release classification (Open, Controlled, or Restricted) using the CQI × PRS Matrix. The final report is reviewed internally and shared with the Steering Committee constituted by the Nodal centre for endorsement and onboarding into the MIDAS repository. The Nodal Centre should maintain a record of all communications, calculations, and evidences for audit and reproducibility. Periodic re-evaluation may be undertaken if the dataset is updated or re-released.

SECTION – B

Data Quality Domains

1. Annotation Fidelity

Quality and reliability of labels/annotations produced by experts or trained readers.

Score	Criteria
0	No documented labeling procedure. Labels created by a single person with unknown expertise; no checks for mistakes.
1	Basic SOP exists but is informal; one primary annotator with occasional spot checks. No agreement statistics.
2	Two independent annotators for a sample; disagreements resolved informally. Agreement reported only overall without class-wise detail.
3	At least two independent annotators per item with adjudication by a senior reviewer. Per‑class agreement reported with confidence intervals; minimum sample per class ≥ 50 or CI width ≤ 0.10.
4	Multi‑reader protocol with blinded reads, adjudication, and a documented label‑error audit. Per‑class κ ≥ 0.80 or Dice ≥ 0.85 with 95% CIs; fixes logged with versioned changelog.

2. Metadata Completeness

Rich, machine‑actionable metadata with persistent identifiers to enable discovery and reuse.

Score	Criteria
0	Minimal metadata (title only) without structured fields; no identifier.
1	Basic fields (title, creator, date) filled; some free‑text descriptions; local IDs only.
2	Standard metadata template used; dataset DOI/Handle minted; core fields populated but creator identifiers missing.
3	DOI with rich schema: creator ORCIDs and custodian ROR present; keywords from controlled vocabularies; versioning documented.
4	FAIRness assessment (FAIRshake/F‑UJI) or equivalent ‘passes’ with remediation notes; complete machine‑actionable metadata, including funder/grant IDs and related outputs.

3. Documentation Richness

Clarity and depth of human‑readable documentation for context, methods, and reuse.

Score	Criteria
0	No public documentation beyond a short description.
1	Readme available but lacks process details; license unclear.
2	Full data card or README with collection context and variables; explicit license (SPDX identifier).
3	Standard Operating Procedures (SOPs), changelog, and known limitations documented; external validation notes included.
4	Comprehensive manual with worked examples, ODRL/DUO machine‑readable reuse terms, and a public ‘transparency & terms’ page.

4. Population Representativeness

How well the dataset reflects the intended population across sites, regions, age, sex, and other axes.

Score	Criteria
0	Unknown sampling frame; no demographic fields.
1	Some demographics present but incomplete; single‑site or convenience sampling.
2	Multi‑site sampling with predefined axes but targets not set; imbalance described qualitatively.
3	Pre‑specified targets across ≥ 2 axes; residual imbalance quantified with \|SMD\| per axis; trends over time plotted.
4	Continuous representativeness monitoring; \|SMD\| ≤ 0.10 for key axes or corrective rebalancing performed; drift alarms using PSI/EMD with actions logged.

5. Interoperability & Standards Conformance (with DQ checks)

Mapping to healthcare standards and automated data‑quality checks: conformance, completeness, plausibility.

Score	Criteria
0	Proprietary format; no mapping or schema.
1	Export scripts exist; partial field mapping to a known standard; no automated checks.
2	Declared mapping to FHIR/DICOM/OMOP/ABDM with validation, but some critical errors persist; missingness not reported.
3	Conforms to ≥ 1 standard with zero critical validator errors; key‑field missingness ≤ 5%; plausibility rules pass (ranges, cross‑field logic).
4	Multi‑standard validation in CI; comprehensive DQ suite runs each release covering conformance, completeness, and plausibility; round‑trip fidelity and cross‑repository resolvability verified.

6. AI‑Readiness & Drift Monitoring

Packaging for benchmarking and safeguards against data leakage and distribution shift.

Score	Criteria
0	Raw dump; no splits; no checks for leakage.
1	Ad‑hoc train/test split without fixed seeds; no leakage tests; no drift monitoring.
2	Documented splits with fixed seeds and manifest; basic leakage checks (duplicate IDs) performed once.
3	Public benchmark kit (splits + seeds + manifests), leakage tests documented, at least one group‑fairness metric reported; initial drift baseline set.
4	Continuous drift monitoring with alarms (e.g., PSI > 0.2) and HIL adjudication; benchmark kit maintained with versioned releases; reproducible scoring script provided.

7. Identifiability & Sensitivity (Privacy Risk)

Risk that individuals can be re‑identified or sensitive attributes can be inferred.

Score	Criteria
0	No de‑identification; direct identifiers present.
1	Identifiers removed but quasi‑identifiers left unassessed; no documented risk method.
2	Formal risk quantification documented (e.g., k‑anonymity proxy or ε‑DP parameter), but no independent check.
3	Risk quantified and mapped to PRS bands; independent assessor replicates results; linkage‑risk considered for rare combos.
4	Independent re‑identification/attack simulation (membership/attribute inference) shows <1% success under declared models; higher of ε‑risk and linkage‑risk is recorded.

8. Security & Governance (Operational)

Demonstrable operational security and auditability of access to data and systems.

Score	Criteria
0	No access controls or audit trails.
1	Role‑based access configured; basic logs enabled; no review routine.
2	Periodic access review; vulnerability management exists but without SLAs; backup configured.
3	Documented vuln‑management SLAs, last BCP/DR test recorded; SIEM/audit logs sampled and reviewed quarterly.
4	Tamper‑evident hashing/watermarking of artifacts; red‑team/tabletop outcomes recorded; continuous monitoring with automated alerts and response playbooks.

9. Provenance & Lineage (Executable)

End‑to‑end trace from raw data to release, with executable pipelines.

Score	Criteria
0	No pipeline or provenance records.
1	High‑level steps described; manual processing; no environment capture.
2	Scripted steps with basic version control; environment noted informally.
3	Containerized workflow (Docker/Singularity) with CWL/WDL/Nextflow; lockfile for dependencies; lineage graph available.
4	Independent rerun using the provided container/workflow reproduces outputs on separate hardware; full lineage captured automatically.

10. Ethical & Social Accountability

Processes for equity impact, stakeholder engagement, and redressal.

Score	Criteria
0	No ethics or social‑impact documentation.
1	IRB/IEC approval cited; consent model mentioned.
2	Stakeholder engagement described; risks/benefits discussed; grievance email provided.
3	Equity impact assessment template completed; public redress channel with median time‑to‑resolution tracked.
4	Periodic equity updates and outcomes published; external stakeholder review minutes shared.

11. Synthetic‑Data Fidelity (if applicable)

Utility and privacy of synthetic data relative to real data.

Score	Criteria
0	No testing of synthetic data; claims only.
1	Basic descriptive comparison; no privacy tests.
2	Task‑level utility evaluated (e.g., model accuracy) on synthetic vs. real; one privacy attack tested.
3	Multiple utility metrics and calibration; membership or attribute‑inference attack success ≤ 5%.
4	Nearest‑neighbor disclosure analysis and multiple attack models show <1% success; report includes attack budgets and defenses.

12. Stewardship & Data‑Protection Governance

Organizational readiness and compliance for responsible data stewardship.

Score	Criteria
0	No assigned data owner; policies absent.
1	Named owner/custodian; basic policy outline.
2	Records of Processing (RoPA) and Data‑Protection Impact Assessment (DPIA/PIA) exist; DPO identified if required.
3	DPIA refresh tied to release cadence; policy compliance included in KPIs and dashboards.
4	Independent internal audit of stewardship with corrective‑action tracking; public governance summary updated each major release.

13. Model‑Linkage Integrity

Traceable and verifiable linkages between datasets and any released models.

Score	Criteria
0	Models released without data linkage information.
1	Manual notes on which data were used; no hashes.
2	Data and model versions recorded; partial file lists.
3	Cryptographic manifest of all files; semantic versioning for data and models; deployment references manifest IDs.
4	Signed manifests; reproducible model‑build script that validates manifest integrity before training/inference.

14. Environmental & Sustainability

Measurement and management of energy and carbon footprints from storage and compute.

Score	Criteria
0	No energy/footprint reporting.
1	Qualitative statements only.
2	Basic measurement using tools (e.g., codecarbon) for major jobs; storage reported in TB‑months.
3	Numeric targets set (e.g., ≤ X kWh/TB‑month); optimization measures recorded; procurement policy referenced.
4	Independent review of footprint; public summary per major release with actions taken.

15. Continuous Curation & Feedback (Freshness)

Release cadence, user feedback, and responsiveness; how ‘fresh’ the data are.

Score	Criteria
0	No versioning or feedback channel.
1	Occasional releases; informal issue handling.
2	Planned release schedule; issue tracker used; freshness (data latency) measured but not enforced.
3	Freshness SLA met in ≥ 90% cycles; issue triage and resolution SLAs met; public dashboard with metrics.
4	Telemetry‑backed adherence to freshness and resolution SLAs; user advisory review minutes published each cycle.

SECTION – C

Annexure – I

How to Compute the Privacy-Risk Score (PRS)

The Privacy-Risk Score (PRS) quantifies residual re-identification risk and potential harm exposure on a 0–100 scale. Each dataset must include a documented PRS calculation in the assessment record. The PRS determines the applicable access category (Open / Controlled / Restricted) when combined with the Composite Quality Index (CQI) in the CQI × PRS Matrix.

Step 1. Baseline Identity-Disclosure Risk

Select the method appropriate to the dataset type.

1A Structured / Tabular / Clinical Data

Estimate the “prosecutor risk” which is the maximum probability that any record can be uniquely re-identified from quasi-identifiers (age, sex, region, diagnosis, timestamp, etc.).
Let p = worst-case re-identification probability (0 ≤ p ≤ 1).
BaselineRisk = 100 × p (approximately 100 / k for the smallest equivalence class). BaselineRisk represents the raw probability before multiplying by sensitivity.

Examples:

k = 10 → p ≈ 0.10 → BaselineRisk = 10
k = 4 → p ≈ 0.25 → BaselineRisk = 25

1B Differentially Private / Synthetic Release

If data are generated or protected by differential privacy, derive an equivalent disclosure risk from ε.
BaselineRisk = min(100, 20 × ε)

Examples:
ε = 0.5 → BaselineRisk = 10 | ε = 1.0 → 20 | ε = 3 → 60 | ε ≥ 5 → 100

Lower ε indicates stronger privacy; ε > 3 suggests high risk.

This mapping is intended for typical tabular / quasi-tabular health data and should not be naively applied to, for example, full-resolution DICOM images or raw WGS reads, where ε may not capture all linkage risks. For non-tabular data, use simulated linkage or membership-inference rate to estimate p.

Step 2. Sensitivity Multiplier

Adjust for intrinsic sensitivity or potential social harm.

Sensitivity Class	Definition / Examples	Multiplier
A (Routine)	Non-stigmatizing, routine health or operational data.	1.0
B (High Stigma / Clinical-Genomic)	HIV/TB records, mental-health, sexual-health, genomic data, rare disease, household geocodes.	1.5
C (Critical / Sovereign)	Forensic genomics, security-sensitive, tribal / conflict zone registries, data subject to national security control.	2.0

AdjustedRisk = BaselineRisk × Multiplier

(capped at 100)

If AdjustedRisk > 100, set PRS = 100.

Step 3. Final PRS Computation

PRS = round (AdjustedRisk)

Record in the assessment log:

Method used (1A or 1B)
Sensitivity Class (A / B / C)
Final PRS value

PRS Range	Risk Level	Interpretation / Use in Matrix
0–15	Low	Eligible for Open Release (only if CQI ≥ Gold and no override).
16–40	Moderate	Controlled Access required with DUA / DAC approval.
41–70	High	Controlled or Restricted Access with enhanced governance.
71–100	Very High	Restricted Access only; release limited to aggregates or DP outputs.

For datasets containing genomic, mental-health, or high-stigma attributes, Controlled Access is mandatory unless PRS ≤ 15 and independent DP verification exists.

Step 4. Recording and Audit

Each PRS calculation must document:

The selected method (1A or 1B).
The variables considered as quasi-identifiers or the ε value used.
The Sensitivity Class and justification.
The final PRS numeric score and risk level.

All computations must be reproducible and, when feasible, verified by an assessor not involved in the dataset’s original curation.

Institutions may use additional automated risk-assessment tools (k-map, ARX, sdcMicro, SmartNoise) provided they yield the same PRS bands as defined above. Store the worksheet and tool logs in the assessment archive.

Example Calculation

A mental-health registry with k = 5 → BaselineRisk = 20. Sensitivity Class B (multiplier 1.5) → AdjustedRisk = 30 → PRS = 30 → Moderate Risk → Controlled Access.