Data Quality
Last updated
Last updated
To maintain the integrity, reliability, and usability of the DataNovaAI platform, all data submitted by researchers must adhere to strict quality standards. These requirements ensure that the decentralized data sharing ecosystem remains a valuable resource for scientific collaboration and AI-driven analysis. This section outlines the expectations for data quality, acceptable formats, and specific content requirements.
The DataNovaAI platform enforces a comprehensive set of data quality standards to guarantee that submitted datasets meet the needs of global researchers and AI agents. Data submissions are evaluated based on the following criteria:
Accuracy: Data must reflect true experimental or observational results, verified through rigorous scientific methods. Any discrepancies or errors must be documented.
Completeness: Datasets should include all necessary components to be fully usable, avoiding missing values unless explicitly justified (e.g., experimental constraints). Metadata must accompany all submissions.
Consistency: Data should maintain internal consistency (e.g., units, formats, and naming conventions should be uniform throughout the dataset).
Relevance: Submitted data must align with scientific research purposes and provide value to the DataNovaAI community.
Innovativeness: Preference is given to datasets that offer novel insights or contribute to advancing scientific knowledge.
To ensure compatibility with the platform’s decentralized storage (IPFS) and AI analysis tools, DataNovaAI specifies the following acceptable formats for data submission:
CSV (Comma-Separated Values):
Preferred Format: CSV is the primary format for structured tabular data due to its simplicity, wide compatibility, and ease of parsing by AI agents.
Requirements:
Use UTF-8 encoding to support international characters.
Include a header row with clear, descriptive column names (e.g., "Temperature_C", "Pressure_atm").
Avoid merged cells or embedded formatting; data should be flat and machine-readable.
Separate fields with commas; avoid using commas within data values unless properly escaped.
JSON (JavaScript Object Notation):
Suitable for hierarchical or nested data structures (e.g., experimental metadata or multi-dimensional results).
Requirements:
Well-formed JSON with consistent key-value pairs.
Keys should be descriptive and follow a logical naming convention (e.g., "experiment_id", "results").
Other Formats:
Formats like HDF5, Parquet, or raw text may be accepted for specific use cases (e.g., large-scale scientific datasets or unstructured logs), but must be pre-approved by the platform’s data validation team.
Binary formats (e.g., images, audio) must be accompanied by a metadata file in CSV or JSON describing the content and context.
Submitted datasets must meet specific content standards to ensure usability and compliance with platform goals:
Metadata:
Every dataset must include a metadata file (in CSV or JSON format) containing:
Title: A concise, descriptive title of the dataset.
Description: A summary of the data’s purpose, source, and scientific relevance (min. 50 words).
Author(s): Names and affiliations of the data contributors.
Date: Date of data collection or creation.
Methodology: Brief description of experimental or observational methods used (e.g., "Temperature measured using a thermocouple with ±0.1°C accuracy").
Units: Clear specification of measurement units for all variables (e.g., "Temperature in Celsius", "Pressure in atmospheres").
License: Licensing information (e.g., Creative Commons, Open Data Commons) for data usage rights.
Data Structure:
Tabular Data: For CSV submissions, each row should represent a single observation or record, and each column a distinct variable. Avoid including calculated fields unless explicitly labeled (e.g., "Calculated_Average").
Hierarchical Data: For JSON, data should be organized logically, with nested objects representing relationships (e.g., experiments with sub-conditions).
Size Limits: Individual datasets should not exceed 10 GB unless split into manageable parts with clear documentation on how parts relate.
Validation Checks:
Data must pass automated validation checks for:
Missing values (flagged if >5% unless justified).
Outliers (detected via statistical methods like Z-scores; must be explained).
Format compliance (e.g., consistent decimal places, no mixed data types in columns).
Preparation: Researchers prepare their data according to the above standards, ensuring all required metadata is included.
Upload: Data is uploaded via the DataNovaAI web interface or API, with the platform automatically generating an IPFS hash upon successful submission.
Validation: An initial automated check verifies format and completeness, followed by a decentralized peer review process for quality scoring.
Approval: Once validated and scored, the dataset is made available on the platform, with rewards issued based on quality metrics.
Peer Review: Community members and experts review submissions, assigning scores (1-5) for accuracy, completeness, and innovativeness. Scores influence token rewards. AI Validation: DataNovaAI AI agents perform automated quality checks, flagging issues for human review. Rewards: Higher-quality datasets (based on scores and usage metrics) receive greater Nova token rewards, incentivizing excellence in submissions. By enforcing these rigorous data quality requirements, DataNovaAI ensures that its platform remains a trusted, high-value resource for the global scientific community, supporting both human researchers and AI-driven advancements.