Contributor workflow

Add datasets with proof

DefendableDatasets accepts registry-first contributions. Contributors add metadata, receipts, samples, and manifests through pull requests; large files move through NAS or object storage.

Required Metadata

  • id, title, slug, description, domain, category, version
  • status, access, license, formats, tasks, language
  • record_count, size_bytes, created_at, updated_at
  • source_type, provenance_summary, intended_use, not_intended_use
  • quality_score, validation, files, hashes, receipts
  • tags, compatible_models, example_records, citation, links

License and Provenance

Every dataset needs a license, source summary, receipt path, and SHA256 hashes for every file. Public records still need source URLs, retrieval dates, and terms review. Private or member data must stay out of public pull requests and route through gated access controls.

Dataset Folder Structure

datasets/
  cre/
    cre_underwriting_royal_jelly_v1/
      dataset.card.md
      manifest.json
      samples/
      receipts/
      splits/
  compute/
    compute_gpu_market_comps_v1/
      dataset.card.md
      manifest.json
      samples/
      receipts/
      splits/

Example PR Checklist

  • Add or update `/data/registry/datasets.json`.
  • Add `datasets/[domain]/[dataset_id]/manifest.json` and `dataset.card.md`.
  • Add small samples under `samples/` and receipt files under `receipts/`.
  • Put real train, validation, test, or full split files under `splits/` only when licensing permits.
  • Run `npm run lint` and `npm run build` before opening the pull request.

CLI Roadmap

Available commands: `defendable-datasets validate`, `defendable-datasets hash`, and `defendable-datasets pack`. Next command: `defendable-datasets receipt`.