Search for bisulfite data

Published

July 22, 2025

This post was originally published on Medium. You can also read it there and leave comments.

The main methylation data types are bisulfite and array. I’ll focus on searching for bisulfite datasets.

Query NCBI SRA advanced search

With the SRA Advanced Search Builder:

For field strategy, click Show index list, choose “bisulfite seq”
click search, I filter for species and DNA on the next page

https://www.ncbi.nlm.nih.gov/sra/advanced

Filter runs

I filtered for source: DNA and Organism: Homo Sapiens

You can also search by modifying Search details. Mine right now:

"bisulfite seq"[Strategy] AND "Homo sapiens"[orgn] AND "biomol dna"[Properties]

You get the same results with this URL:

https://www.ncbi.nlm.nih.gov/sra?term=%22bisulfite%20seq%22%5BStrategy%5D%20AND%20%22Homo%20sapiens%22%5Borgn%5D%20AND%20%22biomol%20dna%22%5BProperties%5D&cmd=DetailsSearch

I got 54368 results as of 2025.07.23.

Send to SRA Run Selector

Click Send to -> Choose Destination: Run Selector

This brings you to:

https://www.ncbi.nlm.nih.gov/Traces/study/?query_key=4&WebEnv=MCID_687f8c820ad1cd6f51d404fa&o=acc_s%3Aa

Download Metadata

Click Download Metadata. You get a csv.

This contains metadata like Age for all runs. Not every dataset publishes age though.

Process metadata

My goal in this step was to find datasets with Age metadata from certain tissues. Unfortunately the metadata is not unified, it takes some work. I processed in R.

library(tidyverse)

df_read <- read_csv("data/sra_human_bisulfite.csv")

df <-
df_read %>% 
  filter(!is.na(AGE)) %>% 
  group_by(BioProject) %>% 
  summarise(
    n=n(),
    unique_tissues = list(unique(tissue)),
    unique_celltype= list(unique(cell_type)),
    unique_tissue_cell_type = list(unique(`tissue/cell_type`)),
    unique_tissue_type = list(unique(tissue_type)),
    unique_sample_type= list(unique(sample_type)),
    unique_isolate= list(unique(isolate)),
    unique_age = list(unique(AGE)),
    unique_source_name = list(unique(source_name))
  ) %>% 
  arrange(desc(n))

It aggregates them by dataset (BioProject) and sort them by sample number. I then investigated further the most interesting ones.

Read and Comment on Medium