Search for bisulfite data

Published

July 22, 2025

This post was originally published on Medium. You can also read it there and leave comments.

The main methylation data types are bisulfite and array. I’ll focus on searching for bisulfite datasets.

Query NCBI SRA advanced search

With the SRA Advanced Search Builder:

https://www.ncbi.nlm.nih.gov/sra/advanced

Filter runs

I filtered for source: DNA and Organism: Homo Sapiens

https://www.ncbi.nlm.nih.gov/sra

You can also search by modifying Search details. Mine right now:

"bisulfite seq"[Strategy] AND "Homo sapiens"[orgn] AND "biomol dna"[Properties]

You get the same results with this URL:

https://www.ncbi.nlm.nih.gov/sra?term=%22bisulfite%20seq%22%5BStrategy%5D%20AND%20%22Homo%20sapiens%22%5Borgn%5D%20AND%20%22biomol%20dna%22%5BProperties%5D&cmd=DetailsSearch

I got 54368 results as of 2025.07.23.

Send to SRA Run Selector

Click Send to -> Choose Destination: Run Selector

This brings you to:

https://www.ncbi.nlm.nih.gov/Traces/study/?query_key=4&WebEnv=MCID_687f8c820ad1cd6f51d404fa&o=acc_s%3Aa

Download Metadata

Click Download Metadata. You get a csv.

csv downloaded from run selector

This contains metadata like Age for all runs. Not every dataset publishes age though.

Process metadata

My goal in this step was to find datasets with Age metadata from certain tissues. Unfortunately the metadata is not unified, it takes some work. I processed in R.

library(tidyverse)

df_read <- read_csv("data/sra_human_bisulfite.csv")

df <-
df_read %>%
filter(!is.na(AGE)) %>%
group_by(BioProject) %>%
summarise(
n=n(),
unique_tissues = list(unique(tissue)),
unique_celltype= list(unique(cell_type)),
unique_tissue_cell_type = list(unique(`tissue/cell_type`)),
unique_tissue_type = list(unique(tissue_type)),
unique_sample_type= list(unique(sample_type)),
unique_isolate= list(unique(isolate)),
unique_age = list(unique(AGE)),
unique_source_name = list(unique(source_name))
) %>%
arrange(desc(n))

It aggregates them by dataset (BioProject) and sort them by sample number. I then investigated further the most interesting ones.

sra csv aggragated by BioProject