Can you clarify the method used for ordering?

#1
by vince62s - opened

you say "This release uses an internally consistent allele-ordering strategy."

is it just statistical over the dataset or what exactly did you do to decide whether flipping A1/A2 was necessary ? just FRQ_A/U numbers ?
I am using hg38 but sometimes it seems off. Not sure if hg19 and hg38 agree.

Thanks for flagging this. I clarified the dataset card because the original sentence was too compressed.

Short answer: the ordering is deterministic internal allele ordering, not a statistical inference over the dataset, and not a REF/ALT resolution step.

For each row, the pipeline takes the two source-reported allele columns, upper-cases them, and sorts them lexicographically. The variant_key is then written as chr:pos:allele1:allele2 using that ordered pair. If the original first allele sorts after the second allele, the signed effect_size is multiplied by -1 and was_flipped=true; otherwise the sign is left unchanged.

So, no: FRQ_A, FRQ_U, maf, eaf, etc. are not used to decide whether a row is flipped. Frequency fields are preserved when available, but they are not used for strand inference, REF/ALT inference, or genome-build-specific allele orientation.

Also important: this release does not perform external-reference allele resolution, dbSNP normalization, or hg19/hg38 liftover. Coordinates are inherited from the selected OpenMed/PGC source configuration. Many PGC summary-statistics releases are GRCh37/hg19-based, but the safe interpretation here is source-coordinate based unless the specific upstream release documents otherwise. If you are joining against hg38 resources, you should liftover positions and verify alleles against an hg38 reference before treating the alleles as REF/ALT.

In other words, variant_key is an internally consistent association key, not a guarantee that allele1/allele2 correspond to hg38 REF/ALT.

Sign up or log in to comment