Raw DNA File Format Reference

What a Raw DNA File Is

A raw DNA file is the plain-text data layer beneath a consumer DNA test. When a company like 23andMe or AncestryDNA shows you ancestry percentages or trait reports, those reports are generated from a file — and that file is available to download.

The file itself is unglamorous: a long list, several hundred thousand rows deep, where each row records one specific position in your genome and the genotype observed there. It is not a whole genome sequence. It is a targeted snapshot of the specific positions — called SNPs, single nucleotide polymorphisms — that the testing company's genotyping chip is designed to read.

That snapshot is enough to drive ancestry analysis, trait prediction, and — when interpreted against published research — pathway and genetic insight reports. Understanding the file's structure is the first step in working with it.

Anatomy of a Genotype Row

Every raw DNA file, regardless of provider, encodes the same four fundamental pieces of information per row:

rsID — the SNP's reference identifier, e.g. rs1801133. A standardized name from the dbSNP database that lets the same SNP be referenced consistently across any dataset or study.
Chromosome — which of the 23 chromosome pairs the SNP sits on (1–22, plus X, Y, and mitochondrial DNA).
Position — the base-pair coordinate of the SNP on that chromosome. Only meaningful relative to a specific genome build.
Genotype — the two alleles observed at that position, one from each parent, e.g. AG. The actual result.

The differences between providers come down to two things: how these four fields are formatted (delimiter, file extension, header style) and whether the genotype is stored as one column or split into two.

Format Comparison Table

The four major consumer DNA testing services, compared on the attributes that matter when reading or processing their raw files.

Attribute	23andMe	AncestryDNA	MyHeritage	FamilyTreeDNA
File extension	.txt	.txt	.csv	.csv
Delimiter	Tab	Tab	Comma	Comma
Download container	.zip	.zip	.zip	.zip / .csv
Header style	# comment lines	# comment lines	Quoted CSV header	CSV header
Genotype column(s)	1 column (e.g. AG)	2 columns (allele1, allele2)	1 column (e.g. AG)	1 column (e.g. AG)
Approx. SNP count	~640,000	~640,000–700,000	~700,000	~700,000
Genome build	GRCh37 (hg19)	GRCh37 (hg19)	GRCh37 (hg19)	GRCh37 (hg19)
Genotyping chip	Illumina GSA (v5)	Illumina (custom OmniExpress / GSA)	Illumina GSA	Illumina GSA
Mitochondrial DNA	Included (chr MT)	Limited	Included	Included

The single most important takeaway All four services use genome build GRCh37 and the standard rsID system — so their files can be cross-referenced. But they differ in file format, delimiter, and crucially whether the genotype is one column or two. Any tool that ingests "raw DNA" must normalize these differences first.

23andMe Format

The 23andMe raw data file is a tab-delimited .txt file delivered inside a .zip archive. It opens with a block of comment lines — each prefixed with # — that explain the file and document which genome build it uses. After the comment block comes a single header row, then the data.

23andMe stores the genotype as a single column: two characters representing both alleles, e.g. AG. For positions on the Y chromosome or mitochondrial DNA, where there is only one allele, the genotype is a single character.

# This data file generated by 23andMe at: [timestamp]
# This file contains raw genotype data, including data that is not used in 23andMe reports.
# Below is a text version of your data. Fields are TAB-separated.
# Each line corresponds to a single SNP. For each SNP, we provide its identifier,
# its location on the reference human genome, and the genotype call oriented
# with respect to the plus strand on the human reference sequence.
rsid    chromosome    position    genotype
rs4477212    1    82154    AA
rs3094315    1    752566    AG
rs3131972    1    752721    GG
rs12124819    1    776546    AA

Key characteristics: comment lines must be skipped when parsing; the four columns are rsid, chromosome, position, genotype; chromosomes are labeled 1–22, X, Y, and MT; and a no-call (a position the chip could not read) appears as --.

AncestryDNA Format

The AncestryDNA raw data file is also a tab-delimited .txt file inside a .zip, and like 23andMe it opens with #-prefixed comment lines. The structural difference that matters: AncestryDNA splits the genotype across two separate columns — allele1 and allele2 — rather than combining them into one.

#AncestryDNA raw data download
#This file was generated by AncestryDNA at: [timestamp]
#Data was collected using AncestryDNA array version: V2.0
#Below is a text version of your DNA file. Fields are TAB-separated.
#Each line corresponds to a SNP. Each SNP has an rsID, a chromosome,
#a position, and two alleles representing the genotype.
rsid    chromosome    position    allele1    allele2
rs4477212    1    82154    A    A
rs3094315    1    752566    A    G
rs3131972    1    752721    G    G
rs12124819    1    776546    A    A

To get a 23andMe-style genotype from an AncestryDNA file, you concatenate allele1 and allele2. This two-column structure is the most common reason a tool built for one format fails on the other. AncestryDNA labels no-calls as 0 in the allele columns.

Path 1 vs Path 2 in a processing pipeline If you are building a pipeline that accepts raw DNA uploads, the 23andMe single-genotype-column format and the AncestryDNA two-allele-column format are the two primary cases to branch on. Detecting which file you have — by inspecting the header row — is the first processing step.

MyHeritage Format

MyHeritage departs from the 23andMe/AncestryDNA convention. Its raw data file is a comma-delimited .csv file, and its header and values are typically wrapped in double quotes. The genotype is stored as a single column, like 23andMe.

# MyHeritage DNA raw data.
# For more information visit: https://www.myheritage.com/dna
# Below is a text version of your DNA file from MyHeritage.
"RSID","CHROMOSOME","POSITION","RESULT"
"rs4477212","1","82154","AA"
"rs3094315","1","752566","AG"
"rs3131972","1","752721","GG"
"rs12124819","1","776546","AA"

Note the differences from 23andMe: comma delimiter instead of tab; values quoted; the genotype column is named RESULT rather than genotype; and column names are uppercase. The underlying four-field structure is identical — only the formatting differs.

FamilyTreeDNA Format

FamilyTreeDNA (FTDNA) Family Finder raw data follows essentially the same convention as MyHeritage: a comma-delimited .csv file with a quoted header, the genotype in a single column named RESULT, and the same four-field structure.

"RSID","CHROMOSOME","POSITION","RESULT"
"rs4477212","1","82154","AA"
"rs3094315","1","752566","AG"
"rs3131972","1","752721","GG"
"rs12124819","1","776546","AA"

FTDNA files generally have minimal or no comment header block — the CSV header row often comes first. Because MyHeritage and FTDNA share the quoted-CSV convention, a parser written for one usually handles the other with little or no modification.

Genome Builds: GRCh37 vs GRCh38

A position number in a raw DNA file — say, 752566 on chromosome 1 — is only meaningful relative to a genome build: the specific version of the human reference genome the coordinates are mapped against.

All four major consumer testing services use GRCh37, also known as hg19, released in 2009. The newer build, GRCh38 (hg38), released in 2013, uses different coordinates. The same physical SNP has a different position number in each build.

Build	Also called	Released	Used by
GRCh37	hg19	2009	23andMe, AncestryDNA, MyHeritage, FTDNA — all consumer chip tests
GRCh38	hg38	2013	Most current research databases; some whole genome sequencing services

⚠ Why this matters when cross-referencing If you look up a SNP's position in a research database that uses GRCh38 and compare it to the position in a consumer raw DNA file (GRCh37), the numbers will not match — not because anything is wrong, but because they are different coordinate systems. Always confirm the build before comparing positions. The rsID, by contrast, is build-independent and can always be matched directly.

Chip Data vs Whole Genome Sequencing

Everything above describes genotyping chip data — the format used by mainstream ancestry tests. It is important to understand what it is not.

A genotyping chip reads a pre-selected set of roughly 600,000–700,000 SNPs: positions chosen because they are known to be informative for ancestry or traits. Whole genome sequencing (WGS), offered by services like Nebula Genomics and Dante Labs, reads all 3.2 billion base pairs of the genome. WGS output is thousands of times larger and is delivered in entirely different file formats — typically FASTQ, BAM, or VCF — not the simple four-column text file described here.

	Genotyping Chip	Whole Genome Sequencing
Positions read	~600K–700K SNPs	~3.2 billion base pairs
File format	4-column .txt / .csv	FASTQ, BAM, VCF
File size	~15–25 MB	Tens to hundreds of GB
Services	23andMe, AncestryDNA, MyHeritage, FTDNA	Nebula Genomics, Dante Labs
Best for	Ancestry, common trait SNPs	Comprehensive sequence analysis

For the purposes of SNP-based analysis — looking up specific well-studied positions like rs1801133 in MTHFR — chip data is sufficient, because those SNPs are exactly what the chip is designed to capture.

From raw file to readable insight

Understanding the file format is step one. The NuGenia Peptide Insight Report takes a raw DNA file from any of the four major services described here and generates a personalized genomic report — mapping your genotype data across biological pathways. It handles the format differences described on this page automatically.

Learn about the Peptide Insight Report →

Common File Issues

1. Opening the file in a spreadsheet corrupts it

Opening a raw DNA file in Excel can silently alter it — long position numbers get converted to scientific notation, and some genotype values (like 1/2 in certain encodings) can be misread. Always work with the file as plain text, or import it explicitly as text-formatted columns.

2. Forgetting to skip comment lines

The #-prefixed comment block at the top of 23andMe and AncestryDNA files is not data. A parser that does not skip these lines will fail on the first row. The number of comment lines varies, so detect them by the # prefix rather than assuming a fixed count.

3. Assuming one genotype column

The most common cross-provider bug: code written for 23andMe's single genotype column breaks on AncestryDNA's two-column allele1/allele2 structure. Detect the format from the header row before parsing the body.

4. Mismatched genome builds

Comparing a GRCh37 position to a GRCh38 position produces silent, wrong results. When in doubt, match on rsID instead of position — the rsID is build-independent.

5. No-call values

Positions the chip could not confidently read appear as no-calls — -- in 23andMe files, 0 in AncestryDNA allele columns. These must be handled explicitly rather than treated as valid genotypes.

6. Character encoding and line endings

Files from different services and operating systems may use different line endings (Windows CRLF vs Unix LF). A robust parser normalizes line endings rather than assuming one.

Frequently Asked Questions

What is a raw DNA file?

A raw DNA file is a plain text file containing the genotype results from a consumer DNA test. It lists several hundred thousand specific positions in the genome (SNPs) and the genotype observed at each one. It is the underlying data behind the ancestry and trait reports a testing company shows you, and it can be downloaded and used with third-party analysis tools.

What is the difference between the 23andMe and AncestryDNA file formats?

Both are tab-delimited text files with comment header lines, but they structure the genotype differently. 23andMe places the full genotype in a single column as a two-character string, for example AG. AncestryDNA splits the genotype across two separate columns, allele1 and allele2. Any tool that reads both formats has to account for this structural difference.

Which genome build do consumer DNA tests use?

All four major consumer testing services — 23andMe, AncestryDNA, MyHeritage, and FamilyTreeDNA — report positions on genome build GRCh37, also known as hg19. This matters when cross-referencing positions against external databases, because a position number is only meaningful relative to its build. Build GRCh38, also called hg38, uses different coordinates.

How many SNPs are in a raw DNA file?

Consumer raw DNA files typically contain between roughly 600,000 and 700,000 SNPs, depending on the testing company and the chip version used. This is a small targeted fraction of the genome — a whole genome sequence covers all 3.2 billion base pairs, which is thousands of times more data.

What do the columns in a raw DNA file mean?

A raw DNA file has four core pieces of information per row: the SNP identifier (rsID), the chromosome it sits on, the base-pair position on that chromosome, and the genotype — the two alleles observed at that position. 23andMe keeps the genotype as one column; AncestryDNA splits it into two.

Can raw DNA files from different companies be compared directly?

They can be cross-referenced because all four major services use the same genome build (GRCh37) and the standard rsID system for naming SNPs. However, the files are not identical: they cover overlapping but different sets of SNPs, use different file formats and delimiters, and structure the genotype column differently. A tool comparing them must normalize these differences first.

Is it safe to download my raw DNA file?

Downloading your own raw DNA file from a testing service you already use does not create new risk — the data already exists in that company's system. The considerations come with what you do next: where you upload it, which third-party services you share it with, and how you store your own copy. Treat the file as sensitive personal data.

Why is the file so small if it contains my DNA?

Because it does not contain all of your DNA. A genotyping chip reads only a curated set of roughly 600,000–700,000 informative positions, not the full 3.2-billion-base-pair genome. That is why a raw DNA file is only 15–25 MB — small enough to email — while a whole genome sequence runs to tens or hundreds of gigabytes.

Informational Reference

This page describes consumer DNA file formats for educational and technical reference purposes. File formats, chip versions, and SNP counts are set by the testing companies and may change; figures here reflect formats current as of the last update and are approximate. Always refer to the testing company's own documentation for authoritative, current specifications.

This page does not provide medical or genetic advice. Genetic data is sensitive personal information — handle and share it accordingly.

Search

Recent Posts

Categories

Raw DNA File Format Reference

What a Raw DNA File Is

Anatomy of a Genotype Row

Format Comparison Table

23andMe Format

AncestryDNA Format

MyHeritage Format

FamilyTreeDNA Format

Genome Builds: GRCh37 vs GRCh38

Chip Data vs Whole Genome Sequencing

From raw file to readable insight

Common File Issues

1. Opening the file in a spreadsheet corrupts it

2. Forgetting to skip comment lines

3. Assuming one genotype column

4. Mismatched genome builds

5. No-call values

6. Character encoding and line endings

Frequently Asked Questions

About this reference

Support & Downloads

Industrial & Research Partnerships

Contact Info

What a Raw DNA File Is

Anatomy of a Genotype Row

Format Comparison Table

23andMe Format

AncestryDNA Format

MyHeritage Format

FamilyTreeDNA Format

Genome Builds: GRCh37 vs GRCh38

Chip Data vs Whole Genome Sequencing

From raw file to readable insight

Common File Issues

1. Opening the file in a spreadsheet corrupts it

2. Forgetting to skip comment lines

3. Assuming one genotype column

4. Mismatched genome builds

5. No-call values

6. Character encoding and line endings

Frequently Asked Questions

Related references

About this reference

Support & Downloads

Industrial & Research Partnerships

Contact Info