What a Raw DNA File Is
A raw DNA file is the plain-text data layer beneath a consumer DNA test. When a company like 23andMe or AncestryDNA shows you ancestry percentages or trait reports, those reports are generated from a file — and that file is available to download.
The file itself is unglamorous: a long list, several hundred thousand rows deep, where each row records one specific position in your genome and the genotype observed there. It is not a whole genome sequence. It is a targeted snapshot of the specific positions — called SNPs, single nucleotide polymorphisms — that the testing company's genotyping chip is designed to read.
That snapshot is enough to drive ancestry analysis, trait prediction, and — when interpreted against published research — pathway and genetic insight reports. Understanding the file's structure is the first step in working with it.
Anatomy of a Genotype Row
Every raw DNA file, regardless of provider, encodes the same four fundamental pieces of information per row:
- rsID — the SNP's reference identifier, e.g.
rs1801133. A standardized name from the dbSNP database that lets the same SNP be referenced consistently across any dataset or study. - Chromosome — which of the 23 chromosome pairs the SNP sits on (1–22, plus X, Y, and mitochondrial DNA).
- Position — the base-pair coordinate of the SNP on that chromosome. Only meaningful relative to a specific genome build.
- Genotype — the two alleles observed at that position, one from each parent, e.g.
AG. The actual result.
The differences between providers come down to two things: how these four fields are formatted (delimiter, file extension, header style) and whether the genotype is stored as one column or split into two.
Format Comparison Table
The four major consumer DNA testing services, compared on the attributes that matter when reading or processing their raw files.
| Attribute | 23andMe | AncestryDNA | MyHeritage | FamilyTreeDNA |
|---|---|---|---|---|
| File extension | .txt | .txt | .csv | .csv |
| Delimiter | Tab | Tab | Comma | Comma |
| Download container | .zip | .zip | .zip | .zip / .csv |
| Header style | # comment lines | # comment lines | Quoted CSV header | CSV header |
| Genotype column(s) | 1 column (e.g. AG) | 2 columns (allele1, allele2) | 1 column (e.g. AG) | 1 column (e.g. AG) |
| Approx. SNP count | ~640,000 | ~640,000–700,000 | ~700,000 | ~700,000 |
| Genome build | GRCh37 (hg19) | GRCh37 (hg19) | GRCh37 (hg19) | GRCh37 (hg19) |
| Genotyping chip | Illumina GSA (v5) | Illumina (custom OmniExpress / GSA) | Illumina GSA | Illumina GSA |
| Mitochondrial DNA | Included (chr MT) | Limited | Included | Included |
23andMe Format
The 23andMe raw data file is a tab-delimited .txt file delivered inside a .zip archive. It opens with a block of comment lines — each prefixed with # — that explain the file and document which genome build it uses. After the comment block comes a single header row, then the data.
23andMe stores the genotype as a single column: two characters representing both alleles, e.g. AG. For positions on the Y chromosome or mitochondrial DNA, where there is only one allele, the genotype is a single character.
Key characteristics: comment lines must be skipped when parsing; the four columns are rsid, chromosome, position, genotype; chromosomes are labeled 1–22, X, Y, and MT; and a no-call (a position the chip could not read) appears as --.
AncestryDNA Format
The AncestryDNA raw data file is also a tab-delimited .txt file inside a .zip, and like 23andMe it opens with #-prefixed comment lines. The structural difference that matters: AncestryDNA splits the genotype across two separate columns — allele1 and allele2 — rather than combining them into one.
To get a 23andMe-style genotype from an AncestryDNA file, you concatenate allele1 and allele2. This two-column structure is the most common reason a tool built for one format fails on the other. AncestryDNA labels no-calls as 0 in the allele columns.
MyHeritage Format
MyHeritage departs from the 23andMe/AncestryDNA convention. Its raw data file is a comma-delimited .csv file, and its header and values are typically wrapped in double quotes. The genotype is stored as a single column, like 23andMe.
Note the differences from 23andMe: comma delimiter instead of tab; values quoted; the genotype column is named RESULT rather than genotype; and column names are uppercase. The underlying four-field structure is identical — only the formatting differs.
FamilyTreeDNA Format
FamilyTreeDNA (FTDNA) Family Finder raw data follows essentially the same convention as MyHeritage: a comma-delimited .csv file with a quoted header, the genotype in a single column named RESULT, and the same four-field structure.
FTDNA files generally have minimal or no comment header block — the CSV header row often comes first. Because MyHeritage and FTDNA share the quoted-CSV convention, a parser written for one usually handles the other with little or no modification.
Genome Builds: GRCh37 vs GRCh38
A position number in a raw DNA file — say, 752566 on chromosome 1 — is only meaningful relative to a genome build: the specific version of the human reference genome the coordinates are mapped against.
All four major consumer testing services use GRCh37, also known as hg19, released in 2009. The newer build, GRCh38 (hg38), released in 2013, uses different coordinates. The same physical SNP has a different position number in each build.
| Build | Also called | Released | Used by |
|---|---|---|---|
| GRCh37 | hg19 | 2009 | 23andMe, AncestryDNA, MyHeritage, FTDNA — all consumer chip tests |
| GRCh38 | hg38 | 2013 | Most current research databases; some whole genome sequencing services |
Chip Data vs Whole Genome Sequencing
Everything above describes genotyping chip data — the format used by mainstream ancestry tests. It is important to understand what it is not.
A genotyping chip reads a pre-selected set of roughly 600,000–700,000 SNPs: positions chosen because they are known to be informative for ancestry or traits. Whole genome sequencing (WGS), offered by services like Nebula Genomics and Dante Labs, reads all 3.2 billion base pairs of the genome. WGS output is thousands of times larger and is delivered in entirely different file formats — typically FASTQ, BAM, or VCF — not the simple four-column text file described here.
| Genotyping Chip | Whole Genome Sequencing | |
|---|---|---|
| Positions read | ~600K–700K SNPs | ~3.2 billion base pairs |
| File format | 4-column .txt / .csv | FASTQ, BAM, VCF |
| File size | ~15–25 MB | Tens to hundreds of GB |
| Services | 23andMe, AncestryDNA, MyHeritage, FTDNA | Nebula Genomics, Dante Labs |
| Best for | Ancestry, common trait SNPs | Comprehensive sequence analysis |
For the purposes of SNP-based analysis — looking up specific well-studied positions like rs1801133 in MTHFR — chip data is sufficient, because those SNPs are exactly what the chip is designed to capture.
From raw file to readable insight
Understanding the file format is step one. The NuGenia Peptide Insight Report takes a raw DNA file from any of the four major services described here and generates a personalized genomic report — mapping your genotype data across biological pathways. It handles the format differences described on this page automatically.
Learn about the Peptide Insight Report →Common File Issues
1. Opening the file in a spreadsheet corrupts it
Opening a raw DNA file in Excel can silently alter it — long position numbers get converted to scientific notation, and some genotype values (like 1/2 in certain encodings) can be misread. Always work with the file as plain text, or import it explicitly as text-formatted columns.
2. Forgetting to skip comment lines
The #-prefixed comment block at the top of 23andMe and AncestryDNA files is not data. A parser that does not skip these lines will fail on the first row. The number of comment lines varies, so detect them by the # prefix rather than assuming a fixed count.
3. Assuming one genotype column
The most common cross-provider bug: code written for 23andMe's single genotype column breaks on AncestryDNA's two-column allele1/allele2 structure. Detect the format from the header row before parsing the body.
4. Mismatched genome builds
Comparing a GRCh37 position to a GRCh38 position produces silent, wrong results. When in doubt, match on rsID instead of position — the rsID is build-independent.
5. No-call values
Positions the chip could not confidently read appear as no-calls — -- in 23andMe files, 0 in AncestryDNA allele columns. These must be handled explicitly rather than treated as valid genotypes.
6. Character encoding and line endings
Files from different services and operating systems may use different line endings (Windows CRLF vs Unix LF). A robust parser normalizes line endings rather than assuming one.
Frequently Asked Questions
What is a raw DNA file?
A raw DNA file is a plain text file containing the genotype results from a consumer DNA test. It lists several hundred thousand specific positions in the genome (SNPs) and the genotype observed at each one. It is the underlying data behind the ancestry and trait reports a testing company shows you, and it can be downloaded and used with third-party analysis tools.
What is the difference between the 23andMe and AncestryDNA file formats?
Both are tab-delimited text files with comment header lines, but they structure the genotype differently. 23andMe places the full genotype in a single column as a two-character string, for example AG. AncestryDNA splits the genotype across two separate columns, allele1 and allele2. Any tool that reads both formats has to account for this structural difference.
Which genome build do consumer DNA tests use?
All four major consumer testing services — 23andMe, AncestryDNA, MyHeritage, and FamilyTreeDNA — report positions on genome build GRCh37, also known as hg19. This matters when cross-referencing positions against external databases, because a position number is only meaningful relative to its build. Build GRCh38, also called hg38, uses different coordinates.
How many SNPs are in a raw DNA file?
Consumer raw DNA files typically contain between roughly 600,000 and 700,000 SNPs, depending on the testing company and the chip version used. This is a small targeted fraction of the genome — a whole genome sequence covers all 3.2 billion base pairs, which is thousands of times more data.
What do the columns in a raw DNA file mean?
A raw DNA file has four core pieces of information per row: the SNP identifier (rsID), the chromosome it sits on, the base-pair position on that chromosome, and the genotype — the two alleles observed at that position. 23andMe keeps the genotype as one column; AncestryDNA splits it into two.
Can raw DNA files from different companies be compared directly?
They can be cross-referenced because all four major services use the same genome build (GRCh37) and the standard rsID system for naming SNPs. However, the files are not identical: they cover overlapping but different sets of SNPs, use different file formats and delimiters, and structure the genotype column differently. A tool comparing them must normalize these differences first.
Is it safe to download my raw DNA file?
Downloading your own raw DNA file from a testing service you already use does not create new risk — the data already exists in that company's system. The considerations come with what you do next: where you upload it, which third-party services you share it with, and how you store your own copy. Treat the file as sensitive personal data.
Why is the file so small if it contains my DNA?
Because it does not contain all of your DNA. A genotyping chip reads only a curated set of roughly 600,000–700,000 informative positions, not the full 3.2-billion-base-pair genome. That is why a raw DNA file is only 15–25 MB — small enough to email — while a whole genome sequence runs to tens or hundreds of gigabytes.
This page describes consumer DNA file formats for educational and technical reference purposes. File formats, chip versions, and SNP counts are set by the testing companies and may change; figures here reflect formats current as of the last update and are approximate. Always refer to the testing company's own documentation for authoritative, current specifications.
This page does not provide medical or genetic advice. Genetic data is sensitive personal information — handle and share it accordingly.