Canonical sequence

A canonical sequence is a sequence of DNA, RNA, or amino acids that reflects the most common choice of base or amino acid at each position. Many databases use or only give the canonical sequence. The UniProtKB/Swiss-Prot policy for example describes all the protein products encoded by one gene and uses the following criteria for the entry of a canonical sequence:^[1]

It is the most prevalent.
It is the most similar to orthologous sequences found in other species.
By virtue of its length or amino acid composition, it allows the clearest description of domains, isoforms, polymorphisms, post-translational modifications, etc.
In the absence of any information, we choose the longest sequence.

Canonical sequence

See also