- Researchers from the Functional Genomics Team at CNAG analysed a vast dataset of 120,000 open reading frames (ORFs) derived from human endogenous retroviruses (HERVs) and identified 17,540 retroviral protein domains.

- Published in NAR Genomics and Bioinformatics, the study paves the way for further research into how HERV proteins are expressed, how they influence the immune system, and how the genome has repurposed them in health and disease.
 
 
March 2, 2026. In humans, nearly 8% of our DNA is made up of human endogenous retroviruses (HERVs), remnants of ancient viral infections that have been passed down from generation to generation. These sequences are often referred to as “genomic fossils” because, although most have lost their ability to act as viruses, they remain embedded in our genome. While many HERVs are fragmented or mutated, some still contain functional elements that can influence gene regulation, protein production, and other cellular processes. Until now, little has been known about their potential to code for proteins. To better understand this role, the Functional Genomics Team at CNAG, led by Dr Anna Esteve-Codina, together with researcher Tomàs Montserrat-Ayuso and in collaboration with Dr Aurora Pujol at the Institut d’Investigació Biomèdica de Bellvitge (IDIBELL), has generated the most comprehensive map to date of protein domains within HERV sequences.
 
 
The comprehensive dataset, published in NAR Genomics and Bioinformatics, analyses more than 120,000 open reading frames (ORFs), DNA sequences derived from HERVs. Using a large-scale, reproducible pipeline based on HMMER and InterProScan, the researchers identified 17,540 protein domain matches within HERV sequences, most of them corresponding to parts of the viral pol gene, such as reverse transcriptase, RNase H and protease. These are core viral components that were once essential for copying viral genetic material and integrating it into the host genome. Remarkably, thousands of these domains remain highly conserved, with around 1,000 showing more than 95% alignment coverage, indicating that many ancient viral protein regions are still almost complete.
 
 
“Proteins derived from HERVs are more than just genomic fossils, many still retain structural features that hint at residual or repurposed functions. We know that some of them play important roles today, from supporting placental development to contributing to neurodegenerative and inflammatory diseases such as multiple sclerosis, and Alzheimer’s. Even so, there is still much to discover about their full impact. Our new dataset provides a comprehensive map of these ancient viral elements, allowing researchers to explore their functions across the entire human genome,” explains Dr Anna Esteve, and corresponding author of the study.
 
 
The study highlights distinct patterns of preservation across HERV subfamilies. Some, like HERVK (HML-2), stand out for keeping multiple sites in the genome with nearly complete viral proteins. These include Gag, which forms the protective shell of the virus; Pol, carrying key enzymes such as reverse transcriptase, RNase H, and protease; and Env, the envelope protein that enables viruses to enter cells, with transmembrane regions that anchor it to membranes. Other subfamilies show different preservation patterns: HERVH tends to retain Pol’s enzymatic domains, while HERVE surprisingly preserves protease and reverse transcriptase regions. Altogether, both young and ancient HERV families maintain functional fragments, offering a glimpse into how these ancient viral remnants may still influence gene activity and cellular processes today.
 
 
The dataset and the code used for the analyses generated in this study are provided as an open-access resource on Zenodo to enable further exploration of HERV proteins, their roles in gene regulation, immune responses, and potential contributions to human health and disease.
 
 
REFERENCE ARTICLE
Montserrat-Ayuso, Tomàs, et al. ‘A Comprehensive Annotation of Conserved Protein Domains in Human Endogenous Retroviruses’. NAR Genomics and Bioinformatics, vol. 8, no. 1, Jan. 2026, p. lqag013. DOI.org (Crossref), https://doi.org/10.1093/nargab/lqag013.