BMC Genomics | |
Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction | |
Research | |
Roderic Guigo1  Barbara Uszczynska1  Dmitri Pervouchine2  Robert Petryszak3  Nuno Fonseca3  Alvis Brazma3  Jonathan M Mudge4  Adam Frankish4  Jennifer Harrow4  Jose M Gonzalez4  Graham RS Ritchie5  | |
[1] Centre for Genomic Regulation, Barcelona, Catalonia, Spain;Centre for Genomic Regulation, Barcelona, Catalonia, Spain;Faculty of Bioengineering and Bioinformatics, 119992 Moscow GSP-2, Leninskie Gory, Moscow State University 1-73, Russia;European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, CB10 1SD, Hinxton, Cambridge, UK;Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, CB10 1SA, Hinxton, Cambridge, UK;Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, CB10 1SA, Hinxton, Cambridge, UK;European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, CB10 1SD, Hinxton, Cambridge, UK; | |
关键词: Variant Annotation; Nonsense Mediate Decay; Exome Sequencing Project; Transcript Annotation; Variant Effect Predictor; | |
DOI : 10.1186/1471-2164-16-S8-S2 | |
来源: Springer | |
【 摘 要 】
BackgroundA vast amount of DNA variation is being identified by increasingly large-scale exome and genome sequencing projects. To be useful, variants require accurate functional annotation and a wide range of tools are available to this end. McCarthy et al recently demonstrated the large differences in prediction of loss-of-function (LoF) variation when RefSeq and Ensembl transcripts are used for annotation, highlighting the importance of the reference transcripts on which variant functional annotation is based.ResultsWe describe a detailed analysis of the similarities and differences between the gene and transcript annotation in the GENCODE and RefSeq genesets. We demonstrate that the GENCODE Comprehensive set is richer in alternative splicing, novel CDSs, novel exons and has higher genomic coverage than RefSeq, while the GENCODE Basic set is very similar to RefSeq. Using RNAseq data we show that exons and introns unique to one geneset are expressed at a similar level to those common to both. We present evidence that the differences in gene annotation lead to large differences in variant annotation where GENCODE and RefSeq are used as reference transcripts, although this is predominantly confined to non-coding transcripts and UTR sequence, with at most ~30% of LoF variants annotated discordantly. We also describe an investigation of dominant transcript expression, showing that it both supports the utility of the GENCODE Basic set in providing a smaller set of more highly expressed transcripts and provides a useful, biologically-relevant filter for further reducing the complexity of the transcriptome.ConclusionsThe reference transcripts selected for variant functional annotation do have a large effect on the outcome. The GENCODE Comprehensive transcripts contain more exons, have greater genomic coverage and capture many more variants than RefSeq in both genome and exome datasets, while the GENCODE Basic set shows a higher degree of concordance with RefSeq and has fewer unique features. We propose that the GENCODE Comprehensive set has great utility for the discovery of new variants with functional potential, while the GENCODE Basic set is more suitable for applications demanding less complex interpretation of functional variants.
【 授权许可】
CC BY
© Frankish et al.; licensee BioMed Central Ltd. 2015
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
RO202311099031274ZK.pdf | 999KB | download |
【 参考文献 】
- [1]
- [2]
- [3]
- [4]
- [5]
- [6]
- [7]
- [8]
- [9]
- [10]
- [11]
- [12]
- [13]
- [14]
- [15]
- [16]
- [17]
- [18]
- [19]
- [20]
- [21]
- [22]
- [23]
- [24]
- [25]
- [26]
- [27]
- [28]
- [29]
- [30]
- [31]
- [32]
- [33]
- [34]
- [35]
- [36]
- [37]