Journal of Biomedical Semantics | |
TopFed: TCGA tailored federated query processing and linking to LOD | |
Helena F Deus2  Stefan Decker1  Jonas S Almeida4  Aftab Iqbal1  Axel-Cyrille Ngonga Ngomo3  Shanmukha S Padmanabhuni1  Muhammad Saleem3  | |
[1] Insight Centre for Data Analytics, National University of Ireland (NUIG), Galway, Ireland;Foundation Medicine, Inc, Cambridge, MA 02141, USA;Universität Leipzig, IFI/AKSW,PO 100920, D-04009Leipzig, Germany;Division Informatics, Department of Pathology, University of Alabama, Birmingham, USA | |
关键词: RDF; TCGA; SPARQL; Federated queries; | |
Others : 1133367 DOI : 10.1186/2041-1480-5-47 |
|
received in 2014-03-09, accepted in 2014-11-03, 发布年份 2014 | |
![]() |
【 摘 要 】
Backgroud
The Cancer Genome Atlas (TCGA) is a multidisciplinary, multi-institutional effort to catalogue genetic mutations responsible for cancer using genome analysis techniques. One of the aims of this project is to create a comprehensive and open repository of cancer related molecular analysis, to be exploited by bioinformaticians towards advancing cancer knowledge. However, devising bioinformatics applications to analyse such large dataset is still challenging, as it often requires downloading large archives and parsing the relevant text files. Therefore, it is making it difficult to enable virtual data integration in order to collect the critical co-variates necessary for analysis.
Methods
We address these issues by transforming the TCGA data into the Semantic Web standard Resource Description Format (RDF), link it to relevant datasets in the Linked Open Data (LOD) cloud and further propose an efficient data distribution strategy to host the resulting 20.4 billion triples data via several SPARQL endpoints. Having the TCGA data distributed across multiple SPARQL endpoints, we enable biomedical scientists to query and retrieve information from these SPARQL endpoints by proposing a TCGA tailored federated SPARQL query processing engine named TopFed.
Results
We compare TopFed with a well established federation engine FedX in terms of source selection and query execution time by using 10 different federated SPARQL queries with varying requirements. Our evaluation results show that TopFed selects on average less than half of the sources (with 100% recall) with query execution time equal to one third to that of FedX.
Conclusion
With TopFed, we aim to offer biomedical scientists a single-point-of-access through which distributed TCGA data can be accessed in unison. We believe the proposed system can greatly help researchers in the biomedical domain to carry out their research effectively with TCGA as the amount and diversity of data exceeds the ability of local resources to handle its retrieval and parsing.
【 授权许可】
2014 Saleem et al.; licensee BioMed Central.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
20150304142855691.pdf | 1566KB | ![]() |
|
Figure 8. | 39KB | Image | ![]() |
Figure 7. | 57KB | Image | ![]() |
Figure 6. | 51KB | Image | ![]() |
Figure 5. | 66KB | Image | ![]() |
Figure 4. | 82KB | Image | ![]() |
Figure 3. | 86KB | Image | ![]() |
Figure 2. | 66KB | Image | ![]() |
Figure 1. | 14KB | Image | ![]() |
【 图 表 】
Figure 1.
Figure 2.
Figure 3.
Figure 4.
Figure 5.
Figure 6.
Figure 7.
Figure 8.
【 参考文献 】
- [1]The cancer genome atlas [https://tcga-data.nci.nih.gov/tcga/ webcite]
- [2]Hudson TJ, Anderson W, Artez A, Barker AD, Bell C, Bernabé RR, Bhan MK, Calvo F, Eerola I, Gerhard DS, Guttmacher A, Guyer M, Hemsley FM, Jennings JL, Kerr D, Klatt P, Kolar P, Kusada J, Lane DP, Laplace F, Youyong L, Nettekoven G, Ozenberger B, Peterson J, Rao TS, Remacle J, Schafer AJ, Shibata T, Stratton MR, International Cancer Genome Consortium, et al.: International network of cancer genome projects. Nature 2010, 464(7291):993-998.
- [3]The international cancer genomics consortia [http://icgc.org/ webcite]
- [4]The 1000 genomes [http://www.1000genomes.org/ webcite]
- [5]The one million genomes project [http://www.genomics.cn/en/navigation/show_navigation?nid=5658 webcite]
- [6]The $10 million genome prize [http://in.reuters.com/article/2012/07/24/us-science-genome-prize-idINBRE86M02G20120724 webcite]
- [7]Drop in the cost of genome sequencing [http://www.genome.gov/sequencingcosts/ webcite]
- [8]Karlsson J, Torreño O, Ramet D, Klambauer G, Cano M, Trelles O: Enabling large-scale bioinformatics data analysis with cloud computing. In Parallel and Distributed Processing with Applications (ISPA), 2012 IEEE 10th International Symposium On. IEEE; 2012:640-645. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6280355 webcite
- [9]Bell G, Hey T, Szalay A: Beyond the data deluge. Science 2009, 323(5919):1297-1298.
- [10]Siegmund KD: Statistical approaches for the analysis of dna methylation microarray data. Hum Genet 2011, 129(6):585-595.
- [11]Bair E, Hastie T, Paul D, Tibshirani R: Prediction by supervised principal components. J Am Stat Assoc 2006, 101(473):119-137.
- [12]Jeong J, Li L, Liu Y, Nephew K, Huang T, Shen C: An empirical bayes model for gene expression and methylation profiles in antiestrogen resistant breast cancer. BMC Med Genomics 2010, 3(1):55. BioMed Central Full Text
- [13]Chin L, Hahn WC, Getz G, Meyerson M: Making sense of cancer genomic data. Genes Dev 2011, 25(6):534-555.
- [14]The SwissProt SPARQL endpoint [http://beta.sparql.uniprot.org/sparql webcite]
- [15]The EBI SPARQL endpoint [http://www.ebi.ac.uk/rdf/ webcite]
- [16]Schwarte A, Haase P, Hose K, Schenkel R, Schmidt M: Fedx: optimization techniques for federated query processing on linked data. The Semantic Web, ISWC 2011 2011, 601-616. Lecture Notes in Computer Science, vol. 7031. http://link.springer.com/chapter/10.1007 webcite
- [17]The TCGA stats dashboard [https://tcga-data.nci.nih.gov/datareports/statsDashboard.htm webcite]
- [18]The cBio portal [http://www.cbioportal.org/ webcite]
- [19]Verhaak RG, Hoadley KA, Purdom E, Wang V, Qi Y, Wilkerson MD, Miller CR, Ding L, Golub T, Mesirov JP, Alexe G, Lawrence M, O’Kelly M, Tamayo P, Weir BA, Gabriel S, Winckler W, Gupta S, Jakkula L, Feiler HS, Hodgson JG, James CD, Sarkaria JN, Brennan C, Kahn A, Spellman PT, Wilson RK, Speed TP, Gray JW, Meyerson M, et al.: An integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in pdgfra, idh1, egfr and nf1. Cancer cell 2010, 17(1):98.
- [20]Noushmehr H, Weisenberger DJ, Diefes K, Phillips HS, Pujara K, Berman BP, Pan F, Pelloski CE, Sulman EP, Bhat KP, Verhaak RGW, Hoadley KA, Hayes DN, Perou CM, Schmidt HK, Ding L, Wilson RK, Berg DVD, Shen H, Bengtsson H, Neuvial P, Cope LM, Buckley J, Herman JG, Baylin SB, Laird PW, Aldape K: Identification of a cpg island methylator phenotype that defines a distinct subgroup of glioma. Cancer Cell 2010, 17(5):510-522.
- [21]Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, Mesirov JP: Integrative genomics viewer. Nat Biotechnol 2011, 29(1):24-26.
- [22]Wang R, Chadalavada K, Wilshire J, Kowalik U, Hovinga KE, Geber A, Fligelman B, Leversha M, Brennan C, Tabar V: Glioblastoma stem-like cells give rise to tumour endothelium. Nature 2010, 468(7325):829-833.
- [23]Kim HS, Minna JD, White MA: Gwas meets tcga to illuminate mechanisms of cancer predisposition. Cell 2013, 152(3):387-389.
- [24]Hsu F-H, Serpedin E, Hsiao T-H, Bishop AJ, Dougherty ER, Chen Y: Reducing confounding and suppression effects in tcga data: an integrated analysis of chemotherapy response in ovarian cancer. BMC Genomics 2012, 13(Suppl 6):13. BioMed Central Full Text
- [25]Freire P, Vilela M, Deus H, Kim YW, Koul D, Colman H, Aldape KD, Bogler O, Yung WK, Coombes K, Mills GB, Vasconcelos AT, Almeida JS: Exploratory analysis of the copy number alterations in glioblastoma multiforme. PloS one 2008, 3(12):4076.
- [26]Deus HF, Veiga DF, Freire PR, Weinstein JN, Mills GB, Almeida JS: Exposing the cancer genome atlas as a sparql endpoint. J Biomed Inform 2010, 43(6):998-1008.
- [27]Robbins DE, Grueneberg A, Tanik MM, Deus HF, Almeida JS: A self-updated roadmap of the cancer genome atlas. Bioinformatics 2012, 29(10):13-33.
- [28]Saleem M, Shanmukha S, Ngonga A-C, Almeida JS, Decker S, Deus HF: Linked cancer genome atlas database. In. I-Semantics 2013, 2013:129-134. http://dl.acm.org/citation.cfm?id=2506200 webcite
- [29]Saleem M, Kamdar MR, Iqbal A, Sampath S, Deus HF, Ngonga A-C: Fostering serendipity through big linked data. Semantic Web Challenge at ISWC2013 2013. paper 15 Fostering Serendipity through Big Linked Data, http://challenge.semanticweb.org/2013/submissions/ webcite
- [30]Saleem M, Kamdar MR, Iqbal A, Sampath S, Deus HF, Ngonga Ngomo A-C: Big linked cancer data: Integrating linked tcga and pubmed. Web Semantics: Science, Services and Agents on the World Wide Web. 2014 http://www.sciencedirect.com/science/article/pii/S1570826814000523 webcite
- [31]Kamdar MR, Iqbal A, Saleem M, Deus HF, Decker S: Genomesnip: Fragmenting the genomic wheel to augment discovery in cancer research. Conference on Semantics in Healthcare and Life Sciences (CSHALS) 2014 2014. http://www.iscb.org/cshals2014-program/cshals2014-presenters#Iqbal webcite
- [32]Quilitz B, Leser U: Querying distributed rdf data sources with sparql. Proceedings of the 5th European Semantic Web Conference on The Semantic Web: Research and Applications 2008, 524-538. ESWC’08. http://link.springer.com/chapter/10.1007 webcite
- [33]Langegger A, Wöß W, Blöchl M: A semantic web middleware for virtual data integration on the web. Proceedings of the 5th European Semantic Web Conference on The Semantic Web: Research and Applications 2008, 493-507. ESWC’08. http://link.springer.com/chapter/10.1007/978-3-540-68234-9_37 webcite
- [34]Harth A, Hose K, Karnstedt M, Polleres A, Sattler K-U, Umbrich J: Data summaries for on-demand queries over linked data. Proceedings of the 19th International Conference on World Wide Web 2010, 411-420. WWW ‘10. http://dl.acm.org/citation.cfm?id=1772733 webcite
- [35]Umbrich J, Hose K, Karnstedt M, Harth A, Polleres A: Comparing data summaries for processing live queries over linked data. World Wide Web 2011, 14(5–6):495-544.
- [36]Görlitz O, Staab S: Splendid: Sparql endpoint federation exploiting void descriptions. Proceedings of the 2nd International Workshop on Consuming Linked Data, Bonn, Germany 2011. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.294.6221 webcite
- [37]Li Y, Heflin J: Using reformulation trees to optimize queries over distributed heterogeneous sources. Proceedings of the 9th International Semantic Web Conference on The Semantic Web - Volume Part I 2010, 502-517. ISWC’10. http://link.springer.com/chapter/10.1007 webcite
- [38]Kaoudi Z, Kyzirakos K, Koubarakis M: Sparql query optimization on top of dhts. Proceedings of the 9th International Semantic Web Conference on The Semantic Web - Volume Part I 2010, 418-435. ISWC’10. http://link.springer.com/chapter/10.1007/978-3-642-17746-0_27 webcite
- [39]Ladwig G, Tran T: Linked data query processing strategies. Proceedings of the 9th International Semantic Web Conference on The Semantic Web - Volume Part I 2010, 453-469. ISWC’10. http://link.springer.com/chapter/10.1007/978-3-642-17746-0_29 webcite
- [40]Acosta M, Vidal M-E, Lampo T, Castillo J, Ruckhaus E: Anapsid: an adaptive query processing engine for sparql endpoints. Proceedings of the 10th International Conference on The Semantic Web - Volume Part I 2011, 18-34. ISWC’11. http://link.springer.com/chapter/10.1007/978-3-642-25073-6_2 webcite
- [41]Basca C, Bernstein A: Avalanche: putting the spirit of the web back into semantic web querying. Proceedings Of The 6th International Workshop On Scalable Semantic Web Knowledge Base Systems (SSWS2010) 2010, 64-79. http://ceur-ws.org/Vol-669/ssws2010-preface.pdf webcite
- [42]Saleem M, Ngonga Ngomo A-C, Parreira JX, Deus H, Hauswirth M: Daw: Duplicate-aware federated query processing over the web of data. Proceedings of ISWC 2013. http://link.springer.com/chapter/10.1007/978-3-642-41335-3 webcite_{3}6
- [43]Broder AZ, Charikar M, Frieze AM, Mitzenmacher M: Min-wise independent permutations. J Comput Syst Sci 1998, 60:327-336.
- [44]Saleem M, Ngomo A-CN: Hibiscus: Hypergraph-based source selection for sparql endpoint federation. Extended Semantic Web Conference (ESWC) 2014. LCNS. http://link.springer.com/chapter/10.1007/978-3-319-07443-6_13 webcite
- [45]Saleem M, Khan Y, Hasnain A, Ermilov I, Ngomo A-CN: A fine-grained evaluation of sparql endpoint federation systems. Semantic Web Journal 2014.
- [46]The TopFed’s utilities [http://goo.gl/rtwm6q webcite]
- [47]The TCGA selected fields [https://code.google.com/p/topfed/wiki/SelectedFields webcite]
- [48]The TCGA annotations files [http://goo.gl/pb3o2G webcite]
- [49]The N3 RDF data format [http://www.w3.org/TeamSubmission/n3/ webcite]
- [50]The linked data principles [http://www.w3.org/DesignIssues/LinkedData.html webcite]
- [51]Ngonga Ngomo A-C: On link discovery using a hybrid approach. J Data Semantics 2012, 1(4):203-217.
- [52]The HGNC SPARQL endpoint [http://hgnc.bio2rdf.org/sparql webcite]
- [53]The OMIM SPARQL endpoint [http://omim.bio2rdf.org/sparql webcite]
- [54]The Homologene SPARQL endpoint [http://homologene.bio2rdf.org/sparql webcite]
- [55]The TCGA stats dashboard [https://tcga-data.nci.nih.gov/datareports/statsDashboard.htm webcite]
- [56]The TCGA tumours [https://tcga-data.nci.nih.gov/tcga/ webcite]
- [57]The TSS-to-Tumour hash table [https://topfed.googlecode.com/files/TSS-to-Tumour_hash_table.txt webcite]
- [58]The basic graph patterns [http://www.w3.org/TR/sparql11-query/#BasicGraphPatterns webcite]
- [59]The international cancer genome consortium (ICGC) [http://icgc.org/ webcite]
- [60]The CNViewer [https://sites.google.com/site/cnviewerguide/ webcite]
- [61]The TCGA data use certification agreement [https://tcga-data.nci.nih.gov/docs/TCGA_Data_Use_Certification.pdf webcite]