Dispatches from the Convergence Coalface: HPC, Big Data, and Large Scale Genomics

As we compute at ever-larger scale, and the problems we can model get richer and more complex, there is increasing discussion about potential convergence of HPC, with its traditional science simulation focus, and Big Data, with its history in analytics and predictive modelling.  With one foot outside of each community, genomic bioinformatics tool builders have been quietly chipping away at the walls between the two, building on the one hand high-performance multithreaded C++ tools with emphasis on cache reuse and (increasingly) numerical methods that would be familiar to those working the HPC tunnels, and on the other hand database-inspired distributed systems with complex cloud workflow orchestration that those mining the big data seams would immediately recognize.
 
In this talk, we discuss several current projects in large scale genomics, from the perspective of someone new to the field but experienced in large scale scientific computation, that illustrate the increasing need for convergence - large-scale cancer whole-genome genomics (the ICGC/TCGA PCAWG project), high-throughput nanopore sequencing, and CanDIG (http://www.distributedgenomics.ca), a distributed, federated platform for national-scale genomics analyses.  We discuss open algorithm, tooling, and “data-plumbing” issues in these areas, where the field could learn more from HPC, and where the HPC and Big Data communities could learn from the work going on in genomics. 
 

Bio

Jonathan Dursi has over twenty-five years experience using large-scale computing to advance science. His personal research has focused on astrophysical fluids with the DOE ASCI ASAP program and on bioinformatics with the Ontario Institute for Cancer Research. He has also worked to support other researchers at Canada’s largest HPC centre, SciNet, and as Compute Canada’s first CTO. He currently works at Toronto’s Hospital for Sick Children on the CanDIG project, helping build a platform for national-scale analysis of locally-controlled private genomics data. He is very interested in tools that have the potential to make big scientific computing more productive and powerful, and occasionally blogs on these topics at http://www.dursi.ca.

Speaker

Jonathan Dursi

Date

Thursday, April 12, 2018

Time

1 pm - 2 pm

Location

IACS Seminar Room