CAREER: A Novel Bioinformatic Infrastructure for Metagenome Assembly and Functional Annotation


CAREER: A Novel Bioinformatic Infrastructure for Metagenome Assembly and Functional Annotation


Primary Investigator:
Cuncong Zhong
Funding:
$722398.00
Sponsor:
NATIONAL SCIENCE FOUNDATION
Sponsor Type:
Federal
Beginning Fiscal Year:
2021
Award Type:
Grant

Abstract

Understanding the behavior of microbes plays a crucial role in environmental monitoring, renewable energy, agriculture, and medicine. Metagenomics sequences the complete genomes of all microbes residing in an ecological niche as a whole, allowing for a comprehensive and unbiased investigation into the community?s function and taxonomy. However, the bioinformatic analysis of the resulted sequence data remains challenging due to its high volume and complexity. The main goal of this Faculty Early Career Development (CAREER) project is to develop efficient and accurate computational methods for the analysis of metagenomics data and to develop the corresponding bioinformatics sections for the University of Kansas?s STEM outreach programs. The project will provide a new bioinformatic infrastructure for metagenomic sequencing data analysis. It will also stimulate the interest and improve the retention of younger generation and underrepresented minorities in STEM.



Reconstructing the complete microbial genomes from the fragmented metagenomic sequencing data (i.e., metagenome assembly) is a fundamental step towards understanding the function and taxonomy of the metagenome. The existing assembly algorithms are primarily based on two types of graph data structures, namely the string graph (which extends the overlap graph) and de Bruijn graph. The string graph has a higher accuracy but a lower sensitivity and contiguity compared to the de Bruijn graph. The Specific Aim 1 of this project will improve the sensitivity and contiguity of the string graph approach through an earlier consideration of the paired-end information and an adaptive overlap length threshold that accounts for the uneven coverage of metagenomics data. Specific Aim 2 of this project will leverage the read connectivity information from the assembly graph to improve the discovery, classification, and quantification of functional genes.