Skip to main content
eScholarship
Open Access Publications from the University of California

UC Santa Cruz

UC Santa Cruz Electronic Theses and Dissertations bannerUC Santa Cruz

Probabilistic Graphical Inference of Pedigrees

Creative Commons 'BY-SA' version 4.0 license
Abstract

Inference of pedigrees from genetic data is a fundamental problem in the field of population genetics with many applications, such as studying the inheritance of traits in natural populations, or characterizing the transmission of genetic disorders. Current methods for reconstructing pedigrees can be divided into two categories. The first category includes approaches that infer pedigrees composed of, at most, two generations. The best performing of these approaches—termed parentage or sibship inference procedures—make direct use of the pedigree likelihood given observed genetic data. Methods in the the second category endeavor to infer multi-generational pedigrees; however, to date, these have only been implemented with restrictive assumptions (such as complete sampling) or by employing approximations to the pedigree likelihood (such as composite likelihoods, or ad hoc approaches).

In this dissertation, I develop a novel representation of pedigrees as a type of factor graph that allows for the inference of multigenerational pedigrees within a proper, probabilistic framework. The factor-graph representation allows the rapid calculation of the pedigree likelihood under a variety of rearrangements, which provides an efficient mechanism for Metropolis-Hastings simulation of a Markov chain through the space of possible pedigrees. My software implementing this, pedFac, produces a sample of pedigrees from their posterior distribution, which allows for fully Bayesian, multigenerational pedigree inference.

I show that pedFac performs as well as other state-of-the-art software for inferring two-generation pedigrees, but it also provides a far superior estimate of uncertainty. PedFac is also successful in inferring multigenerational pedigrees when sampling is incomplete. This means that pedFac can reconstruct multiple, true links in pedigrees through unobserved/unsampled individuals, a task not performed well by any other software available today.

pedFac relies on the sum-product algorithm to calculate the full, joint likelihood of a pedigree factor graph. The sum-product algorithm only delivers the exact joint likelihood when the pedigree has no loops. For pedigrees that do have loops, I develop a “conditioning” approach, that permits the likelihood calculation of cyclic pedigrees by conditioning on sampled genotype values over a set of loop breakers. I show that this conditioning approach allows pedFac to successfully sample the pedigree space, whether it be cyclic or acyclic.

Finally, I present work relevant to identifying and scoring genetic markers that could be used as input to pedFac. A decade ago, most SNPs used in molecular ecology were typed singly on specialized chips using a variant of quantitative PCR; however, today short-read technologies are commonly employed to generate genetic data. To genotype a small number of SNPs or short focal regions in many individuals, molecular ecologists, now, routinely sequence amplicons (short, PCR-amplified regions) on next-generation sequencing machines. These data can be analyzed not just in terms of the SNPs present, but as very short haplotypic variants termed microhaplotypes. I present the R package ‘microhaplot’ to assist in the extraction and curation of these microhaplotypes from short-read amplicon sequence data, and I briefly consider strategies for reducing microhaplotypes to a biallelic representation that can be used as input to pedFac.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View