PhD Topics

Application for topics marked with is especially encouraged during this year's call:

Probabilistic models for the population genetics of molecular evolution:
• 
Episodic selection histories and co-evolution summary
Population trees and selection summary

Inferring selection using Drosophila whole genome sequence data:
A whole genome survey of selected sites in African Drosophila melanogaster populations summary
Incorporation of demography in theoretical models summary

New algorithm and models to analyze population genetic massive parallel sequence data:
Phylogenetic sequence assembly of 454 and Illumina reads summary
A fast and efficient implementation of reference assembly summary

Experimental evolution in Drosophila  summary

• Evolution of gene expression in Drosophila summary

• Evolution of transposable elements in Drosophila summary

• Natural variation in transposable element defense systems summary

Tracing the genomic signature of hybridization between D. mauritiana and D. simulans summary

Functionally important variation in lifespan and other life history traits in natural and experimental evolution populations summary

Mathematical models of spatially varying selection in subdivided populations summary

Statistical methods for detecting selective sweeps using genome-wide data summary

Population genetic estimators from NGS data: assessing the power for methods for genome scans of selection summary

The nature of differentiation between two closely related species of oak summary

The footprint of adaptive gene introgression after secondary contact summary

 

Probabilistic models for the population genetics of molecular evolution

Principle advisor: Carolin Kosiol

• Episodic selection histories and co-evolution

In genome-wide analysis for natural selection, we often identify the same genes to be under recent positive selection detected by population genetic scans (Bustamante et al., 2005; Sabeti et al., 2007) as well as under lineages and clade specific selection detected by phylogenetic scans (Kosiol et al, 2008). Furthermore, it has been observed that several pathways were strongly enriched for positive selected genes (PSGs), suggesting possible co-evolution of interacting genes. Thus selection histories often considerably more complex than what would be expected for an idealized model of a selective sweep or simple lineage specific selection. In this project, we propose to study the dynamics of selection, including episodic selection with a new Bayesian approach. The Bayesian method allows a posterior distribution overall selection histories to be inferred for each gene. Unlike the standard likelihood ratio tests (LRTs) for lineage specific positive selection (which are simple one-sided hypothesis tests and are not necessarily conservative about rejection of the null hypothesis) this framework would consider all candidate selection histories symmetrically, and allows for soft (probabilistic), rather than absolute, choices of history at each gene. Further unraveling the co-evolutionary history of interacting PSGs promises to be a fertile area for future work. In addition to analyzing data on the interactions of separately predicted PSGs, it may be possible to improve power by jointly considering multiple genes and their interactions during PSG detection.

Population trees and selection

By modeling genome evolution as a process by which a single genome sequence mutates along the branches of a species phylogeny, standard phylogenetic methods reduce the entire populations to single points in genotypic space. In reality, each population consists of many individuals that are related by trees of genetic ancestry known as genealogies.
Some progress has been made towards the estimation of the population/species tree under a coalescent theoretic set up. For the neutral case, it is already possible to estimate population trees (Lui and Perl, 2007; Hey, 2009, Yang & Rannala, 2010). However, these recent papers use Bayesian methods and MCMC techniques, which make an accurate estimation of parameters only possible for a few individuals from a few populations. These methods will not scale to estimate parameter-rich models such as codon models.
On the other hand, frequency based methods are suitable for large-scale genome-wide analysis, however, they do not consider effects of selection on the genealogies. In this project, we will develop methods using information from the mutation frequency spectrum as well as the genealogies, and we expect both together to be more powerful than either used individually.

Inferring selection using Drosophila whole genome sequence data

Principle advisor: Claus Vogl

The pattern of variation (site frequency pattern) in neutrally evolving sequences can be used to infer demography and can serve as a reference to compare putatively selected sequences. Using whole genome data, we have studied putatively neutral sequence variation in African Drosophila melanogaster populations and inferred the ancestral states using closely related outgroup species. In short introns, polymorphism spectra and divergence patterns are close to neutral. On the other hand, four-fold degenerate sites clearly deviate from neutrality, and from the patterns observed in short introns, in both their site frequency spectra as well as their divergence patterns. Earlier this deviation has been explained by codon bias. By comparison to the outgroups, however, we find stabilizing selection irrespective of the direction of mutation, which means that the quantity and direction of selection is variable among sites. We also have worked theoretically on directional selection in a bi-allelic model with low effective mutation rates (theta), which is appropriate for such sequence data. This work provides an extension of the neutral infinite sites theory to directional selection and unites Poisson-random-fields approaches with theory for selection-mutation-drift equilibrium.

A whole genome survey of selected sites in African D. melanogaster populations

While we have shown deviations from neutrality and from an equilibrium of directional selection, mutation, and drift in fourfold degenerate sites in an African D. melanogaster population, we have not yet been able to model the observed pattern satisfactorily. A hierarchical Bayesian model, where selection coefficients are drawn from a normal distribution, seems a good candidate model for the spectra and divergence patterns in fourfold degenerate sites. Other forces may also influence the intensity of selection, e.g., recombination rates, which are known to vary over the genome. We plan to develop a Bayesian model to infer these parameters explicitely. Furthermore, the analysis could be extended to other sites, e.g., intergenic regions, long introns, non-protein coding genes. For this project, the emphasis would be on analysis of whole genome population data, that are either publicly available or produced by other groups within Popgen Vienna. The probabilistic or simulation models would be developed mainly by the PI; the candidate would be mainly responsible for the handling of the large data sets, i.e., extracting information from databases, writing scripts, performing simulations, etc.

Incorporation of demography in theoretical models

Analysis of cosmopolitan D. melanogaster populations should provide a contrast to the African patterns due to their different population demography, i.e., a bottleneck associated with the migration out-of-Africa. So far, an analytical model that incorporates selection, drift, migration and demography is not available. An approach based on ancestral selection graphs seems most promising for incorporating population demography. Alternatively, simulation approaches (Approximate Bayesian Computation) can be used. These models would be developed jointly by a candidate, who should be strong in probability theory, and the PI and applied to population genome data.

New algorithm and models to analyze population genetic massive parallel sequence data

Principle advisor: Arndt v. Haeseler

While the determination of SNPs from single individuals using massive parallel sequencing is a straightforward task however not trivial (Wheeler et al., 2008, Bentley et al. 2008), the problem of determining SNPs for population genetic studies from short reads of pooled samples of individuals is not yet satisfactorily addressed. Currently, the SNP calling methods are tuned to minimize false positives, which lead inevitably to biased estimates of population genetic parameters (Johnson et al. 2008). Especially for population genetic analyses, it is necessary to assemble short-reads, to generate more contiguous consensus sequences and to ensure high confidence in these alignments. Only if the alignment is reliable, the determination of e.g. genetic differences is possible. Then, we will be able to determine the amount of true genetic variation from reads of pooled samples of a large number of individuals.

• Phylogenetic sequence assembly of 454 and Illumina reads

The project goal is the development of phylogeny based approaches to reliably map short Illumina reads of not-yet completely sequenced organisms to its next relatives. If a fully sequenced reference genome is lacking, we propose not to use only one closely species, but rather as many as available. Hitherto, reads have been aligned to one reference genome, the novel strategy we want to employ is the comparative alignment of reads to many (completely)  sequenced Drosophila genomes. This phylogenetic approach should increase the sensitivity of the alignment. Simulation studies that are based on realistic models of sequence evolution along an evolutionary tree will help us to understand the pitfalls of currently used approaches and to evaluate the advantages of phylogenetic sequence assembly. By employing parallel computing strategies we want to modify the Smith-Waterman local pairwise alignment algorithms to map reads to multiple Drosophila genomes. Thereby we hope to firstly overcome the arbitrary thresholds introduced in many approaches to reduce computing time and secondly to be able to include insertions and deletions in a more sensible way. These alignment problems not only introduce uncertainties, but also bias the results. Thus with current alignment tools population genetic estimates based on SNPs might be biased across the genome and questions about copy number polymorphisms cannot be addressed at all.

• A fast and efficient implementation of reference assembly

One open issue in reference assembly is the location of the best scoring position, where this location depends on the scoring scheme and the evolutionary distance as well as the sequence (repeat) content of the reference sequence. Many ad hoc solutions do exist (Bateman and Quackenbush, 2009). However, they all rely on the determination of an arbitrary threshold. In PhD-project 1 we will explore the potential of the Smith-Waterman local pair-wise alignment algorithm in a comparative approach. The second project will also employ this strategy but will rather focus on the computational efficiency. To avoid long running times we are planning to port the reference assembly via Smith-Waterman on CUDA, the Compute Unified Device Architecture. Most PCs contain an Nvidia graphics card. The graphics card allows the run of thousands of highly similar jobs at the same time. Here the score of a single read is computed for many positions in the reference at the same time. Thus the reference assembly implementation for graphic cards will be very fast and therefore, widely used, since it does not require expensive infrastructure. We will use this fast implementation to explore the parameter space of scoring functions and to explore the limits of reference genome assembly. Moreover, we will be able to compare the performance of more flexible scoring method to other methods. Finally, we will also investigate approaches to analyze and efficiently process paired end reads.

• Experimental evolution in Drosophila

Principle advisor: Christian Schlötterer

Population genetics has a long-standing history using experimental evolution to study adaptation processes. Previous work mainly focused on the evolution of phenotypic and life history traits. The new sequencing technologies provide now the unique opportunity to link phenotypic evolution with the underlying sequence changes.
The project builds on recently performed experimental evolution experiments in Drosophila studying the selection response to various temperature regimes. The analyses of genome-wide polymorphism data on a genomic scale will not only permit the identification of selected alleles, but also their trajectories during the adaptation process. Finally, the project will combine phenotypic and genomic changes to understand the genotype-phenotype map in evolving populations.

• Evolution of gene expression in Drosophila

Principle advisor: Christian Schlötterer

While variation in gene expression is a major source of phenotypic diversity, our understanding of the processes driving changes in gene expression are still poorly understood. With the new sequencing technologies it will be possible to address many important questions about the evolution of gene expression. In this project the following aspects can be studied:

1) Evolution of sex-biased gene expression: are the differences in gene expression between males and females conserved among species?
2) Evolution of cis- and trans-effects: how does natural variation within and between species affect the regulation of gene expression?
3) Evolution of alternative splicing: to what extent do new splicing variants contribute to adaptation to different habitats?

• Evolution of transposable elements in Drosophila

Principle advisor: Christian Schlötterer

Transposable elements (TEs) are mobile genetic elements that parasitize genomes by semi- autonomously increasing their own copy number within the host genome. Interestingly, insertions of transposable elements could provide either fitness advantages or disadvantages to their host. We have recently developed a new software tool (PoPoolationTE), which estimates the population frequencies of transposable element insertions.
This project will take advantage of full genome sequences of many natural D. melanogaster and D. simulans populations to understand how transposable elements have contributed to the adaptation of natural populations to their environment.

• Natural variation in transposable element defense systems

Principle advisor: Christian Schlötterer

Organisms must perpetually cope with a constant onslaught of challenges from parasites, both from within and outside the host genome. Genome scans for selection show that immune systems that defend against viruses and bacteria comprise a large fraction of the adaptive protein evolution occurring between species. Less well-understood are the systems that defend against intra-genomic parasites, i.e, transposable elements. Transposable elements spread semi-autonomously through the genome, represent an especially successful example of this strategy - up to 80% of some genomes are comprised of TE-derived DNA. In Drosophila, TE's appear to invade regularly, and can cause female sterility and developmental abnormalities if unrepressed. In response, Drosophila has evolved a sophisticated RNA-mediated defense system to counter this attack. In this project, the student will work to understand how this system has evolved between populations and species.

• Tracing the genomic signature of hybridization between D. mauritiana and D. simulans

Principle advisor: Christian Schlötterer

The processes leading to the origin of new species are still poorly understood. It has been proposed that during the speciation process an increasing fraction of the genome persists gene flow between close relatives. With the arrival of the new sequencing technologies it has become feasible to address this fundamental question by measuring gene flow between closely related species on a genomic scale.Analysis of mtDNA and microsatellites suggests that D. mauritiana and D. simulans have been hybridizing after the speciation event. Using genome-wide polymorphism data from multiple populations in both species this project will determine to what extent different genomic regions vary in their admixture signal. These population genetic analyses will be complemented with experimental crosses between both species.

Functionally important variation in lifespan and other life history traits in natural and experimental evolution populations

Principle advisor: Thomas Flatt

While molecular geneticists typically focus on major effects of induced mutations or transgenes, evolutionary geneticists work on much more subtle phenotypic differences caused by standing natural genetic variation, the substrate on which evolutionary change by natural selection is based upon. Although it is becoming increasingly clear that both molecular and evolutionary geneticists have been studying qualitatively different forms of genetic variation at the same loci, it is still unclear whether this also holds for genes affecting life span. For example, not all candidate loci with major effects on longevity may exhibit segregating allelic variation in natural populations. Thus, while the major lifespan effects identified by molecular gerontology may be of biomedical interest, they may be of only limited relevance for our understanding of the evolution of aging in natural populations. On the other hand, the rapid progress made by molecular biologists in identifying candidate mechanisms affecting aging enables evolutionary biologists to determine whether there is standing genetic variation for longevity genes in natural populations and whether they are under selection. In this project offered by the Flatt group we are interested in functionally characterizing natural allelic variation in genes known to affect Drosophila life span.

Mathematical models of spatially varying selection in subdivided populations

Principle advisor: Reinhard Bürger

Although for some simple models and limiting cases, the relation between hard and soft selection is reasonably well understood, this is not the case for general migration patterns and strong selection (Christiansen 1999; Karlin 1982; Nagylaki 1992; Nagylaki & Lou 2001). In particular, examples have been given for a single diallelic locus, where hard selection can maintain polymorphism, whereas soft selection cannot. The relation between hard and soft selection shall be studied systematically and in detail. As a first step, existing results shall be extended to multiple alleles and, if possible, to multiple loci. Gene frequency patterns under juvenile migration may differ considerably from adult migration if selection is strong, and this shall be studied. If population size in (some) demes becomes small, as it may be the case under hard selection, random loss of advantageous alleles becomes likely, and this needs to be addressed. An important problem in this context is the exploration of the consequences of population subdivision for a tightly linked neutral locus. This project will build on theoretical results concerning fixation probabilities in metapopulations (Whitlock & Gomulkiewicz 2005) and it will require extensive individual-based simulations.

Statistical methods for detecting selective sweeps using genome-wide data

Principle advisor: Andreas Futschik

The detection of selective sweeps in genomic data has received considerable interest in recent years. According to what is sometimes called "hitchhiking effect", a selective sweep affects the pattern of genetic variation in some neighborhood of the selected locus. Indeed close to a selective sweep position, the number of segregating sites tends to be reduced (Kaplan et al., 1989), and the allele frequency spectrum as well as the linkage disequilibrium structure changes (Fay, Wu, 2000; McVean, 2006). Based on these observations, several statistical methods have been developed to detect such a selective sweep signature from a genome scan. Some of them use summary statistics (Teshima et al., 2006), whereas others (Kim, Stephan, 2002; Nielsen et al., 2005) are based on the composite likelihood. Still others (Kim, Nielsen, 2004) try to exploit changes in the linkage disequilibrium structure.

So far, mostly shorter stretches of DNA have been scanned for selective sweeps. With the increasing amount of sequence data available, an obvious challenge is to adapt sweep detection methods to longer sequences and possibly the whole genome. With longer sequences, the probability increases to observe signals at some positions that resemble a sweep by chance. Thus the issue of multiple testing arises and one might want to control or estimate the false discovery rate, i.e. the proportion of false positives among the detected sweep signals. In the (different) context of local sequence alignment this is well understood, and statistical significance is assigned to a score dependent on the sequence length. In the same spirit, our goal is to develop statistical methodology in the context of sweep detection that permits for a global control of false discoveries. This will be done by studying the "measure of sweep evidence process" under neutrality as well as under several demographic scenarios on a sliding window along the genome. We plan to consider common summary statistics, like Tajima's D or Fay and Wu's H, as well as composite likelihood methods and our previously proposed Hidden Markov methods. With all methods, the cut-off points for the respective measures of evidence for a sweep need to be adjusted in dependence of the sequence lengths. The plan is to obtain approximate formulas for cut-off points in dependence of the sequence lengths such that the expected proportion of falsely detected sweep signals remains below a desired threshold. The methods could be applied to the Drosophila sequence data planned to be generated as part of projects within the Vienna Graduate School of Population Genetics. As a result of this work, one could obtain estimates (with well understood properties) of the proportion of a genome affected by sweeps. The methodology developed in this project will be empirically validated in close cooperation with Ines Hellmann.

Population genetic estimators from NGS data: assessing the power for methods for genome scans of selection

Principle advisor: Ines Hellmann

Next generation sequencing data will allow us to conduct genome scans of selection. In the light of the new data structures, the question arises, what statistics will have the most power to detect selection. In this project, the student will assemble a simulation pipeline to mimic the output of next generation sequencing data under a variety of population genetic and selective models. As a basis for this pipeline we will use a coalescent simulation program that has been developed in Joachim Hermissons group by Greg Ewing. This program is a substantial improvement to previous coalescent simulation programs like Hudson's ms in that it can simulate complicated demographies as well as selection. For more complicated scenarios that include positive and negative selection acting concomitantly we will use forward-simulations using the program SFScode (Hernandez 2008). First, known algorithms for the inference of demography with for example dadi (Diffusion Approximation for Demographic Inference, by Ryan Gutenkunst) and Mimar (Becquet and Przeworski 2007) will be tested and then based on the estimate of demography, we will try to detect deviations that might have been caused by selection (Williamson et al. 2007; Jensen et al. 2008; Voight et al 2006). This will allow us to estimate false positive and false negative rates for the various tests for more realistic populations and thus gain a better understanding about what we learn from the various tests of selection. The results of these tests will guide the further development of new analysis methods, such as proposed here by Andreas Futschik. Furthermore, it will also help in the interpretation of the actual data that will be generated for D. mauritiana, Aquilegia, and Quercus.

The nature of differentiation between two closely related species of oak

Principle advisor: Magnus Nordborg

The two closely related European oak species, Quercus robur (pedunculate oak) and Q. petraea (sessile oak) may in many ways be analogous to the two columbine species mentioned above. Although sympatric and inter-fertile, they are perfectly good species. Moreover, it is known that, at least for some markers, polymorphism is shared between the species. As is the case for the columbine species, it is not clear whether this reflects recent shared ancestry, or high levels of gene flow (Muir and Schlötterer, 2005; 2006; Lexer et al., 2006). Furthermore, although this pair of oak species has been the subject of years of study, very few markers have been used, and it is thus not at all clear what the genome-wide pattern of divergence looks like. The time is ripe to take these investigations into the genomic era and consider the genome-wide pattern of polymorphism using markers that have been chosen in an unbiased manner. As a first step in this direction, we propose to shotgun sequence several pooled samples from each species using paired-end libraries. The oak genome has been estimated to be roughly the same size as the Aquilegia genome. Thus although no oak reference genome is available, we should be able to assemble most non-repetitive regions into contigs so that we can investigate the pattern of polymorphism within and between species. The goal is simply to investigate the amount of the divergence between these two species, and whether the pattern of divergence varies spatially along the genome, as would be expected under strong selection (e.g., Nordborg & Innan, 2003).

For a similar project look here

The footprint of adaptive gene introgression after secondary contact

Principle advisor: Joachim Hermisson

Recent genome-wide analyses across plant and animal species show that hybridization is more common than previously thought and that substantial fractions of genomes are permeable to alleles from related species (Baack and Rieseberg 2007). Inter-specific gene flow touches on key issues of evolutionary research, including speciation and the spread of adaptations, but has also profound applied consequences in the context of genetic contamination and the long-term fate of hybrid zones. The theory of gene flow through hybrid zones can build on early work by Barton and co-authors (e.g. Barton and Bengtsson 1986). However, existing approaches lack a probabilistic analysis that is needed to assess the evolutionary impact of rare introgression events and to describe genome-wide patterns in new and emerging data sets.

Objectives:

  1. Derive the fixation probability and time from adaptive gene flow despite strong barriers
  2. Describe the resulting footprints of selection in linked neutral DNA variability
  3. Develop tests for the detection of adaptive introgression events from genome-wide polymorphism data.

ad 1. We will use an analytical approach based on multi-type branching processes to derive analytical approximations for the fixation processes. Computer simulations will be used to extend the analytical work and to validate approximations.
ad 2. Analytical and computational methods based on coalescent theory will be used to predict the genetic footprint. Previous work by Pennings and Hermisson (2006a,b) on the footprint of
adaptation from recurrent migration will serve as a starting point. The two-locus model of Pennings and Hermisson (selected locus and neutral) will be extended to a three (or four) locus model, where deleterious alleles are linked to the beneficial migrant.
ad 3. We will assess various summary statistics measuring the site-frequency spectrum or LD for their power and utility to detect adaptive introgressions from polymorphism data.

FWF - Der Wissenschaftsfond Partner: FWF - Der Wissenschaftsfond
Vetmed Uni Vienna Partner: Vetmed Uni Vienna
Max F. Perutz Laboratories Partner: Max D. Perutz Laboratories
Gregor Mendel Institut Partner: Gregor Mendel Institute
Uniwien Partner: Uniwien