You are on page 1of 44

February 2014 Volume 30, Number 2 pp.

4184

Editor Rhiannon Macrae Portfolio Manager Milka Kostic Journal Manager Basil Nyaku Journal Administrators Ria Otten and Patrick Scheffmann Advisory Editorial Board K.V. Anderson, New York, USA A. Clark, Ithaca, USA G. Fink, Cambridge, USA S. Gasser, Geneva, Switzerland D. Goldstein, Durham, USA L. Guarente, Cambridge, USA Y. Hayashizaki, Yokohama, Japan S. Henikoff, Seattle, USA H.R. Horvitz, Cambridge, USA L. Hurst, Bath, UK E. Koonin, Bethesda, USA E. Meyerowitz, Pasadena, USA S. Moreno, Salamanca, Spain A. Nieto, Alicante, Spain C. Ponting, Oxford, UK C. Scazzocchio, Orsay, France and London, UK D. Tautz, Pln, Germany O. Voinnet, Strasburg, France J. Wysocka, Stanford, California Editorial Enquiries
Trends in Genetics Cell Press

Opinions

41 49

Canalization: what the flux? Particle genetics: treating every cell as unique
Reviews

Tom Bennett, Genevive Hines, and Ottoline Leyser Gal Yvert

57

The domestication and evolutionary ecology of apples Neocentromeres: a place for everything and everything in its place Mining cancer methylomes: prospects and challenges

Amandine Cornille, Tatiana Giraud, Marinus J.M. Smulders, Isabel Roldn-Ruiz, and Pierre Gladieux Kristin C. Scott and Beth A. Sullivan

66 75

Clare Stirzaker, Phillippa C. Taberlay, Aaron L. Statham, and Susan J. Clark

600 Technology Square, 5th floor Cambridge MA 02139, USA Tel: +1 617 397 2818 Fax: +1 617 397 2810 E-mail: tig@cell.com

Cover: The apple is one of the most famous cultural symbols, from the Bible to iPhones. It is also one of the most important fruit crops in the world. The origin of the apple as we know it today, however, is not entirely clear, and the genetic makeup of the apples we eat is only just now beginning to be understood. On pages 5765 of this issue of Trends in Genetics, Amandine Cornille and colleagues discuss genomic data that has illuminated the domestication of the apple and discuss the genetic history of this common fruit. Cover image from iStock/Sieboldianus.

Opinion

Canalization: what the ux?


` ve Hines, and Ottoline Leyser Tom Bennett, Genevie
Sainsbury Laboratory, University of Cambridge, Bateman Street, Cambridge, CB2 1LR, UK

Polarized transport of the hormone auxin plays crucial roles in many processes in plant development. A selforganizing pattern of auxin transport canalization is thought to be responsible for vascular patterning and shoot branching regulation in owering plants. Mathematical modeling has demonstrated that membrane localization of PIN-FORMED (PIN)-family auxin efux carriers in proportion to net auxin ux can plausibly explain canalization and possibly other auxin transport phenomena. Other plausible models have also been proposed, and there has recently been much interest in producing a unied model of all auxin transport phenomena. However, it is our opinion that lacunae in our understanding of auxin transport biology are now limiting progress in developing the next generation of models. Here we examine several key areas where signicant experimental advances are necessary to address both biological and theoretical aspects of auxin transport, including the possibility of a unied transport model. Auxin and self-organization in plant development The hormone auxin (see Glossary) regulates almost every aspect of plant development, and the directional movement of auxin by a specialized transport system (polar auxin transport, PAT) is crucial for many of these processes (Box 1, Figure 1A) [1]. In simple cases, ne-scale redistribution of auxin allows for differential responses in different cells, driving patterning and specication events. However, in many cases patterns are generated not simply by auxin redistribution but emerge as a property of the system of feedback between the tissue, auxin, and auxin transport. It is widely supposed that these developmental systems, and the auxin transport patterns that drive them, are selforganizing that is, little or no pre-pattern is needed [2]. Understanding these apparently self-organising phenomena has long been an area of interest, as exemplied by research on phyllotaxis the pattern of leaf initiation at the shoot meristem (Figure 1B) and the vascular patterns of leaves (Figure 1C). Because of their self-organizing properties, intuitive understanding of these systems is difcult and there has therefore been considerable interest in mathematically modeling these phenomena [3]. Vascular patterning and
Corresponding author: Leyser, O. (ol235@cam.ac.uk). Keywords: auxin; auxin transport; self-organization; canalization; mathematical modeling. 0168-9525/$ see front matter 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.tig.2013.11.001

phyllotaxis have primarily been simulated using two fundamentally different (but non-exclusive) auxin transport heuristics, often respectively referred to as with-the-ux (WTF) and up-the-gradient (UTG) (Box 2). Although these models have been immensely useful in demonstrating the plausibility of self-organizing transport as a developmental mechanism, neither type of model is explicit about their biological basis, and they include parameters that are not based in current mechanistic understanding, such as assessment of auxin concentration in neighboring cells. Furthermore, it is probable that neither heuristic is inherently capable of capturing the full range of self-organizing auxin transport [3]. To understand better the role of self-organizing auxin transport in plant development, a new generation of models that are more deeply rooted in a mechanistic understanding of auxin biology is needed. However, our understanding of the biology of canalization and related phenomena has been somewhat outstripped by theoretical work on these problems, and now represents a limiting factor for modeling. The purpose of this article is thus not to propose a next-generation model but to examine the areas in which we need to improve our understanding of auxin transport and discuss how current models can be used to prioritize these experiments. We primarily discuss WTF models, particularly in the context of the canalization hypothesis, vascular patterning, and shoot branching. There has recently been considerable interest in attempting to unify models of auxin transport, and we also assess prospects for achieving this goal. The canalization hypothesis of vascular patterning Vascular patterning in plants is complex but orderly [4] it is not hardwired but clearly proceeds according to rm principles such that the same general vascular topology is reproduced in almost every individual in a species (Figure 1C). Local auxin application induces vascular differentiation in plant tissue, but in narrow strands running away from the application site, rather than in wide elds of cells [5]. These observations led to the singular and pioneering contributions of Tsvi Sachs, whose elegant experiments are still central to the eld [4,68]. Sachs proposed that as auxin ows through tissues it upregulates and polarizes its own transport, which gradually becomes channeled or canalized into les of cells with very high auxin ux away from auxin sources (Figure 1D); these cell les can then differentiate to form vasculature (Figure 2) [7,8]. Sachs also demonstrated that new vasculature usually develops towards and unites with existing vasculature strands, leading to a connected vascular network (Figure 2) [4,7,8]. However, he also demonstrated that existing vasculature could be hyper-canalized by the addition of
Trends in Genetics, February 2014, Vol. 30, No. 2

41

Opinion
Glossary
Angiosperms: flowering plants. By far the largest major grouping of plants and also the most recently evolved. Includes almost all crop species and model species such as Arabidopsis thaliana. Apoplast: the space between plant cells, occupied by thick cellulosic walls (Figure 1A). There is a significant pH difference between the apoplast (pH 5.5) and cytoplasm (pH 7), and this directly affects auxin transport in accordance with the chemiosmotic hypothesis. Arabidopsis thaliana: a principal plant model species, particularly for molecular genetic studies, due to its small size, small genome, and short life-cycle. Its small size, however, means that it is not ideally suited to canalization research. Auxin, auxin transport: auxin (indole-3-acetic acid, IAA) is a low molecular weight, long-distance signal with many functions in plant development. Specific, polar auxin transport (PAT) through tissues seems to be an ancient characteristic of land plants. Canalization: an apparently self-organizing pattern of auxin transport in which an initially broad domain of auxin-transporting cells is reduced to a narrow canal. This is thought to occur by auxin upregulating and polarizing its own transport. Charophyte algae: a group of green algae that constitute the sister taxon of land plants. Chemiosmotic hypothesis: see Box 1. Gymnosperms: a diverse group of plants, including conifers, that produce seeds but not flowers. Together with angiosperms they make up the seed-plant (spermatophyte) clade. Lycophytes: an ancient group of vascular plants; sister taxon to the clade containing ferns and seed plants. Maximization: an apparently self-organizing pattern of auxin transport in which auxin is transported towards cells containing higher concentrations of auxin, leading to the formation of an auxin maximum. Meristem: a specialized region of cell division in plants. Shoot meristems in angiosperms and gymnosperms combine cell division with the production of new organs, either leaves or reproductive structures. Shoot meristems in other plants are generally simpler in structure and contain far fewer cells. Root meristems are only present in vascular plants and do not directly produce new lateral organs. Phyllotaxis: an apparently self-organizing developmental pattern describing the position of organs (e.g., leaves) along and around the stem. Different phyllotactic patterns occur in different species. Phyllotaxis in angiosperms results primarily from the positioning of new organ primordia on the flanks of the multicellular shoot meristem, and is established by maximization-like patterns of auxin transport in the meristem. PIN auxin efflux carriers (PINs): a family of proteins that are generally accepted to be auxin efflux carriers. Canonical PIN proteins have plasma membrane localizations, often polarized, and are thought to be the principal determinants of the direction of auxin efflux, in line with the chemiosmotic hypothesis. Named after a founding member, PIN-FORMED1 (PIN1), in turn named for its mutant phenotype involving impaired organ initiation at the shoot meristem a result of aberrant maximization. PINOID-family kinases: a small family of serine/threonine kinases that phosphorylate the intracellular loop of canonical PIN proteins, thereby controlling their localization. Named after the founding member, PINOID, in turn named for the resemblance of its mutant phenotype to pin1. Super-linear: a mathematical relationship in which one variable is influenced by another with a greater than linear effect; examples include quadratic ( y = ax2), cubic ( y = ax3), and exponential ( y = ax) functions. Up-the-gradient (UTG): a modeling heuristic widely used to simulate maximization-like patterns of auxin transport (Box 2), in which PIN proteins are allocated to the plasma membrane in proportion to the concentration of auxin in cells neighboring that membrane. Vascular patterning: an apparently self-organizing developmental phenomenon in which the position of future veins is established by canalization-like patterns of polar auxin transport through a tissue. Vascular plants: the plant clade containing angiosperms, gymnosperms, ferns, and lycophytes. Defined by the presence of a differentiated vascular network. Non-vascular plants such as mosses lack specialized tissues for water transport and are limited in their size as a result. Vasculature/veins: the vascular network in plants plays analogous roles to the vascular system in animals. It consists of two parallel systems, xylem (primarily water-conducting) and phloem (primarily sugar-conducting), that generally develop in association with each other. With-the-flux (WTF): a modeling heuristic widely used to simulate canalizationlike patterns of auxin transport (Box 2) in which PIN proteins are allocated to the plasma membrane in proportion to the net flux of auxin through that membrane.

Trends in Genetics February 2014, Vol. 30, No. 2

Remarkably, recent investigations have supported his hypotheses at a molecular level including the central canalization concept that, from an initially broad domain of cells with low auxin ux, a subset of cells become progressively more polarized and competent to transport auxin and have shown that canalization is an important component of vascular patterning [911]. It should be emphasized that, although some auxin ows do undoubtedly canalize, not all auxin transport phenomena involve canalization. For instance, initiation of leaf primordia in angiosperm shoot meristems (Figure 1C) requires formation of an auxin maximum by a focused pattern of transport (maximization) (Figure 1D) [12]. Canalization has generally been explored through WTF models (Box 2) which can accurately simulate patterns of auxin transport in a number of developmental processes, including vascular formation in stems and leaves [1315]. Canalization of auxin transport has also been recently modeled as an explanation for the inhibition of bud outgrowth by actively growing shoots, a scenario in which the development of vasculature is not directly considered, although it is an important additional outcome of the bud activation process [16]. Auxin transport canalization thus has the potential to explain multiple developmental phenomena in plants. What is the ux? All current models of canalization are based on a large corpus of research into polar auxin transport, and in particular the behavior of PIN-family auxin efux carriers (Box 1). Examination of phyllotaxis and vein formation has shown very distinctive patterns of PIN protein localization consistent with canalization and maximization [9,12,17]. Almost all modern models of auxin transport therefore explicitly simulate membrane-localized PIN proteins that directly inuence the amount and direction of auxin transport. The main difference between the WTF and UTG models, based on the experimental observations of PIN protein localization in different scenarios, relates to the rules for allocating PIN proteins to membranes (Box 2). In WTF models PIN proteins are allocated to each membrane in a cell in proportion to ux, the net quantity of auxin that exits the cell across that membrane. Net ux efciently couples cells together (because high net ux from cell i!j tends to prevent high ux from j!i), allowing cells to couple to larger-scale patterns of ux and speeding the emergence of global WTF patterns in the overall direction i!j (Box 2). Although mathematically this is a very neat solution, as a concept it is likely to be unrealistic because it requires a cell to calculate the net exchange of auxin across its membranes (including passive uptake). There is no known biological mechanism that achieves this, which is a common criticism of ux-based models [18]. Nevertheless, it is clear that cells in real systems do canalize auxin transport, and do so by allocating PIN proteins apparently in proportion to net auxin ux. It is thus the absolute crux of canalization research to establish how cells are able to localize PIN proteins in relation to larger-scale patterns in a self-organizing manner. The most plausible explanation for the apparent ability of cells to calculate net ux is that cells measure one or more other variables, the combined effect of which is

auxin, in which case developing vasculature could not nd and unite with it (Figure 2) [7]. The work of Sachs pre-dated the advent of molecular genetics, and he therefore needed to infer upstream events based largely on terminal vascular differentiation patterns.
42

Opinion
Box 1. Auxin transport
Auxin is transported in a polar manner through many tissues, and the canalization theory of Sachs [7,8] is framed in the context of PAT. Long-distance PAT has often been theorized as connecting auxin sources (regions of high auxin concentration or production) to sinks (regions of low auxin concentration or high turnover) [6]. In most canalizing systems, developing tissue (leaves, buds, etc.) acts as an auxin source and established vasculature acts as a sink (Figure 2). More recent work has shown that vasculature generally has high auxin concentrations [45], and therefore sink strength in this system is probably determined by auxin flux rapidly carrying auxin away from the source. Subsequent to Sachs initial canalization work, a mechanistic basis for PAT was proposed in the chemiosmotic hypothesis. Central to this is the weakly-acidic nature of auxin (pKa 4.75), which means that a significant fraction of auxin molecules in the apoplast (pH 5.5) are protonated and neutrally charged, and can passively enter cells through the lipid membrane; however, the largely deprotonated auxin in the cytoplasm (pH 7) cannot passively exit cells (Figure 1A). Specific efflux carriers are therefore required to mobilize auxin from cells, and it was proposed that polar localization of these proteins would explain the overall polarity of auxin transport [46,47]. The discovery of PIN-family auxin efflux carriers, transmembrane proteins which often have polar localization [44,48], confirmed the validity of the chemiosmotic hypothesis, and it is generally accepted that PIN proteins are the major determinants of the directionality of local auxin flux (Figure 1A) [28]. Members of the large ABC family of auxin transporters seem to act as non-polar auxin efflux carriers [49], and there are also auxin influx carriers of the AUX1/LAX family [50] (Figure 1A). Auxin regulates its own transport, and in particular PIN protein abundance and localization, at multiple levels, both transcriptional and post-transcriptional [1,51]. For instance, intracellular auxin levels can regulate transcription of PIN genes through canonical auxin signaling [52], whereas apoplastic auxin can inhibit PIN endocytosis though the ABP1 receptor [53]. Work in this area has been greatly a facilitated by live imaging of transporters fused to fluorescent proteins [54], and by proxy liveimaging of intracellular auxin based on fluorescent reporters of the activity of various components of the transcriptional auxin signaling pathway [55,56].

Trends in Genetics February 2014, Vol. 30, No. 2

proportional to net ux. It is not even necessary for these measurements to include any component of ux, but an attractive hypothesis is that cells can measure transporter-mediated efux of auxin across a given membrane, and combine this with other information to regulate PIN protein allocation. For instance, it is possible that, as PIN proteins transport molecules of auxin, they (or a protein partner) produce a positive-feedback signal that reduces the removal of those PIN proteins from the membrane. This alone would be sufcient to maintain WTF patterns of PIN protein localization, but not to generate them in the rst place, because this mechanism would not specically orient PINs on the membrane opposite an auxin source. To achieve this aspect of PIN localization presumably requires at least some information from outside the cell. It is therefore likely that the canalization mechanism has at least two components, and these might include measurement of auxin concentrations on either (or both) sides of cell membranes, as for instance proposed in a recent model of auxin transport in which extracellular auxin concentration is the major determinant of PIN allocation [19]. A recently proposed framework for cell coupling, unrelated to concentration or ux-based models, but operating through bidirectional exchange of information across the apoplast, would also

theoretically be able to generate large-scale patterns of PIN localization [20]. The rst step towards testing these ideas must be to probe the genetic basis of the canalization feedback mechanism, using the well-established toolkit in Arabidopsis, a goal distinct from understanding how PIN proteins are polarized in general [10] or providing descriptive analyses of the process of canalization [9,10,21]. A pure canalization system must be established in Arabidopsis, comparable to the original experiments of Sachs even though its diminutive size makes this difcult but with the addition of reporter lines such that the early stages of canalization can be visualized. By using this system to test the canalization response in mutants or under pharmacological treatments that impair known auxin-sensing, auxin transport, and PIN polarity-generating mechanisms, the role of those factors in the canalization process can be examined, helping to narrow down the mechanisms involved. Of course, as yet undiscovered factors might be central to the canalization mechanism, in which case screening for canalization-decient mutants may be a sensible approach. The vascular patterning defects seen in pin1 pin6 double mutants [11] provide a possible reference point for screening for developmental phenotypes, but another approach to screening may be to look for mutants in which initially well-established but broad transport domains (visualized by reporter genes) fail to narrow down, the hallmark of canalization. Distinguishing potential canalization mutants from generalized auxin transport mutants will be important, and a sensitized genetic background might be preferable to help pick out otherwise relatively subtle phenotypes. This top-down approach to canalization should be accompanied by general research to allow improved parameterization of auxin transport models. Trying to quantify, for example, the amounts of auxin and PIN protein in different parts of each cell, or the cycling rates for PIN proteins, will be endishly difcult, but even establishing a loose range would be an improvement over the current absence of data. Other important questions to address include whether the relationship between ux (or equivalent) and PIN allocation is linear or super-linear, whether there is a saturation point for ux-correlated PIN-allocation, and whether the pool of PIN proteins is quasi-innite and freely allocated or limited and proportionately (re)allocated according to ux, all aspects that current models have shown to be potentially important in pattern emergence [3,18,22]. Cell culture-based systems may prove useful in addressing these questions, as they have been in dissecting the action of auxin transporters and mechanisms of auxin homeostasis [23,24]. Increased understanding of PIN protein behavior will aid modeling of auxin transport, The experiments described above would be well-complemented by bottom-up approaches to improve understanding of the behavior of PIN proteins. There is a signicant body of work relating to the localization of PIN proteins, and it is known that in some cell types they can cycle rapidly between the plasma membrane and endosomal compartments [25]. It is this system that presumably
43

Opinion

Trends in Genetics February 2014, Vol. 30, No. 2

(A)

(B) P8 P5 P10 P2 P7 I2 P4 P3 I1 P1 P9 P6 P11

IAA

Apoplast pH 5.5

Cytoplasm pH 7

IAA

(C)

IAA

(D)

P1 I1

TRENDS in Genetics

Figure 1. Auxin transport, plant development, and self-organization. (A) Schematic illustrating the chemiosmotic mechanism of polar auxin transport. Protonated auxin (indole-3-acetic acid, IAA) in the cell wall space the apoplast (green) can move passively into cells through plasma membranes (black arrows). Influx may also be assisted by influx carriers (yellow circles). Deprotonated auxin (IAA) in the cytoplasm can only move out of cells by the action of efflux carriers (red circles), and polar localization of these carriers (such as PIN proteins) generates overall polarity in auxin transport. (B) Schematic showing the phyllotactic pattern of organ initiation at Arabidopsis shoot meristems (top-down view). New organs are produced in a stable spiral pattern, with approximately 1378 separating each new organ. I1 and I2 mark the position of the next two initiating organ primordia to form. P1 (youngest)P11 (oldest) are existing organ primordia. Phyllotaxis is an apparently self-organizing developmental process that involves auxin maximization, and has primarily been modeled using an up-the-gradient (UTG) heuristic. (C) Schematic showing vascular patterning in an Arabidopsis leaf. The midvein (purple) forms first and joins the leaf to the main vascular bundles in the stem. First-order veins (dark blue) directly connect to the midvein and are associated with local auxin maxima at the edge of the leaf. The major auxin maxima associated with lobes/serrations are shown in red, others are omitted for clarity. Lower-order veins (light blue) connect first-order veins together to form a highly connective reticulate network that very efficiently serves the whole organ. The vascular network is specified by auxin transport through the leaf blade, towards the midvein and ultimately the stem. Vascular patterning is an apparently self-organizing developmental process that involves auxin canalization, and has primarily been modeled using a with-the-flux (WTF) heuristic. (D) Schematic cross-section through an Arabidopsis shoot meristem showing organ initiation events at I1 and P1. Auxin in the meristem is transported (blue arrows) towards the site of I1 by PIN proteins (green bars), resulting in the formation of an auxin maximum (red shading). At P1 the pattern of auxin transport is partially reversed, with auxin being transported away from the maximum in a down-the-gradient pattern. Only a thin file of cells transports auxin, thus showing a canalized pattern of transport; these cells will become the midvein of the new organ. Organ initiation thus involves auxin canalization and maximization in tight spatiotemporal cooperation. Neither WTF nor UTG models of auxin transport have yet convincingly captured this complete range of behavior.

allows for the dynamic changes in PIN protein localization necessary for both WTF and UTG patterns to emerge. However, the mechanisms that determine how PIN proteins are allocated to different membranes in different
44

situations are poorly understood, despite observations of the resultant patterns. PIN protein localization can be inuenced by regulatory proteins such as PINOID-family protein kinases, which phosphorylate the long intracellular

Opinion
Box 2. Auxin transport models
Most mathematical models so far published have generally taken a major experimental observation regarding auxin transport (e.g., PIN localization towards an auxin maximum), and abstracted it into a single mathematical concept (basic modeling terminology is summarized in Figure I). These observation-based models can be broadly allocated to two classes flux-based or concentration-based depending on the primary source of information they use to allocate PIN proteins to plasma-membranes. In practice all flux-based models are explicitly of a WTF subset, and almost all concentration-based models are UTG. Within each broad class, the exact set of parameters and level of abstraction varies between models. These models are not mutually exclusive (mathematically they could be combined), but so far have been considered separately. A small number of mechanistically more explicit models have been proposed, for example one that proposes that auxin concentration in neighboring cells is measured via its effects on cell wall stress (also based on experimental observation) [57], although purely theoretical models, for example based on apoplastic transcriptional auxin gradients, have also been proposed [19]. WTF PIN allocation In WTF models a positive feedback loop increases PIN insertion rate (or decreases PIN removal) in a given cell membrane when there is increased flux f the net quantity of auxin exported through that membrane, per unit time and per unit area. Mathematically, a general formulation for the dynamics of PIN concentration ( p) in the membrane section of a cell i facing neighboring cell j (ij) includes PIN insertion, both at a basal rate (r0) and at an increased rate given by the auxin flux feedback [f(fij)] and PIN removal (m).  d pi j f f i j r 0 m p i j ; x > 0 r0 m p i j ; x  0 dt The exact feedback relationship between flux and PIN allocation has important ramifications for model function. Several different and purely theoretical relationships have been explored in models, including a simple linear relationship, f(f) = af [40], quadratic, f(f) = af2 [15], or a Hill function f f afn /K fn [16]. UTG PIN allocation In UTG models, PIN insertion rate is increased in membrane sections according to the auxin concentrations in cells neighboring those membranes. PIN proteins in cell i are preferentially inserted in the section of membrane that faces the neighboring cell j with the highest auxin concentration, at the expense of other membranes. This

Trends in Genetics February 2014, Vol. 30, No. 2

increases the auxin concentration in j, thus driving positive feedback of PIN allocation. Although some models [39,41] explicitly include PIN cycling between an intracellular pool of non-allocated PINs ( p i ) and membrane-bound PINs ( pij), the more streamlined model proposed in [38] assumes that all PINs ( pi) are instantly and competitively allocated between the different membrane sections proportionally to concentration. The sets of equations below describe the two situations; in both cases, ai is the auxin concentration in cell i and aj is the auxin concentration in cell j which is adjacent to cell i along the membrane section ij. The set of cells k adjacent to cell i is the set of neighbors N(i). 8 d pi j > > a a j ; p i p i v pi j < dt X 39;41 dp i > aa j ; p i m p i i p i v pi j > : dt gai ; p k 2 N i 8 a a > j > pi < pi j P k 2 N i aak 38 > > : d p i g a i ; p i m p i dt

Cell i

pij ij 0

Cell j

ai

aj
TRENDS in Genetics

Figure I. Basic modeling terminology. Schematic illustrating some of the major terms used in mathematical modeling of auxin transport, and their interrelation. The cell i faces its neighbor j at the membrane ij. The basal concentration of PIN protein ( pij) in the membrane ij (indicated by a green bar) is determined by the relative rates of insertion (r0) from an intracellular pool (indicated by a green circle) and recycling from the membrane to the intracellular pool (m). PIN allocation to the membrane can be increased by positive feedback relating the either flux through fij (indicated by a blue arrow) or the auxin concentration in cell j (aj).

loop domain of particular PIN proteins [26,27]. This loop domain shows extensive variation in structure between different types of PIN protein (Bennett et al., unpublished), meaning that each type of PIN protein could have an inherently different potential for localization; for instance, disruption of specic loop domains can result in different localizations within the same cell type [28]. Ultimately, canalization-like patterns are mediated through specic regulation of a subset of PIN proteins. A deep structure function analysis of PIN proteins would therefore delineate how each part of the loop contributes to PIN localization, and how each PIN protein behaves under different circumstances. Indeed, it is possible that part of the loop in some PIN proteins is a specic regulatory element for ux-based feedback, mediating canalization-like behavior of the protein in effect, a canalization motif. In Arabidopsis PIN1 plays major roles in both canalization and maximization, but in other species including grasses these two processes may be mediated by structurally-distinct PIN proteins (OConnor et al., unpublished). Investigating the possible evolution of PIN protein structures specialized for canalization may

therefore also provide an entry point for dissecting how canalization is regulated at a molecular level. Cell typespecic factors [29], external stimuli including light [30] and long-distance signals such as cytokinins and strigolactones [31,32] can all inuence the localization of PIN proteins, and it is therefore also important to continue assessing how different combinations of these proximal factors might contribute in large-scale self-organizing PIN behavior. The role of the apoplast in auxin transport A frequent simplifying assumption in modeling auxin transport is to ignore the apoplast and assume that auxin is transported directly from one cell to the next. Given the chemiosmotic basis of auxin transport (Box 1) this may be a dangerous omission because apoplastic conditions are interconnected with auxin transport in multiple ways [33]. For instance, low extracellular pH results in increased passive movement of auxin into cells (Figure 1A), and auxin ion export through PIN proteins is likely to be energized by the proton motive force across the plasma membrane. Furthermore, a long-established activity of
45

Opinion

Trends in Genetics February 2014, Vol. 30, No. 2

(A)

(B)

(C)

(D)

(E)

TRENDS in Genetics

Figure 2. Canalization phenomena. Schematics based on the classic experiments of Sachs on excised pea epicotyls (juvenile stems). Green cylinders indicate naive nonvascular tissue; gray cylinders indicate vascular bundles. Red semicircles indicate addition of exogenous auxin. Blue lines indicate newly induced vascular strands. (A) Simple demonstration of canalization: lateral auxin application induces vascular connection with the main vascular bundle. (B,C) Sourcesink relationships in induced vascular strands. The vascular bundle is surgically removed and two sources of auxin are added to the apical end of the epicotyl. If added simultaneously (B) two new sets of vasculature are formed. In both cases canalization occurs towards the site of the former vascular bundle, indicating that it is still a strong sink for auxin. If one auxin source is added subsequent to the other (C), canalization now occurs from that source towards the new vascular tissue formed by the first source, indicating that it is now a stronger sink. (D) Sink-finding in canalization. A cut in the epicotyl does not prevent canalization occurring between an exogenous auxin source and the existing vascular bundle. (E) Hyper-canalization. Addition of a strong auxin source to the existing vascular bundle now prevents sink-finding by an exogenous auxin source. However, canalization and vascular formation from the auxin source can still occur in a non-connective fashion. Dotted blue lines indicate the discontinuation of the vascular strands.

intracellular auxin is to stimulate proton-pumping ATPases, thereby further acidifying the cell wall [34]. This gives rise to a potential positive feedback loop in which increased intracellular auxin in one cell, acting through apoplastic acidication, drives increased auxin uptake in neighboring cells and increased activity of its in situ PIN proteins. This mechanism can therefore contribute to the generation of net ux between cells, particularly at high auxin concentrations, and might have important ramications in the switching behavior seen during organ initiation (Figure 1D) where auxin accumulation in the epidermis is associated with internalization and canalization of auxin ow. The apoplast is the central focus of a recently proposed model [19] which invokes a polaritygenerating mechanism that is neither WTF nor UTG, but instead relies on gradients of auxin across the cell wall partitioning an extracellular receptor to generate PIN polarization in the adjoining cells. The apoplast is certainly a potential source of information for polarization mechanisms but there is little biological evidence to support this model, which requires steep gradients of auxin in the tiny apoplastic space to make it work [3,19]. It will certainly be interesting to test the effect of apoplastic auxin and pH dynamics in both WTF and UTG models. However, although there are now a range of approaches for assessing intracellular auxin concentrations (albeit indirectly), there is currently no way to quantify apoplastic auxin, and tools to do so should be a priority for the eld. Unication of auxin transport models The integration of modeling and molecular genetics has demonstrated that auxin transport dynamics provide a plausible explanation for vascular patterning and shoot branching regulation via canalization [9,15,16,18] and for
46

phyllotaxis via maximization [35,36]. Subsequently, there has been considerable interest in producing a unifying model of auxin transport that is capable of reproducing both canalization and maximization patterns with a single heuristic and set of parameters. Published models of this type have mostly been extensions of previous models with either purely WTF mechanisms [37] or purely UTG mechanisms [38], but cannot straightforwardly reproduce both behaviors because they require signicantly altering parameters in different parts of the simulation, making biologically improbable assumptions, or ignoring wet lab data [3]. It is fair to say that the consensus in the eld, supported by reanalysis of current models [3], is that no satisfactory unifying model has been developed yet perhaps not surprisingly given the current gaps in our understanding. From a biological perspective, an interesting question is not whether the models can be mathematically unied after all, with enough parameters one could model anything [39,40] but whether they should be unied. Are canalization and maximization really ipsides of the same coin or are they fundamentally different processes using different mechanisms in different tissues? There are also other auxin transport patterns, particularly in the root, that do not resemble either canalization or maximization are all these phenomena essentially the same process or are they divergent mechanisms that share only some basic aspects? This question is particularly intriguing from an evolutionary perspective because PAT is present throughout land plants and in at least some charophyte algae [41]. However, there is currently little evidence for the specic phenomena of canalization or maximization outside angiosperms. Given its importance in vascular development, it seems a reasonable hypothesis that canalization

Opinion
evolved early in the vascular plant clade and is present throughout it. PAT is present in the lycophyte Selaginella kraussiana and plays a role in vascular development, but whether this is canalization-driven is currently unclear [42]. Lycophytes and ferns have meristems with single apical cells, and initiate organs in a fundamentally different manner to seed plants. This suggests that auxin maximization in meristems arose specically in the large meristems of the seed-plant lineage although this does not necessarily preclude maximization-type phenomena in ferns and lycophytes. If generalized PAT, canalization, and maximization did evolve at different points in the evolutionary history of plants then sequential innovations could have generated novel auxin transport phenomena. In turn, this would suggest that these phenomena are not equal or equivalent, but require process-specic genetic components, which could have included changes in the structure of the PIN proteins themselves; however, more work is necessary to establish the exact evolutionary history of auxin transport. In angiosperms, different PINs are probably specialized for, or act preferentially in, particular processes; for instance, the primary (but not sole) function of PIN2 in Arabidopsis is to control a specic shootward auxin ux in the root meristem [43,44]. Further investigation of auxin transport phenomena and PIN protein subfunctionalization outside angiosperms will not only be illuminating with regard to the evolution of development in land plants but will also help in dissecting the nature of auxin transport itself. Even though canalization and maximization both involve PIN1 in Arabidopsis, there is some molecular genetic evidence to suggest that they might not be identical processes; for instance, PINOID plays a crucial role in maximization but is less central to canalization [29]. This has led to suggestions that there is tissue- or context-specic switching between modes of auxin transport, an approach used in another model [17] in which maximization and canalization are effectively modeled separately. However, as with so much in biology, it is likely that the reality will be more nuanced, especially because we do not yet understand the mechanisms of either canalization or maximization. It is plausible that there is a core machinery for allocating PIN proteins to membranes that, given the inherent differences between contexts, is capable of generating both canalization and maximization and possibly all auxin transport phenomena. For example, if PIN allocation is achieved by the combined assessment of two or more factors inside and outside cells (as discussed above), then perhaps both patterns can be generated depending on the weightings given to those different factors in different contexts. This core machinery could have been elaborated upon during plant evolution to generate new patterns of auxin transport, but remain the same fundamental unied mechanism. Ultimately, although computational work can tell us that the models are uniable, we will only nd out for sure by pushing forward our biological understanding of auxin transport across the whole plant kingdom. Concluding remarks The impressive progress of theoretical research into auxin transport phenomena has outstripped advances in our

Trends in Genetics February 2014, Vol. 30, No. 2

biological understanding of these processes, particularly in the case of canalization, which has only received limited experimental attention in the recent molecular genetic era of plant development [911,21]. Further experiments along the lines proposed here are now required to gain a deeper understanding of the canalization mechanism, and must aim to unite physiological and genetic approaches in a single species. These will not only be relevant to canalization itself but also to the auxin transport eld more generally, allowing construction of a new generation of models to examine self-organizing plant development.
Acknowledgments
Our research is funded by the Gatsby Foundation and the European Research Council (Project 294514 EnCoDe). We would like to thank Graeme Mitchison for critical reading of the manuscript.

References
1 Benjamins, R. and Scheres, B. (2008) Auxin: the looping star in plant development. Annu. Rev. Plant Biol. 59, 443465 2 Leyser, O. (2011) Auxin, self-organisation, and the colonial nature of plants. Curr. Biol. 21, R331R337 3 van Berkel, K. et al. (2013) Polar auxin transport: models and mechanisms. Development 140, 22532268 4 Sachs, T. (1968) On the determination of the pattern of vascular tissue in peas. Ann. Bot. 32, 781790 5 Jacobs, W.P. (1952) The role of auxin in the differentiation of xylem around a wound. Am. J. Bot. 39, 301309 6 Sachs, T. (1968) The role of the root in the induction of xylem differentiation in peas. Ann. Bot. 32, 391399 7 Sachs, T. (1969) Polarity and the induction of organized vascular tissues. Ann. Bot. 33, 263275 8 Sachs, T. (1981) The control of the patterned differentiation of vascular tissues. Adv. Bot. Res. 9, 151162 9 Scarpella, E. et al. (2006) Control of leaf vascular patterning by polar auxin transport. Genes Dev. 20, 10151027 10 Sauer, M. et al. (2006) Canalization of auxin ow by Aux/IAAARFdependent feedback regulation of PIN polarity. Genes Dev. 20, 2902 2911 11 Sawchuk, M.G. et al. (2013) Patterning of leaf vein networks by convergent auxin transport pathways. PLoS Genet. 9, e1003294 12 Reinhardt, D. et al. (2003) Regulation of phyllotaxis by polar auxin transport. Nature 426, 255260 13 Mitchision, G.J. (1980) The dynamics of auxin transport. Proc. R. Soc. Lond. B: Biol. Sci. 209, 489511 14 Mitchison, G.J. et al. (1981) The polar transport of auxin and vein patterns in plants. Philos. Trans. R. Soc. Lond. B: Biol. Sci. 295, 461471 15 Rolland-Lagan, A.G. and Prusinkiewicz, P. (2005) Reviewing models of auxin canalization in the context of leaf vein pattern formation in Arabidopsis. Plant J. 44, 854865 16 Prusinkiewicz, P. et al. (2009) Control of bud activation by an auxin transport switch. Proc. Natl. Acad. Sci. U.S.A. 106, 1743117436 17 Bayer, E.M. et al. (2009) Integration of transport-based models for phyllotaxis and midvein formation. Genes Dev. 23, 373384 nsson, H. (2010) Modeling auxin-regulated 18 Krupinski, P. and Jo development. Cold Spring Harb. Perspect. Biol. 2, a001560 19 Wabnik, K. et al. (2010) Emergence of tissue polarization from synergy of intracellular and extracellular auxin signaling. Mol. Syst. Biol. 6, 447 20 Abley, K. et al. (2013) An intracellular partitioning-based framework for tissue cell polarity in plants and animals. Development 140, 2061 2074 21 Balla, J. et al. (2011) Competitive canalization of PIN-dependent auxin ow from axillary buds controls pea bud outgrowth. Plant J. 65, 571577 22 Feugier, F.G. et al. (2005) Self-organization of the vascular systemin plant leaves: inter-dependent dynamics of auxin ux and carrier proteins. J. Theor. Biol. 236, 366375 23 Barbez, E. et al. (2013) Single-cell-based system to monitor carrier driven cellular auxin homeostasis. BMC Plant Biol. 13, 20
47

Opinion
sek, J. et al. (2006) PIN proteins perform a rate-limiting function 24 Petra in cellular auxin efux. Science 312, 914918 25 Geldner, N. et al. (2001) Auxin transport inhibitors block PIN1 cycling and vesicle trafcking. Nature 413, 425428 26 Huang, F. et al. (2010) Phosphorylation of conserved PIN motifs directs Arabidopsis PIN1 polarity and auxin transport. Plant Cell 22, 11291142 27 Dhonukshe, P. et al. (2010) Plasma membrane-bound AGC3 kinases phosphorylate PIN auxin carriers at TPRXS(N/S) motifs to direct apical PIN recycling. Development 137, 32453255 28 Wisniewska, J. et al. (2006) Polar PIN localization directs auxin ow in plants. Science 312, 883 29 Friml, J. et al. (2004) A PINOID-dependent binary switch in apicalbasal PIN polar targeting directs auxin efux. Science 306, 862865 30 Ding, Z. et al. (2011) Light-mediated polarization of the PIN3 auxin transporter for the phototropic response in Arabidopsis. Nat. Cell Biol. 13, 447452 31 Shinohara, N. et al. (2013) Strigolactone can promote or inhibit shoot branching by triggering rapid depletion of the auxin efux protein PIN1 from the plasma membrane. PLoS Biol. 11, e1001474 , P. et al. (2011) Cytokinin modulates endocytic trafcking of 32 Marhavy PIN1 auxin efux carrier to control plant organogenesis. Dev. Cell 21, 796804 33 Steinacher, A. et al. (2012) A computational model of auxin and pH dynamics in a single plant cell. J. Theor. Biol. 296, 8494 34 Hager, A. (2003) Role of the plasma membrane H+-ATPase in auxininduced elongation growth: historical and new aspects. J. Plant Res. 116, 483505 35 Smith, R.S. et al. (2006) A plausible model of phyllotaxis. Proc. Natl. Acad. Sci. U.S.A. 103, 13011306 nsson, H. et al. (2006) An auxin-driven polarized transport model for 36 Jo phyllotaxis. Proc. Natl. Acad. Sci. U.S.A. 103, 16331638 37 Stoma, S. et al. (2008) Flux-based transport enhancement as a plausible unifying mechanism for auxin transport in meristem development. PLoS Comput. Biol. 4, e1000207 38 Merks, R.M. et al. (2007) Canalization without ux sensors: a travelingwave hypothesis. Trends Plant Sci. 12, 384390 39 Dyson, F. (2004) A meeting with Enrico Fermi. Nature 427, 297 40 Brown, K.S. and Sethna, J.P. (2003) Statistical mechanical approaches to models with many poorly known parameters. Phys. Rev. E: Stat. Nonlin. Soft Matter Phys. 68, 021904 41 Boot, K.J. et al. (2012) Polar auxin transport: an early invention. J. Exp. Bot. 63, 42134218

Trends in Genetics February 2014, Vol. 30, No. 2

42 Sanders, H.L. and Langdale, J.A. (2013) Conserved transport mechanisms but distinct auxin responses govern shoot patterning in Selaginella kraussiana. New Phytol. 198, 419428 43 Luschnig, C. et al. (1998) EIR1, a root-specic protein involved in auxin transport, is required for gravitropism in Arabidopsis thaliana. Genes Dev. 12, 21752187 44 Mu ller, A. et al. (1998) AtPIN2 denes a locus of Arabidopsis for root gravitropism control. EMBO J. 17, 69036911 45 Avsian-Kretchmer et al. (2002) Indole acetic acid distribution coincides with vascular differentiation pattern during Arabidopsis leaf ontogeny. Plant Physiol 130, 199209 46 Rubery, P.H. and Sheldrake, A.R. (1974) Carrier-mediated auxin transport. Planta 118, 101121 47 Raven, J.A. (1975) Transport of indoleacetic acid in plant cells in relation to pH and electrical potential gradients, and its signicance for polar IAA transport. New Phytol. 74, 163172 lweiler, L. et al. (1998) Regulation of polar auxin transport by 48 Ga AtPIN1 in Arabidopsis vascular tissue. Science 282, 22262230 49 Geisler, M. and Murphy, A.S. (2006) The ABC of auxin transport: the role of p-glycoproteins in plant development. FEBS Lett. 580, 10941102 ret, B. et al. (2012) AUX/LAX genes encode a family of auxin inux 50 Pe transporters that perform distinct functions during Arabidopsis development. Plant Cell 24, 28742885 51 Leyser, O. (2010) The power of auxin in plants. Plant Physiol. 154, 501505 52 Vieten, A. et al. (2005) Functional redundancy of PIN proteins is accompanied by auxin-dependent cross-regulation of PIN expression. Development 132, 45214531 53 Robert, S. et al. (2010) ABP1 mediates auxin inhibition of clathrindependent endocytosis in Arabidopsis. Cell 143, 111121 54 Blilou, I. et al. (2005) The PIN auxin efux facilitator network controls growth and patterning in Arabidopsis roots. Nature 433, 3944 , E. et al. (2003) Local, efux-dependent auxin gradients as a 55 Benkova common module for plant organ formation. Cell 115, 591602 56 Brunoud, G. et al. (2012) A novel sensor to map auxin response and distribution at high spatio-temporal resolution. Nature 482, 103106 57 Heisler, M.G. et al. (2010) Alignment between PIN1 polarity and microtubule orientation in the shoot apical meristem reveals a tight coupling between morphogenesis and auxin transport. PLoS Biol. 8, e1000516

48

Opinion

Particle genetics: treating every cell as unique


Gae l Yvert
Laboratoire de Biologie Mole culaire de la Cellule, Ecole Normale Supe rieure de Lyon, CNRS, Universite de Lyon, Lyon, France

Genotypephenotype relations are usually inferred from a deterministic point of view. For example, quantitative trait loci (QTL), which describe regions of the genome associated with a particular phenotype, are based on a mean trait difference between genotype categories. However, living systems comprise huge numbers of cells (the particles of biology). Each cell can exhibit substantial phenotypic individuality, which can have dramatic consequences at the organismal level. Now, with technology capable of interrogating individual cells, it is time to consider how genotypes shape the probability laws of single cell traits. The possibility of mapping single cell probabilistic trait loci (PTL), which link genomic regions to probabilities of cellular traits, is a promising step in this direction. This approach requires thinking about phenotypes in probabilistic terms, a concept that statistical physicists have been applying to particles for a century. Here, I describe PTL and discuss their potential to enlarge our understanding of genotypephenotype relations. Genetics has largely remained Newtonian When Isaac Newton described the link between forces and energy (momentum) in what is known as his 2nd principle, classical mechanics was born. Scientists could compute speeds and trajectories, and this knowledge initiated a profound transformation of occidental societies. New techniques appeared and the philosophical apprehension of the world was modied. It is tempting to consider that the Newtonian revolution of genetics took place during the mid-20th century. When heredity (genes) was linked to biochemistry (enzymes), molecular biology was born. As happened three centuries earlier with mechanics, this discovery profoundly transformed society, in technological and philosophical terms. Over a few decades, it became plausible to explain and predict phenotypes from combinations of genetic and environmental determinants. Current research in genetics is probably still largely inuenced by this excitement. Genomics has scaled up investigations and ndings but did not profoundly change the (sometimes
Corresponding author: Yvert, G. (Gael.Yvert@ens-lyon.fr). Keywords: QTL; GWAS; probabilistic trait locus (PTL); single cell; stochasticity; complex traits. 0168-9525/$ see front matter 2013 The Author. Published by Elsevier Ltd. All rights reserved. http://dx.doi.org/ 10.1016/j.tig.2013.11.002

caricatural) view of a deterministic genotypephenotype control. Most quantitative genetics studies are based on QTL (see Glossary) mapping or whole-genome association. In both cases, the phenotype is assumed to derive from the genotype in a deterministic manner. Mutations that are searched are those that cause an increase in trait values in individuals carrying them. An arsenal of statistical methods can efciently detect them when this increase is large enough. However, mutations contributing little to the Glossary
Expression probabilistic trait locus (ePTL): a PTL where the trait of interest is the abundance of a gene product. Expression quantitative trait locus (eQTL): a QTL where the trait of interest is the abundance of a gene product. In some studies, eQTL refers to traits of mRNA levels and pQTL refers to traits of protein levels. Penetrance: probability that an individual of genotype g displays a phenotype [44]. Usually associated with qualitative traits, such as disease versus control. Probabilistic trait locus (PTL): a DNA polymorphism modifying a quantitative trait density function. A PTL is not necessarily associated with a change in mean trait value. It may affect the variance, skewness, normality, bimodality, or any other property of the trait density function. Genetic buffers of environmental or genetic perturbations are not PTL under this definition. They may affect interindividual variability across different environments or genotypes without necessarily modifying a trait density function defined within a precise environmental and isogenic context. Quantitative trait density function: probability density function f of a quantitative trait among individuals of the same genotype g, such that t2 Z P f g; t
t1

This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution and reproduction in any medium, provided the original author and source are credited.

is the probability that one individual of genotype g displays a trait value falling within the interval [t1,t2]. To be informative, this function must be dened for given values of environmental, age, gender, and other factors that may obviously affect the trait in a deterministic manner. Here, same genotype refers to fully isogenic individuals, such as isogenic strains or lines of experimental organisms. For many outbred organisms, trait density functions cannot be directly observed. Quantitative trait locus (QTL): a DNA polymorphism underlying the genetic variation of a quantitative trait [45]. It is usually mapped within genetic intervals defined by markers. If it is precisely identified, its molecular implication can be studied. In most studies, QTL are associated with a change in mean or median trait value. Single cell probabilistic trait locus (scPTL): a DNA polymorphism modifying a single cell quantitative trait density function. A scPTL is not necessarily a PTL if the difference in single cell properties does not modify the probability of a macroscopic trait of an individual. Single cell quantitative trait density function: probability density function of a single cell quantitative trait, defined for cells of a given genotype, differentiation state, and environmental context. This can refer to individual cells of the same tissue within an individual (Figure 1C, main text), or cells of a clonal microbial colony. Trait expressivity: degree to which trait expression varies among individuals of genotype g. Often used to describe traits that can be discretized, such as the clinical severity of syndromes. Expressivity E of trait T reflects the extent of trait variation but not the probability that an individual expresses T at a given level. If f is the quantitative trait density function of T for genotype g, then E(T,g) corresponds to all values of T where f >0.

Trends in Genetics, February 2014, Vol. 30, No. 2

49

Opinion
phenotype escape detection because their effect is small compared with intragenotype variation. Unfortunately for our understanding, these small-effect variants seem to be particularly important: they are abundant [1]; they were proposed to contribute to the missing heritability of complex traits [1,2]; and evolutionary selection might largely act through them [3]. Detection of these loci can be improved by studying larger cohorts and by applying judicious models that include cofactors (e.g., environmental factors) and nonadditivity. However, staying in a deterministic framework might be limiting when the small contribution of the locus is due to incomplete penetrance of the trait. If the genotype does not comprehensively predict the phenotype, as seen in heritable cardiac arrhythmia [4], polydactyly [5], and various cancer-predisposing syndromes [6], then adopting a probabilistic approach might be more appropriate. Major macroscopic events can result from microscopic properties Rare events occurring in a few cells can have dramatic consequences at the macroscopic level. We are all examples of this, because the macroscopic physiology of our body largely results from only two germ cells contributed by our parents. Peculiarities in these cells or their progenitors can potentially change our everyday life. Another striking example is cancer: macroscopic tumors appear from a single cell that escaped proliferation controls. Anticancer treatments do not eradicate all tumor cells, and the (few) cells that persist represent the major threat for clinical outcomes. Therefore, cancer is a statistical issue of controlling the probability that cells become tumorous (risk factors), and the probability that they escape elimination by the organism and treatment (persistence). The latency of infectious pathogens is also a statistical issue. HIV-1 can persist in a small reservoir of resting cells that later reactivate infection and disease. This mechanism of persistence represents an enormous challenge for long-term therapy [7,8]. Bacterial resistance to antibiotics represents a similar challenge [9]. Thus, some macroscopic phenotypes cannot be fully apprehended without taking into account cell-to-cell differences. However, if genetics has remained Newtonian, are we prepared for microscopic considerations? Objects of atomic scale violate Newtons laws. Colleagues from physics can study and manipulate these objects because their predecessors formulated the quantum theory. The revolutionary concept considered that the parameters of a particle did not determine its position and speed but changed the probability that the particle be at a given position or have a certain speed. Colleagues from statistical physics describe particles by wave functions, which carry this probabilistic information. Without this description, the diffraction of light or the spreading of liquid helium away from its container escape understanding. As any other matter, cells comprise atoms and the fact that quantum properties appear at higher and higher resolution is obvious. However, the consideration that multicellular organisms are statistical systems of cells has not been clearly formulated. Until only recently, biochemistry and molecular biology tests have typically been
50

Trends in Genetics February 2014, Vol. 30, No. 2

conducted on extracts of millions of molecules or cells. Therefore, most experimental readouts report averaged values. Physiological trait measurements often reect the averaged contribution of billions of cells to the function of an organ. However, biological processes, as mechanics, look profoundly different at lower scales. When gene expression is monitored at the single cell level, bursts can be observed corresponding to activity red in some cells and not others. This had been noticed long ago [10] and is now extensively studied. Therefore, our scientic language is changing: what we used to call the level of transcription is replaced by more discrete terms such as burst size and burst frequency [11]. Single molecule studies have also revealed unanticipated activity dynamics [12]. Regarding phenotypes, microscopic heterogeneities can become apparent when traits are observed at single cell resolution. For example, the induction of apoptosis by tumor necrosis factor-related apoptosis-inducing ligand (TRAIL) in cancer cell lines was shown to vary among individual cells [13], as did the activation of nuclear factor (NF)-kB by TNF-a in mouse broblasts [14] and the triggering of proliferation in response to epidermal growth factor (EGF) stimulation [15]. Phenotypic variability among human cell cultures can be driven by local population contexts, such as local cell density [16], and nonuniform mechanical stress can generate heterogeneities within tissues [17]. Thus, the deterministic view of genetic control seems to be challenged by single cell analysis. Even though macroscopic traits result from the collective contribution of billions of cells, they do not necessarily follow the average of these contributions. Therefore, our classical apprehension of phenotypes might have long been blurred by the law of large numbers. Cells and molecules: the particles in biological sciences The boundary between Newtonian and quantum mechanics is a frontier between orders of magnitude. For the law of large numbers to apply, identical particles must be numerous enough in the object considered so that probabilistic considerations are not needed. What are the typical orders of magnitude under consideration in biological systems? For example, in a system such as a human body, how many particles (cells) are there? With the very crude approximation of an average cell size of 10 mm and a density of 1, a 100-kg human body comprises 1014 cells. Given that various body parts are devoid of cells, a lower estimate (1013) was proposed based on DNA mass [18]. However, many cells divide, and the total number of cell divisions in a human body in the course of a lifetime was said to be in the order of 1016 [19]. Notably, these numbers cover only human cells and not our microbiome, which is approximately ten times more abundant [20] and much more proliferative. To realize how big these numbers are, one can visit the Great Dune of Pyla near Arcachon, France. This tall (>100 m) sand dune is made of tiny quartz grains and its volume is estimated at 60 million m3. A 50-ml sample of sand from the dune weighed 80 g and 97 grains weighed 4 mg; therefore, the dune has approximately 2.5 1018 grains. Thus, the few hundred campers staying near the dune will altogether have produced in their lifetime as many human cells as the dune grains.

Opinion
Not only are cells incredibly numerous, but they also differ substantially. Cell identity is often categorized as a cell type, which reects a particular tissue, function, morphology, and differentiation state. However, even within cell types, cells have a large amount of variability. The stochastic nature of gene expression mentioned above illustrates that intracellular concentrations of molecules can range signicantly among so-called identical cells [21]. And cells also differ in the identity of these molecules. First, somatic mutations generate intra-cell type heterogeneities. Considering a somatic mutation rate of 106 per cell division for a human protein of middle size [22], the chance that one of the 20 000 human protein-coding sequence [23] gets mutated at every division is very high (approximately 2%). In addition, mutations also arise in nondividing cells, as shown recently for active transposition in brain neurons [24]. Second, delity of mRNA molecules is largely imperfect, with abundant nucleotide misincorporations and splicing errors [25], and transcript boundaries are extremely heterogeneous among individual molecules [26]. Third, DNA-coding sequences do not strictly dictate the nal identity of intracellular proteins. Errors in translation can generate mistakes in approximately 15% of the proteome [25]. Finally, individual protein molecules change with time. They are dynamically modied at many sites, accumulate oxidative damages, occasionally fail to fold into functionally equivalent conformations, and do not necessarily localize to the same subcellular compartments and macromolecular complexes. For all these reasons, multicellular organisms are dynamic mosaics of a huge number of cells that differ far beyond their differentiation type, and all these aspects can impact the relation between genotype and phenotype. Nondeterministic genetic effects The nonlinearity of biochemical reactions sometimes makes the analysis of single cell statistics essential. Properties such as cooperation, threshold effects, or feedback loops provide cells with the ability to switch between phenotypic states [27,28]. Genetic variation affecting these properties might change the probability of single cell outcomes without necessarily affecting the average trait value. Understanding trait variation in this context requires a statistical description of the behavior of individual cells. Nondeterministic outcomes of genetic mutations can be studied on experimental organisms. In the Caenorhabditis elegans nematode, skinhead (skn)-1 mutants were shown to generate elevated variability in expression of ending (end)-1 transcripts. As a result, some but not all skn-1 mutant embryos did not achieve proper intestinal development [29]. Nonlinear dependencies between phenotypic outcomes and molecular regulations can also be studied by directly manipulating the dosage of genes involved in developmental pathways. A recent study described how C. elegans vulval development can tolerate up to a fourfold variation in EGF signaling without any phenotypic perturbation. Combining dosage perturbations in EGF and Notch signaling enabled the authors to draft an experimental phase diagram of developmental outcomes as a function of quantitative variation in the two pathways [30]. These experiments are important because they estimate

Trends in Genetics February 2014, Vol. 30, No. 2

the boundaries within which developmental processes are robust. In humans, a particularly interesting example is the case of autosomal dominant (AD) mutations predisposing to cancer [6]. A large number of such mutations cause diseases characterized by a wide spectrum of symptoms (syndromes), with varying clinical expressivity, and several considerations on AD mutations illustrate the need to perform genetics in a nondeterministic framework (Box 1). In addition, a surprising correlation was recently reported between morbidities of Mendelian disorders and complex diseases, suggesting that many Mendelian disease-causing mutations have probabilistic effects on complex traits [31]. Other informative observations are those collected on uctuating asymmetry. Some organs, such as animal limbs or plant leaves, are represented more than once in the same individual. This offers the possibility to observe nondeterministic trait variation directly. For example, any difference between the left and right wing of a y cannot be attributed to age, diet, or any environmental effect because the two wings developed simultaneously in the same animal. Fluctuating asymmetry (FA) quanties such intraindividual morphological differences and provides a valuable readout of nondeterministic phenotypic outcomes. When measured on numerous individuals, FA enables phenotypic variability to be quantied, even if the causes of these differences remain unknown at the molecular and cellular level. A remarkable experiment showed that elevated FA and, therefore, phenotypic variability, can have large heritability. By designing successive crosses between Drosophila melanogaster ies displaying high FA, the authors were able to x elevated FA from an outbred population [32]. This demonstrates that different levels of phenotypic noise can segregate in the wild. This likely explains the different levels of cellcell trait variability that were recently observed in natural yeast strains [33,34]. Finding sources of phenotypic noise in the wild complements an earlier observation from an in-lab evolution experiment. Extreme selection on Pseudomonas uorescens bacteria for phenotypic switching generated genotypes causing intraclonal trait bimodality [35]. These examples show that some genotypes can have nondeterministic consequences on phenotypic traits. To really understand how these genetic effects contribute to the physiology and evolution of living systems, classical genetics must be revised. Mapping macroscopic and single cell probabilistic trait loci From Mendelian mapping to current whole-genome association studies (GWAS), genetic linkage is always based on a simple principle: observing phenotype and genotype on a set of individuals, and deriving correlations. A genetic locus of sufciently large effect on the phenotype is detected because data points (individuals) display covariation between the genotype at the locus and the phenotype. In this framework, all the microscopic diversity discussed above is compressed in a single parameter: the phenotype of the individual. The ability to acquire parameters on single cells from every individual has not yet been fully exploited.
51

Opinion
Box 1. Possible nondeterministic effects of haploinsufficiency
Neurofibromatosis 1 is a typical case of an autosomal dominant disorder displaying a range of disease severity. It is caused by heterozygous loss-of-function mutations of the NF1 gene, and symptoms vary from cafe au lait stains on the skin to severe malignancy [46]. Variability in disease appearance and expressivity can be interpreted in two complementary ways. Mutations might appear in two hits: a first mutation is inherited from the parental germline and a second one occurs later somatically. This secondary mutation can occur via loss of heterozygosity, or via a novel mutation hitting the wild type allele. In this two-hit model, the probabilistic nature of the trait among carriers of the first mutation is strictly associated with the probability of the occurrence of the secondary mutation. The model remains deterministic in terms of genotype phenotype control: heterozygous cells are healthy and homozygous / cells are pathogenic. The alternative interpretation is that haploinsufficiency alone might produce a subpopulation of pathogenic cells as a result of improper regulation of enzymatic activity in some heterozygous cells. In this case, the genotypephenotype control is probabilistic because most heterozygous cells are healthy but some of them become pathogenic. Note that this alternative model does not necessarily exclude the two-hit interpretation: if the probabilistic cellular trait affects the mutation rate of the wild type allele, then haploinsufficiency facilitates secondary mutations and the two-hit model also applies. Possible nondeterministic consequences of haploinsufficiency have been discussed [47] and explored in simulations [48,49]. Two scenarios are particularly plausible. First, haploinsufficiency might increase sensitivity to differential allelic expression. Two alleles of a gene are not necessarily fired simultaneously. If they both encode a fully functional protein, these temporal allelic differences do not generate significant fluctuations in gene activity. By contrast, firing a null allele is a deadend and fluctuations between allelic transcription rates might generate variable enzymatic activity in /+ heterozygous cells (Figure IA). Second, haploinsufficiency can render cells particularly sensitive to molecular noise because of the nonlinearity of enzymatic reactions. This is illustrated in Figure IB, where heterozygosity suppresses buffering against fluctuations. Experimental evidence supporting such scenarios is scarce, but an important observation is the elevated variability in single cell traits that has been observed among Nf1/+ melanocytes compared with Nf1+/+ control samples [50].

Trends in Genetics February 2014, Vol. 30, No. 2

(A)

+/+

/+
(B)

Healthy

/+ +/+
Output noise

Cellular phenotype

Input noise

Input noise

Disease

/+
Gene acvity

+/+
TRENDS in Genetics

Figure I. Possible nondeterministic consequences of haploinsufficiency. (A) Diversity from fluctuations in allele-specific expression. Color of the cytoplasm represents the concentration of functional gene product (darker color indicates a higher concentration). (B) A cellular outcome is represented as a quantitative trait, from disease-causing low levels to healthy full levels, as a function of the activity of a gene product. The heterozygous genotype produces normal mean level activity but an increased variability in the outcome. Note that the input noise reflects variability in enzymatic activity, which can correspond to various parameters, such as variation in concentration or in the proportion of molecules that have the required post-translational modifications, subcellular localization, or conformation.

Common traits are classically dissected by mapping QTL. A QTL is detected when one can reliably reject the null hypothesis of no difference in mean trait value between carriers of one allele and carriers of other alleles at the locus. Sometimes, the test is applied on the median value instead. In multilocus scans, more genotype combinations are considered, and the null hypothesis is also an equal mean (or median) trait value across genotypes. Therefore, QTL are mapped within a statistical framework, but they have a deterministic nature because they affect the average or median trait value of all individuals of the same genotype. If a genetic locus changes trait properties other than its mean or median, it is likely not detected. For example, interindividual variance might be changed, thereby generating more individuals with extreme trait values (Figure 1B). To better account for the probabilistic nature of common traits, one can consider the trait probability density function instead of the observed trait values only. This function not only relates to the trait expressivity, but also provides the probability of observing the trait at a given value, as does penetrance for dichotomous traits. Using this function, it is possible to rene the concept of QTL by considering any change of the trait density function in response to genetic variation. Let us dene a probabilistic trait locus (PTL) as any DNA polymorphism that modies a
52

trait probability density function. Under this denition, all QTL are PTL because they affect the mean or median trait value and, therefore, the trait density function. However, the reverse is not true: a PTL may change various properties of the trait probability without necessarily affecting its average. Although not specically named this way, PTL mapping has already been reported in several studies that looked at within-genotype interindividual trait variation. The earliest example was a QTL mapping strategy applied to stochastic variation in yeast gene expression [36]. The approach was recently followed up to derive additional PTL [37]. In these studies, the trait of interest was the expression level of a GFP construct reporting the activity of the yeast methionine-requiring (MET)17 promoter. Probability density functions were tracked by ow cytometry and several loci were associated with a change in variance and not mean of MET17 promoter activity. These loci can be qualied as expression (e)PTL because they affect the density function of a gene expression trait. Notably, three DNA polymorphisms causing increased variability were discovered. One was a uracil-requiring (ura)3 mutation that is widely used as an auxotrophic marker in yeast laboratories. Given that URA3 activity can affect transcriptional elongation efciency, this ePTL revealed that

Opinion

Trends in Genetics February 2014, Vol. 30, No. 2

(A)

QTL

(B)

PTL

(C)

scPTL

TRENDS in Genetics

Figure 1. Quantitative trait loci (QTL), probabilistic trait loci (PTL), and single cell (sc)PTL effects. In each case, two genotypes at a given locus are compared, as indicated by the color outline of individuals (green versus purple). (A) The locus is a QTL and the purple genotype increases the trait value, as indicated by the size of individuals. (B) The genotype is a PTL and the purple genotype increases the trait variance without changing the mean trait value of individuals. (C) Individual cells of a tissue are represented by dots, colored by their value of a single cell quantitative trait. Here, the locus is a scPTL: the purple genotype increases single cell trait variance within the tissue. This may or may not change the macroscopic phenotype of individuals.

elongation impairments increased the levels of stochasticity in gene expression [36]. Another ePTL was a frameshift mutation in ethionine resistance conferring (ERC)-1, a transmembrane transporter gene, which reduced MET17-GFP expression variability. A third one was the promoter region of the methionine uptake (MUP)1 gene, also encoding a transmembrane transporter, which probably increased the sensitivity of cells to microenvironmental uctuations [37]. Another study also mapped ePTL using a transcriptomic data set of Arabidopsis thaliana [38]. Within-genotype coefcient of variation of mRNA levels were considered as quantitative traits and genetic loci linked to them were identied. Many, but not all of them were also eQTL affecting mean expression. An apparently similar observation was made in humans, where the fat mass and obesity associated (FTO) gene locus was associated with both mean and variability of obesity in a GWAS study [39]. However, in this case, variability was measured across individuals sharing a common allele at the FTO locus, but differing at numerous other loci and each having a specic history of exposure to various environmental factors.

Therefore, the observed PTL effect of FTO could result from fully deterministic genegene or geneenvironment interactions that remain challenging to characterize. In this regard, the effect is comparable to results from a previous study mapping QTL of genetic and environmental robustness in mice [40]. Using approximately 20 animals from each of 19 inbred lines, the authors mapped numerous robustness QTL conferring different levels of across genetic-background or interindividual variability without altering median trait values. A detailed dissection of the underlying genegene and geneenvironment interactions would require more lines and animals, but the results already indicate the presence of abundant genetic loci implicated in trait buffering. Interestingly, the nonparametric method described in this mouse study can be applied to variability among isogenic individuals sharing a common environment and, therefore, provides a direct way to map ePTL systematically [40]. These pilot studies illustrate the feasibility and potential of PTL mapping. However, carrying out genetic mapping at the level of individuals without exploiting single cell data
53

Opinion
faces a major limitation: sample size. It is well known in statistics that testing differences in variance or other highorder moments of a distribution requires larger samples than testing differences in median or mean. This issue has been explored in a quantitative genetics model: a single nucleotide polymorphism (SNP) causing a change in phenotypic variance with a 1.1 multiplicative effect requires a minimum of 10 000 observations to be detected by GWAS [41]. Given the large samples already needed for classical QTL and GWAS studies, the experimental effort to identify PTL in a systematic approach seems enormous. To bypass this limitation, an attractive possibility is to remember that a single individual can provide of a huge number of cells. If a genetic locus has an intrinsically nondeterministic effect on molecular or cellular regulation, then it likely affects the density function of one or more single cell traits.
(A) 0.4 Density 0.3 0.2 0.1 0

Trends in Genetics February 2014, Vol. 30, No. 2

This trait can be a gene expression level or any other intracellular concentration, a cellular shape, a cell division rate, a rate of secretion, or any other quantity relevant to the macroscopic phenotype under study. When this single cell trait can be measured experimentally on many cells collected from each individual of a cohort, the single cell trait density function can be estimated. This is the case, for example, if the trait is the cell size of a class of macrophages. Its density function can be obtained by drawing blood from donors, extracting macrophages, labeling the ones of interest with appropriate cell surface markers and analyzing them by ow-cytometry. I now dene a single cell (sc)PTL as a genetic locus changing a single cell trait density function. Mapping scPTL can bypass the issue of statistical power because samples of large size (many cells) are available from every individual. Thus, comparing
More pathogenic cells
(B)

g1

Increased probability of disease

0.4 Density 0.3 0.2 0.1 0 0.4 Density 0.3 0.2 0.1 0

g2

Healthy versus disease

Phenotype

Poor associaon

g3

gref

g1

Genotype
(C)

Variance of single cell trait

Phenotype

0.4 Density 0.3 0.2 0.1 0 Normal

gref

Strong associaon
1

0 Pathogenic

gref

g1
TRENDS in Genetics

Single cell trait value

Genotype

Figure 2. A scenario where single cell probabilistic trait loci (scPTL) mapping has greater potential than a classical approach. (A) A single cell trait, such as the expression level of an oncogene, is pathogenic at high values. Graphs represent distributions of the trait value (x-axis) among isogenic cells for four individuals having genotype gref, g1, g2, and g3, respectively, at a given locus. This locus is a scPTL because the distributions are significantly different, but it is not a QTL because the mean trait value (orange broken line) is the same in all four individuals. Genotypes g1, g2, and g3 generate more pathogenic cells than genotype gref, thereby increasing disease risk. (B) Corresponding data set used if a classical approach is applied. Brown and red symbols represent healthy and diseased individuals, respectively. Due to incomplete penetrance, only few individuals carrying the g1 genotype at the locus display the disease. Therefore, the correlation between the macroscopic phenotype (disease versus healthy) and the genotype (gref versus g1) is weak. (C) Data set used for scPTL mapping. Symbols represent the same individuals as in (B). This time, the phenotype on the yaxis is the variance of the distributions shown in (A). All g1 individuals display greater phenotypic values compared with gref controls. This greater correlation between the microscopic phenotype and the genotype enables detection of the locus.

54

Opinion
variances or other higher order moments of the single cell trait density function becomes possible. Therefore, using ow cytometry or other high-throughput single cell measurements [42] to map scPTL might prove more powerful than mapping PTL of traits exhibited by the individual. In what situation would scPTL mapping succeed and QTL detection fail? A scenario is presented in Figure 2, where a cellular quantitative trait is monitored and cells with high trait values are pathogenic. A trait like this can be, for example, the expression level of an oncogene such as v-erb-b2 erythroblastic leukemia viral oncogene homolog 2 (ERBB2), which can trigger tumorigenic processes when it is overexpressed in a single cell [43]. Alleles at a scPTL locus can change the statistical distribution of the trait among cells without necessarily changing the mean expression level (Figure 2A). This can be, for example, a variant in the promoter of the ERBB2 gene that increases cell-to-cell variability in ERBB2 expression [11]. In this case, some ERBB2 genotypes will increase the fraction of pathogenic cells appearing in the body and, therefore, will increase disease risk. The phenotype of the individuals (disease versus healthy) is poorly contrasted by the genotype because many individuals at risk are healthy (e.g., their immune system managed to clear the pathogenic cells). Owing to this low penetrance, standard QTL or GWAS detection has poor power (Figure 2B). By contrast, if single cell data are available, it becomes apparent that all carriers of the predisposition allele display a modied distribution of the single cell trait. Every individual is then highly informative for the genetic linkage test, and the scPTL can be detected (Figure 2C). Concluding remarks The huge mosaic of cells that forms an individual constitutes both a challenge and an opportunity. There is no chance that we will exhaustively describe this complex system by Newtonian deterministic laws inherited from molecular biology, and this might seem bad news. But fortunately, experimental measures are accessible to estimate probability functions from single cells and, therefore, the genetic properties of these functions can be dissected. A particle approach will probably not revolutionize genetics in general. However, for diseases that depend on the behavior of rare cells, several genetic factors might have been missed by classical QTL studies and GWAS because their reduced penetrance makes their overall effect small. The scPTL approach has the potential to reveal such variants. It is now, more than ever, time to talk to statistical physicists, to invite them to train our students, and to think in probabilistic terms about the roots of phenotypic control.
Acknowledgments
I am grateful to Marie Delattre, Bernard Dujon, Marie-Anne Felix, and Jean-Louis Mandel for fruitful discussions, to Daniel Jost, Magali Richard, Orsolya Symmons, and Michel Yvert for critical reading of the manuscript, to present and past members of the laboratory for their support and commitment, to two anonymous reviewers for very ophile Yvert for a sample of sand from constructive critics, and to The the Great Dune of Pyla. This work was supported by the European Research Council under the European Unions Seventh Framework Programme (FP7/2007-2013 Grant Agreement n8281359).

Trends in Genetics February 2014, Vol. 30, No. 2

References
1 Bloom, J.S. et al. (2013) Finding the sources of missing heritability in a yeast cross. Nature 494, 234237 2 Eichler, E.E. et al. (2010) Missing heritability and strategies for nding the underlying causes of complex disease. Nat. Rev. Genet. 11, 446450 3 Rockman, M.V. (2012) The QTN program and the alleles that matter for evolution: all thats gold does not glitter. Evolution 66, 117 4 Giudicessi, J.R. and Ackerman, M.J. (2013) Determinants of incomplete penetrance and variable expressivity in heritable cardiac arrhythmia syndromes. Transl. Res. 161, 114 5 Furniss, D. et al. (2008) A variant in the sonic hedgehog regulatory sequence (ZRS) is associated with triphalangeal thumb and deregulates expression in the developing limb. Hum. Mol. Genet. 17, 24172423 6 Garber, J.E. and Oft, K. (2005) Hereditary cancer predisposition syndromes. J. Clin. Oncol. 23, 276292 7 Finzi, D. et al. (1999) Latent infection of CD4+ T cells provides a mechanism for lifelong persistence of HIV-1, even in patients on effective combination therapy. Nat. Med. 5, 512517 8 Weinberger, L.S. et al. (2008) Transient-mediated fate determination in a transcriptional circuit of HIV. Nat. Genet. 40, 466470 9 Balaban, N.Q. et al. (2004) Bacterial persistence as a phenotypic switch. Science 305, 16221625 10 McAdams, H.H. and Arkin, A. (1997) Stochastic mechanisms in gene expression. Proc. Natl. Acad. Sci. U.S.A. 94, 814819 11 Hornung, G. et al. (2012) Noisemean relationship in mutated promoters. Genome Res. 22, 24092417 12 Lu, H.P. et al. (1998) Single-molecule enzymatic dynamics. Science 282, 18771882 13 Spencer, S.L. et al. (2009) Non-genetic origins of cell-to-cell variability in TRAIL-induced apoptosis. Nature 459, 428432 14 Tay, S. et al. (2010) Single-cell NF-kB dynamics reveal digital activation and analogue information processing. Nature 466, 267271 15 Albeck, J.G. et al. (2013) Frequency-modulated pulses of ERK Activity transmit quantitative proliferation signals. Mol. Cell 49, 249261 16 Snijder, B. et al. (2009) Population context determines cell-to-cell variability in endocytosis and virus infection. Nature 461, 520523 17 Werfel, J. et al. (2013) How changes in extracellular matrix mechanics and gene expression variability might combine to drive cancer progression. PLoS ONE 8, e76122 18 Baserga, R. (1985) The Biology of Cell Reproduction, Harvard University Press 19 Alberts, B. et al. (2007) Molecular Biology of the Cell. (5th revised edn), Garland Publishing 20 Berg, R.D. (1996) The indigenous gastrointestinal microora. Trends Microbiol. 4, 430435 21 Shalek, A.K. et al. (2013) Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells. Nature 498, 236240 22 Araten, D.J. et al. (2005) A quantitative measurement of the human somatic mutation rate. Cancer Res. 65, 81118117 23 Clamp, M. et al. (2007) Distinguishing protein-coding and noncoding genes in the human genome. Proc. Natl. Acad. Sci. U.S.A. 104, 19428 19433 24 Perrat, P.N. et al. (2013) Transposition-driven genomic heterogeneity in the Drosophila brain. Science 340, 9195 25 Drummond, D.A. and Wilke, C.O. (2009) The evolutionary consequences of erroneous protein synthesis. Nat. Rev. Genet. 10, 715724 26 Pelechano, V. et al. (2013) Extensive transcriptional heterogeneity revealed by isoform proling. Nature 497, 127131 27 Ferrell, J.E. and Machleder, E.M. (1998) The biochemical basis of an all-or-none cell fate switch in Xenopus oocytes. Science 280, 895898 28 Suel, G.M. et al. (2006) An excitable gene regulatory circuit induces transient cellular differentiation. Nature 440, 545550 29 Raj, A. et al. (2010) Variability in gene expression underlies incomplete penetrance. Nature 463, 913918 30 Barkoulas, M. et al. (2013) Robustness and epistasis in the C. elegans vulval signaling network revealed by pathway dosage modulation. Dev. Cell 24, 6475 31 Blair, D.R. et al. (2013) A nondegenerate code of deleterious variants in Mendelian loci contributes to complex disease risk. Cell 155, 7080
55

Opinion
32 Carter, A.J. and Houle, D. (2011) Articial selection reveals heritable variation for developmental instability. Evolution 65, 35583564 33 Yvert, G. et al. (2013) Single-cell phenomics reveals intra-species variation of phenotypic noise in yeast. BMC Syst. Biol. 7, 54 34 Ziv, N. et al. (2013) Genetic and nongenetic determinants of cell growth variation assessed by high-throughput microscopy. Mol. Biol. Evol. http://dx.doi.org/10.1093/molbev/mst138 35 Beaumont, H.J. et al. (2009) Experimental evolution of bet hedging. Nature 462, 9093 36 Ansel, J. et al. (2008) Cell-to-cell stochastic variation in gene expression is a complex genetic trait. PLoS Genet. 4, e1000049 37 Fehrmann, S. et al. (2013) Natural sequence variants of yeast environmental sensors confer cell-to-cell expression variability. Mol. Syst. Biol. 9, 695 38 Jimenez-Gomez, J.M. et al. (2011) Genomic analysis of QTLs and genes altering natural variation in stochastic noise. PLoS Genet. 7, e1002295 39 Yang, J. et al. (2012) FTO genotype is associated with phenotypic variability of body mass index. Nature 490, 267272 40 Fraser, H.B. and Schadt, E.E. (2010) The quantitative genetics of phenotypic robustness. PLoS ONE 5, e8635 41 Visscher, P.M. and Posthuma, D. (2010) Statistical power to detect genetic loci affecting environmental sensitivity. Behav. Genet. 40, 728733

Trends in Genetics February 2014, Vol. 30, No. 2

42 Brouzes, E. et al. (2009) Droplet microuidic technology for single-cell high-throughput screening. Proc. Natl. Acad. Sci. U.S.A. 106, 14195 14200 43 Leung, C.T. and Brugge, J.S. (2012) Outgrowth of single oncogeneexpressing cells from suppressive epithelial environments. Nature 482, 410413 44 Lynch, M. and Walsch, B. (1998) Genetics and Analysis of Quantitative Traits, Sinauer 45 Geldermann, H. (1975) Investigations on inheritance of quantitative characters in animals by gene markers I. Methods. Theor. Appl. Genet. 46, 319330 46 Brems, H. et al. (2009) Mechanisms in the pathogenesis of malignant tumours in neurobromatosis type 1. Lancet Oncol. 10, 508515 47 Veitia, R.A. (2005) Stochasticity or the fatal imperfection of cloning. J. Biosci. 30, 2130 48 Cook, D.L. et al. (1998) Modeling stochastic gene expression: implications for haploinsufciency. Proc. Natl. Acad. Sci. U.S.A. 95, 1564115646 49 Bosl, W.J. and Li, R. (2010) The role of noise and positive feedback in the onset of autosomal dominant diseases. BMC Syst. Biol. 4, 93 50 Kemkemer, R. et al. (2002) Increased noise as an effect of haploinsufciency of the tumor-suppressor gene neurobromatosis type 1 in vitro. Proc. Natl. Acad. Sci. U.S.A. 99, 1378313788

56

Review

The domestication and evolutionary ecology of apples


Amandine Cornille1,2, Tatiana Giraud1,2, Marinus J.M. Smulders3, n-Ruiz4, and Pierre Gladieux1,2 Isabel Rolda
1 2

Centre National de la Recherche Scientique (CNRS), Unite Mixte de Recherche (UMR) 8079, Ba timent 360, 91405 Orsay, France Universite Paris Sud, UMR 8079, Ba timent 360, 91405 Orsay, France 3 Wageningen UR Plant Breeding, Wageningen University & Research Centre, Wageningen, The Netherlands 4 Growth and Development Group, Plant Sciences Unit, Institute for Agricultural and Fisheries Research (ILVO), Melle, Belgium

The cultivated apple is a major fruit crop in temperate zones. Its wild relatives, distributed across temperate Eurasia and growing in diverse habitats, represent potentially useful sources of diversity for apple breeding. We review here the most recent ndings on the genetics and ecology of apple domestication and its impact on wild apples. Genetic analyses have revealed a Central Asian origin for cultivated apple, together with an unexpectedly large secondary contribution from the European crabapple. Wild apple species display strong population structures and high levels of introgression from domesticated apple, and this may threaten their genetic integrity. Recent research has revealed a major role of hybridization in the domestication of the cultivated apple and has highlighted the value of apple as an ideal model for unraveling adaptive diversication processes in perennial fruit crops. We discuss the implications of this knowledge for apple breeding and for the conservation of wild apples. The genus Malus as a model system for understanding fruit tree evolution Apple (Malus domestica Borkh) is one of the most important fruit crops in temperate regions worldwide in terms of production levels (http://faostat.fao.org/), and it occupies a central position in folklore, culture, and art [1]. Dessert apples are popular because of their taste, nutritional properties, storability, and convenience of use. Apple-based beverages such as ciders have been consumed for centuries by the peoples of Eurasia, even before the advent of the cultivated apple. Humans have been exploiting, selecting, and transporting apples for centuries, and several thousand apple cultivars have been documented [2]. Much of the diversity present in domesticated apples is currently maintained in germplasm (see Glossary) repositories and amateur collections, including a broad spectrum of
Corresponding author: Cornille, A. (amandine.cornille@gmail.com). Keywords: QTL; archaeological record; crabapple; genome; microsatellites; Malus domestica. 0168-9525/$ see front matter 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.tig.2013.10.002

cultivars of largely unknown history and pedigree [2,3]. Apples are self-incompatible, and it must soon have become apparent that offspring grown from seed frequently did not resemble the mother apple tree. This high degree of variability of the progeny, combined with a long juvenile phase, probably complicated and slowed the articial selection of interesting phenotypes by early farmers. The introduction of vegetative propagation by grafting, and the selection of dwarfed apple trees for use as rootstocks, were thus key events in the history of apples, facilitating the Glossary
Crabapple: wild apple species, usually producing profuse blossom and small acidic fruits. The word crab comes from the Old English crabbe meaning bitter or sharp tasting. Many crabapples are cultivated as ornamental trees and their apples are sometimes used for preserves. In Western Europe the term crabapple is often used to refer to Malus sylvestris (the European crabapple), in the Caucasus this term refers to Malus orientalis (the Caucasian crabapple) and, in Siberia, to Malus baccata (the Siberian crabapple). The native North American crabapples are Malus fusca, Malus coronaria, Malus angustifolia, and Malus ioensis. Malus sieversii, the main progenitor of the cultivated apple, is usually not referred to as a crabapple. Diachronic: changing over time. Germplasm: collection of genetic resources for a species. Hybridization: the process of interbreeding between individuals from different gene pools. Introgression: the transfer of genomic regions from one species into the gene pool of another species through an initial hybridization event followed by repeated backcrosses. Isolation by distance: pattern of correlation between genetic differentiation and geographic distance, resulting from a balance between genetic drift and dispersal. Panmictic group: random mating population. QTL mapping: identification of genomic regions governing a phenotypic trait by the statistical association of molecular markers with the phenotypic trait in a population derived from a cross between two or more parental lines (a segregating population) or in a population of unrelated individuals (an association mapping population). Rootstocks: part of a plant (usually the underground part) onto which a graft is fixed. In apples, the genotypes of the rootstocks used often result in dwarf apple trees. Self-incompatible: a plant incapable of self-fertilization because its own pollen is prevented from germinating on the stigma or because the pollen tube is blocked before it reaches the ovule. Simple sequence repeats (SSR): in other words microsatellites, the repetition in tandem of di-, tri-, or tetranucleotide motifs. SSR marker (or microsatellite marker): a marker based on PCR amplification and separation of SSR alleles with differences in repeat length. Vegetative propagation: natural or assisted asexual reproduction in plants through the regeneration of tissues or plant organs from one plant part or tissue. Natural vegetative propagation methods include bulbs or corms, for example. Horticultural vegetative propagation may involve stem or root cuttings or grafting. Grafting is often used for the propagation and maintenance of cultivars of interest in perennial fruit crops.

Trends in Genetics, February 2014, Vol. 30, No. 2

57

Review
selection and propagation of superior genotypes derived from open-pollination progenies (i.e., chance seedlings) [4], and improving control over the size of mature apple trees [1]. Despite the immense diversity of cultivars available, apple production worldwide is now largely based on the cultivation of a few dozen ornamental and edible cultivars, grafted onto less than a dozen different clonal rootstocks, with high levels of chemical inputs. Wild Malus taxa constitute a useful source of variation for apple breeding. They are also of ecological importance as a source of food for wildlife or as components of hedges in agricultural landscapes and, as such, they make a signicant contribution to ecosystem services. The use of these natural resources in breeding and the development of effective conservation programs require a good understanding of the genetic relationships within and between cultivated and related wild Malus taxa. More generally, the cultivated apple and its wild relatives constitute a suitable model system for studies of the domestication of long-lived perennial crops, about which much less is known than for seed-propagated annual crops. Recent population studies have provided insight into the origins of the cultivated apple, its domestication history, and its spread by human populations, the status and genetic integrity of its wild relatives, and the response to selection of these long-lived perennials. Here, we review recent progress in studies of the origin and diversity of cultivated apple and its wild relatives. We also explore future prospects for increasing our understanding of the nature and genetic architecture of the phenotypic changes

Trends in Genetics February 2014, Vol. 30, No. 2

associated with apple domestication across the wide range of habitats and environmental conditions in which wild apples grow. The status and genetic integrity of cultivated and wild apple taxa The apple belongs to the Rosaceae family, as do other major temperate fruit tree species (e.g., pear, apricot, peach, plum, cherry, and almond). The genus Malus consists of about 30 species and several subspecies, but the taxonomy of this genus is complex, unclear, and likely to be revised in the future [5]. Species delimitation in apples is hampered by the paucity of diagnostic morphological features [6], the fact that records are lacking concerning the geographic range of some taxa, and the prevalence of hybridization between species. Hybridization among apples is facilitated by a lack of interspecic reproductive barriers, self-incompatibility, and the cultivation of apple in areas in which wild apple populations occur naturally [711]. The cultivated (or sweet) apple is usually designated as Malus pumila Mill. or M. domestica Borkh., but many other synonyms, now considered illegitimate, have been used. The name M. pumila seems to be more legitimate in terms of the rules of botanical nomenclature, but M. domestica is the name most widely used, and this name will be used here. M. domestica belongs to the section Malus, one of the ve sections within the genus Malus, and is a diverse crop species containing a large number of wild-introgressed individuals and feral populations [12]. The cultivated apple was initially domesticated from the

(A)
0 200 400 km

1 Origin: Tian Shan

Malus sylvestris

Hybridizaon along the Silk Routes

Malus sieversii Malus domesca


(B)

Malus orientalis
Genec markers SSR* and chloroplast genome SSR SSR, chloroplast, and nuclear sequences *SSR: simple sequence repeats (microsatellites)

1 10 0004000 ya 45001500 ya 1500 ya 100 ya 2 2

Hybridizaon between wild and culvated apples Crop-to-wild hybridizaon Bidireconal hybridizaons

BACC

SIEV

DOM

OR

SYL

TRENDS in Genetics

Figure 1. Evolutionary history of the cultivated apple. (A) This history was revealed by recent population studies using different types of molecular markers for evolutionary inferences. (1) Origin in the Tian Shan Mountains from Malus sieversii, followed by (2) dispersal from Asia to Europe along the Silk Route, facilitating hybridization and introgression from the Caucasian and European crabapples. Arrow thickness is proportional to the genetic contribution of various wild species to the genetic makeup of Malus domestica. (B) Genealogical relationships between wild and cultivated apples. Approximate dates of the domestication and hybridization events between wild and cultivated species are detailed in the legend. Abbreviations: BACC, Malus baccata; DOM, M. domestica; OR, Malus orientalis; SIEV, M. sieversii; SYL, Malus sylvestris; ya, years ago.

58

Review
wild apple Malus sieversii (Ldb.) Roem in the Tian Shan Mountains in Central Asia [11,1315]. People then took the domesticated apples westwards along the great trade routes known as the Silk Route (Figure 1), where they came into contact with other wild apples, such as Malus baccata (L.) Borkh. in Siberia, Malus orientalis Uglitz. in the Caucasus, and Malus sylvestris Mill. in Europe (Box 1). These three crabapple species and M. sieversii are considered to be the closest relatives of M. domestica, with which they are fully interfertile [12]. The four wild apple relatives

Trends in Genetics February 2014, Vol. 30, No. 2

are widely distributed across temperate areas in Eurasia where they often grow as low-density populations (except for M. sieversii) in a wide range of habitats and environmental conditions (Box 1). Several recent studies have used molecular markers to investigate the diversity, structure, and dispersal capacities of wild apple relatives (Box 2) other than M. baccata, which has been little studied to date. They have provided highly relevant and timely information, helping to identify and protect germplasm sources for apple breeding for traits

Box 1. Ecology of wild relatives of the cultivated apple


Three decades of field sampling expeditions in Eurasia by different research groups have provided a clearer picture of the geographic distributions and ecology of wild apple relatives (Figure I). The four wild apple species, Malus sylvestris, Malus orientalis, Malus sieversii, and Malus baccata, are mostly pollinated by bees and flies [36,51]. Diverse wild animals, including mammals and large birds, feed on the fruits, but their respective efficiencies as seed-dispersal vectors are unknown [1,36]. The wild apple species bear plentiful fragrant white flowers and red to yellow fruits of about 16 cm in diameter (Figure I) that are variable in shape, color, and taste. The fruits of M. orientalis, M. baccata (http://www.efloras.org/), and M. sylvestris are edible and were probably already used before the spread of cultivated apple [1]. M. sieversii (Ldb.) Roem grows at intermediate elevations (typically 9001600 m) in the mountainous regions of Central Asia, and is extremely variable in growth habit, height, fruit quality, and fruit size [1,10,35]. Trees growing in this area, on their own roots, can bear fruit approaching the size of commercial cultivars (i.e., >60 mm diameter), with excellent characteristics that are highly appreciated locally. Currently, M. sieversii remains in only a few small intact forests of no more than several hundred hectares in extent in the Tian Shan, at the border of Kazakhstan and China, in southern Kazakhstan, with further small stands in Kyrgyzstan (Figure I). It is also reported to occur sporadically across an area comprising Turkmenistan, Uzbekistan, Tajikistan, and northeastern Afghanistan (Figure I). M. sieversii populations experienced severe overharvesting during the Soviet era, and the remaining populations are further threatened by forest destruction [18]. M. sylvestris Mill. occurs across Western and Central Europe [52] (Figure I). The European crabapple has a high light requirement, resulting in its occurrence mostly at forest edges, in farmland hedges, and on marginal sites. It grows on almost all soils but prefers the wet edge of the forest (http://www.euforgen.org/). This species has suffered from the abandonment of coppicing, a traditional forest management practice designed to rejuvenate old trees or shrubs by cutting them to ground level, thus promoting the regeneration of new stems from the base. The European crabapple is now considered endangered in Belgium, the Czech Republic, and Germany [52]. M. orientalis Uglitzh occurs in the Caucasian region (Turkey, Armenia, Georgia, and southwestern Russia) but little is known about its ecology (Figure I). M. baccata occurs across Siberia and South Asia (India, Pakistan, and Nepal) and is probably relatively cold-tolerant, although little is known about its ecology (Figure I).

Malus sylvestris Diameter: 13 cm

Malus baccata Diameter: 1 cm

200 400 km

>30 10 1

Malus orientalis Diameter: 24 cm

Malus sieversii Diameter: up to 8 cm

TRENDS in Genetics

Figure I. Distribution and key morphological features of wild apples. The distribution of species can be inferred from the geographic origin of accessions included in published studies: Malus sylvestris (blue) [15,17,20,34,36,36], Malus orientalis (yellow) [15,17,2022], Malus sieversii (red) [5,14,15,1720,53], and Malus baccata (purple) [15]. Disk areas are proportional to the number of accessions. Pictures of the fruit of the different wild apples are provided, as well as their respective diameters.

59

Review
Box 2. Phylogeography, population structure, and dispersal of wild apples
The population structure of the four wild apple species has recently been investigated with microsatellites [17,19,34]. A weak spatial genetic population structure was detected for each species, consistent with the high level of interpopulation gene flow. Estimates of historical gene flow have suggested that wild apple relatives have high dispersal capacities, but further investigations of contemporary and historical pollen and seed dispersal distances are required. The European crabapple, Malus sylvestris, is the wild apple relative with the most thoroughly investigated biogeographic history. M. sylvestris experienced range contraction in glacial refugia in Southern Europe during the last glacial maximum. At the beginning of the Holocene, when the climate became warmer in Europe, it recolonized Northern Europe. Three main populations have been identified in Western Europe, around the Carpathian Mountains, and in the Balkan Peninsula with admixture in their suture zones [17]. Further investigation, including high-density sampling in France, has revealed an alternative structure, with five distinct clusters in Italy, western France/Great Britain/Belgium, eastern France, Sweden/ Denmark/Norway, and the Balkans (Cornille et al., unpublished). Malus sieversii exists as a main population spread over Central Asia and a smaller population (101 individuals) in the Tian Shan Mountains

Trends in Genetics February 2014, Vol. 30, No. 2

[20]. Comparison of the population structure of M. sieversii with that of one of its major fungal pathogens, Venturia inaequalis, revealed similar patterns of genetic differentiation in the M. sieversii forests of the eastern mountains of Kazakhstan [54]. This correspondence suggests possible co-structuring of the populations of the main apple progenitor, M. sieversii, and its chief pathogen. The use of another dataset (949 individuals) [19] revealed stronger population differentiation in Kazakhstan than previously reported [20], with two large populations and two others with narrow distributions. The difference in sampling density (101 versus 949 individuals, respectively) may account for these apparent differences in population structure. Malus orientalis displays a weak northsouth spatial genetic structure, with three distinct populations: a large population corresponding to most of the Armenian samples and two more narrowly distributed populations, one in the Southern Caucasus (Turkey and southern Armenia) and the other at more northerly latitudes (in Russia). The population structure of Malus baccata has not yet been investigated, mainly due to the lack of samples across its distribution. The four wild relatives may have geographic distributions contiguous with those of other wild species [1], but detailed information about the distribution of these other wild populations is lacking.

such as resistance to pathogens (e.g., Venturia inaequalis, the fungus causing apple scab, [16]). Population genetic analyses of large collections of wild apple relatives have uncovered signatures of postglacial recolonization from distinct glacial refugia in M. sylvestris [17], and high levels of diversity and complex population structures in M. sieversii [10,1820] and M. orientalis [2023]. Genetic data have also revealed weak isolation by distance (Box 2). Ubiquitous and high levels of introgression from domesticated apple were found in three of the four wild apple relatives (Box 3), revealing a threat to their genetic integrity. Thus, wild apple relatives clearly have a high dispersal capacity, and the widespread cultivation of domesticated apples in temperate latitudes has promoted extensive cropto-wild gene ow. The mating system of apple species, characterized by a self-incompatibility mechanism that ensures outbreeding, may have played a crucial role in these introgressions. This mechanism is controlled by a single locus, the S locus, with a large number of alleles, up to several hundred in some species. In apples, individual trees can only mate with other individuals with different S alleles. The self-incompatibility of apples may thus have favored interspecic hybridization and introgression not only by forcing outcrossing, but also through adaptive introgression, because S locus alleles that introgress from a closely related species, if rare in the recipient species, can rapidly spread by positive selection [24,25]. Population genetic methods are increasingly used to resolve taxonomic uncertainties or to identify misclassied germplasm in apple collections. Admixture analyses have revealed that Malus kirghisorum, the wild apple from the forests of Kyrgyzstan, and Malus asiatica, the apple formerly cultivated in China, are in fact indistinguishable from M. sieversii [14]. Multilocus microsatellite typing of the US Department of Agriculture (USDA) Agricultural Research Service (ARS)NPGS (National Plant Germplan System) collection (Geneva, USA) suggested that hybrids and misclassications may also be common in germplasm repositories [26,27]. Further surveys of the diversity and distribution of wild apples
60

should facilitate the conservation of wild genetic resources in situ (e.g., conservation of genetically differentiated populations) or ex situ (e.g., establishment of core collections of pure wild individuals, maximizing genetic diversity), ultimately enabling their optimal use [27,28]. The enduring riddle of the genetic makeup of cultivated apples Do domesticated apples actually exist? The debate concerning whether a crop or animal species should be considered domesticated is similar to that regarding the species concept [29]. Denitions of the boundaries of domesticated taxa are essential in studies of domestication because the classication of domestication-related traits and of the underlying genes can have a major impact on evolutionary inferences. De Queiroz [6] argued that all modern biologists agree that species correspond to segments of evolutionary lineages evolving independently from each other. The seemingly endless dispute about species concepts stems from the confusion between species denition and species criteria, with different recognition criteria corresponding to different events occurring during lineage divergence, rather than to fundamental differences in what is considered to represent a species [6]. Applying a similar reasoning to the domestication process, domesticated species can be dened as segments of evolutionary lineages diverging from their wild progenitors in response to articial selection pressures and human control over reproduction. Domestication, similarly to speciation, is often quantitative in nature, and different means of quantifying lineage independence can be used to measure different arbitrary stages along this continuum [2931]. The later stages of divergence along the domestication continuum are characterized by the xation of marked discontinuities between crops and their wild ancestors (i.e., domestication syndrome) which can be used to classify plants categorically as domesticated. Earlier stages of divergence are more readily detected with rapidly evolving genetic markers.

Review
Box 3. Crop-to-wild gene flow in apple
Self-incompatibility and large dispersal capacities make the genus Malus an outstanding system for the study of crop-to-wild and interspecific gene flow. The differentiation between domesticated and wild gene pools makes it possible to track cultivated varieties with genetic tools, facilitating inferences about population history such as movement, population subdivision, hybridization, and introgression. Multilocus microsatellite typing and Bayesian clustering methods, making use of the differentiation between domesticated and wild gene pools, recently revealed the existence of extensive crop-to-wild gene flow in apples [20,27,34,36,41]. Most studies have focused on Malus domestica to Malus sylvestris gene flow (Table I). Statistical models investigating the factors (environmental and human) influencing the extent of crop-to-wild introgression in apple (gene flow from M. domestica to the European crabapple) have revealed a significant positive effect of the number of apple orchards on crop-to-wild introgression rates in the European crabapple (Cornille et al., unpublished). Thus, human activity influences rates of hybridization

Trends in Genetics February 2014, Vol. 30, No. 2

and introgression from the cultivated apple to M. sylvestris. Recent studies have also investigated crop-to-wild gene flow from modern commercial varieties imported from Western countries to Malus sieversii and Malus orientalis [15,20]. Introgression from the cultivated apple into the wild progenitor M. sieversii, and to a lesser extent into M. orientalis, has also been detected. The high potential for cropto-wild gene flow must be considered in the management of wild apple resources. Phenotypic data have been reported in parallel with genetic data and have been compared between wild and cultivated apple species (Table I). The results obtained for molecular markers matched, to some extent, the morphological traits studied, but no clear-cut morphological criteria distinguishing between cultivated apple and the European crabapple were identified [34,36]. Comparisons of phenotypic data showed that admixed M. sieversii individuals had, on average, larger fruits than pure wild individuals [27]. However, it is difficult to determine the extent to which these admixed individuals are actually feral trees.

Table I. Crop-to-wild gene ow in applea


Species Malus sylvestris N 44 178 Geographic area Belgium, Germany Denmark DNA marker b SSR, AFLP SSR Percentage of hybrids 6.8 11.2 Phenotypic data Hairiness of the underside of the leaf Fruit diameter, fruit color, pubescence of terminal part of long shoots, pubescence of abaxial surface of leaves from long shoots, and pubescence of abaxial surface of leaves from spur shoots Refs [34] [36]

159 61 796

Malus orientalis Malus sieversii

46 217 118 101

The Netherlands, Belgium Germany, Macedonia Austria, Belgium, Bosnia Herzegovina, Bulgaria, Denmark, France, Germany, Great Britain, Hungary, Italy, Norway, Poland, Romania, Ukraine Russia, Turkey Turkey, Armenia, Russia Kazakhstan Kazakhstan, China, Kyrgyzstan, Tajikistan and Uzbekistan

17 19.7 36.7

[41] [27] [20]

SSR SSR

13.0 3.2 14.4 14.8

Mean fruit weight, length, and width

[27] [20] [27] [20]

Material features of each study: wild species, sampling size (N), geographic area covered, molecular and phenotypic data used, percentage of hybrids. Abbreviations: AFLP, amplied fragment length polymorphisms; SSR, simple sequence repeats.

The domestication process, which typically involved open pollination and unconscious selection over thousands of years, has been followed by the recent process of crop improvement, in which breeders intentionally carry out crosses for the selection of new traits in crop varieties [32]. Furthermore, crop improvement traits are typically variable among the cultivars of a crop [33]. The modern breeding phase can thus complicate studies of the genetic basis of phenotypic changes occurring during the initial domestication phase. However, it is theoretically possible to distinguish between the two phases of the evolution of domesticated plants by inferring the age and origin of alleles contributing to agronomically important traits. Analyses of population structure based on comprehensive samples of cultivated and wild gene pools have revealed that the cultivated apple forms a distinct, panmictic group, well separated from its Central Asian progenitor, M. sieversii, with similar levels of genetic variation

in the two groups [15,27]. Divergent selection between wild and cultivated gene pools, extensive selection on domesticates by humans outside the center of origin of the crop, and vegetative propagation have all contributed to a reduction of gene ow between cultivated apples and their progenitors, and the resulting pattern of clear population differentiation provides denitive evidence for the domesticated nature of apples. Several studies have investigated phenotypic variation in close relatives of M. domestica [10,23,27,3436] but, to the best of our knowledge, no attempt has been made to quantify the morphological and physiological changes associated with apple domestication. These phenotypic changes are expected to be weaker in apples and other long-lived perennial fruit crops than in seed-propagated annuals because the lengthy juvenile phase and the use of grafting make domestication more recent, at least in terms of the number of generations [37]. Nevertheless, several traits clearly distinguish
61

Review
domesticated long-lived perennials from their wild ancestors (see below), and the phenotypic changes associated with domestication may parallel those in annual crops [30]. Archaeological evidence Several scholars have evaluated the archaeological and historical evidence relating to the origin and use of apples [13,12]. Evidence for the collection of wild apples has been found at Neolithic and Bronze Age archaeological sites across Europe. Similar evidence suggesting the early use of apples in Western and Central Asia is lacking, but some anomalies in the distribution of M. domestica, such as isolated patches of large sweet apples (i.e., different from native bitter crabapples) in the Caucasus Mountains, the Crimean Peninsula, parts of Afghanistan, Iran, Turkey, and the Kursk region of European Russia, and the large, possibly 3000-year-old apple discovered at Navan Fort (Northern Ireland), might be hallmarks of early apple seedling imports from Central Asia [1]. Advances in ancient DNA technology should facilitate diachronic investigations of the domesticated status of apple seed remains in the Neolithic and Bronze Age archaeological record across Eurasia, providing information about the transition from gathering to cultivation. The recent amplication of ribosomal DNA from apple remains seed coat (testa) and pericarp preserved in waterlogged layers from a Roman site in Switzerland suggests that it may be feasible to identify wild and domesticated apple from the archaeological record [38]. Insight into the human behavior responsible for the changes associated with domestication can also be gained by studying current practices such as the exploitation of wild apples in some areas of Western and Central Asia. Grafting, as a handy way to propagate elite cultivars, has probably played a key role in the spread of apple as a major fruit crop worldwide [1,11]. Grafting is a practice that is thought to have begun about 3800 years ago based on a cuneiform description of budwood importation for grape in Mesopotamia. Indirect evidence has been obtained for the cultivation of apples 3000 years ago in Mesopotamia [12], but the most compelling evidence for horticulture and for apple cultivation dates from the Greek period (i.e., from about the 3rd century BC) [1]. The Romans probably learned apple grafting, cultivation, harvesting, and storage from the Greeks, and brought the production chain technology to the rest of their empire. Dwarf apple trees were known 2300 years ago, and dwarfing rootstocks were probably brought to the West from the Caucasus. In the 19th century, dwarng clones known as Paradise (or French Paradise) or Doucin (or English Paradise) were common in Europe, before the East Malling Research Station in England created a standardized collection of 10 rootstocks with new names. Two of the original East Malling selections, M9 (Paradis Jaune de Metz) and M7 (Doucin Reinette), are still widely used by horticulturists worldwide. Genetic evidence Having originated from M. sieversii in Central Asia about 4000 to 10 000 years ago, the cultivated apple then underwent hybridization with its wild relatives during its spread
62

Trends in Genetics February 2014, Vol. 30, No. 2

from the Tian Shan Mountains westward along the Silk Route (Figure 1). The different timescales of the domestication process (ancient initial domestication in Central Asia followed by more recent hybridization during the route of spread of apple cultivation) have been reconstructed from panels of markers tracing back different evolutionary timescales (i.e., chloroplast, nuclear, and microsatellite markers). These genetic markers have revealed original patterns of diversity and differentiation within cultivated apple, and non-trivial relationships with its wild relatives. The various steps in the domestication of apple unraveled by the analysis of genetic markers are outlined below. (i) Origin in the Tian Shan Mountains of Central Asia: was Malus sieversii the progenitor? The morphological similarity between M. domestica and M. sieversii, and the extraordinary diversity of wild apples in Kazakhstan, were rst reported by Vavilov [13] and then by Ponomarenko (1983, cited in [39]). These ndings were recently conrmed by collaborative collection expeditions involving American, European, and Central Asian scientists [10,14,15,39]. The genetic data were found to be consistent with these morphological observations. Initial phylogenetic analyses of chloroplast and nuclear sequences conrmed a progenitordescendant relationship, with M. sieversiiM. domestica being the pair of species most closely related genetically, based on sequences, and possibly not even distinguishable, but conclusions about the origin of this crop were hampered by the lack of strong statistical support and/ or limited sample size [11,14]. Analyses of microsatellite data [i.e., simple sequence repeat (SSR) markers] for a comprehensive collection of wild and cultivated apples have provided new insight, revealing that M. domestica forms a distinct, panmictic group, well separated from M. sieversii [15,40]. This may indicate that the populations of M. sieversii from which the crop was domesticated have yet to be identied or may no longer exist. Malus sieversii is unusual in that it grows at high density over large areas, but these forests have been decimated since Vavilov reported their existence. The substantial genetic differentiation between M. domestica and M. sieversii may also result from a combination of genetic drift and introgressions from other wild apple species, erasing the genetic footprints of the initial contribution of M. sieversii. Genomic studies and further sampling of wild apple and cultivar diversity should provide quantitative and qualitative insight into the origin of M. domestica, and improve our understanding of the genetic architecture and basis of the phenotypic changes selected during domestication and subsequent crop improvement. (ii) Diversication of the domesticated apple along the Silk Route Several studies have highlighted the importance of the European crabapple, M. sylvestris, in the evolution and diversication of the cultivated apple and, to a lesser extent, that of the Caucasian crabapple M. orientalis, through the use of microsatellite markers and nuclear or chloroplast sequences [15,36,41,42].

Review
M. sieversii was probably the initial progenitor, but other wild apple species subsequently contributed to the genetic makeup of cultivated apple through introgressive hybridization, affecting both nuclear and chloroplast DNA, during its dispersal westward along the Silk Route (Figure 1). These introgressions obscure the signal of any initial bottleneck occurring during the original domestication of M. domestica. Introgression has been so extensive that M. domestica now appears to be more similar genetically to the European crabapple M. sylvestris than to the Asian wild apple M. sieversii on the basis of microsatellite markers [15]. Most, but not all, apple cultivars now also appear to contain the chloroplast genome of M. sylvestris [42]. Genetic evidence has also demonstrated that current cider cultivars, which usually produce smaller fruits that are more bitter than those of dessert cultivars, are not the cultivars genetically closest to M. sylvestris [15], as initially hypothesized. This hypothesis was based on the known use of crabapples for the preparation of apple beverages, even before apple cultivation [1], and the astringent, bitter properties of these small apples, ideal for cidermaking. Dessert apple cultivars were actually found to be genetically closer to M. sylvestris than cider cultivars in studies using nuclear microsatellite markers [15]. Because cider is made and consumed across Eurasia, cider apple cultivars may also have originated in Central Asia where wild apple populations are highly diverse in terms of color, taste, and size. Breeding methods for fruit tree crops and the phenotypic changes associated with apple domestication The life-history traits of apples make it harder to x the desired agronomic traits quickly in this crop than in annual seed-propagated crops. In particular, apples have long generation times and display obligate outcrossing due to their self-incompatibility system. The development of appropriate breeding methods (i.e., open-pollination followed by the grafting of interesting phenotypes) resulted in a model of domestication fundamentally different from that for annual seed-propagated crops. Theoretically, the introduction of clonal propagation may have rapidly decreased the genetic diversity of cultivated apples because grafting permitted the genetic xation of a limited set of elite clones that could only diversify by accumulating somatic mutations. However, no evidence of a domestication bottleneck or clonal population structure has been found in apple in studies considering a single representative of each variety [15]. This raises questions about how high levels of diversity have been maintained, and about the lack of a clonal or relatedness structure between cultivars. Long-prevailing apple breeding methods may account for this apparent paradox. If many farmers each independently selected trees producing good fruits from progeny arising from natural pollination (chance seedlings [10]), the obligate outcrossing of the crop, the isolation of farms, and differences in taste preferences across regions may have been powerful forces driving the maintenance of high levels of variation. The occasional

Trends in Genetics February 2014, Vol. 30, No. 2

selection of plants resulting from hybridization with wild relatives would have further increased diversity. These traditional methods may have contributed to a high level of diversity in the cultivated gene pool, but current breeding practice methods, in which a small set of cultivars is common to the pedigree of many new cultivars, may be eroding the genetic diversity available in the cultivated genepool. For instance, a genetic analysis of a comprehensive sample of commercial cultivars revealed extensive rst-, second- and third-degree relationships between popular apple varieties [43], as previously reported for grape [44]. This highlights the need to make better use of the diversity in apple germplasm [28,45]. Conversely, the current situation may also reect the extensive use of genetic diversity by breeders, but with the cultivars generated replacing each other in the production area rather than being grown alongside older varieties. This is the situation that applies for many vegetable and arable crops in which genetic diversity remains at a high level if considered over a longer period of time. Apple phenotypes selected by humans include: (i) higher productivity (e.g., number of fruits, fruit size, fruit weight); (ii) fruit quality (e.g., color, shape), avor (volatile compounds), taste (e.g., fruit acidity, sugar content, polyphenolics), texture (fruit esh rmness), and conservation (storage behavior); (iii) easier harvesting (e.g., growth habit, plant size); and (iv) greater phenological congruence with agricultural practices (e.g., shorter juvenile phases, synchronicity in blooming or fruit ripening). Crop improvement and cultivar diversication have affected many characteristics, including juvenile phase length, cold requirement, cold tolerance, drought tolerance, fruit tenacity, and disease resistance [3]. Rootstocks may also have been subject to selection for their effects on the productivity of aerial parts and their role in tolerance to edaphic, climatic, and biotic conditions. The domestication and improvement of cultivated apples may have beneted from the high heritability of quality traits [46]. Quantitative trait locus (QTL) mapping has been used to dissect the genetic architecture of several desired traits through crosses between cultivars [47]. However, QTL mapping has not been used to investigate the genetic architecture of domestication traits through wild domesticated biparental crosses. Given the high costs of developing and maintaining mapping populations in apple, alternatives, such as candidate gene and bottom-up approaches, relying on population genetic analyses and functional information to connect selected (candidate) genes to phenotypes, are likely to be the best options for identifying the genomic regions that have undergone the greatest change during domestication [48,49]. The recently released Golden Delicious genome sequence [14], and the availability of large-scale genotyping tools [50], have made it possible to generate population-scale data for both domesticated and wild apples.

Concluding remarks Recent studies of the evolutionary history of the domesticated apple and its wild relatives have shown this to be an outstanding biological model for exploring the evolutionary
63

Review
process at work during domestication. Full-genome resequencing and reference-assisted de novo assembly of domesticated and wild apple genomes will undoubtedly provide further insight into the history of this remarkable crop, facilitating the genetic dissection of agronomically and ecologically relevant traits. The main limiting factor may now be the availability of suitable accessions for evolutionary inferences and plant breeding rather than the development of genomic tools, highlighting the need for further eld collections. Future sampling campaigns should focus on: (i) worldwide rootstock accessions the origin of rootstocks is unknown despite their key role in the tolerance of cultivars to abiotic conditions, (ii) local and ancient cultivars from different parts of the world (Eastern Europe, Mediterranean, and Asia), and (iii) wild unsampled species and populations, from Central Asia in particular. Another unanswered question concerns the adaptive signicance of the high levels of wild-to-crop gene ow: has human selection unwittingly favored gene ow from wild species? Do introgressions localize in neutral regions of the crop genome or at regions of functional signicance? Regarding crop-to-wild gene ow, has natural selection retained M. domestica alleles introgressed into wild apples? Conversely, is introgression deleterious for crabapples, threatening wild populations, or will introgressions from the domesticated species be selected against? The application of population and comparative genomics to wild and cultivated apples provides new opportunities to investigate the molecular basis and genetic architecture of both (i) agronomic traits potentially selected and introgressed during domestication (e.g., fruit size, color, volatile content), and (ii) ecological traits contributing to the adaptation of wild apple populations to environmental and biotic factors (e.g., temperature, drought, pathogens). Population and comparative genomics may also reveal which traits were selected during the early steps of domestication and which were selected during the modern breeding phase. Breeding programs will probably benet from the incorporation of candidate alleles identied by evolutionary analyses, and there may be an increase in diversity from exotic germplasm in key genomic regions (such as apparently less diverse candidate regions). The ecological genomics of apples sensu lato will, therefore, not only advance our understanding of the history of this fascinating system but will also guide and accelerate the development of new breeding strategies for increasing the volume and quality of apple production in the face of dynamic abiotic and biotic threats.
Acknowledgments
We thank the editor and two referees for comments on a previous version of this manuscript. We thank Ilya Zacharov for photographs of M. baccata and Pascal Heitlzer for photographs of M. sieversii.

Trends in Genetics February 2014, Vol. 30, No. 2

References
1 Juniper, B.E. and Mabberley, D.J. (2006) The Story of the Apple, Imber Press 2 Morgan, J. and Richards, A. (2003) The New Book of Apples, Brogdale Horticultural Trust, Ebury Press 3 Janick, J. (2005) The origins of fruits, fruit growing, and fruit breeding. Plant Breed. Rev. 25, 255321
64

4 Gardiner, S.E. et al. (2007) Apple. In Genome Mapping and Molecular Breeding in Plants, Fruits and Nuts (Kole, C., ed.), pp. 162, Springer 5 Robinson, J.P. et al. (2001) Taxonomy of the genus Malus Mill. (Rosaceae) with emphasis on the cultivated apple, Malus domestica Borkh. Plant Syst. Evol. 226, 3558 6 De Queiroz, K. (1999) The general lineage concept of species, species criteria, and the process of speciation. In Endless Forms: Species and Speciation (Howard, D.J. and Berlocher, S.H., eds), pp. 4989, Oxford University Press 7 Korban, S.S. (1986) Interspecic hybridization in Malus. HortScience 21, 4148 8 Way, R.D. et al. (1991) Apples (Malus). Acta Hort. 290, 346 9 Watkins, R. (1995) Apple and pear. In Evolution of Crop Plants (Smartt, J. and Simmonds, N.W., eds), pp. 418422, Longman 10 Forsline, P.L. et al. (2003) Collection, maintenance, characterization and utilization of wild apples of Central Asia. Hort. Rev. 29, 161 11 Harris, S.A. et al. (2002) Genetic clues to the origin of the apple. Trends Genet. 18, 426430 12 Zohary, D. and Hopf, M. (2000) Domestication of Plants in the Old World. (3rd edn), Oxford University Press 13 Vavilov, N.I. (1926) Studies on the origin of cultivated plants. Tr. Byuro Prikl. Bot. 16, 139245 14 Velasco, R. et al. (2010) The genome of the domesticated apple (Malus domestica Borkh.). Nat. Genet. 42, 833839 15 Cornille, A. et al. (2012) New insight into the history of domesticated apple: secondary contribution of the European wild apple to the genome of cultivated varieties. PLoS Genet. 8, e1002703 Van, A. et al. (2013) Differential selection pressures exerted by host 16 Le resistance quantitative trait loci on a pathogen population: a case study in an apple Venturia inaequalis pathosystem. New Phytol. 197, 899 908 17 Cornille, A. et al. (2013) Post-glacial recolonization history of the European crabapple (Malus sylvestris Mill.), a wild contributor to the domesticated apple. Mol. Ecol. 22, 22492263 18 Zhang, C. et al. (2007) Genetic structure of Malus sieversii population from Xinjiang, China, revealed by SSR markers. J. Genet. Genomics 34, 947955 19 Richards, C. et al. (2009) Genetic diversity and population structure in Malus sieversii, a wild progenitor species of domesticated apple. Tree Genet. Genomes 5, 339347 20 Cornille, A. et al. (2013) Crop-to-wild gene ow and spatial genetic structure in the closest wild relatives of the cultivated apple. Evol. Appl. 6, 737748 21 Volk, G.M. et al. (2008) Genetic diversity and disease resistance of wild Malus orientalis from Turkey and Southern Russia. J. Am. Soc. Hortic. Sci. 133, 383389 22 Volk, G.M. et al. (2009) Capturing the diversity of wild Malus orientalis from Georgia, Armenia, Russia, and Turkey. J. Am. Soc. Hortic. Sci. 134, 453459 fer, M. et al. (2013) Assessment of phenotypic variation of Malus 23 Ho orientalis in the North Caucasus region. Genet. Resour. Crop Evol. 60, 14631477 24 Castric, V. et al. (2008) Repeated adaptive introgression at a gene under multiallelic balancing selection. PLoS Genet. 4, e1000168 25 Joly, S. and Schoen, D.J. (2011) Migration rates, frequency-dependent selection and self-incompatibility locus in Leavenworthia (Brassicaceae). Evolution 65, 23572369 26 Van Treuren, R. et al. (2010) Microsatellite genotyping of apple Malus domestica Borkh.) genetic resources in The Netherlands: application in collection management and variety identication. Genet. Resour. Crop Evol. 57, 853865 27 Gross, B.L. et al. (2012) Identication of interspecic hybrids among domesticated apple and its wild relatives. Tree Genet. Genomes 8, 12231235 28 Myles, S. (2013) Improving fruit and wine: what does genomics have to offer? Trends Genet. 29, 190196 29 Larson, G. and Burger, J. (2013) A population genetics view of animal domestication. Trends Genet. 29, 197205 30 Miller, A. and Gross, B.L. (2011) From forest to eld: perennial fruit crops domestication. Am. J. Bot. 98, 13891414 31 Nosil, P. (2012) Ecological Speciation, Oxford University Press 32 Gross, B.L. and Olsen, K.M. (2009) Genetic perspectives on crop domestication. Trends Plant Sci. 15, 529537

Review
33 Olsen, K.M. and Wendel, J.F. (2013) A bountiful harvest: genomic insights into crop domestication phenotypes. Annu. Rev. Plant Biol. 64, 4770 34 Coart, E. et al. (2003) Genetic variation in the endangered wild apple (Malus sylvestris (L.) Mill.) in Belgium as revealed by amplied fragment length polymorphism and microsatellite markers. Mol. Ecol. 12, 845857 35 Dzhangaliev, A.D. (2003) The wild apple tree of Kazakhstan. Hort. Rev. 00, 63303 36 Larsen, A. et al. (2006) Hybridization and genetic variation in Danish populations of European crab apple (Malus sylvestris). Tree Genet. Genomes 2, 8697 37 Pickersgill, B. (2007) Domestication of plants in the Americas: insights from Mendelian and molecular genetics. Ann. Bot. 100, 925940 38 Schlumbaum, A. et al. (2012) Towards the onset of fruit tree growing north of the Alps: ancient DNA from waterlogged apple (Malus sp.) seed fragments. Ann. Anat. 194, 157162 39 Luby, J.J. et al. (2001) Field resistance to re blight in a diverse apple (Malus sp.) germplasm collection. J. Am. Soc. Hortic. Sci. 127, 245253 40 Gharghani, A. et al. (2009) Genetic identity and relationships of Iranian apple (Malus domestica Borkh.) cultivars and landraces, wild Malus species and representative old apple cultivars based on simple sequence repeat (SSR) marker analysis. Genet. Resour. Crop Evol. 56, 829842 41 Koopman, W.J.M. et al. (2007) Linked vs. unlinked markers: multilocus microsatellite haplotype-sharing as a tool to estimate gene ow and introgression. Mol. Ecol. 16, 243256 42 Nikiforova, S.V. et al. (2013) Phylogenetic analysis of 47 chloroplast genomes claries the contribution of wild species to the domesticated apple maternal line. Mol. Biol. Evol. 30, 17511760

Trends in Genetics February 2014, Vol. 30, No. 2

43 Noiton, D.A.M. and Alspach, P.A. (1996) Founding clones, inbreeding, coancestry, and status number of modern apple cultivars. J. Am. Soc. Hortic. Sci. 121, 773782 44 Myles, S. et al. (2011) Genetic structure and domestication history of the grape. Proc. Natl. Acad. Sci. U.S.A. 108, 35303535 45 Feuillet, C. et al. (2008) Cereal breeding takes a walk on the wild side. Trends Genet. 24, 2432 46 Kouassi, A. et al. (2009) Estimation of genetic parameters and prediction of breeding values for apple fruit-quality traits using pedigreed plant material in Europe. Tree Genet. Genomes 5, 659672 47 Maliepaard, C. et al. (1998) Aligning male and female linkage maps of apple (Malus pumila Mill.) using multi-allelic markers. Theor. Appl. Genet. 97, 6073 48 Morrell, P.L. and Clegg, M.T. (2007) Genetic evidence for a second domestication of barley (Hordeum vulgare) east of the Fertile Crescent. Proc. Natl. Acad. Sci. U.S.A. 104, 32893294 49 Ross-Ibarra, J. et al. (2007) Plant domestication, a unique opportunity to identify the genetic basis of adaptation. Proc. Natl. Acad. Sci. U.S.A. 104, 86418648 , D. et al. (2012) Genome-wide SNP detection, validation, and 50 Chagne development of an 8K SNP array for apple. PLoS ONE 7, e31745 51 Deguines, N. et al. (2012) The whereabouts of ower visitors: contrasting land-use preferences revealed by a country-wide survey based on citizen science. PLoS ONE 7, e45822 52 Jacques, D. et al. (2009) Natural distribution and variability of the wild apple (Malus sylvestris) in Belgium. Belg. J. Bot. 142, 3949 53 Volk, G.M. et al. (2005) Ex situ conservation of vegetatively propagated species: development of a seed-based core collection for Malus sieversii. J. Am. Soc. Hortic. Sci. 130, 203210 54 Gladieux, P. et al. (2010) Evolution of the population structure of Venturia inaequalis, the apple scab fungus, associated with the domestication of its host. Mol. Ecol. 19, 658674

65

Review

Neocentromeres: a place for everything and everything in its place


Kristin C. Scott1,2,3 and Beth A. Sullivan2,3
1 2

Institute for Genome Sciences & Policy, Duke University, DUMC 3382, Durham, NC 27708, USA Department of Molecular Genetics and Microbiology, Duke University Medical Center, Durham, NC 27710, USA 3 Division of Human Genetics, Duke University Medical Center, Durham, NC 27710, USA

Centromeres are essential for chromosome inheritance and genome stability. Centromeric proteins, including the centromeric histone centromere protein A (CENP-A), dene the site of centromeric chromatin and kinetochore assembly. In many organisms, centromeres are located in or near regions of repetitive DNA. However, some atypical centromeres spontaneously form on unique sequences. These neocentromeres, or new centromeres, were rst identied in humans, but have since been described in other organisms. Neocentromeres are functionally and structurally similar to endogenous centromeres, but lack the added complication of underlying repetitive sequences. Here, we discuss recent studies in chicken and fungal systems where genomic engineering can promote neocentromere formation. These studies reveal key genomic and epigenetic factors that support de novo centromere formation in eukaryotes. Eukaryotes exhibit a range of centromeres Preserving genome integrity is a major goal of cell division, because genetic information is passed from mother to daughter cells. The centromere (see Glossary) is essential to faithful chromosome segregation and genome stability. It is generally recognized that both genomic and epigenetic pathways are critical for establishing and maintaining functional centromeres. Centromeres are often dened by repetitive DNA, but unique sequences are present at endogenous centromeres of Schizosaccharomyces pombe, Candida albicans, and Gallus gallus. Centromeres can be small and similar in size and sequence, such as the 125-bp point Saccharomyces cerevisiae centromere. Centromeres in larger eukaryotes are regional; the site of kinetochore assembly occurs at variably sized genomic regions, ranging from 40 kb to 5 Mb. In Caenorhabditis elegans, the chromosomes are holocentric, in that the centromere is formed along the length of each chromosome [1]. Sometimes, chromosomes contain two centromere regions. These
Corresponding authors: Scott, K.C. (kristin.scott@duke.edu); Sullivan, B.A. (beth.sullivan@duke.edu). Keywords: CENP-A; transcription; replication; heterochromatin; histone; gene conversion. 0168-9525/$ see front matter 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.tig.2013.11.003

dicentrics are usually products of chromosome fusion and are typically unstable during cell division; the activity of one centromere is suppressed so that dicentric segregation occurs in the manner of a monocentric chromosome [2]. Inactive centromeres represent a class of centromeres that remains to be fully characterized. Neocentromeres are an intriguing type of centromere arising at atypical chromosomal sites, including chromosome arms or telomeres (reviewed in [3,4]). They are unique models for studying de novo centromere formation because they usually form on nonrepetitive DNA, yet recruit centromere proteins, and generally segregate faithfully during cell division. Neocentromeres were rst described in humans in 1993 and, since then, over 100 have been identied. They are usually ascertained due to their presence on chromosomes associated with abnormal phenotypes. These include marker chromosomes that have been deleted or duplicated from endogenous chromosomes [57] or native or marker chromosomes in which the normal centromere has been repressed [8,9]. Although neocentromeres originating from nearly every human chromosome have been described, some appear to cluster in similar locations, such as the long arms of chromosomes 3, 4, 8, 13, and 15 [4,10]. These are not hotspots per se, because precise mapping of centromere protein-binding regions showed that the different neocentromeres form on distinct DNA sequences, even within the same genomic interval [11,12]. Furthermore, the sizes of the CENP-A domains on neocentromeres in the same genomic region can range fourfold (approximately 100400 kb), emphasizing the plasticity of centromere assembly. Understanding human neocentromere formation has been limited by the retrospective nature of many analyses. Glossary
CENP-A: histone H3 variant that replaces canonical H3 at centromeres. Centromere: chromosomal locus at which the kinetochore is assembled and spindle microtubules attach. HJURP/Scm3: the chaperone protein that assembles CENP-A into chromatin. Immature and/or incomplete centromere: a chromosomal locus that is contains CENP-A at low levels and/or fails to recruit a full complement of centromere and/or kinetochore proteins. Kinetochore: the multiprotein structure that is assembled on centromeric DNA and facilitates chromosomal connection to spindle microtubules. mardel(10): one of the first human neocentromeres to be described and characterized; it is a marker chromosome derived from the long arm of chromosome 10 on which a neocentromere formed on noncentromeric DNA. Neocentromere: a centromere that forms at a nontypical genomic region and usually at sequences that differ from endogenous centromeres.

66

Trends in Genetics, February 2014, Vol. 30, No. 2

Review
At the time of study, human neocentromeres are already stabilized in the karyotype. Mechanisms of their formation can only be insinuated by their structure and chromosomal origin, thus underscoring the need for strategies to induce neocentromere formation experimentally. In this review, we discuss exciting, recent studies of controlled neocentromere formation that have extended understanding of genomic and epigenetic factors that govern de novo centromere formation. Centromere specication through unique chromatin assembly The diversity of eukaryotic centromeric DNAs contrasts with the common chromatin organization that is largely independent of the underlying DNA sequence. Within centromeric chromatin, the histone H3 variant CENP-A fully replaces canonical histone H3 in a subset of nucleosomes, so that centromeres contain a mixture of H3 nucleosomes and CENP-A nucleosomes [13,14]. Replenishment of CENP-A during each cell cycle is critical to centromere stability. New CENP-A is loaded into chromatin by the CENP-A-specic chaperone, Holliday junction recognition protein (HJURP) [Scm3 in fungi and Chromosome Alignment 1 (CAL1) in Drosophila]. Tethering HJURP to noncentromeric sites can seed a de novo centromere [15] that persists following HJURP disassociation, emphasizing the important role for CENP-A in centromere specication. In addition to CENP-A-containing chromatin, eukaryotic centromeres are also enriched for other types of chromatin. CENP-A chromatin forms the centromeric core and is surrounding by chromatin marked by histone H3 lysine 9 (H3K9) and H3K27 trimethylation [16,17]. CENP-A nucleosomes within the centromeric core of metazoans are interspersed with H3 nucleosomes methylated at K4 and K36 [18,19]. Such distinct chromatin domains exist at centromeres in organisms ranging from fungi to plants to humans, suggesting that chromatin organization is fundamentally important for centromere specication and/or function. Surprisingly, many neocentromeres lack common chromatin features. At the mardel(10) neocentromere, CENPA-containing subdomains are interspersed with histone H3 subdomains, indicating shared chromatin organization with endogenous centromeres [20]. However, 13q neocentromeres lack interspersed H3 nucleosome and are dened by one major and one minor CENP-A domain [12]. Some neocentromeres contain varying amounts of heterochromatin, whereas others lack heterochromatin altogether [11]. The absence of a consistent chromatin environment raises questions about genomic and epigenetic features that inuence neocentromere formation. Targeting CENP-A to certain noncentromeric sites can promote de novo centromere formation and recruitment of centromere proteins [21]. Yet, despite the requirement for CENP-A at functional centromeres, the presence of CENPA is not always sufcient for its continued maintenance. Studies in Drosophila and human cultured cells have shown that global, ectopically expressed CENP-A/CID incorporates at several different genomic sites [22,23]. However, a complete protein repertoire of a fully functional centromere is not always recruited to every ectopic locus.

Trends in Genetics February 2014, Vol. 30, No. 2

Similar immature or incomplete centromeres have been observed at sites where HJURP and CENP-A have been tethered [21]. The presence of the endogenous centromere might inhibit maturation of additional centromeres elsewhere on the same chromosome. However, a more likely explanation is that certain chromatin environments favor CENP-A incorporation and new centromere formation and/ or maturation [24] (see below). Neocentromeres arise near sites of former centromere function What makes certain genomic regions particularly amenable to centromere assembly is unclear. Inferences of mechanism are confounded by potential selection bias for retention of human neocentromeres that are associated with the most viable, least deleterious phenotypes. Two recent studies in chicken cells and C. albicans support the notion that experimentally derived neocentromeres form at specic genomic locations [25,26]. To induce neocentromere formation in these organisms, an endogenous centromere was physically removed and replaced with a selectable marker (bleomycin in DT40 chicken cells and URA3 in C. albicans). Cells lacking the endogenous centromere but that could still grow in media containing G418 (chicken) or media lacking uracil (C. albicans) were identied as those that had formed neocentromeres. The centromeres of chicken chromosomes 5 and Z comprise nonrepetitive DNA, and their CENP-A regions span 3040 kb, similar to other chicken centromeres. When a large (127-kb) portion of the Z centromere was conditionally deleted, neocentromeres formed in several locations, ranging from near either chromosome end to the middle of the chromosome arm [25] (Figure 1A). Neocentromere formation occurred most frequently near the original Z centromere. Although endogenous chicken centromeres have approximately 35-kb regions of concentrated CENP-A accumulation, a much larger region (approximately 2 Mb) surrounding the centromere region contains small amounts of nonkinetochore-associated CENP-A [25]. CENP-A enrichment in the anking regions was low but still more enriched compared with the rest of genome. The preference for nonrandom neocentromere formation near the endogenous centromere was thought to be due to the presence of CENP-A in the anking regions. Indeed, deletion of a smaller region (67 kb) of centromere 5 resulted in 97% of neocentromeres forming within a 3-Mb region near the original centromere (Figure 1B). Similar experiments in C. albicans, a pathogenic yeast in which each centromere is approximately 4.5 kb in size and dened by unique, nonrepetitive sequences [27], support the notion that centromere-proximal sites are highly amenable to neocentromere formation. Varying amounts (4.530 kb) of endogenous centromere regions (CEN1, CEN5, and CEN7) were deleted and replaced with a selectable reporter gene (Figure 2), and neocentromeres formed both proximal and distal to the centromere [26], agreeing with a previous study of neocentromere formation on chromosome 7 [28]. In a recent study, most neocentromeres preferentially formed between 1 kb and 13 kb from the location of the original centromere (Figure 2AC) [26]. Interestingly, the neocentromeres that formed farther
67

Review
(A) High

Trends in Genetics February 2014, Vol. 30, No. 2

CENP-A

Low

35 kb P Z Cen q

127 kb deleon P neo neo neo bleo neo neo q

6%

7%

76%

5%

6%

(B)

High

CENP-A

Low

35 kb p 5 Cen q

67 kb deleon p neo bleo neo neo q

97% 3 Mb

3%

Key:

CENP-A nucleosome H3 nucleosome CEN deleted region Neocentromere


TRENDS in Genetics

Figure 1. Engineered neocentromeres in DT40 chicken cells arise nonrandomly near the original centromere. Endogenous chicken chromosomes Z and 5 contain centromere protein A (CENP-A) chromatin regions (black-filled curves; reddish-blue nucleosomes) that are approximately 35 kb in size. (A) Removal of the 127 kb of the centromere region of chromosome Z, including the 35-kb CENP-A domain, and replacement with a bleomycin selectable marker cassette (blue box) using Cre-lox P genome engineering led to neocentromere formation (yellow boxes) at various sites along chromosome Z. The location of neocentromere formation was preferentially skewed, with 76% of neocentromeres forming proximal to the original centromere. (B) Low levels of CENP-A were detected by ChIP-seq in a 2-Mb region surrounding the endogenous centromere of chromosome 5. To test whether these regions of more modest CENP-A incorporation are capable of nucleating a centromere in the absence of the adjacent, more enriched CENP-A domain, a smaller (67-kb) region of centromere 5 was deleted. Nearly all neocentromeres (97%) formed adjacent to the original centromere, suggesting that, in chicken cells, nonkinetochore CENP-A-enriched chromatin can seed neocentromere formation in the absence of the original centromere. Drawings are not drawn precisely to scale.

from the site of CEN7 contained 35% of the amount of CENP-A compared with endogenous CEN7 levels, yet they were still viable. These ndings agree with studies in humans indicating that centromeres with <20% the
68

normal amount of CENP-A retain almost normal function [18,29]. Although none of the yeast neocentromere strains exhibited signicant chromosome loss, the neocentromeres located 13 kb away from the deleted endogenous

Review
(A)

Trends in Genetics February 2014, Vol. 30, No. 2

4.5 kb CEN7

3 kb neo URA3

1 kb neo

(B) CEN7

6.5 kb URA3

13 kb neo

(C) CEN7

30 kb neo 2 kb URA3

(D) neo

3 kb URA3

13 kb neo

Gene conversion neo CEN7 neo

Key:

CENP-A nucleosome H3 nucleosome

CEN deleted region Neocentromere

Incomplete and/or reverted neocentromere


TRENDS in Genetics

Figure 2. Induced neocentromeres in Candida albicans form at high frequency near the original centromere. (A) Replacement of the 4.5-kb C. albicans centromere 7 (CEN7) with a URA3 marker (red box) resulted in neocentromere formation (yellow boxes) within 13 kb on either side of the original centromere. The amount of centromere protein A (CENP-A; reddish-blue nucleosomes) at the neocentromeres relative to the amount at the original CEN7 is denoted by the number of cartoon CENP-A nucleosomes (1 = reduced to 3 = normal amount at the endogenous centromere). (B,C) Larger deletions (6.5 kb or 30 kb) of the CEN7 region produced neocentromeres that were located 213 kb from the original centromere. Notably, neocentromeres that formed farther from the original centromere contained lower amounts of CENP-A. (D) In a subset of neocentromere-containing strains, the neocentromeres disappeared and the endogenous CEN7 was restored by gene conversion. In these strains, the neocentromeres contained lower amounts of CENP-A (denoted schematically by the blue boxes and few CENP-A nucleosomes), suggesting that the amount of CENP-A marks the completeness of a centromere, or its probability of being reverted by gene conversion. Drawings are not drawn precisely to scale.

69

Review
centromere were the only ones to show a low level of chromosome loss. These nding suggest that, despite reduced CENP-A enrichment at these distal neocentromeres, a generally functional kinetochore was formed. At least two interesting distinctions have emerged from the recent C. albicans and chicken neocentromere studies. First, C. albicans does not exhibit an obvious correlation between the size of the deleted centromere region and the centromere-proximal location of neocentromeres. Second, all of the chicken neocentromeres were comparable in size to endogenous centromeres, whereas C. albicans neocentromere sizes varied among the different chromosomes. Neocentromeres formed from deletion of CEN1 or CEN5 were two to four times larger (612 kb) than the endogenous centromeres (35 kb). This variability in neocentromere size is more similar to that observed for human neocentromeres [12]. The observed plasticity in neocentromere size could reect the absence of genomic features that repress de novo centromere formation, given that Candida neocentromeres frequently form in large, intergenic regions [28]. Targeting CENP-A to intergenic regions that are variable in size and chromatin enrichment, while simultaneously deleting the endogenous centromere, could address this question. Centromere assembly and replication timing: cause or effect? An intriguing property of centromeres is that they replicate at a different time compared with bulk DNA. In the yeasts S. cerevisiae, S. pombe, and C. albicans, centromeres replicate early in S phase. Such early replication of centromeres appears to be crucial for proper kinetochore assembly in S. cerevisiae [30] and in S. pombe, where it is regulated by the centromere protein Swi6 [31]. In fungi at least, early-replicating domains may be preferred sites of neocentromere formation over late-replicating domains. CENP-A loading in early S phase might drive neocentromere formation at early-replicating sites. However, neocentromere formation at a late-replicating domain in C. albicans created a replication shift to early S phase [32], suggesting that replication timing alone is not a primary determinant of de novo centromere assembly. More recent studies corroborate this nding in S. cerevisiae. The repositioning of the chromosome XIV centromere from its endogenous locus to a late-replicating domain not only results in a functional centromere, but also shifts timing of replication to early S phase [33]. Thus, it appears that, in these organisms, replication timing is an inherent property of endogenous centromeres that can be transferred to neocentromeres. In contrast to fungal centromeres, replication of centromeres in vertebrates and other multicellular organisms occurs in mid to late S phase [3436]. Perhaps CENP-A loading at neocentromeres in yeast is linked to replication timing. This might explain why early replicating regions are preferred sites of neocentromere formation, especially in C. albicans [32]. In the DT40 neocentromere studies, one neocentromere formed at an already late-replicating domain, and did not alter replication timing [25]. However, two other neocentromeres formed in early-replicating domains that shifted to late upon neocentromere
70

Trends in Genetics February 2014, Vol. 30, No. 2

formation. Similarly, human neocentromere formation on chromosome 10 shifts replication timing of the region to a later time [37]. The mechanism by which centromere assembly alters replication timing (either late to early in fungi or early to late in metazoans) remains unclear. Engineered neocentromeres in fungi and chicken provide controllable experimental systems to now explore the effects of replication timing on centromere assembly and vice versa. De novo centromeres and transcription Given that they are typically embedded in pericentric heterochromatin, sites of kinetochore assembly were historically presumed to lack transcriptional activity. However, on the heels of discoveries that pericentric heterochromatin domains can be transcriptionally active in ssion yeast [38,39], landmark studies in maize demonstrated that DNA interwoven with CENP-A-containing nucleosomes is permissive to RNA polymerase II (RNAPII)-mediated transcription [40]. Over the past decade, transcripts homologous to the primary sequence underlying native kinetochore assembly sites have been identied in S. cerevisiae [41] and S. pombe [42], rice [43,44], mouse [45,46], tammar wallaby [47], and humans [48]. Furthermore, centromeres present on human articial chromosomes (HACs) are likewise transcriptionally active [18,49,50]. Dening the types and properties of centromere-derived transcripts, including both endogenous genes and noncoding RNAs (ncRNAs), is the next challenge in understanding centromeric transcription [40,43 45,47]. There are important links between the level of RNAPII transcriptional activity at CENP-A-containing chromatin domains and centromere identity and function. An emerging Goldilocks model of centromeric transcription in both unicellular and multicellular eukaryotes posits that transcription that is too high or too low negatively affects centromere function. Instead, a just right amount is important for proper centromere assembly and chromosome segregation [51]. In humans, studies have taken advantage of easily manipulated HACs to demonstrate that targeting of transcriptional activators to a HAC core domain not only alters gene expression, but also modies chromatin structure and HAC stability [49,50]. When HAC transcriptional activity was reduced, CENP-A incorporation and mitotic stability were signicantly compromised [18]. Where it has been studied, low transcriptional activity is also a feature of endogenous centromeres. For example, experimental manipulation of core domain transcription results in chromosome mis-segregation and lagging chromosomes in both S. cerevisiae and tammar wallaby [41,47]. Likewise, treatment of mammalian cells with inhibitors of RNAPII compromises centromere function [52]. Studies of endogenous centromeres in S. pombe support and extend the conclusion that a low level of transcription is a normal feature of eukaryotic centromeres [42]. In light of these ndings, it is not surprising that neocentromeres in both C. albicans and G. gallus frequently form adjacent to genes or predicted genes [25,26,28]. Furthermore, recent studies predict that open reading frames (ORFs) associated with neocentromere formation

Review
are transcriptionally active. The steady-state transcript level of neocentromere adjacent genes is strongly reduced upon neocentromere formation in yeast [26,28]. Similarly, in S. pombe, neocentromere-adjacent genes that are typically induced by nitrogen starvation remain repressed upon nitrogen depletion [53]. In chicken cells, changes to gene transcription after neocentromere formation are less obvious, because neocentromeres form over both transcriptionally active and inactive genomic regions [25]. Unfortunately, at most loci in the chicken neocentromere study, the transcriptional activity of genes could not be ascertained due to technical limitations, although at one testable locus, transcription was downregulated. Whether transcriptional effects are causes or consequences of neocentromere assembly remains unknown. Intriguingly, C albicans neocentromeres assembled at or near the URA3 reporter gene can move locally in response to experimental manipulation of growth conditions that change the amount of URA3 transcription. Increased transcriptional activity prohibits CENP-A incorporation, whereas transcriptional repression results in CENP-A association at gene promoters [26,28]. Of the few human neocentromeres that have been studied, the mardel(10) neocentromere showed a distinct correlation between centromere function and long interspersed element-1 (LINE1) transcription [54]. Although it remains to be formally tested, the transcriptional activity associated with heterochromatin formation in S. pombe [55] may also contribute to the site of neocentromere formation. Chromatin environments that favor new centromere formation In ssion yeast, neocentromeres rarely form adjacent to the excised endogenous centromere [53]. This is likely due to the nature of the specic engineered deletions that removed both CENP-A and pericentric heterochromatin domains in the neocentromere studies. Low levels of CENP-A and/or heterochromatin would not be expected outside of the excised regions and, as a result, neocentromeres might preferentially assemble at subtelomeric regions that do contain heterochromatin [53,56]. These ndings imply that a distinct chromatin environment promotes neocentromere assembly. Indeed, de novo centromere assembly on circular articial chromosomes in S. pombe requires the presence of pericentric heterochromatin [57]. Similarly, in Drosophila, genomic regions near or within heterochromatin are preferred sites of neocentromere formation [22,24,58,59]. Even some human neocentromeres are located in or near heterochromatic regions, such as the acrocentric short arms [60]. Nevertheless, other human neocentromeres are formed in nonheterochromatic regions, and de novo centromeres in C. elegans are assembled in the absence of heterochromatin [61]. Although heterochromatin may strongly promote or support neocentromere formation, it is not the only type of chromatin environment in which neocentromere assembly can occur. Thus, questions regarding the perfect environment for neocentromere formation remain to be experimentally addressed. A recent study in S. pombe suggests that regions depleted for nucleosomes that contain H2A.Z are particularly suited for neocentromere formation [56]. Indeed, increased

Trends in Genetics February 2014, Vol. 30, No. 2

neocentromere formation in ssion yeast was observed at regions lacking H2A.Z, suggesting that CENP-A and H2A.Z are typically not present in the same nucleosomes. These studies suggested that Scm3/HJURP has decreased afnity for nucleosomes containing H2A.Z, which consequently inhibits new CENP-A incorporation. Given that heterochromatin contains little H2A.Z, a feasible model is that centromeric and telomeric heterochromatin promotes maturation of new centromere formation once CENP-A incorporation has occurred. If CENP-A is aberrantly loaded into sites that contain little or no H2A.Z or in regions that experience high histone turnover, neocentromere formation may be more easily seeded and reinforced by continued, efcient recruitment of Scm3/HJURP. Mechanisms that counter spontaneous neocentromere formation Low levels of CENP-A are found at noncentromeric sites in multiple organisms, including promoters, yet these regions do not mature to fully functional centromeres. In addition, in instances in which CENP-A is overexpressed or tethered at specic genomic regions, only partial centromere assembly occurs [2123,59]. In fungi, transient neocentromeres can form that contain lower (<15%) amounts of CENP-A [28,56]. However, these immature or incomplete centromeres disappear and relocate, either naturally or under stress conditions, to more favorable genomic regions, where they become more enriched for CENP-A [28,56]. An open question, then, is why do new centromeres not arise regularly throughout the genome? Several lines of evidence indicate that multiple mechanisms protect the genome against de novo centromere formation (Figure 3). CENP-A deposition at centromeres of eukaryotic centromeres corresponds with several events, including its own transcription, the availability of chaperones that load it into chromatin, and regulation of the CENP-A assembly machinery by cyclin-dependent kinase (CDK) complexes [6264]. For instance, the Drosophila centromere protein CAL1, (HJURP/Scm3 homolog), is present in limiting amounts during the cell cycle to ensure that CENP-A and/or CID assembly occurs appropriately [65]. In addition, chromatin remodelers participate in both the incorporation of CENP-A at centromeres [66] and in the preservation of H3 chromatin, thereby ensuring that CENP-A is not incorporated at noncentromeric sites [42]. At all times, the cell is surveying H3 chromatin and misincorporated CENP-A. Given that promoter regions often contain higher than average amounts of H2A.Z, this variant histone may also help to prevent inappropriate CENP-A deposition [56]. Neocentromere formation may represent instances in which even slight perturbations in chromatin regulation or genome surveillance enable CENP-A to encroach into unauthorized genomic regions. Although excess or inappropriately incorporated CENPA can lead to partial or complete centromere formation, mechanisms exist to evict mislocalized CENP-A. Ubiquitin-mediation proteolysis has been demonstrated to prevent CENP-A misincorporation and effectively control normal CENP-A levels in several organisms [6772]. If chromatin remodelers or E3 ubiquitin ligases are mutated
71

Review

Trends in Genetics February 2014, Vol. 30, No. 2

Centromere loss

(A)

(B)

(C)

Reposioning Maturaon

Centromere loss Chromosome loss

Maturaon

Key:
CENP-A chroman Variant and/or H2AZ chroman CENP-A H3 HJURP and/or Scm3 Kinetochore complex Chroman remodelers and/or chaperones Promoter and/or gene Transcript

Neocentromere formaon Accurate segregaon

TRENDS in Genetics

Figure 3. The formation and fate of de novo centromeres arising at atypical genomic locations. Noncentromeric chromosomal loci contain low levels of centromere protein A (CENP-A; red) and histone variants, including H2A.Z (yellow). Upon centromere loss, CENP-A is preferentially incorporated at existing CENP-A loci, whereas H2A.Z may guard against CENP-A incorporation. (A) Chromatin remodeling complexes and histone H3 chaperones monitor local chromatin structure and evict misincorporated CENPA, resulting in centromere loss. (B) Alternatively, following centromere loss, CENP-A is incorporated at loci already containing a low level of CENP-A or other chromatin structures permissive to neocentromere formation, such as heterochromatin. Holliday junction recognition protein (HJURP) association enables maturation of incomplete centromeres, followed by recruitment of centromere and kinetochore proteins necessary for neocentromere function. (C) Failure to recruit a sufficient amount of CENP-A in diploid organisms can result in incomplete neocentromere formation, which may be corrected by repositioning that results in CENP-A incorporation at a more favorable site or by homologous recombination (not shown). Neocentromeres can form, perhaps preferentially, within or adjacent to genes, resulting in reduced transcriptional activity adjacent to the mature neocentromere.

or ineffective, a critical mass of misincorporated CENP-A may remain in certain genomic regions. Indeed, when CENP-A/Cse4p is overexpressed in S. cerevisiae strains mutated for Psh1, an E3 ubiquitin ligase, misincorporated CENP-A/Cse4p is not removed from noncentromeric loci [73]. CENP-A misincorporation and/or failure in eviction may represent an early step in new centromere formation. As CENP-A persists in a new location, levels of H2A.Z or other restrictive chromatin marks may decrease, enabling the neocentromere to mature, perhaps in concert with enrichment for permissive chromatin, such as heterochromatin or H3K4me and/or H3K36me. The minimal level of CENP-A that can bypass or escape eviction and proteolysis remains to be tested, although some studies suggest that only a few molecules of CENP-A can maintain centromere function [18,28,29]. Understanding the molecular switch between new centromere formation and centromere suppression is relevant beyond neocentromere biology. Similar mechanisms might
72

underlie centromere inactivation in de novo dicentric chromosomes and, when ineffective or mutated, might explain why some dicentrics fail to inactivate the second centromere [2,7476]. CENP-A is also overexpressed in many cancers [77,78]. It is tempting to speculate that surveillance and/or eviction machinery might be compromised in these cells, and neocentromeres may arise more often and contribute to the genome instability that is a hallmark of cancers. Finally, a new view of centromere maintenance has emerged from C. albicans in which genomic mechanisms related to centromere or chromosome pairing protect against new centromere formation [26]. Deletion of endogenous CEN7 led to neocentromere formation, although in a fraction of strains, the centromere was restored at the endogenous locus by gene conversion through recombination with CEN7 on the unaltered homolog. Notably, the C. albicans neocentromeres that disappeared contained lower amounts of CENP-A compared with the original

Review
Box 1. Outstanding questions in neocentromere research
 Does replication timing direct centromere specification or does centromere assembly trigger a change in replication dynamics?  Do neocentromeres preferentially assemble near origins of replication, at ncRNAs, and/or within domains enriched for cohesins?  Do neocentromeres nonrandomly arise next to centromeres defined by repetitive DNA?  Can neocentromeres arise in organisms with genetically determined centromeres?  How do diseased cell states influence neocentromere formation?  Does primary incorporation of ectopic CENP-A occur at sites of DNA damage?  Do neocentromeres preferentially assemble within specific nuclear locations and/or territories?  What are the molecular mechanisms that control the maturation of centromeres from incomplete sites of CENP-A incorporation to fully functional centromeres?  Is there a molecular difference between incomplete and repressed (neo)centromeres?

Trends in Genetics February 2014, Vol. 30, No. 2

centromere (Figure 2). These ndings suggest that neocentromere formation in diploid organisms happens more often than appreciated. These events may go unobserved because incomplete and/or immature centromeres are reverted or removed, and endogenous centromere function is restored by recombination (Figure 3). Gene conversion has been reported at both budding yeast and maize centromeres [79,80], and is thought to occur at human centromeres, although the latter has been more difcult to study. In light of the new ndings in C. albicans, models of centromere stability now include recombination-based mechanisms that maintain centromere location, diversify centromeric DNAs, and suppress propagation of unfavorable or disadvantageous new centromeric locations, perhaps based on the amount or extent of CENP-A incorporation. Concluding remarks The ability to engineer and recover neocentromeres efciently in both fungal and vertebrate wild type cells represents a powerful strategy to study the establishment and maintenance of de novo centromeres. Recent studies have provided insight into some genomic and epigenetic factors that promote de novo centromere formation, but many intriguing questions remain (Box 1). It will be important to dene roles for transcription, replication, and chromatin environment in neocentromere formation. Such studies have implications not only for basic centromere and chromosome biology, but also for developing strategies to create controllable centromeres or repress centromere function for therapeutic applications and disease treatments.
Acknowledgments
We apologize to our colleagues whose work on centromeres and neocentromeres was not acknowledged due to space constraints. We thank Megan Aldrup-MacDonald for critical reading of the manuscript. The Scott lab is supported by Duke Institute for Genome Sciences & Policy. Sullivan lab research is funded by grants R01 GM098500 (NIH) and #1-FY13-434 from the March of Dimes Foundation.

References
1 Maddox, P.S. et al. (2004) Holoer than thou: chromosome segregation and kinetochore function in C. elegans. Chromosome Res. 12, 641653

2 Stimpson, K.M. et al. (2012) Dicentric chromosomes: unique models to study centromere function and inactivation. Chromosome Res. 20, 595 605 3 Burrack, L.S. and Berman, J. (2012) Neocentromeres and epigenetically inherited features of centromeres. Chromosome Res. 20, 607619 4 Warburton, P.E. (2004) Chromosomal dynamics of human neocentromere formation. Chromosome Res. 12, 617626 5 Depinet, T.W. et al. (1997) Characterization of neo-centromeres in marker chromosomes lacking detectable alpha-satellite DNA. Hum. Mol. Genet. 6, 11951204 6 Voullaire, L. et al. (1999) Trisomy 20p resulting from inverted duplication and neocentromere formation. Am. J. Med. Genet. 85, 403408 7 Voullaire, L.E. et al. (1993) A functional marker centromere with no detectable alpha-satellite, satellite III, or CENP-B protein: activation of a latent centromere? Am. J. Hum. Genet. 52, 11531163 8 Amor, D.J. et al. (2004) Human centromere repositioning in progress. Proc. Natl. Acad. Sci. U.S.A. 101, 65426547 9 Hasson, D. et al. (2011) Formation of novel CENP-A domains on tandem repetitive DNA and across chromosome breakpoints on human chromosome 8q21 neocentromeres. Chromosoma 120, 621632 10 Marshall, O.J. et al. (2008) Neocentromeres: new insights into centromere structure, disease development, and karyotype evolution. Am. J. Hum. Genet. 82, 261282 11 Alonso, A. et al. (2010) A paucity of heterochromatin at functional human neocentromeres. Epigenetics Chromatin 3, 6 12 Alonso, A. et al. (2003) Genomic microarray analysis reveals distinct locations for the CENP-A binding domains in three human chromosome 13q32 neocentromeres. Hum. Mol. Genet. 12, 27112721 13 Kim, S.M. et al. (2003) Early-replicating heterochromatin. Genes Dev. 17, 330335 14 Earnshaw, W.C. et al. (2013) Esperanto for histones: CENP-A, not CenH3, is the centromeric histone H3 variant. Chromosome Res. 21, 101106 15 Foltz, D.R. et al. (2009) Centromere-specic assembly of CENP-a nucleosomes is mediated by HJURP. Cell 137, 472484 16 Choo, K.H. (2001) Domain organization at the centromere and neocentromere. Dev. Cell 1, 165177 17 Partridge, J.F. et al. (2000) Distinct protein interaction domains and protein spreading in a complex centromere. Genes Dev. 14, 783791 18 Bergmann, J.H. et al. (2011) Epigenetic engineering shows H3K4me2 is required for HJURP targeting and CENP-A assembly on a synthetic human kinetochore. EMBO J. 30, 328340 19 Sullivan, B.A. and Karpen, G.H. (2004) Centromeric chromatin exhibits a histone modication pattern that is distinct from both euchromatin and heterochromatin. Nat. Struct. Mol. Biol. 11, 1076 1083 20 Chueh, A.C. et al. (2005) Variable and hierarchical size distribution of L1-retroelement-enriched CENP-A clusters within a functional human neocentromere. Hum. Mol. Genet. 14, 8593 21 Barnhart, M.C. et al. (2011) HJURP is a CENP-A chromatin assembly factor sufcient to form a functional de novo kinetochore. J. Cell Biol. 194, 229243 22 Heun, P. et al. (2006) Mislocalization of the Drosophila centromerespecic histone CID promotes formation of functional ectopic kinetochores. Dev. Cell 10, 303315 23 Van Hooser, A.A. et al. (2001) Specication of kinetochore-forming chromatin by the histone H3 variant CENP-A. J. Cell Sci. 114, 35293542 24 Olszak, A.M. et al. (2011) Heterochromatin boundaries are hotspots for de novo kinetochore formation. Nat. Cell Biol. 13, 799808 25 Shang, W.H. et al. (2013) Chromosome engineering allows the efcient isolation of vertebrate neocentromeres. Dev. Cell 24, 635648 26 Thakur, J. and Sanyal, K. (2013) Efcient neocentromere formation is suppressed by gene conversion to maintain centromere function at native physical chromosomal loci in Candida albicans. Genome Res. 23, 638652 27 Sanyal, K. et al. (2004) Centromeric DNA sequences in the pathogenic yeast Candida albicans are all different and unique. Proc. Natl. Acad. Sci. U.S.A. 101, 1137411379 28 Ketel, C. et al. (2009) Neocentromeres form efciently at multiple possible loci in Candida albicans. PLoS Genet. 5, e1000400
73

Review
29 Fachinetti, D. et al. (2013) A two-step mechanism for epigenetic specication of centromere identity and function. Nat. Cell Biol. 15, 10561066 30 Kitamura, E. et al. (2007) Kinetochore microtubule interaction during S phase in Saccharomyces cerevisiae. Genes Dev. 21, 33193330 31 Hayashi, M.T. et al. (2009) The heterochromatin protein Swi6/HP1 activates replication origins at the pericentromeric region and silent mating-type locus. Nat. Cell Biol. 11, 357362 32 Koren, A. et al. (2010) Epigenetically-inherited centromere and neocentromere DNA replicates earliest in S-phase. PLoS Genet. 6, e1001068 33 Pohl, T.J. et al. (2012) Functional centromeres determine the activation time of pericentric origins of DNA replication in Saccharomyces cerevisiae. PLoS Genet. 8, e1002677 34 Hultdin, M. et al. (2001) Replication timing of human telomeric DNA and other repetitive sequences analyzed by uorescence in situ hybridization and ow cytometry. Exp. Cell Res. 271, 223229 35 Sullivan, B. and Karpen, G. (2001) Centromere identity in Drosophila is not determined in vivo by replication timing. J. Cell Biol. 154, 683690 36 Ten Hagen, K.G. et al. (1990) Replication timing of DNA sequences associated with human centromeres and telomeres. Mol. Cell. Biol. 10, 63486355 37 Lo, A.W. et al. (2001) A 330 kb CENP-A binding domain and altered replication timing at a human neocentromere. EMBO J. 20, 20872096 38 Volpe, T. et al. (2003) RNA interference is required for normal centromere function in ssion yeast. Chromosome Res. 11, 137146 39 Verdel, A. and Moazed, D. (2005) RNAi-directed assembly of heterochromatin in ssion yeast. FEBS Lett. 579, 58725878 40 Topp, C.N. et al. (2004) Centromere-encoded RNAs are integral components of the maize kinetochore. Proc. Natl. Acad. Sci. U.S.A. 101, 1598615991 41 Ohkuni, K. and Kitagawa, K. (2011) Endogenous transcription at the centromere facilitates centromere activity in budding yeast. Curr. Biol. 21, 16951703 42 Choi, E.S. et al. (2012) Factors that promote H3 chromatin integrity during transcription prevent promiscuous deposition of CENPA(Cnp1) in ssion yeast. PLoS Genet. 8, e1002985 43 Yan, H. et al. (2006) Genomic and genetic characterization of rice Cen3 reveals extensive transcription and evolutionary implications of a complex centromere. Plant Cell 18, 21232133 44 Yan, H. et al. (2005) Transcription and histone modications in the recombination-free region spanning a rice centromere. Plant Cell 17, 32273238 45 Bouzinba-Segard, H. et al. (2006) Accumulation of small murine minor satellite transcripts leads to impaired centromeric architecture and function. Proc. Natl. Acad. Sci. U.S.A. 103, 87098714 46 Ferri, F. et al. (2009) Non-coding murine centromeric transcripts associate with and potentiate Aurora B kinase. Nucleic Acids Res. 37, 50715080 47 Carone, D.M. et al. (2009) A new class of retroviral and satellite encoded small RNAs emanates from mammalian centromeres. Chromosoma 118, 113125 48 Wong, L.H. et al. (2007) Centromere RNA is a key component for the assembly of nucleoproteins at the nucleolus and centromere. Genome Res. 17, 11461160 49 Cardinale, S. et al. (2009) Hierarchical inactivation of a synthetic human kinetochore by a chromatin modier. Mol. Biol. Cell 20, 41944204 50 Nakano, M. et al. (2008) Inactivation of a human kinetochore by specic targeting of chromatin modiers. Dev. Cell 14, 507522 51 Ohkuni, K. and Kitagawa, K. (2012) Role of transcription at centromeres in budding yeast. Transcription 3, 193197 52 Chan, F.L. et al. (2012) Active transcription and essential role of RNA polymerase II at the centromere during mitosis. Proc. Natl. Acad. Sci. U.S.A. 109, 19791984 53 Ishii, K. et al. (2008) Heterochromatin integrity affects chromosome reorganization after centromere dysfunction. Science 321, 10881091 54 Chueh, A.C. et al. (2009) LINE retrotransposon RNA is an essential structural and functional epigenetic component of a core neocentromeric chromatin. PLoS Genet. 5, e1000354

Trends in Genetics February 2014, Vol. 30, No. 2

55 Chen, E.S. et al. (2008) Cell cycle control of centromeric repeat transcription and heterochromatin assembly. Nature 451, 734737 56 Ogiyama, Y. et al. (2013) Epigenetically-induced paucity of histone H2A.Z. stabilizes ssion yeast ectopic centromeres. Nat. Struct. Mol. Biol. 20, 13971406 57 Kagansky, A. et al. (2009) Synthetic heterochromatin bypasses RNAi and centromeric repeats to establish functional centromeres. Science 324, 17161719 58 Maggert, K.A. and Karpen, G.H. (2001) The activation of a neocentromere in Drosophila requires proximity to an endogenous centromere. Genetics 158, 16151628 59 Mendiburo, M.J. et al. (2011) Drosophila CENH3 is sufcient for centromere formation. Science 334, 686690 60 Klein, E. et al. (2012) Five novel locations of neocentromeres in human: 18q22.1, Xq27.1 approximately 27.2, Acro p13, Acro p12, and heterochromatin of unknown origin. Cytogenet. Genome Res. 136, 163166 61 Yuen, K.W. et al. (2011) Rapid de novo centromere formation occurs independently of heterochromatin protein 1 in C. elegans embryos. Curr. Biol. 21, 18001807 62 Jansen, L.E. et al. (2007) Propagation of centromeric chromatin requires exit from mitosis. J. Cell Biol. 176, 795805 63 Shelby, R.D. et al. (1997) Assembly of CENP-A into centromeric chromatin requires a cooperative array of nucleosomal DNA contact sites. J. Cell Biol. 136, 501513 64 Silva, M.C. et al. (2012) Cdk activity couples epigenetic centromere inheritance to cell cycle progression. Dev. Cell 22, 5263 65 Mellone, B.G. et al. (2011) Assembly of Drosophila centromeric chromatin proteins during mitosis. PLoS Genet. 7, e1002068 66 Warburton, P.E. et al. (1993) Nonrandom localization of recombination events in human alpha satellite repeat unit variants: implications for higher-order structural characteristics within centromeric heterochromatin. Mol. Cell. Biol. 13, 65206529 67 Collins, K.A. et al. (2004) Proteolysis contributes to the exclusive centromere localization of the yeast Cse4/CENP-A histone H3 variant. Curr. Biol. 14, 19681972 68 Gross, S. et al. (2012) Centromere architecture breakdown induced by the viral E3 ubiquitin ligase ICP0 protein of herpes simplex virus type 1. PLoS ONE 7, e44227 69 Lermontova, I. et al. (2013) Arabidopsis KINETOCHORE NULL2 is an upstream component for centromeric histone H3 variant cenH3 deposition at centromeres. Plant Cell 25, 33893404 70 Lomonte, P. et al. (2001) Degradation of nucleosome-associated centromeric histone H3-like protein CENP-A induced by herpes simplex virus type 1 protein ICP0. J. Biol. Chem. 276, 58295835 71 Moreno-Moreno, O. et al. (2006) Proteolysis restricts localization of CID, the centromere-specic histone H3 variant of Drosophila, to centromeres. Nucleic Acids Res. 34, 62476255 72 Thakur, J. and Sanyal, K. (2012) A coordinated interdependent protein circuitry stabilizes the kinetochore ensemble to protect CENP-A in the human pathogenic yeast Candida albicans. PLoS Genet. 8, e1002661 73 Ranjitkar, P. et al. (2010) An E3 ubiquitin ligase prevents ectopic localization of the centromeric histone H3 variant via the centromere targeting domain. Mol. Cell 40, 455464 74 Earnshaw, W.C. et al. (1989) Visualization of centromere proteins CENP-B and CENP-C on a stable dicentric chromosome in cytological spreads. Chromosoma 98, 112 75 Stimpson, K.M. et al. (2010) Telomere disruption results in nonrandom formation of de novo dicentric chromosomes involving acrocentric human chromosomes. PLoS Genet. 6, e1001061 76 Sullivan, B.A. and Willard, H.F. (1998) Stable dicentric X chromosomes with two functional centromeres. Nat. Genet. 20, 227228 77 Tomonaga, T. et al. (2003) Overexpression and mistargeting of centromere protein-A in human primary colorectal cancer. Cancer Res. 63, 35113516 78 Wu, Q. et al. (2012) Expression and prognostic signicance of centromere protein A in human lung adenocarcinoma. Lung Cancer 77, 407414 79 Shi, J. et al. (2010) Widespread gene conversion in centromere cores. PLoS Biol. 8, e1000327 80 Symington, L.S. and Petes, T.D. (1988) Meiotic recombination within the centromere of a yeast chromosome. Cell 52, 237240

74

Review

Mining cancer methylomes: prospects and challenges


Clare Stirzaker1,2*, Phillippa C. Taberlay1,2*, Aaron L. Statham1, and Susan J. Clark1,2
1 2

Epigenetics Program, Garvan Institute of Medical Research, The Kinghorn Cancer Centre, Sydney 2010, NSW, Australia St Vincents Clinical School, University of NSW, Sydney 2010, NSW, Australia

There are over 28 million CpG sites in the human genome. Assessing the methylation status of each of these sites will be required to understand fully the role of DNA methylation in health and disease. Genome-wide analysis, using arrays and high-throughput sequencing, has enabled assessment of large fractions of the methylome, but each protocol comes with unique advantages and disadvantages. Notably, except for whole-genome bisulte sequencing, most commonly used genome-wide methods detect <5% of all CpG sites. Here, we discuss approaches for methylome studies and compare genome coverage of promoters, genes, and intergenic regions, and capacity to quantitate individual CpG methylation states. Finally, we examine the extent of published cancer methylomes that have been generated using genome-wide approaches. DNA methylation and (de)regulation of the epigenome Epigenetic regulation (see Glossary) of normal cellular processes is typically driven in a cell type-dependent manner, requiring a complex interplay between different layers of epigenetic information, including DNA methylation, nucleosome positions, histone modications, and expression of noncoding RNA. Several epigenetic mechanisms help establish and consolidate the correct higher-order chromatin structures and gene-expression patterns during differentiation and development. Of these, DNA methylation is the best-studied epigenetic modication in mammals. Precise DNA methylation patterns are established during embryonic development and are mitotically inherited through multiple cellular divisions. DNA methylation is necessary for normal cell development [1,2], underpinning X chromosome inactivation [3,4], control of some tissue-specic gene expression, and regulation of imprinted alleles [2,5,6], with widespread effects on cellular growth and genomic stability [79]. DNA methylation in mammalian cells is characterized by the addition of a methyl group at the carbon-5 position of cytosine residues within CpG dinucleotides through the
Corresponding author: Clark, S.J. (s.clark@garvan.org.au). Keywords: cancer methylome; epigenetics; DNA methylation. * Equal rst authors. 0168-9525/$ see front matter 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.tig.2013.11.004

action of DNA methyltransferase enzymes, forming 5methylcytosine (5MeC) [10]. There are approximately 28 million CpG sites in the genome, but these are not evenly distributed; in fact, the bulk of the genome is depleted of CpG sites with less than one quarter of the expected frequency. By contrast, clusters of CpG sites occur at the expected frequency, termed CpG islands, and these commonly span promoters of house-keeping genes. Promoter CpG islands typically remain unmethylated in normal cells and are associated with active gene expression during differentiation (CpG island, promoter; Figure 1). By contrast, methylated CpG island promoters are associated with gene repression. Regions of intermediate CpG densities also exist across the genome, often in the body of genes. Unlike CpG island promoters, extensive exonic or genic methylation is typically associated with active gene expression (genic; Figure 1). CpG island shores are regions of comparatively low CpG density, located approximately 2 kb from CpG islands [11]. Shores also exhibit tissue- and cancer-specic differential methylation and are associated with gene repression [12]. Beyond CpG islands and shores, the remainder of genome displays a lower than expected frequency of CpG sites and is typically methylated in normal cells (intergenic; Figure 1). This includes CpG-poor promoters and distal enhancers that regulate tissue-specic genes (tissue specic; Figure 1). Despite extensive knowledge of DNA methylation events, the underlying biology largely remains an enigma, particularly the mechanism by which it is altered in diseased states, such as cancer. Normal epigenetic processes are disrupted during the initiation and progression of Glossary
Bisulfite genomic sequencing: sequencing of bisulfite-treated DNA allowing resolution of the methylation state of every cytosine in the target sequence, at single-molecule resolution. This is considered the gold standard for DNA methylation analysis. Bisulfite modification: exploits the different sensitivities of cytosine and 5-meC to deamination by bisulfite under acidic conditions in which cytosine undergoes conversion to uracil, whereas 5-meC remains unreactive. Cancer methylome: the map of DNA methylation across a cancer cell genome. DNA methylation: the addition of a methyl CH3 group to the cytosine base at the carbon 5 position (5-meC) in DNA; found primarily in the context of CpG dinucleotides in eukaryotes. Epigenetic mechanisms: the mechanisms that govern the role of epigenetics in gene expression without changing the underlying DNA sequence; include chromatin structure, histone modifications, nucleosome positioning, and DNA methylation. Epigenetics: the study of changes in gene expression or phenotype of a cell, caused by mechanisms other than changes in the underlying DNA sequence. Methylome: the genome-wide map of DNA methylation.
Trends in Genetics, February 2014, Vol. 30, No. 2

75

Review
(A)

Trends in Genetics February 2014, Vol. 30, No. 2

Normal cell

X
Shore
2 kB from CpG island

Promoter CpG island

Genic

Intergenic

Enhancer CpG poor

Promoter

Genic

(B)

Cancer cell

X
CpG island hypermethylaon Aberrant gene silencing (e.g., tumor suppressor genes) Genomic instability Aberrant enhancer silencing Aberrant gene expression (e.g., oncogenes)

TRENDS in Genetics

Figure 1. DNA methylation and (de)regulation of the genome. A schematic representation of the methylome and a summary of major changes that occur in cancer cells. CpG islands are often associated with gene promoters and are resistant to DNA methylation in normal cells (A) (green). Gene expression can occur, and is highly correlated with high levels of gene body (genic) methylation. CpG-poor regions (intergenic), with the exception of enhancers, are typically methylated in normal cells. Similarly, CpG-poor promoters are silenced by DNA methylation and exhibit a closed chromatin structure unless gene expression is required (tissue specific). In cancer cells (B), CpG islands are prone to DNA hypermethylation, which results in aberrant gene silencing (e.g., of tumor suppressor genes). Concomitant hypomethylation of intergenic regions and CpG-poor promoters contributes to genomic instability and aberrant gene expression (e.g., of oncogenes), respectively. White circle, unmethylated CpG; black circle, methylated CpG.

cancer, including global changes in DNA methylation patterns [13]. CpG island hypermethylation is common and often associated with the silencing of tumor suppressor genes and downstream signaling pathways [1316] (Figure 1). Whereas CpG islands become susceptible to DNA methyltransferase activity, CpG-poor regions undergo hypomethylation during transformation, resulting in an overall decrease in total genomic 5MeC in cancer cells [13,14,16] (Figure 1). The exception includes CpG-poor, distal enhancers that are unmethylated in normal cells but often gain methylation [17,18] in cancer cells (Figure 1). Global hypomethylation in cancer is thought to contribute to genomic instability and aberrant expression of some oncogenes, such as MYC [19] (Figure 1), which results in deregulation of cellular processes. The opportunity now exists to provide more comprehensive maps of cancer DNA methylomes using whole genome-based technologies [2025]. These technologies will help provide greater insight into the underlying mechanism and location of cancer-specic methylation changes at individual CpG residues and may aid in further identication of potential epigenetic-based cancer biomarkers. Genome-wide methylome technologies DNA methylation analyses were initially restricted to relatively localized CpG-rich regions of the genome, but several methods have now been developed to map DNA methylation on a genomic scale. Here, we describe four different genome-wide approaches (summarized in Figure 2): whole-genome bisulte sequencing (WGBS); methyl-binding domain capture sequencing
76

(MBDCap-Seq); reduced-representation-bisulte-sequencing (RRBS); and Innium HumanMethylation450 BeadChips (HM450, Illumina). We discuss some of the requirements, merits, and challenges that should be considered when choosing a methylome technology to ensure that it will be informative. Whole-genome bisulte sequencing Bisulte-sequencing, which was developed in 19921994 by Frommer and Clark [26,27], is considered the gold standard for DNA methylation analyses because CpG methylation can be measured at single-base resolution. DNA is treated with sodium bisulte to convert cytosine to uracil, which is converted to thymine after PCR amplication, whereas 5MeC residues are not converted and remain as cytosines [27]. Clonal sequencing of bisulte-converted PCR products from a single genomic region have typied the approach until recently; however, the development of high-throughput sequencing now facilitates the generation of genome-wide, single-base resolution DNA methylation maps from bisulte-converted DNA (Figure 2). To perform WGBS, genomic DNA (15 mg) is sheared and ligated to methylated adaptors before size selection and bisulte conversion, followed by library construction and highthroughput sequencing (Figure 2). More than 500 million paired-end reads are required to achieve approximately 30-fold coverage of the 28 217 009 CpG sites on autosomes and sex chromosomes; typically approximately 95% of all CpG sites in the genome can be assessed using WBGS. The rst methylome was generated from the Arabidopsis thaliana genome in 2008 [28,29], and the rst human methylomes of embryonic stem cells and IMR90 broblasts were

Review
WGBS
Sonicaon of DNA Library preparaon Gel-size selecon Bisulte treatment Library amplicaon

Trends in Genetics February 2014, Vol. 30, No. 2

Input requirement

Genomic DNA: 15 g FFPE DNA: 15 g

Coverage
>500 million reads per sample

95%

MBDCap-Seq
Sonicaon of DNA capture 5mC by MBD Library preparaon Gel-size selecon Library amplicaon

Input requirement

RRBS
Digeson with MspI Library preparaon Gel-size selecon Bisulte treatment Library generaon

Input requirement

Genomic DNA: 0.010.03 g FFPE DNA: 1 g

Coverage
10 million reads per sample

3.7%

High-throughput illumina sequencing Array

HM450
Bisulte treatment Hybridizaon Single-base extension Stain BeadChip HiScan scan BeadChip

Input requirement

Genomic DNA: 0.51 g FFPE DNA: 0.51 g

Coverage
Array

1.7%

TRENDS in Genetics

Figure 2. Summary of techniques to interrogate whole-genome DNA methylation. The figure compares the maximum coverage of whole-genome bisulfite sequencing (WGBS, orange; most genomic coverage), MBD capture sequencing (MBDCap-Seq, purple), reduced representation bisulfite sequencing (RRBS, blue), and HumanMethylation450 BeadChip (HM450, green) assays for measuring genome-wide DNA methylation. A summary of the standard workflow for each method is shown (colored boxes, left). The amount of genomic DNA or formalin-fixed paraffin-embedded tissue (FFPE) needed to perform each technique reliably ranges from 0.01 ug (RRBS) to 5 ug (WGBS), which may influence platform selection. The minimum number of unique sequencing reads varies from 10 million reads (RRBS) to >500 million reads (WGBS), whereas the HM450 platform utilizes array technology. Therefore, the cost of each technique is approximately proportional to the amount of data needed to analyze reliably the data, and the coverage of the genome [range, 1.7% (HM450) 95% (WBGS)].

reported by Lister et al. in 2009 [22]. To date, relatively few WGBS human cancer [3032] or related [33] methylomes have been generated, likely due to the overall cost of the assay, technical expertise, and downstream computational requirements. WGBS has the advantage of providing single-nucleotide resolution and whole-genome coverage. However, it typically requires relatively large quantities of DNA (15 ug) and accurate interpretation requires computational expertise. Commercial bisulte conversion reagents exist in kit form; yet, standard WGBS protocol/s or library preparation methods are just beginning to emerge. Sequencing providers are performing WGBS using customized in-house methods, but the technique currently is not particularly amenable to high-throughput use, particularly in a clinical setting, partly due to the extensive hands-on and depth of sequencing required. Finally, the bioinformatics requirements for data interpretation present additional challenges. Initial WGBS studies relied upon inhouse adaptations of genome sequencing pipelines to bisulte data and unpublished bespoke analysis pipelines [22,3436]; however, public tools for the analysis of WGBS data are being developed as the technique becomes more accessible.

Enrichment-based technologies Genome-wide afnity-based methods rely on enrichment of methylated regions, followed by microarray hybridization or next-generation sequencing (Figure 2). Two of the common enrichment approaches include methyl-DNA immunoprecipitation (MeDIP), which uses a monoclonal antibody specic for 5-methylcytosine [37] and afnity capture with MBDCap proteins [38,39]. Both MeDIP and MBDCap can be combined with next-generation sequencing (MeDIP-Seq and MBDCap-Seq). However, due to bias in the different capture technologies, distinctive genomic regions are commonly interrogated [40]. MeDIP is based on immunoprecipitation of single-stranded DNA fragments and targets methylated regions of low CpG density (e.g., intergenic regions). By contrast, the MBDbased strategy captures double-stranded methylated DNA fragments and favors enrichment of CpG-dense regions (e.g., CpG islands) [41]. Here, we highlight MBDCap-Seq as one of the most widely used capture approaches. The workow for MBDCap-Seq exhibits similarities to WGBS, but is devoid of a bisulte conversion step (Figure 2). To perform MBDCap-Seq, genomic DNA (0.21 mg) is sonicated before capturing methylated DNA with MBD protein
77

Increasing coverage

Genomic DNA: 0.21 g FFPE DNA: 0.51 g

Coverage
30 million reads per sample

17.8%

Review
coupled to streptavidin beads. Following capture, the bound methylated DNA can be eluted as a single fraction or in a step-wise elution series to enrich different CpG densities. Enriched DNA is then subjected to library preparation and high-throughput sequencing (Figure 2). Although the method is more efcient with amounts of >0.2-mg DNA from fresh-frozen tissue, genomic DNA preparations for cancer methylomes can also be isolated from formaldehyde-xed parafn embedded tissue (FFPET), and is amenable to MBDCap-seq using as little as approximately 0.5 mg of DNA. Approximately 30 million single-end reads are required for accurate interpretation of data. MBDCap-Seq performed on fully methylated DNA can yield approximately 18% coverage of the genome because it captures approximately 5 million methylated CpG sites (Figure 2). MBDCap-seq is a simple approach that does not require bisulte conversion and can be used to identify differentially methylated regions [40,41]. However, a notable disadvantage of MBDCap-Seq is that it does not provide single-nucleotide resolution. Rather, it identies regions containing multiple methylated CpG sites typically at CpG-rich regions in a readout similar to chromatin immunoprecipitation (ChIP-Seq). Furthermore, MBDCap-Seq is only marginally quantitative because the number of reads mapping to a particular region of the genome depends on the density of methylated CpG sites [41]. Reduced representative bisulte sequencing RRBS is an efcient and high-throughput technique used to analyze methylation proles at a single-nucleotide level from regions of high CpG content (e.g., CpG islands), but does not interrogate intergenic or lowly methylated regions of the genome (Figure 2) [24,42]. RRBS relies rst on the digestion of genomic DNA (0.010.03 mg) with a methylation-insensitive restriction enzyme, such as MspI (C0 CGG), that selects genomic regions with moderate to high CpG density, such as CpG islands, followed by DNA size fractionation (Figure 2). This reduced representation of the genome is sequenced similarly to WGBS to generate a single-base pair resolution DNA methylation map [24,42]. A minimum of approximately 10 million sequencing reads are required for the downstream analysis of RRBS data sets, leading to approximately 3.7% actual coverage of CpG dinucleotides genome-wide or approximately 1 million CpG sites. One of the main advantages of RRBS is that it is more cost-effective than WGBS, because it targets bisulte sequencing to an enriched population of the genome, while retaining single-nucleotide resolution. RRBS data are restricted to regions with moderate to high CpG density, and are enriched for promoter-associated CpG islands. However, RRBS interrogates only <4% of the approximately 28 million CpG dinucleotides distributed throughout the human genome. Thus, a lack of coverage at intergenic and distal regulatory elements is a potential disadvantage of the method. In addition, although RRBS data can be processed using similar WGBS pipelines (e.g., [43,44]) data analysis requires a similar level of expertise and, hence, involves similar challenges.
78

Trends in Genetics February 2014, Vol. 30, No. 2

Innium HumanMethylation450 BeadChip The HM450 is an attractive option for genome-wide DNA methylation analyses in a variety of cell types. It is suitable for clinical samples, including FFPE tissue, it requires little starting material (approximately 0.5 mg), is cost effective, and can be used in a high-throughput manner. The technology is distinct from the other methylation technologies described above, in that it does not depend on capture or enrichment, or use of restriction enzymes or highthroughput sequencing for data generation (Figure 2). The HM450 protocol begins with the bisulte conversion of genomic DNA (0.51 mg) (Figure 2). Converted genomic DNA is hybridized to arrays that contain predesigned probes to distinguish chemically methylated (cytosine) and unmethylated (converted to uracil). A single-base extension step incorporates a labeled nucleotide that is uorescently stained. Scanning of the array detects the ratio of uorescent signal arising from the unmethylated probe compared with the methylated probe, allowing the level of methylation to be determined (Figure 2). The HM450 BeadChip interrogates 482 422 cytosines across the human genome, which represents only approximately 1.7% of all CpG sites in the human genome (Figure 2), substantially less than other methods. However, these sites are enriched for CpG (99.3%) residues and almost half (>41%, approximately 197 790 CpG sites) of the probes on the array cover intergenic regions, such as bioinformatically predicted enhancers, DNase I hypersensitive sites, and validated differentially methylated regions (DMRs) [45,46]. HM450 can be performed on both fresh-frozen and FFPE DNA, and methods are now being optimized to enable smaller amounts (0.2 mg) to be proled efciently [47]. Therefore, HM450 has become the method of choice for genome-wide DNA methylation analyses of prole large cohorts, because it requires a low amount of input material and it is cost effective. However, when using HM450 BeadChip technology, there are also some issues to consider. First, the design is heavily biased due to preselection and inclusion of probes that interrogate only certain CpG sites that have been previously identied in methylation-based assays and, therefore, the design is not hypothesis neutral. Second, it is assumed that CpG sites located adjacent to those interrogated by the probes will be similarly un/methylated, which is known as the co-methylation assumption [48]. Finally, there are behavioral differences between the two types of probe design on the array, and the ltering of probes may be affected by single nucleotide polymorphisms, which need to be factored in to the data analysis pipelines [49]. Comparison of genome-wide coverage The major advantage of WGBS is that, in theory, the methylation state of almost every single CpG dinucleotide (total 28 217 009) in the genome can be determined at single molecule resolution (Figure 3A,B). By contrast, with MBDCap-Seq, RRBS, and HM450, there is substantially less coverage with approximately 5 040 790, approximately 1 054 280, and 482 422 individual CpG sites, respectively, interrogated (Figure 3A,B). Notably, only a proportion of CpG sites are commonly interrogated by all three techniques (Figure 3A). MBDCap-Seq has greater

Review
(A) (C) 100

Trends in Genetics February 2014, Vol. 30, No. 2

RRBS
HM450

MBDCap-Seq

Percent of all CpG sites in the human genome

80 Intergenic

WGBS

60

40 Genic 20

(B)
Promoter Genic Intergenic Total CpGs assayed

Genome wide
WGBS 1 962 844 11 951 925 13 116 432 MBDCap 1 281 138 2 188 593 1 571 060 RRBS 504 446 312 957 236 875 1 054 278 HM450 187 791 175 760 118 871 482 422

27 031 201 5 040 791

Promoter 0

eq

RR BS

GB

CpG islands CpG shores

2 019 500 1 936 549

1 572 591 902 062

641 182 127 090

150 253 111 988

BD

Ca

pS

CpG-rich regions

TRENDS in Genetics

Figure 3. Proportion of promoters, genic, and intergenic regions interrogated by each technique. (A) The overlap and relative proportions of whole-genome bisulfite sequencing (WGBS), MBD capture sequencing (MBDCap-Seq), reduced representation bisulfite sequencing (RRBS), and HumanMethylation450 BeadChip (HM450) is plotted in a Venn diagram. (B) The total number of CpG dinucleotides (based on fully methylated DNA) covered by each technique is shown, and ranges from 482 422 (RRBS) to 27 031 201 (WBGS). Each CpG site is located in a promoter or genic or intergenic region of the genome and the distribution of these sites is detailed in the upper panel. The number of CpG sites covered by each technique that overlap CpG-rich regions (CpG island and CpG shores) is also shown in the lower panel. (C) WGBS covers approximately 95% of all CpG sites in the genome, most of which are located in intergenic or genic regions (approximately 12 million in each category) and the remainder in promoters (approximately 2 million). By contrast, HM450 interrogates the DNA methylation state of approximately 120 000 intergenic, approximately 170 000 genic, and approximately 180 000 promoter CpG sites. Data are expressed as a percentage of all CpG sites in the human genome.

coverage of promoter (approximately 1 281 140) and CpG island (approximately 1 572 590) CpG sites, as well as greater regional coverage of intergenic regions (approximately 1 571 060) and shores (approximately 902 060), compared with RRBS and HM450 arrays (Figure 3B). Moreover, when the genome is sorted into functional categories (promoter, genic, or intergenic; Figure 3B), it becomes clear that each technique, except for WGBS, is biased for different regulatory regions of the genome (Figure 3B,C). For example, MBDCap-Seq interrogates 1 572 591 CpG island sites (approximately 31% of all CpG sites assayed using MBDCap-Seq) compared with WGBS, which interrogates all 2 019 500 CpG island sites in the genome (approximately 7.5% of all CpG sites assayed using WGBS). Although RRBS covers less than 5% of all CpG sites in the human genome (Figure 3B), it enriches for regions of the genome that have a high CpG content and of the more than approximately 1 million CpG sites interrogated, almost 50% (504 446) are within promoter regions and 641 182 CpG sites are within CpG islands. Although HM450 arrays cover the fewest number of CpG sites (Figure 3B,C), the arrays provide good coverage of methylation at CpG island promoters. Nonetheless, WGBS is the only method to date that best represents regions of lower CpG density, such as intergenic gene deserts, partially methylated domains, and distal regulatory elements (e.g., enhancers) that potentially

facilitates control of tissue-specic expression and noncoding RNA expression, which are commonly deregulated in cancer. Comparison of DNA methylation data output Consistent with variations in genomic coverage, the data output of the genome-wide DNA methylation approaches differs considerably (summarized in Figure 4). We have used CAV1 and GSTP1 gene promoters to illustrate the differences in methylation signal and coverage across CpG island gene promoters and adjacent intergenic and genic regions (Figure 4A,B). With the exception of MBDCap-Seq, WGBS, RRBS, and HM450 all measure both unmethylated and methylated cytosines at single CpG sites and, therefore, are fully quantitative, but the accuracy depends on coverage. Notable is the explicit detail of CpG methylation in WGBS data (Figure 4C,D). With sufcient sequencing depth, individual WGBS and RRBS sequencing reads allow the separation of DNA methylation data for each strand, the detection of cytosine methylation in a non-CpG context [22], heterogeneous patterns, and allele-specic DNA methylation (Figure 4C,D). Current bisulte-based methodologies cannot distinguish between 5mC and other novel structurally similar DNA modications that have recently been discovered, including 5-hydroxymethylcytosine (5hmC), 5-formylcytosine (5fC), and 5-carboxylcytosine (5caC). This
79

HM 45 0

Review

Trends in Genetics February 2014, Vol. 30, No. 2

(A)

2 kb
chr7:116 137 500 116 138 500

hg19
116 139 500 116 140 500 116 141 500 116 142 500 116 143 500

CAV1
CpG island

WGBS MBDCap signal MBDCap RRBS HM450


(B)

10 kb
chr11: 67 340 000 67 345 000

hg19
67 350 000 67 355 000 67 360 000 67 365 000

GSTP1
CpG island

WGBS MBDCap signal MBDCap RRBS HM450

(C)
1 kB chr:21

Allelic methylaon
hg18 10 128 500 10 129 000
SNP: rs11184058

(D)
2 kB chr:20 26 137 500

Stranded WGBS
26 138 500 CpG: 201 hg18 26 139 500

MBDCap signal
CpG: 38

WGBS

WGBS

Forward Reverse

Key:
Unmethylated CpG Methylated CpG
TRENDS in Genetics

Figure 4. Comparison of DNA methylation approaches. The signal from MBD capture sequencing (MBDCap-Seq) is averaged and not fully quantitative, whereas explicit detail can be viewed in whole-genome bisulfite sequencing (WGBS) data. (A,B) Screenshots of glutathione-S-transferase P1 gene (GSTP1) and caveolin 1 (CAV1) gene promoters and adjacent intergenic and genic regions show that reduced representation bisulfite sequencing (RRBS) and HumanMethylation450 BeadChip (HM450) largely capture CpG sites surrounding promoters, but few CpG sites in the genic and intergenic regions of these genes. MBDCap is not fully quantitative and relies on accurate analysis and interpretation of the raw signal. (C) Individual sequencing reads allow the separation of DNA methylation data by genomic sequence (e.g., single nucleotide polymorphisms; SNP), demonstrating the phenomenon of allele-specific DNA methylation. (D) Heterogeneous methylation (defined as either sporadic methylation within an individual DNA molecule or differential levels of methylation between individual DNA molecules) can be observed in WGBS data, as can the unique information obtained from the forward- and reverse-sequencing strands.

may have consequences for data interpretation, potentially leading to an overestimation of DNA methylation levels. However, innovative detection methods are being developed, such as those that allow specic detection of 5mC and 5hmC [50,51], which opens up future possibilities to develop whole-genome approaches to assess all methylation modications simultaneously. Bioinformatics A particular challenge of any genome-wide approach is the downstream computational requirements for obtaining meaningful outcomes. The main disadvantages of WGBS
80

at the present time are the onerous computational resources needed for read alignment [43,5254], and the current need to develop custom bioinformatics scripts. Additionally, WGBS studies to date have performed few, if any, replicates, which severely limits statistical power and the ability to distinguish actual alterations from biological variability [55]. RRBS also requires bioinformatics expertise for analysis; however, the greatly reduced amount of data produced per experiment requires comparatively modest computational resources [56]. MBDCapSeq requires less bioinformatics expertise and can be analyzed using established algorithms [41,57]. However,

Review
(A)
Bladder (2%) Blood (4%) Brain (11%)

Trends in Genetics February 2014, Vol. 30, No. 2

Number of methylomes

Thyroid (7%) Sarcoma (1%) Uterine (6%) Stomach (4%) Skin (4%)

(B)

1200

Key:
WGBS MeDIP-Seq / MBDCap-Seq RRBS HM450K

800
WGBS (4)

Rectum (2%) Prostate (3%) Pancreas (1%) Ovary (8%) Nerve (0%) Lung (12%) Liver (1%)

Breast (14%)

400

Cervix (2%) Colon (6%) Endometrium (1%) Head and neck (4%) Kidney (9%)

0
ad d B l er oo B d Br rain ea Ce s t En r He do Co vix ad me lo an triu n d m ne Ki ck dn e Liv y e Lu r n Ne g O rv Pa v a e nc r y P r rea os s Re tat ct e um St S k om i n a Ut ch Sa erin rc e T h om yr a oi d Bl

Cancer type

(C)

Methylaon data portal


Cancer methylome system Encyclopedia of DNA elements (ENCODE) Gene expression omnibus (GEO) The cancer genome atlas (TCGA) Internaonal cancer genome consorum (ICGC) Array express

Online resource
hp://cbbiweb.uthscsa.edu/KMethylomes/ hp://genome.ucsc.edu/ENCODE/ hp://www.ncbi.nlm.nih.gov/geo/ hp://cancergenome.nih.gov/ hp://icgc.org/ hp://www.ebi.ac.uk/arrayexpress/
TRENDS in Genetics

Figure 5. Cancer methylomes. (A) Cancer methylomes of at least 21 broad cancer types have been completed, representing >8000 individual data sets. Most DNA methylomes have been produced for breast (approximately 14% of all methylomes), lung (approximately 12% of all methylomes), and brain tumors (approximately 11% of all methylomes). Data from rare cancers are also beginning to be performed (nerve; one methylome, approximately 0.01% of all methylomes). The data are expressed as percentage of all methylomes produced, regardless of tumor origin, and show a wide distribution of methylomes across a broad range of cancer types. (B) We compared the techniques used to generate each methylome. HumanMethylation450 BeadChip (HM450; green) clearly dominates as the method of choice for high-throughput methylation studies. Currently, only four whole-genome bisulfite sequencing (WGBS; orange) data sets have been produced (colon). MBD capture sequencing (MBDCapSeq; purple) has been used to measure DNA methylation in blood, brain, endometrial, and lung tumors, whereas the use of reduced representation bisulfite sequencing (RRBS; blue) has been limited to blood cancer. (C) Key online resources for accessing publicly available methylation data are summarized.

MBDCap-Seq is not fully quantitative and, therefore, relies on accurate analysis and interpretation of the raw signal. In particular, failure to control for copy-number alterations can lead to inaccuracies in methylation measurements, an issue that affects cancer samples [57]. More mature bioinformatics analysis pipelines exist for HM450 [5860], and these pipelines already include normalization measures to analyze data [49,59], meaning that these arrays may be the most accessible genome-wide DNA methylation assay. Sequencing coverage The amount of sequencing needed to yield meaningful results differs substantially between techniques. The main disadvantage of WGBS at the present time is the cost of sequencing, which requires >500 million reads (100 bp paired-end) per sample (approximately 30 x coverage), or approximately three sequencing lanes on the Illumina HiSeq. At a shallow sequencing depth (15 x coverage), regions of high and low average methylation can be quantitated, whereas at a deep sequencing depth (30 x), individual CpG sites can be accurately quantitated. With sufcient coverage, it is possible to apply adaptations of genomic variant detection algorithms [61] to interrogate the genotype and methylation status of the samples simultaneously, enabling applications such as the assessment of allele-specic methylation (Figure 4C). By contrast, MBDCap-Seq only requires short read chemistry (50 bp single-end) and a relatively shallow sequencing depth (approximately 30 million reads per sample), allowing

six samples to be multiplexed per HiSeq lane. However, RRBS requires only 10 million reads per sample (Figure 2). Notably, the sequencing depth required correlates with the genome coverage capability of each approach. HM450 arrays do not rely on high-throughput sequencing for data generation. Summary of cancer methylome studies To date, approximately 8000 cancer methylomes have been generated (Figure 5A). Most major cancers have at least one representative methylome, with no one type being overrepresented as a proportion of all methylomes available (Figure 5A). However, it is clear that the HM450 arrays dominate studies investigating cancer methylomes (Figure 5B). Indeed, the Cancer Genome Atlas consortium (TCGA; http://cancergenome.nih.gov) is a portal for understanding the genomic basis of more than 200 human cancer types. Among the massive data sets that are accessible to all researchers, TCGA has proled the DNA methylome in approximately 7500 samples using the HM450 methodology [6266]. These data sets largely comprise the newer, HM450 array. To date, only two deeply sequenced WGBS of primary tumors have been completed [30,32], three shallowly sequenced WGBS tumors (all colon; Figure 5B) [31] and approximately 55 RRBS analyses, of which most investigate primary blood cancers (Figure 5B). However, the limited number WGBS cancer methylomes is likely to change drastically as the cost of the technology and ease of bioinformatics analyses improves. A summary of DNA methylation data portals is shown in Figure 5C.
81

Review
What have we learnt from cancer methylome studies? The development of next-generation sequencing technologies and ability to map the changes in DNA methylation across many cancer types has led to huge advances in knowledge. DNA methylation studies have revealed that changes are not restricted to CpG island promoters, but occur genome wide, including genic and intergenic regions. The intergenic space is vast and houses distal regulatory elements, including enhancers and noncoding RNA genes, and is a frequent site for the mutation hotspots in cancer [67,68]. It is now clear that DNA methylation in distal regulatory regions is also associated with transcriptional regulation. Methylation in genic or exonic regions is also associated with changing levels of transcription, where high methylation occurs in active genes and lower methylation in repressed genes [22]. Somatic mutations in noncoding regions add another dimension to the complexity of deregulation of the cancer epigenome, given that mutation hotspots can be caused by DNA methylation [69,70] and that genetic mutations can be strongly associated with changes in methylation patterns [9,66,7174]. Cancer methylomes now face ner interpretation as we try to understand architectural differences, such as long-range epigenetic silencing (LRES; [75]) or long-range epigenetic activation (LREA; [76]), as well as discrete changes, such as atypical DNA methylation at localized CpG sites, partially methylated domains (PMDs; [30,31,77]) and DMRs [31] that may be responsible for disabling or enabling key gene regulatory elements. The identication of DNA methylation valleys (DMVs) in embryonic stem cells points to novel genomic features that may also be evident in tumor methylomes [36]. Altered cancer methylomes are commonly associated with changes in transcriptional output and altered genomic stability. Indeed, cancer cells undergo a multitude of step-wise and cumulative methylation changes that impinge on crucial biological pathways that potentially inuence proliferation rates, response to extracellular signals, and the response to DNA damage. Yet, not all aberrant DNA methylation changes drive disease. It is, and will be, important to distinguish driver from passenger roles [78], which will enable an even more precise stratication of cancer subtypes [66,79] and personalized therapeutic programs [9,80]. One of the rst studies investigating the role of DNA methylation drivers and passengers demonstrated that cancer cells are potentially addicted to the modied epigenome [78]. Future analyses will reveal the specic DNA methylation signatures that are either associated or drive the survival capacity of cancer cells. However, distinct methylation patterns are being used to classify distinct subtypes [81 84]. For example, the CpG Island Methylator Phenotype (CIMP), rst described in colorectal cancer [85] and evident in many other cancer types [86], indicates that DNA methylation is potentially useful for disease classication. In fact, CIMP has recently been reported to be associated with underlying genetic mutations, such as somatic isocitrate dehydrogenase-1 (IDH1) mutations and mutations in teneleven translocation (TET) methylcytosine dioxygenase-2 (TET2).
82

Trends in Genetics February 2014, Vol. 30, No. 2

Advances in genome-wide DNA methylation technology have also enabled new strategies for the identication of early novel diagnostic and prognostic cancer biomarkers [87,88]. Already, the measurement of promoter hypermethylation of individual genes has been successfully implemented in the clinic. For example, the glutathioneS-transferase P1 gene (GSTP1) gene is methylated in >90% of prostate cancers [89] and Septin 9 (SEPT9) is hypermethylated in colorectal cancer; both are currently being used for early cancer detection in tissue samples and body uids [90]. Moreover, promoter hypermethylation of the MGMT DNA-repair gene is a clear predictor of tumor responsiveness to alkylating agents in patients with glioblastoma [91,92]. These examples highlight the promise of translating epigenetic markers into a clinical setting, especially given that the deregulation of cellular epigenetic patterns is an early event in carcinogenesis. Concluding remarks and future perspectives The advent of genome-wide approaches to map the cancer methylome, and the ability to identify differentially methylated loci, is leading to the development of panels of biomarkers that increase the specicity and sensitivity for improved diagnostic potential [93,94]. In cancer treatment, one of the major challenges is to stratify tumor types, because most cancer subtypes do not behave as a single entity in response to current therapies. The ability to identify epigenetic events associated with survival from archival cancer samples is revealing epigenetic prognostic signatures that can be used to cluster subtypes upon diagnosis to enable better treatment options. The future production of cancer methylomes, especially with detailed information of the approximately 28 million CpG sites in each different cancer cell type will further advance understanding of the role of DNA methylation in epigeneticbased molecular function and disease progression. Ultimately, however, the choice of which whole-genome methylation approach to use will depend on the quantity and quality of DNA available, accessibility to next-generation sequencing, bioinformatics expertise, cost, and, nally, consideration of the question being asked and the required coverage of the genome.
Acknowledgments
We thank Elena Zotenko for the MBDCap-Seq coverage analysis and members of the Clark Laboratory for helpful discussions and careful reading of the manuscript. P.C.T. is a Cancer Institute NSW Career Development Fellow. S.J.C. is a National Health and Medical Research Council (NH&MRC) Senior Principal Research Fellow. This work was further supported by NH&MRC Project Grants (to S.J.C., P.C.T., and C.S.).

References
1 Bird, A.P. (1986) CpG-rich islands and the function of DNA methylation. Nature 321, 209213 2 Li, E. et al. (1993) Role for DNA methylation in genomic imprinting. Nature 366, 362365 3 Mohandas, T. et al. (1981) Reactivation of an inactive human X chromosome: evidence for X inactivation by DNA methylation. Science 211, 393396 4 Gartler, S.M. and Riggs, A.D. (1983) Mammalian X-chromosome inactivation. Annu. Rev. Genet. 17, 155190 5 Swain, J.L. et al. (1987) Parental legacy determines methylation and expression of an autosomal transgene: a molecular mechanism for parental imprinting. Cell 50, 719727

You might also like