
Citation: Michael-Christian Mörl, Tilo Zülske, Robert Schöpflin, Gero Wedemann. Data formats for modelling the spatial structure of chromatin based on experimental positions of nucleosomes[J]. AIMS Biophysics, 2019, 6(3): 83-98. doi: 10.3934/biophy.2019.3.83
[1] | Raghvendra P Singh, Guillaume Brysbaert, Marc F Lensink, Fabrizio Cleri, Ralf Blossey . Kinetic proofreading of chromatin remodeling: from gene activation to gene repression and back. AIMS Biophysics, 2015, 2(4): 398-411. doi: 10.3934/biophy.2015.4.398 |
[2] | Edward N Trifonov . Columnar structure of SV40 minichromosome. AIMS Biophysics, 2015, 2(3): 274-283. doi: 10.3934/biophy.2015.3.274 |
[3] | E.N. Baranova, R.M. Sarimov, A.A. Gulevich . Stress induced «railway for pre-ribosome export» structure as a new model for studying eukaryote ribosome biogenesis. AIMS Biophysics, 2019, 6(2): 47-67. doi: 10.3934/biophy.2019.2.47 |
[4] | Larisa I. Fedoreyeva, Boris F. Vanyushin, Ekaterina N. Baranova . Peptide AEDL alters chromatin conformation via histone binding. AIMS Biophysics, 2020, 7(1): 1-16. doi: 10.3934/biophy.2020001 |
[5] | Yuri M. Moshkin . Chromatin—a global buffer for eukaryotic gene control. AIMS Biophysics, 2015, 2(4): 531-554. doi: 10.3934/biophy.2015.4.531 |
[6] | Davood Norouzi, Ataur Katebi, Feng Cui, Victor B. Zhurkin . Topological diversity of chromatin fibers: Interplay between nucleosome repeat length, DNA linking number and the level of transcription. AIMS Biophysics, 2015, 2(4): 613-629. doi: 10.3934/biophy.2015.4.613 |
[7] | Missag Hagop Parseghian . What is the role of histone H1 heterogeneity? A functional model emerges from a 50 year mystery. AIMS Biophysics, 2015, 2(4): 724-772. doi: 10.3934/biophy.2015.4.724 |
[8] | Andrea Bianchi, Chiara Lanzuolo . Into the chromatin world: Role of nuclear architecture in epigenome regulation. AIMS Biophysics, 2015, 2(4): 585-612. doi: 10.3934/biophy.2015.4.585 |
[9] | Théo Lebeaupin, Hafida Sellou, Gyula Timinszky, Sébastien Huet . Chromatin dynamics at DNA breaks: what, how and why?. AIMS Biophysics, 2015, 2(4): 458-475. doi: 10.3934/biophy.2015.4.458 |
[10] | Vladimir B. Teif, Andrey G. Cherstvy . Chromatin and epigenetics: current biophysical views. AIMS Biophysics, 2016, 3(1): 88-98. doi: 10.3934/biophy.2016.1.88 |
The spatial structure of chromatin has an important impact on gene regulation [1]. It is closely linked to the positions of nucleosomes in the genome, which are actively regulated by cells [2]. Computer models based on Monte Carlo, Brownian Dynamics or Molecular Dynamics can be applied to investigate the connection between the positions of nucleosomes and the spatial structure of chromatin [3],[4]. Computer models differ with respect to their resolution level, among other properties. For chains with many nucleosomes coarse-grained models are used, whereas molecular models with atomic resolution are more detailed and can represent single nucleosomes. In the present work, we study the usage of coarse-grained models, beginning with experimental determination of the positions of nucleosomes and ending with analysis of data from computer simulations. We focused at the level of a particular locus in the genome of a specific cell type. For all steps in this process, a multitude of bioinformatics and computational biophysics tools are applied that require different input formats and deliver various output formats, which makes the process tedious [5].
For the development of data formats we analysed the workflow used in our research group as an example. We identified three domains; experimental, simulation and analysis (Figure 1). Each domain involves several steps generating different artefacts.
The experimental domain is where chromatin is prepared and experimentally investigated. Nucleosomes are isolated from chromatin by applying e.g., micrococcal nuclease (MNase), and DNA fragments are isolated and sequenced to generate raw data in the form of short reads. Nucleosome positions are determined by mapping the obtained sequences to a reference genome using dedicated software [6]–[8] or existing nucleosome position databases [9],[10].These positions usually overlap, reflecting different positions in different cells from a sample, and experimental shortcomings. For computer models, individual non-overlapping nucleosome positions at a particular locus of interest are identified using appropriate software tools [11].
The simulation domain comprises all steps required to prepare and perform a computer simulation of a locus, and files contain the positions for a genomic region of interest. Additionally, a presetting file contains the settings for the 3D computer simulation model and the simulation process, such as nucleosome geometry, and parameters for potentials and simulations. Locus and presetting parameters are agammaegated into a single blueprint, and the simulation software creates a trajectory that contains the positions and orientations of the model elements for all simulation steps.
In the exchange domain, simulation results are made accessible for other research groups. This can be done by converting the trajectory into an easy to understand exchangeable format, although this requires greater disk space. Published data can then be used for visualisation and analysis.
Computer simulations are performed differently in different research groups [12],[13] utilizing various data formats and different chromatin models [14]–[16]. In the present work, we propose common data formats for presetting and locus parameters, and an exchange trajectory based on the Extensible Markup Language XML [17]. We believe that data can be expressed in a uniform way for different applications since the information contained within is essentially of a universal structure (Figure 2). We demonstrate the viability of the approach herein using a working example. The value for other groups will be corroborated by examples of Extensible Style Language Transformations XSLT [18] for analysis of exchange trajectories and transformation into other data formats. The XML Schema files and examples of XML files are published in open source repositories.
The locus, presetting and exchange trajectory data formats are freely available if (1) the format specification is unrestrictedly accessible, (2) use must not be subject to any fees or licenses, and (3) it must be interoperable (i.e., the data format must be platform-, hardware- and software-independent). A data file must be correct in terms of its structure and content. The clarity and usefulness of data can be assessed by means of a defined syntax and grammar. Data for an exchange trajectory can be validated utilising a schema describing the validity of the type and value range. In order to ensure the usability of data, data formats must not only be machine-readable, but also human-readable (i.e., text instead of a binary format). Broad and advanced software support should be available (e.g., communities, libraries, etc.) to minimise the effort for the processing an exchange trajectory. All the above criteria are met by the Extensible Markup Language (XML [17]), specified by the World Wide Web Consortium (W3C) [19]). The aforementioned data schema can be defined using an XML Schema.
In software development, object-oriented methods are regularly used to model structure and behaviour. Unified Modelling Language (UML) is a widespread standard [20]. In this article we use UML class diagrams to model classes and their relations as foundation for the XML formats. See the UML standard [20] for details, since most software engineering textbooks and even Wikipedia use the outdated UML 2.0 standard.
A class defines common properties of objects. In this context, objects are also referred to as instances of a class. For example, Figure 3 shows the nucleosome as a class with the properties pos and mpos. The text after the colon describes the type of the property e.g., integer. Instances of a class have its properties with concrete values. A class may be labelled as <abstract> which means that it cannot be instantiated, and only derived classes can be instantiated (s. section Associations between Objects). An example for an abstract class is the class Bead in Figure 6. Here, only DNABead and NuclesomeBead can be instantiated.
In UML different relations between objects can be modelled.
An association denotes a link between two classes and is depicted by a line between the two classes. A filled dot at a class signifies that the other class of the association owns the association. As example in Figure 5 Presetting owns the association to ParameterList. If no numbers or symbols are attached to the end of the line one instance of a class is associated to the other. A star indicates that an indefinite number of instances of that class are associated. As example in Figure 4 the association between Locus and Nucleosome symbolises, that a single instance of Locus can have an association to an indefinite number of instances of Nucleosome.
Inheritance describes a specialisation of an object A to an object B. One also says that B is derived from A. This means, that B inherits the properties and associations of A and has additional properties or associations that are special to B. A line with an open triangle as arrowhead in direction of the generalization represents graphically an inheritance. An example for inheritance in Figure 6 the class DNABead is derived from Bead.
Herein, we propose common data formats for locus, presetting and exchange trajectory files. For illustration purposes, we demonstrate our format for the blueprint.
A locus is located on a chromosome and is delimited by a first and a last nucleotide (base). It contains the positions of identified nucleosomes. The position of a nucleosome is defined by the positions of the two bases that delimit the nucleosome in the genome. The positions are derived experimentally, and these relations can be modelled in a straightforward way using two classes with a one-to-many association, as shown in Figure 3. The results are presented using an XML schema, as shown in Figure 4.
The presetting file defines the settings for the simulation model and the simulation process. These are mostly the geometry of the DNA at the nucleosomes, the parameters of the different energy potentials, and the simulation parameters [12]. The geometry of a nucleosome is defined as the location and orientation of the DNA at the nucleosome, and can be described by seven angles (α, β, γ, δ, ε, φ and ζ), the segment length in the elastically relaxed state, and the distance between the centre of gravity of the nucleosome and the segment centre (Figure 7) [21]. The nucleosome potential is determined by selecting one of the models implemented in our simulation software (e.g., Lennard-Jones or Gay-Berne). To exclude invalid angles or distances, the data schema uses its own data types for floating point values. It is also possible to define additional simulation parameters. The presetting structure is shown schematically in Figure 5.
An exchange trajectory file contains the results of a computer simulation, in particular the type, positions and orientations of all elements for different simulation steps. It includes information about the origin of the data (i.e., the conversion of a particular trajectory format, the time stamp of the conversion, the name and version of the converter, and the version of the data schema on which the exchange trajectory is based). The parameter list contains parameters for simulations as given in the presetting file. Parameters can be changed during the simulation (e.g., the temperature in simulations can be altered by applying simulated annealing) [22]. Altered parameters can be associated to configurations in simulation steps, at which point new parameter values are valid. The resulting structure of an exchange trajectory can be represented as a UML class diagram (Figure 7).
The workflow and the suitability of the schema files are demonstrated using an example locus. Eight nucleosomes are defined, and their positions are included in a locus.xml-file (Figure 8) for which correctness was validated by the corresponding XML-schema file (Figure 4). The simulation conditions are specified in a presetting.xml file (Figure 9).
The locus and presetting files were used to generate a blueprint.xml file (Figure 10). For the generation process, an internal tool from our simulation software was employed. The blueprint was used to generate a trajectory format suitable as an input for our simulation software. All steps described so far were performed prior to the simulation being executed. For the example simulation, 5000 Monte-Carlo steps were performed. The results are in the form of an extended trajectory file containing the generated configurations, and this is converted to an exchange trajectory XML file (Figure 11).
To demonstrate the possible usage of the file format, we show two examples utilising XSLT for analysis and transformation of a trajectory. In the first example, the end-to-end distance of the nucleosome chain is computed (Figure 12). This is the Euclidian distance between the starting point of the first segment and the end point of the last segment.
As a second example, we transformed the exchange trajectory into another trajectory format, VTF [23], which can be used with VMD-Viewer [24]. This can be done also easily with XSLT (Figure 13).
As an example for the whole workflow, we analysed a locus in mouse cells at chromosome 14 from base pair 47589000 to 47593031 (genome mm9). We evaluated experimental data from MNase-seq experiments of mouse embryonic stem cells (ESC) (GSM1004653), Neural Precursor Cells (NPC) (GSM1004652) and mouse embryo fibroblast (MEF) (GSM1004654) cells (BED-Files, zero based coordinates, right open interval) [25],[26]. Positions of nucleosome were determined applying the software NucPosSimulator using default settings [11] identifying between 22 and 24 nucleosomes (Table 1). We simulated the three cell types according to the described workflow with our simulation software [3] for 1000000 Monte Carlo steps, saved all 2000 steps and transformed each last configuration of the exchange trajectories for visualization with the raytracer POV-Ray [27] (Figure 14). Moreover, we analysed the average end-to-end distances of the chain. From analysis we excluded the first 25 values respectively 50000 steps needed for equilibration [28]. The results are presented in Table 1. Surprisingly, the chromatin model for MEF cells have longest elongation despite the largest number of nucleosomes caused by the positions of the nucleosomes.
Type of Cell | End-To-End Distance in nm | #Nucleosomes |
ESC | 90,50 ± 32,31 | 22 |
NPC | 89,44 ± 29,30 | 23 |
MEF | 97,68 ± 28,67 | 24 |
In this article, we present data formats for use in a workflow for computer simulation of chromatin based on experimental nucleosome positions that are easy to read and process. The formats establish inputs and outputs for the simulation domain as XML schemas. We used an example workflow on a locus to demonstrate the viability of our approach. Specifically, three XSL transformations demonstrated the ease with which the XML format can be processed. XML is superior to other text formats such as JSON [29] since it has standardised schema definitions and permits transformations using XSLT. Sophisticated libraries exist for nearly all programming languages. XML is the standard file format used by the systems biology community, especially the Systems Biology Markup Language (SBML) [30]. On the downside, XML files tend to require much more disk space than terser formats such as our internal trajectory format or binary formats (Table 2). These data formats are less readable, which hinders analysis by other groups. For publication of trajectories, we recommend only to publish uncorrelated configurations, and to compress the trajectory using standard tools such as bzip2 to minimise disk space.
#Configurations | Block-Trajectory | XML-Trajectory | Compressed |
100 | 2.55 Mb | 26.5 Mb | 895.6 Kb |
1250 | 31.33 Mb | 327.8 Mb | 10.8 Mb |
2500 | 63.00 Mb | 655.1 Mb | 21.9 Mb |
The proposed formats are limited to systems containing DNA and nucleosomes. Next-generation computer models include explicit models of linker histones [31], CTCF or Cohesin [32]. The format will be advanced in the future to include these elements. The proposed file format has possible shortcomings that may inhibit use by other groups. For example, the angles describing the geometry of nucleosomes are very specific for our model. However, chromatin has been investigated for more than 20 years using computer simulations [33], and a common file format for model and simulation results has not yet been proposed. Simulation data are usually not published, and the present work is the first to attempt to establish a common format. We intend to publish all future data in the proposed format and invite other groups to participate in this endeavour.
All mentioned files are included in the supplemental material and are publicly available at https://github.com/HOSTbioinformatics/ExchangeFormats
[1] |
Lanctôt C, Cheutin T, Cremer M, et al. (2007) Dynamic genome architecture in the nuclear space: regulation of gene expression in three dimensions. Nat Rev Genet 8: 104–115. doi: 10.1038/nrg2041
![]() |
[2] |
Diermeier S, Kolovos P, Heizinger L, et al. (2014) TNFα signalling primes chromatin for NF-κB binding and induces rapid and widespread nucleosome repositioning. Genome Biol 15: 536. doi: 10.1186/s13059-014-0536-6
![]() |
[3] |
Müller O, Kepper N, Schöpflin R, et al. (2014) Changing chromatin fiber conformation by nucleosome repositioning. Biophys J 107: 2141–2150. doi: 10.1016/j.bpj.2014.09.026
![]() |
[4] | Bajpai G, Padinhateeri R (2018) Irregular chromatin: packing density, fiber width and occurrence of heterogeneous clusters. bioRxiv: 453126. |
[5] |
Busch N, Wedemann G (2009) Modeling genomic data with type attributes, balancing stability and maintainability. BMC Bioinf 10: 97. doi: 10.1186/1471-2105-10-97
![]() |
[6] | Teif VB (2015) Nucleosome positioning: resources and tools online. Briefings Bioinf 17: 745–757. |
[7] | Bowtie: Bowtie 2: fast and sensitive read alignment. Available from: http://bowtie-bio.sourceforge.net/bowtie2/index.shtml. |
[8] | BWA-Mapping: BWA Mapper. Available from: https://www.ridom.de/u/BWA_Mapper.html. |
[9] |
Zhao Y, Wang J, Liang F, et al. (2019) NucMap: a database of genome-wide nucleosome positioning map across species. Nucleic Acids Res 47: D163–D169. doi: 10.1093/nar/gky980
![]() |
[10] |
Marti-Renom MA, Almouzni G, Bickmore WA, et al. (2018) Challenges and guidelines toward 4D nucleome data and model standards. Nat Genet 50: 1352. doi: 10.1038/s41588-018-0236-3
![]() |
[11] |
Schöpflin R, Teif VB, Müller O, et al. (2013) Modeling nucleosome position distributions from experimental nucleosome positioning maps. Bioinformatics 29: 2380–2386. doi: 10.1093/bioinformatics/btt404
![]() |
[12] | Rippe K, Stehr R, Wedemann G (2012) Monte Carlo Simulations of nucleosome chains to identify factors that control DNA compaction and access. In: Schlick T, editor, Innovations in Biomolecular Modeling and Simulations. Cambridge: Royal Society of Chemistry, 198–235. |
[13] | Nordenskiold L (2017) Coarse-Grained Modeling of Biomolecules. In: Papoian GA, editor, Coarse-Grained Modeling of Biomolecules, CRC Press, 297–340. |
[14] | Jung J, Nishima W, Daniels M, et al. (2019) Scaling molecular dynamics beyond 100,000 processor cores for large-scale biophysical simulations. J Comput Chem 40: 1919–1930. |
[15] |
Perišić O, Portillo-Ledesma S, Schlick T (2019) Sensitive effect of linker histone binding mode and subtype on chromatin condensation. Nucleic Acids Res 47: 4948–4957. doi: 10.1093/nar/gkz234
![]() |
[16] | Nordenskiöld L, Soman A, Korolev N, et al. (2019) Structure and Dynamics of the Telomeric Nucleosome and Chromatin. Biophys J 116: 71a. |
[17] | W3C: XML Technology. Available from: https://www.w3.org/standards/xml/. |
[18] | W3C: The Extensible Stylesheet Language Family (XSL). Available from: https://www.w3.org/Style/XSL/. |
[19] | W3C: World Wide Web Consortium (W3C). Available from: https://www.w3.org/. |
[20] | Group OM: About the Unified Modeling Language Specification Version 2.5.1. Available from: https://www.omg.org/spec/UML/About-UML/. |
[21] |
Kepper N, Foethke D, Stehr R, et al. (2008) Nucleosome geometry and internucleosomal interactions control the chromatin fiber conformation. Biophys J 95: 3692–3705. doi: 10.1529/biophysj.107.121079
![]() |
[22] |
Stehr R, Kepper N, Rippe K, et al. (2008) The effect of internucleosomal interaction on folding of the chromatin fiber. Biophys J 95: 3677–3691. doi: 10.1529/biophysj.107.120543
![]() |
[23] | Lenz O (2018) The VTF Plugin is a plugin for the VMD software that reads the VTF format. Available from: https://github.com/olenz/vtfplugin/wiki. |
[24] | Lenz O: VMD-Visual Molecular Dynamics. Available from: http://www.ks.uiuc.edu/Research/vmd/. |
[25] | Information NCfB: Data Series GSE40896. Available from: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE40896. |
[26] |
Teif VB, Vainshtein Y, Caudron-Herger M, et al. (2012) Genome-wide nucleosome positioning during embryonic stem cell development. Nat struct mol biol 19: 1185–1192. doi: 10.1038/nsmb.2419
![]() |
[27] | POV-Ray: POV-Ray-The Persistence of Vision Raytracer. Available from: http://www.povray.org/. |
[28] |
Wedemann G, Langowski J (2002) Computer simulation of the 30-nanometer chromatin fiber. Biophy J 82: 2847–2859. doi: 10.1016/S0006-3495(02)75627-0
![]() |
[29] | ECMA-404: ECMA-404 The JSON Data Interchange Standard. Available from: https://www.json.org/. |
[30] |
Hucka M, Finney A, Sauro HM, et al. (2003) The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics 19: 524–531. doi: 10.1093/bioinformatics/btg015
![]() |
[31] | Bascom GD, Schlick T (2018) 5-Mesoscale Modeling of Chromatin Fibers. In: Lavelle C, Victor J-M, editors. Nuclear Architecture and Dynamics. Boston: Academic Press, 123–147. |
[32] |
Bascom GD, Sanbonmatsu KY, Schlick T (2016) Mesoscale modeling reveals hierarchical looping of chromatin fibers near gene regulatory elements. J Phys Chem B 120: 8642–8653. doi: 10.1021/acs.jpcb.6b03197
![]() |
[33] | Ehrlich L, Münckel C, Chirico G, et al. (1997) A Brownian dynamics model for the chromatin fiber. Comput Appl Biosci 13: 271–279. |
1. | Aymen Attou, Tilo Zülske, Gero Wedemann, Cohesin and CTCF complexes mediate contacts in chromatin loops depending on nucleosome positions, 2022, 121, 00063495, 4788, 10.1016/j.bpj.2022.10.044 | |
2. | Katharina Brandstetter, Tilo Zülske, Tobias Ragoczy, David Hörl, Miguel Guirao-Ortiz, Clemens Steinek, Toby Barnes, Gabriela Stumberger, Jonathan Schwach, Eric Haugen, Eric Rynes, Philipp Korber, John A. Stamatoyannopoulos, Heinrich Leonhardt, Gero Wedemann, Hartmann Harz, Differences in nanoscale organization of regulatory active and inactive human chromatin, 2022, 121, 00063495, 977, 10.1016/j.bpj.2022.02.009 | |
3. | Tilo Zülske, Aymen Attou, Laurens Groß, David Hörl, Hartmann Harz, Gero Wedemann, Nucleosome spacing controls chromatin spatial structure and accessibility, 2024, 123, 00063495, 847, 10.1016/j.bpj.2024.02.024 |
Type of Cell | End-To-End Distance in nm | #Nucleosomes |
ESC | 90,50 ± 32,31 | 22 |
NPC | 89,44 ± 29,30 | 23 |
MEF | 97,68 ± 28,67 | 24 |
#Configurations | Block-Trajectory | XML-Trajectory | Compressed |
100 | 2.55 Mb | 26.5 Mb | 895.6 Kb |
1250 | 31.33 Mb | 327.8 Mb | 10.8 Mb |
2500 | 63.00 Mb | 655.1 Mb | 21.9 Mb |