UniBind documentation

UniBind Documentation

Last updated: August 9, 2018

What is UniBind?

UniBind is a comprehensive map of direct transcription factor (TF) – DNA interactions in the human genome. These interactions were obtained by uniformly processing thousands of ChIP-seq data sets, from raw reads to high confidence TF binding site predictions, using the ChIP-eat software. The uniform processing, up to ChIP-seq peaks calling was performed by ReMap and the entire collection of ChIP-seq peaks is also available in the ReMap database. ChIP-eat used the MACS2 peak caller to identify ChIP-seq peaks on the hg38 version of the human genome. Next, these genomic regions were analysed with four different TF binding models to predict direct TF-DNA interactions. These models include DiMO-optimized position weight matrices (PWMs), transcription factor flexible models, DNA shape-based models, and binding energy models. An entropy-based algorithm was used to automatically delineate an enrichment zone containing direct TF – DNA interactions, supported by both strong computational evidence and strong experimental evidence. The UniBind database hosts the complete set of TFBS predictions for each prediction model, as well as the models themselves, the original ChIP-seq peaks, and cis-regulatory modules derived from these direct TF – DNA interactions. All the data is publicly available. For further details, please refer to the associated publication: (DOI: https://doi.org/10.1093/nar/gky1210 ).

How was the data processed?

The entire collection of ChIP-seq data sets was uniformly processed in ReMap up to ChIP-seq peak calling. The entire collection of ChIP-seq peaks is also available in the ReMap database. These peaks served as input for the ChIP-eat data processing pipeline. The complete pipeline is designed to uniformly process ChIP-seq data sets, from raw reads to the identification of direct TF-DNA binding events, and it was implemented in the ChIP-eat software with source code freely available at https://bitbucket.org/CBGR/chip-eat/. Specifically, ChIP-eat allows for: (i) aligning and filtering raw ChIP-seq data, (ii) calling ChIP-seq peaks, (iii) training the TFBS computational models and (iv) automatically defining the enrichment zone in the landscape plots to predict TFBSs. Four TFBS computation models are provided: PWMs, optimized with DiMO, TFFM, DNAshaped, and binding energy models . Only the ChIP-seq datasets for which a TF binding profile for the targeted TF was available in JASPAR were used for TFBS predictions. The enrichment zone containing high confidence direct TF-DNA interactions was automatically defined for each data set using an entropy-based algorithm. The diagram below illustrates the processing steps.

What data does UniBind host?

The UniBind database contains millions of transcription factor (TF) binding site (TFBS) predictions across the human genome. These predictions were derived from the uniform processing of 1,983 publicly available ChIP-seq datasets accounting for 232 distinct TFs in 346 cell lines. For each ChIP-seq data set, the user can download the following: the set of predicted direct TF-DNA interactions in BED6 format, and in FASTA format, the initial set of ChIP-seq peaks, a visual representation of the delineated enrichment zone, and the trained prediction model used. For each ChIP-seq data set present in UniBind, four prediction models were used (see previous section) and the data is available for download for each of the four models individually.

What does each entry in UniBind contain?

WWhen searching/clicking on one TF name or one ChIP-seq data set of interest, a page will be displayed with all the information available. A summary part is at the top of the page describing what cell line, TF, data source, and JASPAR profile was used in the prediction models. The user is also provided with external links pointing for more details about this entry’s components.

On the middle section of the page, all the prediction models used for this database entry are displayed in a tabbed layout. Download buttons are also available in this section from where the user can obtain the set of TF binding site predictions and/or the original set of ChIP-seq peaks. At the bottom of this section, a plot consisting of four panels is displayed to give a visual representation of the data through the selected prediction model. Please see next section in documentation on how to interpret these plots.

Below this section, a summary of the statistics for all the prediction models used for this data set is displayed. Namely, for each prediction model, the thresholds that define the set of direct TF-DNA interactions are displayed, as well as a CentriMo p-value providing information about the centrality of the predictions with respect to the ChIP-seq peak summits. The lower the p-value the stronger the centrality. The closer it is to 0 the lower the quality of the data and/or performance of the prediction model is on this particular data set.

Finally, at the bottom of the entry page, suggestions with other ChIP-seq data sets targeting the same TF are available.

How can I download the data?

The user can download the set of transcription factor binding sites (TFBSs), the initial ChIP-seq peaks, and the trained prediction models for each of the datasets individually by navigating to the UniBind entry of interest. If the user wants to download the available data in bulk, this can be done through the Download section of the UniBind database. For the TFBS bulk download, BED and FASTA files are available for each model. Note that for the DNAshaped model, only the 4bits model is available for bulk download. In case of TFs with variants (e.g., TFAP2C, JUND), individual files are available for each variant. The file names follow the format dataset_cell-line_tf_pfm-id so that the user can trace back which TFBSs correspond to which variant. This section allows the user to download the cis-regulatory modules derived using the entire set of TFBS predictions.

How to interpret the enrichment zone and centrality p-value?

For each entry in UniBind, a visual representation of the results is available. It consists of a plot composed of four panels. Each point in panel (A) represents the top scoring sequence in one ChIP-seq peak, that is the sequence within the ChIP-seq peak with the best score computed from the TFBS computation model used. On the Y-axis is the motif score, and on the X-axis is the distance to the ChIP-seq peak summit. The dashed lines represent the thresholds on the TFBS computational model scores and on the distance to the peak summits, delimiting the area that contains direct TF-DNA interactions. Panel (B) is a heat map of the data from panel (A) to help in visualizing the density of the data and better understand the defined thresholds. The bottom panels show where the threshold on the motif score (C) and on the distance from the peak summit (D) was automatically defined by the entropy-based algorithm. In the legend of panel (D) the centrality p-value is provided. The closer it is to 0 the lower the quality of the data and/or performance of the prediction model is on this particular data set.

The user can perform an advanced search of the data available in UniBind. Entering the Search page by clicking “Search” on the top of the page, an “Advanced Options” link is available under the search bar. Here, the user can select the TF of interest, the cell-line of interest, as well as the preferred prediction model, or data source. Another option is to browse only the high quality data sets by checking the option P-value. This will filter based on the data sets that present a significant centrality p-value (log10(p-value) < 0). Once the desired options are selected, the user can click the search button, and come back to modify those options at any time during the browsing. The diagram below illustrates the use of advanced search:

ChIP-eat pipeline source code is available on Bitbucket