ChIP-seq (chromatin immunoprecipitation followed by sequencing) is a comprehensive measure of where and how often specific transcription factor bindings and histone modifications occur in the genome. You will be able to start from the environment construction, analyze using the data of the paper, and finally see the results on the genome browser and make peak calls ~. Let's start with the environment construction immediately!
First, install the package manager, miniconda. The package manager makes it easy to install and manage tools when building an environment.
For Linux
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
For Mac
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
bash Miniconda3-latest-MacOSX-x86_64.sh
Once you have done this, follow the instructions and press ENTER or type yes
. You can type yes
for all your questions. When you're done, close the terminal once. After starting it, the next channel will be added to miniconda, so please do the following. __ Be sure to do it in this order __.
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
If you are told that you don't have conda here, see Help 1 below.
Now, let's install the necessary tools using miniconda. This time, we will install the following tools.
-SRA Toolkit Tools used to download sequence data from SRA -Trimmomatic Adapter trimming tool --fastQC A tool for QC of fastq files -Bowtie2 Mapping tool -Picard Tool to remove PCR duplicates -samtools Tools used in various places -DeepTools Tools used to correct data -HOMER Various tools
Install using miniconda's conda
command as follows.
conda install sra-tools
conda install trimmomatic
conda install fastqc
conda install bowtie2
conda install picard
conda install samtools
conda install deeptools
conda install homer
Answer y
to allProceed ([y] / n)?
.
The environment is now ready! I will touch the data from now on, but in this article, to avoid confusion, I wrote the following ** assuming that all commands are executed in the same directory **. Let's go get the data ~.
Download the sequence data from SRA (Sequence Read Archive). This time, [Kagey MH * et al., *](Https: // ChIP-seq data for Med1 of mouse ES cells and Creyghton * et al., * from www.nature.com/articles/nature09380) (GSM560348) We also use the ChIP-seq data for H3K27Ac from mouse ES cells from /50/21931.long) (GSM594579). We also obtain the input sequence data (GSM560357) from the same paper as Med1. () The numbers in are the GEO Accession numbers for each data. NCBI's Gene Expression Omnibus (GEO) is the number needed to page through these data.
Now, before downloading the data, let's first check the location of these data on the browser. Let's take the data of Med1 as an example. First, GEO site Open .nlm.nih.gov / geo /) in your browser. Type GSM560348
in the search window enclosed in red in the image below.
Please take a close look at the various information about this data as shown in the image below.
Now click on the number to the right of what says SRA at the bottom of this page.
Then you will be taken to the following page. The number enclosed in red in the image is called the SRR number, which is required when downloading with the access number of this data.
These numbers are written somewhere in the dissertation so that you can see where the data is.
Use the sratoolkit command fastq-dump
. It's very easy to use and if the data is single-ended
fastq-dump SRR number of the data you want to download
For paired ends
fastq-dump SRR number of the data you want to download--split-files
This should download the fastq file, which is the raw data output from the sequencer, to the directory where you ran this command.
Let's download this data.
fastq-dump SRR058988 #Med1
fastq-dump SRR066767 #H3K27Ac
fastq-dump SRR058997 #input
You can also write them as follows and download them all at once.
fastq-dump SRR058988 SRR066767 SRR058997
This process downloads the SRA-specific compressed file format .sra file and converts it to a fastq file. It will take some time, so be patient. Use this time for my bioinformatics How about commentary for the general public? ??
https://laborify.net/2019/11/30/michida-bioinformatics/
After downloading, the file name is given by the SRR number, so rename it to make it easier to understand.
mv SRR058988.fastq Med1.fastq
mv SRR066767.fastq H3K27Ac.fastq
mv SRR058997.fastq Input.fastq
Now let's clean the sequence results using Trimmomatic.
trimmomatic SE -threads 4 -phred33 Med1.fastq Med1_trimmed.fastq ILLUMINACLIP:$HOME/miniconda3/share/trimmomatic/adapters/TruSeq3-SE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
Although it is long, it is not broken to prevent accidents due to line breaks. Immediately after trimmomatic
, specify whether the data to be parsed is single-ended (SE) or paired-ended (PE). In the next -threads
Specify the number of threads to use. -phred33
is magical. Be sure to enter it. After that, enter the name of the file to be trimmed and the name of the file after trimming.
The location after ILLUMINACLIP is the location of the adapter sequence information, which should be under the miniconda3 directory. Rewrite it according to your miniconda3 location. (You have to do nothing when installing. It should be in your home directory like this.) Also, 2:30:10 represents the allowed number of mismatches, palindrome clip threshold, and simple clip threshold, respectively. Basically, I think you don't have to mess with this. Also, LEADING: 3
and TRAILING: 3
mean to remove bases with quality score less than 3 from the beginning and end of the read, respectively. SLIDING WINDOW: 4: 15
means every 4 bp Look at the average quality score and it means to remove the part that is less than 15; and the last MINLEN: 36
means to remove those with lead length less than 36 from the analysis. I used the "Quick start" settings from the Trimmomatic page (http://www.usadellab.org/cms/?page=trimmomatic).
When finished, a file called Med1_trimmed.fastq will be generated. Please execute the same option for the remaining two data.
Use fastQC to control the quality of the trimmed fastq file.
fastqc --threads 4 --nogroup -o . Med1_trimmed.fastq
Write the number of threads with --threads
immediately after fastqc
. If you write the next --nogroup
, the read at the 3'end will also be analyzed. Output the result to -o
Write the directory to QC. Finally, write the name of the file to be QC.
A file called Med1_trimmed_fastqc.html
will be created in the directory specified in -o
, so let's open it with a browser.
If the Summary on the left is almost green, the quality is OK. This data is too clean ... Explain each index at https://bi.biopapyrus.jp/rnaseq/qc/fastqc.html Thank you. Thank you very much for your continued support of this site, bioinformatics.
Please do the same option for the remaining two data.
Finally, we will map using Bowtie2! And before that, we have to build the genome index. That is, we need to prepare the reference genome needed for mapping.
Download the entire mm10 array from the UCSC page (https://hgdownload.soe.ucsc.edu/downloads.html) with the wget
command. I decided to create a folder called ref_genome
and drop it there. Masu.
mkdir ref_genome
cd ref_genome
wget http://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/chromFa.tar.gz
tar -xzf chromFa.tar.gz
Also, this time we will discard random and unknown sequences and mitochondrial genome (chrM.fa).
rm *random.fa
rm chrUn*
rm chrM.fa
Use cat
to create a single file called mm10.fa
.
cat *.fa > mm10.fa
I will create a directory called mm10_index in the same hierarchy as ref_genome
and save the index there.
cd .. #Now ref_If you are in the genome
mkdir mm10_index
bowtie2-build -f ./ref_genome/mm10.fa ./mm10_index/mm10_index --threads 4
bowtie2-build
is the command to index Bowtie2. Write the path to the location of the genome sequence in the -f
option and then the path to the index. Specify the number of threads with --threads
You can. In this case, you should have 6 files named mm10_index * .bt2
in the directory mm10_index
you created earlier. You only have to do this once.
It takes a lot of time. Use this time to study the R language, which is often used for statistical analysis in bioinformatics. (Like it!)
https://qiita.com/roadricefield/items/001c882f84dd093f4407
I won't use R in this article, but ...
.........................
Good morning everybody! It took me 7 hours because I forgot to specify --threads
! Let's map.
bowtie2 -p 4 -x ./mm10_index/mm10_index -U Med1_trimmed.fastq -S Med1.sam
-p
is the number of threads, -x
is the index, -U
is the fastq file to be mapped, -S
is the output file name. It is output once as a sam file. By the way, mapping also takes a lot of time. It's about 30 minutes.
When you're done, use samtools to convert the sam file to a bam file.
samtools view -b -o Med1.bam Med1.sam
This completes the mapping! Please do the same option for the remaining two data.
Finally, use Picard to remove the PCR duplicates, not necessarily.
samtools sort Med1.bam > Med1_sorted.bam #You need to sort the bam file to use Picard.
picard MarkDuplicates I=Med1_sorted.bam O=Med1_rm_dups.bam M=Med1_report.txt REMOVE_DUPLICATES=true
In ʻI, the name of the bam file for which you want to remove PCR duplicates, in ʻO
, the name of the file after removing duplicates, and write the name in M
because it will create a report that summarizes the calculation results. Masu.
Do the other two data with the same options.
The directory I'm working on is getting crowded, so I'll organize the data here. The data for Med1 ChIP-seq is in the directory Med1_data
, the data for H3K27Ac ChIP-seq is in the directory H3K27Ac_data
, and the input Move the data to a directory called ʻInput_data`.
mkdir Med1_data
mv Med1* Med1_data
mkdir H3K27Ac_data
mv H3K27Ac* H3K27Ac_data
mkdir Input_data
mv Input* Input_data
First, let's convert the bam file to a format called bigWig that is easy to see in the genome browser. For this, use bamCoverage
of deepTools. In the bigWig format, the correction value of the number of reads in the determined bin width is calculated for the whole genome. That is, the signal strength of ChIP-seq for each bin on the genome is calculated. To make this, you first need to create a bam.bai file, which is a bam index file, so use samtools. Let's make it.
samtools index Med1_data/Med1_rm_dups.bam
Doing this will create an index file called Med1_rm_dups.bam.bai
.
Now run bamCoverage
. Be sure to put the bam file and its bam.bai file in the same directory and run it.
bamCoverage -b Med1_data/Med1_rm_dups.bam -p 4 --normalizeUsing RPGC --effectiveGenomeSize 2652783500 --binSize 1 -o Med1_data/Med1.bigwig
Write the name of the bam file to be converted to a bigWig file in -b
. -p
is the number of threads. --NormalizeUsing
selects the type of correction value to be calculated in each bin. RPKM
, You can select CPM
, BPM
, RPGC
, None
. If you select None
, the number of reads included in the bin will be the value in that bin. --EffectiveGenomeSize
is the genome. Enter the length (bp) of the mapable part of. For mm10 (also known as GRCm38), it is 2652783500
. (Reference https://deeptools.readthedocs.io/en/latest/content/feature/ effectiveGenomeSize.html) For --binSize
, enter the length (bp) of the bin used for the calculation. For -o
, write the name of the output file.
The calculation takes time, so let's install the genome browser in the meantime.
A genome browser is a tool that visualizes sequence results. You often see the signal of 〇〇-seq at a specific position on the genome visualized, right? That's it. Let's install it now!
Download the installer for your OS at the IGV Download Page. IGV is a GUI (Graphic comes out and is operated with mouse and keyboard. If you are a Windows user using WSL, please select the Windows version here. Launch the downloaded installer and install according to the instructions. Then, the following IGV shortcut will be created on the desktop. Masu.
Double-click it to start it (it takes about 30 seconds to start). After starting, the following window will appear.
Now that hg19 is loaded, let's download and load mm10. Click the down arrow in the red box on the screen above and you will see "More ...". Click on it. Then click "Mouse mm10", check "Download Sequence" at the bottom left and click "OK". This will start downloading mm10.
It's about time the bam Coverage
is over ...?
When you're done, drag and drop Med1.bigwig
into the IGV window.
Did you see the Med1 ChIP-seq profile as shown in the image above? In this state, you are looking at all the chromosomes from a bird's-eye view, and since you do not know the details, enter various gene names in the search window surrounded by red. Let's fly to the location of that gene body. Here's just one example.
Create a bigwig for the remaining two data in the same way and check it with IGV.
Now let's make a peak call to detect the peak of the signal based on statistical criteria. This time we will use HOMER's findPeaks
. Another commonly used peak caller is [MACS2](https:: //github.com/taoliu/MACS). If you are interested, please compare the results.
Now, before doing findPeaks
, we have to convert the bam file into a form of TagDirectory that HOMER can handle, using HOMER's makeTagDirectory
.
makeTagDirectory Med1_data/Med1_TagDir -single Med1_data/Med1_rm_dups.bam
Write the name of the TagDirectory to be created immediately after makeTagDirectory
, then the options, and finally the name of the bam file. This time, the option specified only the -single
option that cleans up the TagDirectory. See http://homer.ucsd.edu/homer/ngs/tagDir.html for options for. Create a Tag Directory for the remaining two data as well.
Now let's run findPeaks
.
findPeaks Med1_data/Med1_TagDir -style factor -o auto -i Input_data/Input_TagDir
Write the name of the TagDirectory that will make the peak call immediately after findPeaks
. For the -style
option, this time we considered Med1 as a transcription factor and entered factor
. -o
option Write the location to write the result in, but if you set it as ʻauto, it will be saved in the TagDirectory that makes the peak call. Also, in the
-ioption, write the name of the TagDirectory of the input data. Input data If there is no, if you do not enter the
-i` option itself, the calculation will be done without input.
Next, make a peak call for the data of H3K27Ac ChIP-seq.
findPeaks H3K27Ac_data/H3K27Ac_TagDir -style histone -o auto -i Input_data/Input_TagDir
Since H3K27Ac is histone-modified data, set the -style
option to histone
.
Let's take a look at the result file of the peak call. I think there is a file named peaks.txt
in TagDirectory (H3K27Ac is named regions.txt
). Since it is a text, it is easy to see if you open it in Excel.
After various information is written at the top, peak information like the lower half of the image is written. If you look at the columns chr
, start
, ʻend, the genome of each peak is written. You can see the position above.
Normalized Tag Count is the signal strength of each peak. Divide by 10 to get rpm (Reads per Million). For more information on the function of
findPeaks`, see http://homer. See ucsd.edu/homer/ngs/peaks.html.
Next, let's look at this result in IGV together with the bigwig file mentioned earlier. To make it easy for IGV to read the peak information, it is better to use the format as a bed file. The bed file is very simple and one. It is a text file like the image below that records the position information of the peak consisting of three lines from the leftmost, chr
, start
, ʻend`. (Additional information is included in the fourth and subsequent lines. There may be.)
You can create it from peaks.txt
in the output of findPeaks
with the following command.
sed '/^#/d' Med1_data/Med1_TagDir/peaks.txt | awk -v 'OFS=\t' '{print $2, $3, $4}' > Med1_data/Med1_TagDir/peaks.bed
Try dragging and dropping the peaks.bed
you just created onto the IGV.
You should see the peak called position like this. Easy to understand ...
Since transcription factors bind to a certain sequence (binding motif sequence), when peak regions are given, what kind of transcription factor binds to them by examining what kind of sequence is enriched there. For example, by performing binding motif analysis in the peak region of ChIP-seq of Med1, it is possible to predict what kind of transcription factor will bind to the place where Med1 is bound. Will try this using HOMER's findMotifsGenome.pl
.
This also requires preparation and requires the genome to be installed on HOMER.
perl $HOME/miniconda3/share/homer-4.10-0/configureHomer.pl -install mm10
Install mm10 with configureHomer.pl
located above the miniconda3 directory. The above is the case when miniconda3 is in the home directory. Also, you canconda install the part of homor-4.10-0. Please note that it may change depending on the time and environment when HOMER was installed with homer
. (Because the version of HOMER may be different.) If you are a WSL person and did not work here, the help below See 2.
Now run findMotifsGenome.pl
.
findMotifsGenome.pl Med1_data/Med1_TagDir/peaks.txt mm10 Med1_data/Med1_motif -size given -p 4
Write the result file of findPeaks
immediately after findMotifsGenome.pl
. (If you want to use the result of other peak callers, refer to Help 3 below.) Then write the genome version. In addition, write the name of the directory to save the result. The calculation is performed for the area centered on the center of each peak of the integer length entered in the -size
option. It is difficult to understand, so an example is given. And -size 100
, the calculation is performed in the range of +/- 50 bp from the center of each peak. This time, the calculation is performed in the actual range of each peak area passed to the program as -size given
. Enter the number of threads to use in -p
.
When the calculation is finished, a directory called Med1_motif
will be created in Med1_data
, so let's take a look inside. Of note is knownResults.html
. Open this in your browser. The calculation ends here. If you get an error, please refer to Help 2 below.
In the case of this calculation, the p value shows what kind of motif array on the HOMER database was enriched in the peak area input for the automatically prepared random area. It is displayed in ascending order. Since the calculation is performed for a random region, the result changes slightly each time. Looking at this result, the peak region of Med1 is for maintaining stem cell characteristics such as KLF, OCT, SOX. There are many important transcription factor motifs, suggesting that Med1 and these transcription factors are co-localized.
Apart from this, homerResults.html
summarizes the sequences that are abundant in the input peak area and are similar to the motif array on the HOMER database. Basically, check knownResults.html
. It should be good.
For more information on findMotifsGenome.pl
, please visit http://homer.ucsd.edu/homer/ngs/peakMotifs.html.
Finally, I will show you how to check the overlap between peaks. To do this, use HOMER's mergePeaks
. This time, we will investigate the overlap between the peak area of Med1 ChIP-seq and the peak area of H3K27Ac ChIP-seq.
mergePeaks -d given Med1_data/Med1_TagDir/peaks.txt H3K27Ac_data/H3K27Ac_TagDir/regions.txt -prefix mergePeaks -venn venn.txt
If -d
is set to given
, the overlap between the input peak areas will be calculated as it is. The input peak areas will be written side by side. There may be 3 or more. -Prefix XXX
And the area where the overlapping areas of each peak starting with XXX
are combined and the area that exists in only one peak area is output separately. If you set it as -venn YYY.txt
, it will be output. It creates a table that summarizes the number of overlapping peak areas called YYY.txt
and the number of peak areas that are only one of them to draw a Venn diagram. For details on options etc., see http://homer.ucsd See .edu / homor / ngs / mergePeaks.html.
When this command is executed, mergePeaks_H3K27Ac_data_H3K27Ac_TagDir_regions.txt
, mergePeaks_Med1_data_Med1_TagDir_peaks.txt
, `mergePeaks_Med1_data_Med1_TagDir_peaks.txt_H3K27Ac_data_H3 , The region existing only in the peak of Med1 ChIP-seq, the region where the overlapping peak regions of Med1 ChIP-seq and H3K27Ac ChIP-seq are combined, and the table for drawing the Venn diagram.
Let's draw a Venn diagram with Python's matplotlib. Since we use a package called matplotlib_venn,
conda install matplotlib-venn
Then open an editor and write the following code, whatever you want.
from matplotlib import pyplot as plt
from matplotlib_venn import venn2
#venn.Open txt and Med1 ChIP-Number of peaks present only in seq,
#H3K27Ac ChIP-Number of peaks present only in seq,Check the number of overlapping peaks.
venn2(subsets=(770, 25254, 2738), set_labels = ("Med1", "H3K27Ac"))
#subsets=(Med1 only,H3K27Ac only,Overlapping peaks)
plt.savefig("./venn.png ")
If you can write it, save it with a name such as venn_plot.py
and execute the following command in the saved location.
python venn_plot.py
Then a file called venn.png
will be created in that directory, so open it.
About 80% of the peaks of ChIP-seq of Med1, which is a protein that activates transcription, overlap with the peaks of ChIP-seq of H3K27Ac, which is also a transcriptional activity marker. That's right. Please compare the two data with IGV.
Thank you for reading so far. The analysis after peak call introduced this time is only a part of ChIP-seq analysis. You can now map and peak call. Perform purpose-built analysis using HOMER, R, Python, etc. Information on bioinformatics, including this article, is now abundant on the net. Also, follow the same procedure as this time for the opening and closing state of chromatin throughout the genome It is also possible to analyze ATAC-seq (Assay for Transposase-Accessible Chromatin Sequencing), which is exhaustively investigated in. We hope that this article will help you in your research. If you have any questions, please help us as much as possible. I think it's a good idea, so I'd love to hear from you!
conda
command doesn't work!Probably the cause is that the path does not pass after installing miniconda3. Please do the following.
cd #Move to home directory
vim .bash_profile #Describe the path.bash_open profile in vim
When it opens, press the esc key first. Then press the i key. Now you can edit in Insert mode. Make sure you type the following:
PATH=$PATH:~/miniconda3/bin
Write the path to miniconda3 / bin
afterPATH = $ PATH:
. The above is when miniconda is in your home directory. Make sure you haven't made a mistake and press esc again. And: Type wq
and press enter.
Then restart the terminal
source .bash_profile
This should pass the path and run conda
.
configureHomer.pl -install mm10
doesn't work!Probably an error that only WSL people can have, but it may be because the commands needed to run configureHomer.pl
are not installed. Do all of the following:
which gcc
which g++
which make
which perl
which zip
which gzip
which wget
this house
/usr/bin/make
If the path to the command is not displayed, such as
sudo apt install zip #When the zip was not included
Please run to install. After installing everything that wasn't there, try again
perl $HOME/miniconda3/share/homer-4.10-0/configureHomer.pl -install mm10
Try to run.
If you want to use a bed file created by a program other than HOMER in HOMER, it is safe to convert it to a HOMER bed file in advance. To do this, use HOMER's bed2pos.pl
.
bed2pos.pl (Bed file you want to convert) > Converted_file.hb
The extension of the HOMER bed file is "hb".
References
http://rnakato.hatenablog.jp/entry/2017/07/06/110926
https://bi.biopapyrus.jp/rnaseq/qc/trimmomatic.html
http://www.usadellab.org/cms/?page=trimmomatic
https://bi.biopapyrus.jp/rnaseq/qc/fastqc.html
https://bi.biopapyrus.jp/rnaseq/mapping/bowtie2/
Recommended Posts