This article is really awesome, so don't really take it.
It is said that the Human Genome Project has revealed the entire sequence of human DNA, but there must be a lot of unknown parts. Draw a graph of what you don't know.
You can download the sequence of the human genome from Gencode. https://www.gencodegenes.org/human/
Here, download ALL luxuriously.
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_35/GRCh38.p13.genome.fa.gz
You can unzip it, but you can see it to some extent by using the zcat
command.
First, let's display head
.
zcat GRCh38.p13.genome.fa.gz | head
>chr1 1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Yes, it came out immediately. It is a mountain of N. In other words, the part where the sequence cannot be specified comes from the beginning. Next, let's check the chromosomes for the time being.
zcat GRCh38.p13.genome.fa.gz | grep "^>"
Well, it comes out in large quantities like this.
Next, let's check if the array really consists of 5 types of strings that are only TCGAN. I'm sure there is a way to use command line tools, but I'm going to write a small program using the Crystal language. The reason for the Crystal language is that it is explosive.
count.cr
counter = Hash(Char, Int32).new(0)
while l = gets
l.chomp.each_char do |c|
counter[c] += 1
end
end
p counter
Yes, it's just plain code for anyone who understands Ruby. The only point is that you specified the type when you generated the Hash. Everything else is the same as Ruby. This is Crystal. Build.
crystal build counter.cr --release
If you include this build time, the execution speed may not be much different from Julia etc., but I will not worry about it. Let's count the number of TCGAN
zcat GRCh38.p13.genome.fa.gz | grep -v "^>" | ./count
The line that defines the chromosome at the beginning of the line with grep -v" ^> "
is omitted. Result is
{'N' => 161331703, 'T' => 916993511, 'A' => 914265135, 'C' => 635937481, 'G' => 638590158}
have become. Certainly, it was confirmed that there are no strings other than TCGA and N. Next, let's count the continuous distribution of N. In other words, let's find out how many characters N are consecutive.
nseq.cr
temp = 0
while l = gets
l.chomp.each_char do |c|
if c == 'N'
temp += 1
elsif temp != 0
puts temp
temp = 0
end
end
end
puts temp if temp != 0
Build.
crystal build nseq.cr --release
First, let's find out how many consecutive N points there are.
zcat GRCh38.p13.genome.fa.gz | grep -v "^>" | ./nseq | wc -l
1234
You can see that there are not so many. If you display them side by side in long order, it will look like this.
I think it's strange that 1 is, and I get the impression that there are many 100s and 50,000s.
Now, write a program that calculates the ratio of N, AT, and CG for every 10000 characters.
n2
tcgan = Hash(Char, Int32).new(0)
target = ARGV[0]
chr = ""
loc = 1
flag = false
while l = gets
if l.starts_with?(">")
exit if flag
if l == target
puts "loc\tN\tAT\tCG"
flag = true
end
next
end
if flag
l.chomp.each_char do |c|
tcgan[c] += 1
loc += 1
if loc % 10000 == 0
total = tcgan.values.sum.to_f
ta = (tcgan['A'] + tcgan['T']) / total
cg = (tcgan['G'] + tcgan['C']) / total
n = tcgan['N'] / total
puts "#{loc}\t#{n}\t#{ta}\t#{cg}"
tcgan = Hash(Char, Int32).new(0)
end
end
end
end
Run it as a trial.
zcat GRCh38.p13.genome.fa.gz | ./n2 ">chr1 1" | head
loc N AT CG
10000 1.0 0.0 0.0
20000 0.0001 0.4076 0.5923
30000 0.0 0.4826 0.5174
40000 0.0 0.5288 0.4712
50000 0.0 0.6439 0.3561
60000 0.0 0.6346 0.3654
70000 0.0 0.6669 0.3331
80000 0.0 0.6199 0.3801
90000 0.0 0.6294 0.3706
It seems that it is working well. Creating a program with the same behavior in Ruby takes a lot of execution time, but Crystal is super fast. Throw this to uplot on the command line.
uplot is a Ruby tool that I personally make that allows me to display graphs on the terminal using UnicodePlots.rb.
From here I'm doing something pretty confusing to draw a graph on the terminal, take a screenshot and paste it into Qiita.
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr 15
chr 16
chr 17
chr 18
chr 19
chr 20
chr 21
Added scatter
chr 22
Added scatter
chr X
chr Y
According to Wikipedia It seems that the heterochromatin region is mainly undeciphered and that region is inactive. I think it means that it is rarely transcribed. In other words, it may be biologically meaningless.
However, when I try this, it seems that there are many parts where the arrangement is unknown.
If you are familiar with it, please feel free to drop in.
That's all for this article.