It's a picture diary of summer vacation where you can see the unspecified sequence of the human genome.

Introduction

This article is really awesome, so don't really take it.

Purpose

It is said that the Human Genome Project has revealed the entire sequence of human DNA, but there must be a lot of unknown parts. Draw a graph of what you don't know.

Download the human genome fasta

You can download the sequence of the human genome from Gencode. https://www.gencodegenes.org/human/

Here, download ALL luxuriously.

wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_35/GRCh38.p13.genome.fa.gz

You can unzip it, but you can see it to some extent by using the zcat command.

First, let's display head.

zcat GRCh38.p13.genome.fa.gz | head
>chr1 1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Yes, it came out immediately. It is a mountain of N. In other words, the part where the sequence cannot be specified comes from the beginning. Next, let's check the chromosomes for the time being.

zcat GRCh38.p13.genome.fa.gz | grep "^>"

hoge.png

Well, it comes out in large quantities like this.

Next, let's check if the array really consists of 5 types of strings that are only TCGAN. I'm sure there is a way to use command line tools, but I'm going to write a small program using the Crystal language. The reason for the Crystal language is that it is explosive.

count.cr


counter = Hash(Char, Int32).new(0)

while l = gets
    l.chomp.each_char do |c|
        counter[c] += 1
    end
end

p counter

Yes, it's just plain code for anyone who understands Ruby. The only point is that you specified the type when you generated the Hash. Everything else is the same as Ruby. This is Crystal. Build.

crystal build counter.cr --release

If you include this build time, the execution speed may not be much different from Julia etc., but I will not worry about it. Let's count the number of TCGAN

zcat GRCh38.p13.genome.fa.gz | grep -v "^>" | ./count

The line that defines the chromosome at the beginning of the line with grep -v" ^> " is omitted. Result is

{'N' => 161331703, 'T' => 916993511, 'A' => 914265135, 'C' => 635937481, 'G' => 638590158}

have become. Certainly, it was confirmed that there are no strings other than TCGA and N. Next, let's count the continuous distribution of N. In other words, let's find out how many characters N are consecutive.

nseq.cr



temp = 0 

while l = gets
    l.chomp.each_char do |c|
        if c == 'N'
             temp += 1
        elsif temp != 0
            puts temp
            temp = 0
        end
    end
end
puts temp if temp != 0

Build.

crystal build nseq.cr --release

First, let's find out how many consecutive N points there are.

zcat GRCh38.p13.genome.fa.gz | grep -v "^>" | ./nseq | wc -l

1234

You can see that there are not so many. If you display them side by side in long order, it will look like this.

fuga.png

I think it's strange that 1 is, and I get the impression that there are many 100s and 50,000s.

Main subject

Now, write a program that calculates the ratio of N, AT, and CG for every 10000 characters.

n2


tcgan = Hash(Char, Int32).new(0)

target = ARGV[0]
chr = ""
loc = 1
flag = false

while l = gets
  if l.starts_with?(">")
    exit if flag
    if l == target
      puts "loc\tN\tAT\tCG"
      flag = true
    end
    next
  end
  if flag
    l.chomp.each_char do |c|
      tcgan[c] += 1
      loc += 1
      if loc % 10000 == 0
        total = tcgan.values.sum.to_f
        ta = (tcgan['A'] + tcgan['T']) / total
        cg = (tcgan['G'] + tcgan['C']) / total
        n  = tcgan['N'] / total
        puts "#{loc}\t#{n}\t#{ta}\t#{cg}"
        tcgan = Hash(Char, Int32).new(0)
      end
    end
  end
end

Run it as a trial.

zcat GRCh38.p13.genome.fa.gz | ./n2 ">chr1 1" | head
loc	N	AT	CG
10000	1.0	0.0	0.0
20000	0.0001	0.4076	0.5923
30000	0.0	0.4826	0.5174
40000	0.0	0.5288	0.4712
50000	0.0	0.6439	0.3561
60000	0.0	0.6346	0.3654
70000	0.0	0.6669	0.3331
80000	0.0	0.6199	0.3801
90000	0.0	0.6294	0.3706

It seems that it is working well. Creating a program with the same behavior in Ruby takes a lot of execution time, but Crystal is super fast. Throw this to uplot on the command line.

uplot is a Ruby tool that I personally make that allows me to display graphs on the terminal using UnicodePlots.rb.

From here I'm doing something pretty confusing to draw a graph on the terminal, take a screenshot and paste it into Qiita.

chr1

image.png

chr2

image.png

chr3

image.png

chr4

image.png

chr5

image.png

chr6

image.png

chr7

image.png

chr8

image.png

chr9

image.png

chr10

image.png

chr11

image.png

chr12

image.png

chr13

image.png

chr14

image.png

chr 15

image.png

chr 16

image.png

chr 17

image.png

chr 18

image.png

chr 19

image.png

chr 20

image.png

chr 21

image.png

Added scatter image.png

chr 22

image.png

Added scatter image.png

chr X

image.png

chr Y

image.png

According to Wikipedia It seems that the heterochromatin region is mainly undeciphered and that region is inactive. I think it means that it is rarely transcribed. In other words, it may be biologically meaningless.

However, when I try this, it seems that there are many parts where the arrangement is unknown.

If you are familiar with it, please feel free to drop in.

That's all for this article.

Recommended Posts

It's a picture diary of summer vacation where you can see the unspecified sequence of the human genome.
Site where you can see the version relation of spring (?)
A world where you can release a product without knowing the javac command