Just looking at the title makes my heart squeaky ... I wrote a memo from my experience so far About half for myself

Indicators that can be used to investigate the cause

HTTP status code

https://developer.mozilla.org/ja/docs/Web/HTTP/Status Of these, the ones that cause an error are the 400 series and 500 series. The rough differences are as follows

--400 series: Access itself is not possible --Name resolution is not possible

Unauthorized --The server is down --500 series: The process inside the server stops with an error. --The server settings are incorrect --Errors in the framework running the site

Correspondence changes depending on which one, so first separate here

logfile

If you haven't tampered with the output path on linux, you'll probably find a rough log in / var / log /.

apache: /var/log/httpd/
nginx : /var/log/nginx/
php for nginx + php-Check fpm: /var/log/php-fpm/

If you want to see what's working, ps aux Since a large amount of information will come out, if there is a hit, also use grep together

AWS monitoring

With AWS, you can check a lot of information from the console

EC2： --CPU usage --Number of requests
ELB： --Latency --Number of running servers
RDS： --CPU usage --Number of connections --Write / Read Throughput

Correspondence

1, calm down

It's rather important. Human temper does not do anything good ... Let's calm down by organizing the current situation or consulting with a great person

2, check the status code

As mentioned above, there are many reasons why you cannot access, so I will isolate it from now on. Most browsers should have a status code on the screen

3-1 and 400 series

It is easy to deal with because the code is divided finely depending on the cause I often see the following

400 Bad Request --The request is invalid. Is the request sent internally strange immediately after the release of a new function?
403 Forbidden --It's a permission issue. Check IP restrictions and file permissions
404 Not Found --See below

3-2, 500 series

I often see the following It starts from checking the error log for the time being Correspondence contents vary depending on the log, so I will omit it

500 Internal Server Error --Something is happening inside the server. Below is an example of an error I've encountered.
502 Bad Gateway --A server acting as a gateway or proxy receives an invalid response when trying to execute a request.

From here on, an example

Correspondence of 404 Not Found

There are various causes, so I will put it in a separate frame. What is possible

The server is down
Name resolution is not possible Since it is around, I will check this area.

Try ping

ping {IP/hostname}

I will test the communication with (This is localhost as a dummy)

$ ping localhost
PING localhost (127.0.0.1): 56 data bytes
64 bytes from 127.0.0.1: icmp_seq=0 ttl=64 time=6.893 ms
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.115 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.076 ms
64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.117 ms
^C
--- localhost ping statistics ---
4 packets transmitted, 4 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.076/1.800/6.893/2.940 ms

If I live, I will come back for the time being. If you want to specify the port number, you can not do it with plain ping, so use another method. I use nping. Convenient https://qiita.com/Yu-s/items/4b4f683fda374c8ddcc9

Try to log in

Mostly you should be able to log in with ssh or something If you can't log in with the command you should have been able to do before, it's likely that you're down.

Check from the console on AWS

EC2 Dashboard> Instances> Instance Status You can check from. When it becomes stop, it has fallen. (If you don't use aws-cli or autoscale, it's possible that someone stopped it intentionally ... it shouldn't stop automatically ...) You can also check if the status check has failed, so even if this fails, it will fail.

However, please note that it may be running even if the instance is restarting automatically (= it is actually down).

If you can confirm that the server is alive so far, but you can not access it with the domain, you probably can not resolve the name

Try dig

This is a quick check https://www.atmarkit.co.jp/ait/articles/1711/09/news020.html

You can also do it with nslookup https://www.atmarkit.co.jp/ait/articles/1710/27/news021.html

$ dig www.google.com

; <<>> DiG 9.10.6 <<>> www.google.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 3344
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;www.google.com.			IN	A

;; ANSWER SECTION:
www.google.com.		89	IN	A	172.217.24.132

;; Query time: 13 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Fri Sep 11 13:51:58 JST 2020
;; MSG SIZE  rcvd: 59

Name resolution cannot be done without ;; ANSWER SECTION:

Support for 500 Internal Server Error

It is possible that an error has occurred

--Server software --The framework that runs the code

Since there are two choices, I will look at the two types of logs for the time being. There are various ways to deal with it depending on the content of the error, but the following are the errors I have encountered.

Forget log rotation

The following appeared in the error log of nginx

write() to "/var/log/nginx/access.log" was incomplete: 83 of 314 while logging request

"I couldn't write the access log." When I went to see it, only the access log of the day was messed up, so I guessed that the storage was exhausted. If you delete the corresponding file, it should be solved for the time being, but even if you delete the access log, it is not solved ... (Maybe there were other heavy files as well) I missed the time to look for it and autoscaled it, so I started a new server and replaced it to deal with it for the time being.

So, in the meantime, when I investigated the root cause, I noticed that it was not logrotated ... I updated nginx a few days ago, but at that time it seems that I forgot to restore the configuration file around log rotation. In addition to this, if storage and memory are exhausted, you will not be able to access it, so it may be good to make a note of the command to check it.

$ df -h //Storage check
$ free -m //Memory check

Forgot to reflect the setting change in the startup setting @ Autoscale

One day the site suddenly went down and I got a 500 error, so check the error log Looking at the cakephp2 log, I see an error related to a new feature that I made a few days ago! Should have been fixed ...? I immediately noticed that, but I forgot to reflect the setting change in the startup settings ... Apparently Traffic goes up => Start autoscale server that reproduces the error => Inaccessible It seems that it was the flow of.

It was a kind of thing that could be fixed by deleting the cache file due to an error around the cache of cakephp2, so Clear cache again => Create AMI in that state => Specify in startup settings I was able to respond with

Summary

I'm impatient to death, but ... ・ Calm down ・ Isolation of the cause ・ Consult a great person It's pretty good.

What to do when you get "I can't see the site !!!!"