Cheat sheet that does not cause an accident

"Add (correct) functions" "No accidents". The hard part of an "engineer" is that you have to do "both". Are you ready? I'm done. </ b>

Definition of accident

The accident described here refers to an event that hinders the user's use of the service and an event that loses the trust of the service </ b>. The following are examples of possible accident events.

- Infrastructure failure (network failure, server failure) </ b> - DB failure (data inconsistency, deadlock) </ b> - Vulnerabilities (information leakage, injection, falsification, unauthorized access) </ b> - E-mail / SNS mis-sending </ b> - Implementation bugs (double billing, server error, inoperability, unintended behavior) </ b>

The first premise that an accident can be sufficient is the scale </ b> problem. If you compare the failure of the service used by 10 people with the failure of the service used by 1 million people, the latter is clearly the more serious accident. Next, it is necessary to consider the scale of the accident, whether it is a failure that affected the entire service or a failure that affected only a small number of users. (If the scale is too small, the cause of the accident may be set aside and the response may be a settlement.) Second, it depends on the importance of the information handled by the service </ b>. For example, if it is a service that does not handle personal information, there is nothing to be stolen, so you do not have to think much about security (although there are almost no such services nowadays), and if it is a banking system, of course you need to take all network access logs There is, and it is necessary to prevent tampering and unauthorized access.

Humans are creatures that make mistakes, so as long as you do it manually, you will make the same mistake someday, or another person will make the same mistake. In either case, it would be better if the check could be automated </ b> as well as manual. As a factor that causes an accident, there are many cases that occur outside the scope of the implementer's imagination </ b>, so if you are operating a large-scale system, share information among members. It is important to use keyword search and imagination </ b> to find out where </ b> and corrections will affect you. (For example, when a modification changes not only the main system but also subsystems outside the scope of responsibility) It is important to create a monitoring and notification mechanism and read error messages </ b>. When building a system with cloud services such as AWS and GCP, it is recommended to use Datadog.

In particular, the language and execution platform can be anything, but I am writing assuming a general web service built using the PaaS cloud.

Infrastructure failure

Accidents are likely to occur mainly due to misconfiguration around network and authentication / authorization </ b> and insufficient server resources </ b>. It is classified into those that are almost untouched once set, those that need to be checked each time due to functional modifications, and those that require constant monitoring.

Incorrect private network settings

When building a virtual private network for access between servers using VPC (Virtual Private Cloud) on AWS etc. An accident that prevents communication between servers that were previously able to communicate. As a countermeasure, check the communication between each server in the set network and from the server to the Internet. Once confirmed, it can only happen when rebuilding the network.

DNS record setting error

An accident that occurs when you make a mistake in setting CNAME record, A record, TXT record, etc. when you set DNS record with a service like Route53 of AWS. An accident occurs if you forget to set or confirm communication when you change the domain. Once confirmed, nothing can happen except when changing the domain or changing the record.

SSL certificate expired

A few years after the SSL certificate expired, https communication suddenly became impossible and a browser warning was issued. It is necessary to take measures such as setting automatic renewal of SSL certificate or notifying before the expiration date.

Authentication / authorization setting error

An accident that occurs when the access authority to a cloud service is mistakenly changed when adding or changing cloud-related functions. For example, when using AWS IAM, S3 access authority is only Read authority and Write authority is forgotten. As a countermeasure, actually execute it with the program or the set user authority and check whether it can be accessed or written.

Incorrect cache setting (CDN setting)

You can have a cache for a specific URL path in a CDN setting such as CloudFront. By having a cache, it is possible to return a response at high speed from the second time onward, If there is an omission in setting the whitelist parameter, an accident occurs in which the parameter is not passed to ALB or the server. When creating the API cache, note that when you add the implementation parameters, you need to add them to the whitelist as well. </ b>

Server failure

The server referred to here includes API server, DB server, etc. Both require constant monitoring. In principle, do not create a single point of failure </ b>. (For example, if there is only one server and the server is dropped, the entire system will stop) The following server resources are a problem

- Delay due to insufficient CPU performance of the server </ b> - Insufficient memory on the server </ b> - Server disk shortage </ b>

Specifically, there are the following measures

Raise CPU specs

If the CPU continues to stick to 100% due to high-load processing such as an infinite loop, other processing cannot be performed and performance drops. It is necessary to monitor the CPU utilization rate and API response time, and if the CPU resources are insufficient, correct the processing that is under load or raise the CPU specifications. Also, note that AWS t2 instances and Heroku dyno have an upper limit on CPU resources, and the server will stop if the operation exceeds the upper limit. You need to use CPU boost or charge.

Create swap space

By creating a swap area, you can temporarily use the disk area when the memory becomes insufficient. Since the swap area is a disk area, IO processing occurs and performance drops, but the memory area punctures and the server does not stop at worst. (I want to increase the memory before using up the swap area ...) If the memory is punctured, you will not be able to open the file with an editor such as vim, so restart the server or open the file with less and edit it. (Less was the lightest in past experience)

Set log rotation, upload files are not saved in API server

If you do nothing with the log file, it will occupy the server disk and eventually puncture the disk. You can delete old log files on a regular basis by using the logrotate command. (This will prevent punctures in the log file) When the service handles file uploads, upload to a file storage service such as S3 instead of uploading directly to the server.

Leave a backup

If you are using AWS EC2, you can back up the entire EC2 instance on the AWS settings screen. Mainly, if there is a problem that cannot be recovered immediately, roll back together with the backup of the DB.

Make monitoring settings

Regarding CPU, memory, disk and health check, there is no response for a certain period of time, etc. It is possible to skip notifications (email, Slack) when the threshold is exceeded with Datadog.

Redundancy, autoscale

If you configure multiple servers and access via a load balancer, even if one server goes down, another server can cover it, so you do not have to stop the system. (The server is monitored alive by the health check of the load balancer) If the load increases, increase the number of servers and autoscale. If you run on a serverless managed service such as AWS Lambda or Firebase Functions, autoscaling will be done automatically and you only need to set up memory expansion. Also, if you are creating a server with a Docker container, you can also use AWS Fargate to autoscale the Docker container.

Check for failures in the cloud itself

In rare cases, the cloud service itself may be failing. AWS:AWS Service Health Dashboard GitHub:GitHub Status If it is not restored in the long term, the business impact will be very large, and countermeasures include temporarily distributing the regions and making them redundant.

Implementation bug countermeasures

This is the most common cause of accidents. It often happens because the implementer has a poor understanding of system specifications, language specifications, and libraries </ b>. Also, if technical debt is accumulated and design mistakes, code readability, uniformity, and searchability are lost </ b>, accidents are more likely to occur. What you should see in the code review is summarized in Code Review Cheat Sheet.

Security accident

An accident that damages the trust of the user's system. It also leads to the loss of personal information and the worst financial damage. Of course, if it is http, it will be eavesdropped (or rather, a warning will be issued in the browser), so https communication is performed. Pay particular attention to the area around the form input, which the user can enter relatively freely, as it tends to be a security hole.

--XSS: Send a form of a malicious JS script that a malicious user can execute in a browser and save it in the DB. When another user browses the data saved in the DB, a malicious JS script is executed and information such as the login token saved in the browser is sent to the malicious user. As a countermeasure, it is encoded at the time of input, and it is not displayed as an execution code on the front end side. If you are using a framework such as React, it will be automatically detoxified. (Excluding dangerously setinnerhtml) --SQL injection: The act of a malicious user submitting a SQL statement in a form to steal or tamper with DB information. Instead of embedding the user input contents directly in the query, there is a method of not executing it when there is a problem with the query by using the placeholder function. --Session hijacking: Stealing or guessing another user's login token and being able to log in as another user. If this can be done, it will be possible to impersonate another user and obtain information. Do not implement such as not tampering with or leaking login tokens, or switching sessions with easy-to-guess information such as id = 1 without issuing login tokens for each user. Use Untamperable JsonWebToken as the login token. --CSRF: If the API server allows CORF, you can submit a form from an external site. Therefore, it can be used as a means to steal user information by making a request to the real server side at a phishing site. In particular, a one-time token is embedded in the form that sends important information such as login information every time the form is displayed, and it is confirmed whether the form is submitted properly. --DoS attack: An attack that attempts to bring down a server with a large amount of meaningless access. There are measures such as temporarily banning the IP with the firewall function. For AWS, using WAF, etc.

There are many other things, but the password is hashed raw without saving it in the DB. It is essential to separate APIs that require authentication from APIs that do not require authentication.

Accidents around routing

Occurs due to caching or overwriting other paths when adding paths

Overwrites other routes when adding routes

For example, when adding an API path, the existing routing is overwritten and the target API cannot be accessed. (In the case of SPA, there is an accident that the target page cannot be displayed by overwriting the route of pseudo routing such as React Router.)

POST /api/hoge
POST /api/hoge/:id //← Add
POST /api/hoge/:key //← Although the URL parameters are different, the above API has priority in terms of path and cannot be reached.

When adding, it is important to check if the access of other paths is overwritten, change the order, or change to another path. Specifically, it is possible to perform a recursive test with CI by creating an API call test. Since it is a call test for a specified path, the assumed function to be called may be mocked.

Erase old APIs and page paths even though there is a cache

If you delete the URL of the old API or page, the browser cache will remain and the URL of the old API or page will be accessed, which is a problem. Especially in the case of SPA, the old bundle.js will remain until the CDN cache & browser cache disappears & the browser is reloaded, so the CDN cache clear and the old API need to be redirected to the new API. Besides, Google Bot etc. have a cache of the old path, so access comes to the old one. If you are using the cache, you will need to 301 redirect to the new page as it will access the old API and page.

Accidents caused by external API

If you are calling the API of an external service, you need to check the specifications of the external API.

Processing at the time of error

It tends to be missed. It is necessary to control even if you do not know what kind of error is returned even if you look at the API specifications or if there is a communication error. You also have to decide whether to perform later processing when an error occurs.

API request limit

This is also easy to overlook if you do not look at the API specifications, and if it is local, there is no problem with a small number of requests, but if you put it in the production environment, you may make a large number of requests and an error may occur. In many cases, the upper limit can be raised by billing, etc., but if there is an alternative method, it is limited to cases where it is not used or the cost can be recovered. If you cannot raise the upper limit, you can queue (batch) so that the upper limit of the number of API requests is not exceeded, or if real-time performance is required, return it as an error (have the user wait).

API call account session expired, session limit

For stateful APIs that are not stateless (Rest API), you may want to keep the session under the user account. If the session expires, you need to log in again, so you need to handle the re-login process. APIs with external service sessions need not exceed the session limit, etc.

Memory leak

Of course, languages without GC (garbage collection) (C, C ++, etc.) will eat up the memory area unless the memory is explicitly released after using it when allocating dynamic memory. Even in a language with GC (JavaScript, etc.), the instance that was memory-allocated with new If circular reference is performed, memory is not released even by GC and a memory leak occurs. There is a method of detecting a memory leak location and using a weak reference (WeakRef) or a smart pointer when circularly referencing. In the first place, it is also a good idea not to use new as much as possible and not to make circular references.

By the way, in the case of JavaScript, WeakRef exists at the proposal stage NodeJS memory leak detection method is helpful.

Accident due to poor performance

Pay attention to the performance of the table with a large number of records. Backend processing slows response times and hangs the system when performance is degraded. For read, index is pasted in the field often used for searching in the table, and for mass write in migration script and batch processing, it is necessary to take measures to write in a short time by bulk processing. After that, provisional load distribution can be done by starting the server for the number of CPU cores in a multi-process (cluster). Output a framegraph to find system-wide performance bottlenecks.

For example, NodeJS has a well-organized research method in Node.js Performance Tuning Starting from 0.

Data inconsistency accident

If you make a mistake in implementing a data migration script such as data insertion or an error in the middle causes data inconsistency, implement it in a transaction and roll back if there is a problem. In addition, back up the data before execution in case there is a problem after execution. In addition, important processes that require writing to multiple tables, such as around billing, are transactional to prevent inconsistencies.

API migration accident

If you are referencing the subsystem API from the main system and you need to upgrade the subsystem API to return different data Basically, you can't release the subsystem and main system at exactly the same time, so you have to take steps.

  1. Implement the new API of the subsystem, and do not delete the old API yet. Release the subsystem.
  2. Implement the processing of the new API call of the subsystem in the main system and delete the old API. Released the main system.
  3. Delete the old API of the subsystem. Release the subsystem.

Accidents where changes affect subsystems

Consider whether the changes affect not only the main system but also the subsystems. This area is strict if the whole system is not known, and there is no choice but to implement and review it by experts. We need a system to share information on a regular basis. It is especially likely to occur when the table field of the DB is changed / deleted. An accident may occur when querying BI tools such as redash or automatically synchronizing data to another system such as salesforce.

Exclusive control accident

DB transactions, multi-thread exclusive control, etc. block processing, so if you forget to cancel it, the system will hang. (Deadlock) For example, there is a measure such as wrapping with try syntax at the start of transaction and making sure to unlock with finally statement.

Accidents associated with library (OSS) version upgrades

This happens when you manage your 3rd party library with a package manager tool such as npm or gem You can use the library only when there is a merit that exceeds maintenance, because the over-spec library consumes space (especially when the user downloads the application or JS file), and it is obligatory to upgrade the library. In the first place, implement with language specifications and standard API without including unnecessary libraries. The version of the library is fixed without raising it indiscriminately until the operation is confirmed (major version, minor version, patch version). Also, files that manage detailed dependencies such as package-json.lock and yarn.lock describe the version of the library on which the 3rd party library depends, so do not delete them indiscriminately. If you delete these files, the version of the library on which the 3rd party library depends will be pulled up when you reinstall it, which may cause an accident (once).

Technical debt

It may not be the direct cause of the accident, but it can be the cause of the accident if neglected.

Implement in a typed language

In particular, the backend should be implemented in a typed language (NodeJS + TypeScript, go, Java). The reason is that static compilation prevents careless mistakes.

--Type check can prevent unintended type parameters from being passed to arguments --Type check prevents you from forgetting to pass parameters to arguments --By having a type, you can easily determine whether it is primitive data or a class or object type. --By having a type check, you can tell from the type definition whether it is an optional argument or not (in the case of TypeScript) --Understand the return type

For TypeScript implementation, Clean Code for TypeScript is helpful.

design

Keep KISS </ b> (simple design / implementation) in mind. When using classes, change the specifications if you are aware of SOLID Principles and Demeter's Law It is also strong and easy to test. (Separation of concerns)

--Give a default argument to make it fail-safe, but if you specify an empty function as the default argument, if there is an implementation omission etc., send an error log and notify it so that it can be found immediately --Direct change or deletion of DB table field will cause an accident, so add another field and delete the original field after migrating processing and data. --Similar processing is common to functions according to DRY. Utilities with few changes may be standardized, but excessive standardization unnecessarily expands the scope of influence ... Are you misunderstanding the DRY principle? --Be careful about the range of influence when modifying functions and table fields that are called many times. Understand with Output dependency graph with tool etc. --It is desirable to abstract the business logic with an interface and abstract because it is easier to respond to specification changes. Implementation is entrusted, but argument and return types are guaranteed.

Readability, ease of search

Unify naming rules and coding rules. It's good when the project is small, but when the project is large, the work efficiency is obviously reduced if the file cannot be searched immediately. Unify the camel case, snake case, etc. into one (do not mix!). Surprisingly important is to prevent typo, do not use the same variable name or function name even though it is used in a different meaning, and eliminate notation fluctuation. This is due to the prevention of omissions and misunderstandings. (Unify the domain model) You can use fuzzy search such as fzf to search for existing typo in the project. The amount of bugs increases in proportion to the amount of code, so always keep in mind YAGNI </ b> (do not uselessly implement, do not leave).

--Insert lint to unify coding rules --Enter appropriate comments (mainly explanation of function and business logic specifications) and write as concisely as possible --Unify naming conventions for file names, variable names, and function names, and give names that are easy to understand (do not typo) --Do not deepen function calls (call stack) as it makes it difficult to follow the code --Nesting is not deep (early return of conditional branch, asynchronous callback processing await) --Do not double-mean variables and DB table fields -Use synthesis with priority over inheritance (Inheritance reduces the risk of inheriting unnecessary variables and methods, readability and maintainability), inheritance There is no need to ban itself, but I think the limit is child inheritance

PR rules

PR rules to prevent accidents, do not have multiple PR roles (single responsibility) Since the number of errors increases in proportion to the amount of code, the amount of correction should be reduced. In particular, it is dangerous to make many corrections that straddle the source file or corrections that are highly dependent, so do not mix corrections as much as possible. Checklist for items that require post-release work and important fixes.

--Refactoring is not done at the same time as adding functions or changing functions, separating PR ――Refactoring that is too big is refactored into small pieces and PR is divided. -Create a checklist in Template for creating PR -Lint, run test when creating PR

test

Write a test </ b> that becomes an asset and perform a recursive test </ b> with CI.

--Write unit tests for normal, boundary, and abnormal systems --Do not write duplicate tests --Make parallel tests as much as possible ――Since it is more difficult to reproduce the more complicated the conditions, make it a structure that can be covered by unit tests and can be unit tested. --Do not write different types of tests such as unit tests and API tests in the same file (separate them into folders) ――E2E test is fragile but highly comprehensive, so use it pinpointly for core functions of services etc. --The visual regression test is good for the display difference test (it can also follow the UI library version upgrade).