After the machine is repaired restart it and pull the main library binlog resulting in the solution to the network problem

2021-01-03 21:06:36
OfStack
Problem Description:

A week ago, one of the mysql servers went down with a hardware failure. We submitted an application to the students who are specially in charge of this area, and they are responsible for repairing this server. When the server was fixed today, they started it up. The four mysql instances on the server are automatically started upon boot to pull binlog from the main library. Due to the long downtime of this server, more logs were lost, and the binlog of the main library was actively pulled, resulting in problems in the network of the main library.

Phenomenon:

First of all, we didn't realize that it was caused by a broken server restarting the master repository binlog, because we had no idea what was going on with this server, except that a week ago, we had reported a server for repair. We don't know exactly what's going on, whether it's been fixed or whether it's been turned on.
In this case, the student who suddenly heard the network said that one of mysql's machines had too much network traffic, which made the business feel very slow, which lasted for 17 minutes in total. In fact, this is not much of a clue.

Screen:

Looking at processlist, full log, and slow log didn't show any problems.

Looking at the monitor, I found that the read IO of the server increased sharply during that time.
Through checking the history of processlist, it was found that there was a period of time when the state of master and slave users was waiting for net. Through its IP, it was found that this server was a broken slave server 1 week ago.

Conclusion:
There are 4 instances on this server. When the server is started, the mysql instance starts automatically and pulls binlog to the main library. The daily amount of binlog of each main library is about 6G, and the amount of binlog of G per week for 4 instances is about 160.

Question:
1. When will the broken server be fixed and when will it be started up? We have no control, do not know or care about it
2. This kind of case is actually a very simple and typical case which may cause influence or failure. We were not aware of this phenomenon in advance, although we knew it was a very easy problem, but in our case, we had no awareness of this aspect. So this incident happened
3. Lack of effective monitoring of network traffic

Solutions:
1. Cancel startup of mysql for all servers, and start slave artificially after startup of the server. (In this case, if there are too many servers and it may be too much trouble, it is better to record them first rather than make an impact)
2. Be aware of the problem and include it in your general knowledge base or workbook to avoid it.