PHP realizes the method of locating fault cause through strace
- 2021-09-20 19:47:52
- OfStack
In this paper, an example is given to describe the method of locating the fault cause through strace by PHP. Share it for your reference, as follows:
As the saying goes: If you are not afraid of thieves stealing, you are afraid of thieves thinking about it. In the face of failure, I also have a similar feeling: I am not afraid of failure, I am afraid that you don't know the cause of failure, but the failure comes to you every 3 to 5.
The 101 holiday is not over yet, but the server frequently has high load, and Nginx has an error log:
connect() failed (110: Connection timed out) while connecting to upstream
connect() failed (111: Connection refused) while connecting to upstream
It seems that there is something wrong with Upstream, in this case Upstream is PHP (version: 5.2. 5). Unfortunately, the monitoring is not perfect, so I can't figure out what went wrong, so I have to restart PHP constantly to alleviate the failure.
If you manually restart the service every time, it is undoubtedly a chore. Fortunately, it can be performed every minute through CRON settings:
#/bin/bash
LOAD=$(awk '{print $1}' /proc/loadavg)
if [ $(echo "$LOAD > 100" | bc) = 1 ]; then
/etc/init.d/php-fpm restart
fi
Unfortunately, this is only an expedient measure. If you want to solve it completely, you must find out what the real cause of the failure is.
Don't talk about gossip, it's Strace's turn to appear, and count the time-consuming situation of each system call under 1:
shell> strace -c -p $(pgrep -n php-cgi)
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
30.53 0.023554 132 179 brk
14.71 0.011350 140 81 mlock
12.70 0.009798 15 658 16 recvfrom
8.96 0.006910 7 927 read
6.61 0.005097 43 119 accept
5.57 0.004294 4 977 poll
3.13 0.002415 7 359 write
2.82 0.002177 7 311 sendto
2.64 0.002033 2 1201 1 stat
2.27 0.001750 1 2312 gettimeofday
2.11 0.001626 1 1428 rt_sigaction
1.55 0.001199 2 730 fstat
1.29 0.000998 10 100 100 connect
1.03 0.000792 4 178 shutdown
1.00 0.000773 2 492 open
0.93 0.000720 1 711 close
0.49 0.000381 2 238 chdir
0.35 0.000271 3 87 select
0.29 0.000224 1 357 setitimer
0.21 0.000159 2 81 munlock
0.17 0.000133 2 88 getsockopt
0.14 0.000110 1 149 lseek
0.14 0.000106 1 121 mmap
0.11 0.000086 1 121 munmap
0.09 0.000072 0 238 rt_sigprocmask
0.08 0.000063 4 17 lstat
0.07 0.000054 0 313 uname
0.00 0.000000 0 15 1 access
0.00 0.000000 0 100 socket
0.00 0.000000 0 101 setsockopt
0.00 0.000000 0 277 fcntl
------ ----------- ----------- --------- --------- ----------------
100.00 0.077145 13066 118 total
It seems that "brk" is very suspicious. It actually took 30% of the time. To be on the safe side, confirm it separately:
shell> strace -T -e brk -p $(pgrep -n php-cgi)
brk(0x1f18000) = 0x1f18000 <0.024025>
brk(0x1f58000) = 0x1f58000 <0.015503>
brk(0x1f98000) = 0x1f98000 <0.013037>
brk(0x1fd8000) = 0x1fd8000 <0.000056>
brk(0x2018000) = 0x2018000 <0.012635>
Note: In Strace, there are two options related to the time spent in operation, which are "-r" and "-T". The difference between them is that "-r" means relative time, while "-T" means absolute time. Simple statistics can use "-r", but it should be noted that under the background of multitasking, CPU may be switched out to do other things at any time, so the relative time is not accurate. At this time, it is best to use "-T", and the operation time can be seen at the end of the line, which can be found to be really slow.
Before continuing to locate the cause of the fault, let's check its meaning under "man brk":
brk() sets the end of the data segment to the value specified by end_data_segment, when that value is reasonable, the system does have enough memory and the process does not exceed its max data size (see setrlimit(2)).
Simply put, when there is not enough memory, use it to apply for new memory (data segment), but why?
shell> strace -T -p $(pgrep -n php-cgi) 2>&1 | grep -B 10 brk
stat("/path/to/script.php", {...}) = 0 <0.000064>
brk(0x1d9a000) = 0x1d9a000 <0.000067>
brk(0x1dda000) = 0x1dda000 <0.001134>
brk(0x1e1a000) = 0x1e1a000 <0.000065>
brk(0x1e5a000) = 0x1e5a000 <0.012396>
brk(0x1e9a000) = 0x1e9a000 <0.000092>
Through "grep", we can easily get the relevant context, run it several times, and find that whenever some PHP scripts are requested, there will be several time-consuming "brk", and these PHP scripts have a common feature, that is, they are very large, even hundreds of K, why are there such large PHP scripts? In fact, in order to avoid database operation, Persisting very large array variables into the PHP file through "var_export", Then, the corresponding variables are obtained through "include" in the program. Because the variables are too large, PHP has to execute "brk" frequently. Unfortunately, in the environment of this example, this operation is slow, which leads to too long request processing time. In addition, the number of PHP processes is limited, which leads to request congestion on Nginx and eventually leads to high load failure.
The following needs to verify whether the inference under 1 seems to be correct. First, query what is involved in the problem script under 1:
shell> find /path -name "*.php" | xargs grep "script.php"
Disable them all directly to see if the server can slow down. Maybe everyone thinks this is too Lumeng, but special decisions must be made under special circumstances. You can't be indecisive like a pussy. Before long, the server load returns to normal, and then count the time spent on system calls:
shell> strace -c -p $(pgrep -n php-cgi)
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
24.50 0.001521 11 138 2 recvfrom
16.11 0.001000 33 30 accept
7.86 0.000488 8 59 sendto
7.35 0.000456 1 360 rt_sigaction
6.73 0.000418 2 198 poll
5.72 0.000355 1 285 stat
4.54 0.000282 0 573 gettimeofday
4.41 0.000274 7 42 shutdown
4.40 0.000273 2 137 open
3.72 0.000231 1 197 fstat
2.93 0.000182 1 187 close
2.56 0.000159 2 90 setitimer
2.13 0.000132 1 244 read
1.71 0.000106 4 30 munmap
1.16 0.000072 1 60 chdir
1.13 0.000070 4 18 setsockopt
1.05 0.000065 1 100 write
1.05 0.000065 1 64 lseek
0.95 0.000059 1 75 uname
0.00 0.000000 0 30 mmap
0.00 0.000000 0 60 rt_sigprocmask
0.00 0.000000 0 3 2 access
0.00 0.000000 0 9 select
0.00 0.000000 0 20 socket
0.00 0.000000 0 20 20 connect
0.00 0.000000 0 18 getsockopt
0.00 0.000000 0 54 fcntl
0.00 0.000000 0 9 mlock
0.00 0.000000 0 9 munlock
------ ----------- ----------- --------- --------- ----------------
100.00 0.006208 3119 24 total
Obviously, "brk" is gone and replaced by "recvfrom" and "accept", but these operations are inherently time consuming, so it is possible to locate "brk" as the cause of the failure.
…
Embrace faults, and every fault is experience. As the saying goes, if heaven will bring down great responsibilities to Sri Lankan people, they must first suffer from their minds, strain their bones and muscles, starve their body and skin, lack their body, and do whatever they do, so they should be patient and gain what they can't do.
For more readers interested in PHP related content, please check the topics on this site: "Summary of PHP Error and Exception Handling Methods", "Summary of php String (string) Usage", "Encyclopedia of PHP Array (Array) Operation Skills", "Summary of PHP Operation and Operator Usage", "Summary of PHP Network Programming Skills", "Introduction to PHP Basic Syntax" and "Introduction to php Object-Oriented Programming"
I hope this paper is helpful to everyone's PHP programming.