PHP realizes the method of locating fault cause through strace

  • 2021-09-20 19:47:52
  • OfStack

In this paper, an example is given to describe the method of locating the fault cause through strace by PHP. Share it for your reference, as follows:

As the saying goes: If you are not afraid of thieves stealing, you are afraid of thieves thinking about it. In the face of failure, I also have a similar feeling: I am not afraid of failure, I am afraid that you don't know the cause of failure, but the failure comes to you every 3 to 5.

The 101 holiday is not over yet, but the server frequently has high load, and Nginx has an error log:

connect() failed (110: Connection timed out) while connecting to upstream
connect() failed (111: Connection refused) while connecting to upstream

It seems that there is something wrong with Upstream, in this case Upstream is PHP (version: 5.2. 5). Unfortunately, the monitoring is not perfect, so I can't figure out what went wrong, so I have to restart PHP constantly to alleviate the failure.

If you manually restart the service every time, it is undoubtedly a chore. Fortunately, it can be performed every minute through CRON settings:


#/bin/bash
LOAD=$(awk '{print $1}' /proc/loadavg)
if [ $(echo "$LOAD > 100" | bc) = 1 ]; then
  /etc/init.d/php-fpm restart
fi

Unfortunately, this is only an expedient measure. If you want to solve it completely, you must find out what the real cause of the failure is.

Don't talk about gossip, it's Strace's turn to appear, and count the time-consuming situation of each system call under 1:


shell> strace -c -p $(pgrep -n php-cgi)
% time   seconds usecs/call   calls  errors syscall
------ ----------- ----------- --------- --------- ----------------
 30.53  0.023554     132    179      brk
 14.71  0.011350     140    81      mlock
 12.70  0.009798     15    658    16 recvfrom
 8.96  0.006910      7    927      read
 6.61  0.005097     43    119      accept
 5.57  0.004294      4    977      poll
 3.13  0.002415      7    359      write
 2.82  0.002177      7    311      sendto
 2.64  0.002033      2   1201     1 stat
 2.27  0.001750      1   2312      gettimeofday
 2.11  0.001626      1   1428      rt_sigaction
 1.55  0.001199      2    730      fstat
 1.29  0.000998     10    100    100 connect
 1.03  0.000792      4    178      shutdown
 1.00  0.000773      2    492      open
 0.93  0.000720      1    711      close
 0.49  0.000381      2    238      chdir
 0.35  0.000271      3    87      select
 0.29  0.000224      1    357      setitimer
 0.21  0.000159      2    81      munlock
 0.17  0.000133      2    88      getsockopt
 0.14  0.000110      1    149      lseek
 0.14  0.000106      1    121      mmap
 0.11  0.000086      1    121      munmap
 0.09  0.000072      0    238      rt_sigprocmask
 0.08  0.000063      4    17      lstat
 0.07  0.000054      0    313      uname
 0.00  0.000000      0    15     1 access
 0.00  0.000000      0    100      socket
 0.00  0.000000      0    101      setsockopt
 0.00  0.000000      0    277      fcntl
------ ----------- ----------- --------- --------- ----------------
100.00  0.077145         13066    118 total

It seems that "brk" is very suspicious. It actually took 30% of the time. To be on the safe side, confirm it separately:


shell> strace -T -e brk -p $(pgrep -n php-cgi)
brk(0x1f18000) = 0x1f18000 <0.024025>
brk(0x1f58000) = 0x1f58000 <0.015503>
brk(0x1f98000) = 0x1f98000 <0.013037>
brk(0x1fd8000) = 0x1fd8000 <0.000056>
brk(0x2018000) = 0x2018000 <0.012635>

Note: In Strace, there are two options related to the time spent in operation, which are "-r" and "-T". The difference between them is that "-r" means relative time, while "-T" means absolute time. Simple statistics can use "-r", but it should be noted that under the background of multitasking, CPU may be switched out to do other things at any time, so the relative time is not accurate. At this time, it is best to use "-T", and the operation time can be seen at the end of the line, which can be found to be really slow.

Before continuing to locate the cause of the fault, let's check its meaning under "man brk":

brk() sets the end of the data segment to the value specified by end_data_segment, when that value is reasonable, the system does have enough memory and the process does not exceed its max data size (see setrlimit(2)).

Simply put, when there is not enough memory, use it to apply for new memory (data segment), but why?


shell> strace -T -p $(pgrep -n php-cgi) 2>&1 | grep -B 10 brk
stat("/path/to/script.php", {...}) = 0 <0.000064>
brk(0x1d9a000) = 0x1d9a000 <0.000067>
brk(0x1dda000) = 0x1dda000 <0.001134>
brk(0x1e1a000) = 0x1e1a000 <0.000065>
brk(0x1e5a000) = 0x1e5a000 <0.012396>
brk(0x1e9a000) = 0x1e9a000 <0.000092>

Through "grep", we can easily get the relevant context, run it several times, and find that whenever some PHP scripts are requested, there will be several time-consuming "brk", and these PHP scripts have a common feature, that is, they are very large, even hundreds of K, why are there such large PHP scripts? In fact, in order to avoid database operation, Persisting very large array variables into the PHP file through "var_export", Then, the corresponding variables are obtained through "include" in the program. Because the variables are too large, PHP has to execute "brk" frequently. Unfortunately, in the environment of this example, this operation is slow, which leads to too long request processing time. In addition, the number of PHP processes is limited, which leads to request congestion on Nginx and eventually leads to high load failure.

The following needs to verify whether the inference under 1 seems to be correct. First, query what is involved in the problem script under 1:


shell> find /path -name "*.php" | xargs grep "script.php"

Disable them all directly to see if the server can slow down. Maybe everyone thinks this is too Lumeng, but special decisions must be made under special circumstances. You can't be indecisive like a pussy. Before long, the server load returns to normal, and then count the time spent on system calls:


shell> strace -c -p $(pgrep -n php-cgi)
% time   seconds usecs/call   calls  errors syscall
------ ----------- ----------- --------- --------- ----------------
 24.50  0.001521     11    138     2 recvfrom
 16.11  0.001000     33    30      accept
 7.86  0.000488      8    59      sendto
 7.35  0.000456      1    360      rt_sigaction
 6.73  0.000418      2    198      poll
 5.72  0.000355      1    285      stat
 4.54  0.000282      0    573      gettimeofday
 4.41  0.000274      7    42      shutdown
 4.40  0.000273      2    137      open
 3.72  0.000231      1    197      fstat
 2.93  0.000182      1    187      close
 2.56  0.000159      2    90      setitimer
 2.13  0.000132      1    244      read
 1.71  0.000106      4    30      munmap
 1.16  0.000072      1    60      chdir
 1.13  0.000070      4    18      setsockopt
 1.05  0.000065      1    100      write
 1.05  0.000065      1    64      lseek
 0.95  0.000059      1    75      uname
 0.00  0.000000      0    30      mmap
 0.00  0.000000      0    60      rt_sigprocmask
 0.00  0.000000      0     3     2 access
 0.00  0.000000      0     9      select
 0.00  0.000000      0    20      socket
 0.00  0.000000      0    20    20 connect
 0.00  0.000000      0    18      getsockopt
 0.00  0.000000      0    54      fcntl
 0.00  0.000000      0     9      mlock
 0.00  0.000000      0     9      munlock
------ ----------- ----------- --------- --------- ----------------
100.00  0.006208         3119    24 total

Obviously, "brk" is gone and replaced by "recvfrom" and "accept", but these operations are inherently time consuming, so it is possible to locate "brk" as the cause of the failure.

Embrace faults, and every fault is experience. As the saying goes, if heaven will bring down great responsibilities to Sri Lankan people, they must first suffer from their minds, strain their bones and muscles, starve their body and skin, lack their body, and do whatever they do, so they should be patient and gain what they can't do.

For more readers interested in PHP related content, please check the topics on this site: "Summary of PHP Error and Exception Handling Methods", "Summary of php String (string) Usage", "Encyclopedia of PHP Array (Array) Operation Skills", "Summary of PHP Operation and Operator Usage", "Summary of PHP Network Programming Skills", "Introduction to PHP Basic Syntax" and "Introduction to php Object-Oriented Programming"

I hope this paper is helpful to everyone's PHP programming.


Related articles: