Based on the difference between select poll epoll

  • 2020-04-02 00:44:24
  • OfStack

Linux provides select, poll and epoll interfaces for IO reuse. The prototypes of the three are shown below. This paper compares the three in terms of parameters, implementation and performance.
 
int select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout);
int poll(struct pollfd *fds, nfds_t nfds, int timeout);
int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);
 

Select, poll, epoll_wait parameters and implementation comparison
1. The first parameter of select, NFDS, is to add 1 to the maximum descriptor value in the fdset set. Fdset is a bit array whose size is limited to s/s _ setsize (1024).

The second, third, and fourth parameters of select represent an array of file descriptors that need to be concerned with read, write, and error events. These parameters are both input and output parameters and may be modified by the kernel to indicate which descriptors are concerned. So you need to reinitialize the fdset before each call to the select.

The timeout parameter is the timeout time, the structure is modified by the kernel, and the value is the remaining time of the timeout.
 
Select corresponds to the sys_select call in the kernel. Sys_select first copies the fd_set pointed by the second, third and fourth parameters to the kernel, then pollates the descriptor call of each SET and records it in the temporary result (fdset). If there is an event, the select will write the temporary result to the user space and return it. When no events occur after a poll, if a timeout is specified, the select sleeps to a timeout, polls again after sleep, writes the provisional results to user space, and returns.

After the select returns, you need to check one by one whether the descriptor of interest has been SET (whether the event has occurred).

2. Unlike select, pollfd array passes the events that need to be concerned to the kernel through a pollfd array, so there is no limit on the number of descriptors. The events field and revents in pollfd are used to indicate the events that are concerned and the events that happen, so the pollfd array only needs to be initialized once.

The polling implementation mechanism is similar to select, which corresponds to sys_poll in the kernel, except that polling passes the pollfd array to the kernel, and then pollates each descriptor in pollfd, which is more efficient than processing fdset.

After the poll returns, each element in the pollfd needs to be checked for its revents value to indicate whether the event has occurred.

3. Epoll creates a descriptor for epoll polling through epoll_create, adds/modifies/removes events through epoll_ctl, checks events through epoll_wait, and the second parameter of epoll_wait is used to store the results.

Epoll is different from select and poll.firstly, it does not copy the event description information to the kernel every time it is called, and after the first call, the event information is associated with the corresponding epoll descriptor. In addition, epoll does not poll, but registers the callback function on the waiting descriptor. When an event occurs, the callback function is responsible for storing the event that occurs in the ready event linked list and finally writing to user space.

When epoll returns, the event in the buffer that the parameter points to is the event that happened, and each element in the buffer can be processed without the need for polling checks like poll or select.

Select, poll, epoll_wait performance comparison
The internal implementation mechanism of select and poll is similar, and the performance difference mainly lies in the passing of parameters to the kernel and the bit operation of fdset. In addition, select has a hard limit on the number of descriptors and cannot handle a large set of descriptors. Here we mainly investigate the difference of poll-epoll performance under different descriptor sets of different sizes.

The test program will count the number of poll and epoll calls in 1s with different sets of file descriptors. The statistical results are as follows. It can be seen from the results that, for poll, although the number of system calls per second decreases rapidly with the increase of the set, epoll basically remains unchanged and has good scalability.

Descriptor set size

poll

epoll

1

331598

258604

10

330648

297033

100

91199

288784

1000

27411

296357

5000

5943

288671

10000

2893

292397

25000

1041

285905

50000

536

293033

100000

224

285825


Number of connections
I have also used select and epoll in projects. What impressed me most about select was the maximum number of select under Linux (there seems to be no limit under Windows). The select of each process can handle FD_SETSIZE FDS (file handle) at most.
If you want to handle more than 1024 handles, you can only use multiple processes.
The common multi-process model using slect is as follows: a process is dedicated to accept, and after success, fd is passed to the child process for processing through the Unix socket, and the parent process can dispatch the load according to the child process. I have used 1 parent process +4 child processes to carry over 4000 loads.
This model worked very well in our business at the time. Epoll has no restrictions on the number of connections, and of course it may require the user to call the API to reproduce the resource limits of the process.

Two, IO difference
1, the implementation of select
This section can be described in combination with the Linux kernel code, I used 2.6.28, other 2.6 code should be similar.
First take a look at the select:
The code for the select system call is in fs/ select.c,

asmlinkage long sys_select(int n, fd_set __user *inp, fd_set __user *outp,
            fd_set __user *exp, struct timeval __user *tvp)
{
    struct timespec end_time, *to = NULL;
    struct timeval tv;
    int ret;
    if (tvp) {
        if (copy_from_user(&tv, tvp, sizeof(tv)))
            return -EFAULT;
        to = &end_time;
        if (poll_select_set_timeout(to,
                tv.tv_sec + (tv.tv_usec / USEC_PER_SEC),
                (tv.tv_usec % USEC_PER_SEC) * NSEC_PER_USEC))
            return -EINVAL;
    }
    ret = core_sys_select(n, inp, outp, exp, to);
    ret = poll_select_copy_remaining(&end_time, tvp, 1, ret);
    return ret;
} 

The first step is to copy each fd_set from the user control to the kernel space, and the next step is to work in core_sys_select.
Core_sys_select - > Do_select, the real core content in the do_select:

int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
{
    ktime_t expire, *to = NULL;
    struct poll_wqueues table;
    poll_table *wait;
    int retval, i, timed_out = 0;
    unsigned long slack = 0;
    rcu_read_lock();
    retval = max_select_fd(n, fds);
    rcu_read_unlock();
    if (retval < 0)
        return retval;
    n = retval;
    poll_initwait(&table);
    wait = &table.pt;
    if (end_time && !end_time->tv_sec && !end_time->tv_nsec) {
        wait = NULL;
        timed_out = 1;
    }
    if (end_time && !timed_out)
        slack = estimate_accuracy(end_time);
    retval = 0;
    for (;;) {
        unsigned long *rinp, *routp, *rexp, *inp, *outp, *exp;
        set_current_state(TASK_INTERRUPTIBLE);
        inp = fds->in; outp = fds->out; exp = fds->ex;
        rinp = fds->res_in; routp = fds->res_out; rexp = fds->res_ex;
        for (i = 0; i < n; ++rinp, ++routp, ++rexp) {
            unsigned long in, out, ex, all_bits, bit = 1, mask, j;
            unsigned long res_in = 0, res_out = 0, res_ex = 0;
            const struct file_operations *f_op = NULL;
            struct file *file = NULL;
            in = *inp++; out = *outp++; ex = *exp++;
            all_bits = in | out | ex;
            if (all_bits == 0) {
                i += __NFDBITS;
                continue;
            }
            for (j = 0; j < __NFDBITS; ++j, ++i, bit <<= 1) {
                int fput_needed;
                if (i >= n)
                    break;
                if (!(bit & all_bits))
                    continue;
                file = fget_light(i, &fput_needed);
                if (file) {
                    f_op = file->f_op;
                    mask = DEFAULT_POLLMASK;
                    if (f_op && f_op->poll)
                        mask = (*f_op->poll)(file, retval ? NULL : wait);
                    fput_light(file, fput_needed);
                    if ((mask & POLLIN_SET) && (in & bit)) {
                        res_in |= bit;
                        retval++;
                    }
                    if ((mask & POLLOUT_SET) && (out & bit)) {
                        res_out |= bit;
                        retval++;
                    }
                    if ((mask & POLLEX_SET) && (ex & bit)) {
                        res_ex |= bit;
                        retval++;
                    }
                }
            }
            if (res_in)
                *rinp = res_in;
            if (res_out)
                *routp = res_out;
            if (res_ex)
                *rexp = res_ex;
            cond_resched();
        }
        wait = NULL;
        if (retval || timed_out || signal_pending(current))
            break;
        if (table.error) {
            retval = table.error;
            break;
        }
        
        if (end_time && !to) {
            expire = timespec_to_ktime(*end_time);
            to = &expire;
        }
        if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
            timed_out = 1;
    }
    __set_current_state(TASK_RUNNING);
    poll_freewait(&table);
    return retval;
} 

There is a lot of code above, but the real key is this:

mask = (*f_op->poll)(file, retval ? NULL : wait); 
 This is where the file system is called  poll Function, different file systems poll The function is naturally different, because what we're focusing on here is tcp Connections, and socketfs Registered in  net/Socket.c In the water. 
register_filesystem(&sock_fs_type); 
socket The file system function is also in net/Socket.c In: 
static const struct file_operations socket_file_ops = {
    .owner =    THIS_MODULE,
    .llseek =    no_llseek,
    .aio_read =    sock_aio_read,
    .aio_write =    sock_aio_write,
    .poll =        sock_poll,
    .unlocked_ioctl = sock_ioctl,
#ifdef CONFIG_COMPAT
    .compat_ioctl = compat_sock_ioctl,
#endif
    .mmap =        sock_mmap,
    .open =        sock_no_open,    /* special open code to disallow open via /proc */
    .release =    sock_close,
    .fasync =    sock_fasync,
    .sendpage =    sock_sendpage,
    .splice_write = generic_splice_sendpage,
    .splice_read =    sock_splice_read,
};

So let's go from sock_poll,
Finally can go to net/ipv4/ TCP
Unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
This is the final query function,
That is, the core function of select is to call the poll function of the TCP file system, keep querying, and if there is no desired data, actively perform a schedule (to prevent CPU consumption) until a connection has the desired message.
It can be seen from here that the way select is performed is basically to call the poll in different ways until the desired message is available. If the select processes many sockets, this is also a cost to the performance of the entire machine.
2. Implementation of epoll
Epoll implementation code under fs/ eventpoll.c,
Since epoll involves several system calls, instead of analyzing each one, just a few key points,
The first key point is
Static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
                        Struct file *tfile, int fd)
This is the function we call when we call sys_epoll_ctl to add a managed socket. The key lines are as follows:

epq.epi = epi;
    init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);
    
    revents = tfile->f_op->poll(tfile, &epq.pt); 

This is also the poll function that calls the file system, but this time initializes a structure that has a callback function for the poll function: ep_ptable_queue_proc,
When the pollfunction is called, the callback is executed, which adds the current process to the socket's waiting process.

static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
                 poll_table *pt)
{
    struct epitem *epi = ep_item_from_epqueue(pt);
    struct eppoll_entry *pwq;
    if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) {
        init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
        pwq->whead = whead;
        pwq->base = epi;
        add_wait_queue(whead, &pwq->wait);
        list_add_tail(&pwq->llink, &epi->pwqlist);
        epi->nwait++;
    } else {
        
        epi->nwait = -1;
    }
}  

Notice the parameter Whead is actually sk- > Sleep, which is simply adding the current process to sk's wait queue, is called when the socket receives data or other events are triggered
Sock_def_readable or sock_def_write_space tells the function to wake up the waiting process, both of which were filled in the sk structure when the socket was created.
From the previous analysis, epoll is indeed much smarter and more relaxed than select.

Related articles: