c++ 使用std :: atomic和std :: condition_variable,Sync不可靠

在用C 11编写的分布式作业系统中,我使用以下结构实现了一个fence(即工作线程池外部的线程可能要求阻塞,直到完成所有当前计划的作业):

struct fence
{
    std::atomic<size_t>                     counter;
    std::mutex                              resume_mutex;
    std::condition_variable                 resume;

    fence(size_t num_threads)
        : counter(num_threads)
    {}
};

实现围栏的代码如下所示:

void task_pool::fence_impl(void *arg)
{
    auto f = (fence *)arg;
    if (--f->counter == 0)      // (1)
        // we have zeroed this fence's counter, wake up everyone that waits
        f->resume.notify_all(); // (2)
    else
    {
        unique_lock<mutex> lock(f->resume_mutex);
        f->resume.wait(lock);   // (3)
    }
}

如果线程在一段时间内进入围栏,这种方法非常有效.然而,如果他们几乎同时尝试这样做,似乎有时会发生在原子递减(1)和开始条件var(3)的等待之间,线程产生CPU时间而另一个线程将计数器递减到零( 1)并解雇cond. var(2).这导致前一个线程在(3)中永远等待,因为它在已经被通知之后开始等待它.

让事情变得可行的黑客就是在(2)之前进行10毫秒的睡眠,但这显然是不可接受的.

关于如何以高效的方式解决这个问题的任何建议?

您的诊断是正确的,此代码很容易以您描述的方式丢失条件通知.即在一个线程锁定互斥锁之后但在等待条件变量之前,另一个线程可能会调用notify_all(),以便第一个线程错过该通知.

一个简单的解决方法是在递减计数器之前锁定互斥锁,同时通知:

void task_pool::fence_impl(void *arg)
{
    auto f = static_cast<fence*>(arg);
    std::unique_lock<std::mutex> lock(f->resume_mutex);
    if (--f->counter == 0) {
        f->resume.notify_all();
    }
    else do {
        f->resume.wait(lock);
    } while(f->counter);
}

在这种情况下,计数器不必是原子的.

在通知之前锁定互斥锁的额外奖励(或惩罚,取决于观点)是(从here开始):

The pthread_cond_broadcast() or pthread_cond_signal() functions may be called by a thread whether or not it currently owns the mutex that threads calling pthread_cond_wait() or pthread_cond_timedwait() have associated with the condition variable during their waits; however, if predictable scheduling behavior is required, then that mutex shall be locked by the thread calling pthread_cond_broadcast() or pthread_cond_signal().

关于while循环(从here开始):

Spurious wakeups from the pthread_cond_timedwait() or pthread_cond_wait() functions may occur. Since the return from pthread_cond_timedwait() or pthread_cond_wait() does not imply anything about the value of this predicate, the predicate should be re-evaluated upon such return.

https://stackoverflow.com/questions/20982270/sync-is-unreliable-using-stdatomic-and-stdcondition-variable

转载注明原文:c++ 使用std :: atomic和std :: condition_variable,Sync不可靠