c-当有足够的可用RAM时使用交换.性能受到影响

我编写了一个简单的程序来研究在Linux(64位Red Hat Enterprise Linux Server 6.4版)上使用大量RAM时的性能. (请忽略内存泄漏.)

#include <sys/time.h>
#include <time.h>
#include <stdio.h>
#include <string.h>
#include <iostream>
#include <vector>
using namespace std;

double getWallTime()
{
  struct timeval time;
  if (gettimeofday(&time, NULL))
  {
    return 0;
  }
  return (double)time.tv_sec + (double)time.tv_usec * .000001;
}


int main()
{
  int *a;
  int n = 1000000000;
  do
  {
    time_t mytime = time(NULL);
    char * time_str = ctime(&mytime);
    time_str[strlen(time_str)-1] = '\0';
    printf("Current Time : %s\n", time_str);
    double start = getWallTime();
    a = new int[n];
    for (int i = 0; i < n; i++)
    {
      a[i] = 1;
    }
    double elapsed = getWallTime()-start;
    cout << elapsed << endl;
    cout << "Allocated." << endl;
  }
  while (1);

  return 0;
}

输出是

Current Time : Tue May  8 11:46:55 2018
3.73667
Allocated.
Current Time : Tue May  8 11:46:59 2018
64.5222
Allocated.
Current Time : Tue May  8 11:48:03 2018
110.419

顶部输出如下.我们可以看到交换增加了,尽管有足够的可用RAM.结果是运行时间从3秒猛增到64秒.

top - 11:46:55 up 21 days,  1:14, 18 users,  load average: 1.24, 1.25, 0.95
Tasks: 819 total,   3 running, 816 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.6%us,  1.4%sy,  0.0%ni, 97.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  132110088k total, 127500344k used,  4609744k free,   262288k buffers
Swap: 10485752k total,     4112k used, 10481640k free, 45988192k cached

top - 11:47:01 up 21 days,  1:14, 18 users,  load average: 1.38, 1.27, 0.96
Tasks: 819 total,   2 running, 817 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.5%us,  2.1%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  132110088k total, 131620156k used,   489932k free,   262288k buffers
Swap: 10485752k total,     4112k used, 10481640k free, 45844228k cached

top - 11:47:53 up 21 days,  1:15, 18 users,  load average: 1.25, 1.26, 0.97
Tasks: 819 total,   2 running, 817 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.1%us,  2.5%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  132110088k total, 131626300k used,   483788k free,   262276k buffers
Swap: 10485752k total,     5464k used, 10480288k free, 43056696k cached

top - 11:47:56 up 21 days,  1:15, 18 users,  load average: 1.23, 1.26, 0.97
Tasks: 819 total,   2 running, 817 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.1%us,  2.5%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  132110088k total, 131627568k used,   482520k free,   262276k buffers
Swap: 10485752k total,     5792k used, 10479960k free, 42949788k cached

top - 11:47:59 up 21 days,  1:15, 18 users,  load average: 1.21, 1.25, 0.97
Tasks: 819 total,   2 running, 817 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.1%us,  2.5%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  132110088k total, 131623080k used,   487008k free,   262276k buffers
Swap: 10485752k total,     6312k used, 10479440k free, 42840068k cached

top - 11:48:02 up 21 days,  1:15, 18 users,  load average: 1.21, 1.25, 0.97
Tasks: 819 total,   2 running, 817 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.1%us,  2.5%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  132110088k total, 131620016k used,   490072k free,   262276k buffers
Swap: 10485752k total,     6772k used, 10478980k free, 42729276k cached

我读了thisthis.我的问题是

>为什么Linux会牺牲性能而不是全部使用缓存的RAM?内存碎片?但是,将数据放在交换上也肯定会造成碎片.
>是否有一种解决方法可以使3秒钟保持一致,直到达到物理RAM大小?

谢谢.

更新1:
从顶部添加更多输出.

更新2:
根据David的建议,查看/ proc // io会显示我的程序不是I / O.因此,大卫的第一个答案应该解释这一观察.现在是我的第二个问题.如何提高作为非root用户的性能(无法修改swappiness等).

更新3:因为需要sudo一些命令,所以我切换到另一台计算机.这是一台具有Intel®Xeon®CPU E5-2680 0 @ 2.70GHz的真实计算机(无虚拟机).该计算机具有16个物理核心.

uname -a
2.6.32-642.4.2.el6.x86_64 #1 SMP Tue Aug 23 19:58:13 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

使用更多迭代来运行osgx的修改后的代码可以

Iteration 451
Time to malloc: 1.81198e-05
Time to fill with data: 0.109081
Fill rate with data: **916**.75 Mints/sec, 3667Mbytes/sec
Time to second write access of data: 0.049731
Access rate of data: 2010.82 Mints/sec, 8043.27Mbytes/sec
Time to third write access of data: 0.0478709
Access rate of data: 2088.95 Mints/sec, 8355.81Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 180800Mbytes
Iteration 452
Time to malloc: 1.09673e-05
Time to fill with data: 5.16316
Fill rate with data: **19**.368 Mints/sec, 77.4719Mbytes/sec
Time to second write access of data: 0.0495219
Access rate of data: 2019.31 Mints/sec, 8077.23Mbytes/sec
Time to third write access of data: 0.0439548
Access rate of data: 2275.06 Mints/sec, 9100.25Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 181200Mbytes

当速度变慢时,我确实看到内核从2MB页面切换到4KB页面.

vmstat 1 60
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0 1217396 11506356 5911040 47499184    0    2    35    47    0    0 14  2 84  0  0  
 2  0 1217396 11305860 5911040 47499184    4    0     4    36 5163 3460  7  6 87  0  0  
 2  0 1217396 11112744 5911040 47499188    0    0     0     0 4326 3451  7  6 87  0  0  
 2  0 1217396 10980556 5911040 47499188    0    0     0     0 4801 3385  7  6 87  0  0  
 2  0 1217396 10845940 5911040 47499192    0    0     0    20 4650 3596  7  6 87  0  0  
 2  0 1217396 10712508 5911040 47499200    0    0     0     0 5743 3562  7  6 87  0  0  
 2  0 1217396 10583380 5911040 47499200    0    0     0    40 4531 3622  7  6 87  0  0  
 2  0 1217396 10449096 5911040 47499200    0    0     0     0 4516 3629  7  6 87  0  0  
 2  0 1217396 10187856 5911040 47499200    0    0     0     0 4499 3456  7  6 87  0  0  
 2  0 1217396 10053256 5911040 47499204    0    0     0     8 5334 3507  7  6 87  0  0  
 2  0 1217396 9921624 5911040 47499204    0    0     0     0 6310 3593  6  6 87  0  0   
 2  0 1217396 9788532 5911040 47499208    0    0     0    44 5794 3516  7  6 87  0  0   
 2  0 1217396 9660516 5911040 47499208    0    0     0     0 4894 3535  7  6 87  0  0   
 2  0 1217396 9527552 5911040 47499212    0    0     0     0 4686 3570  7  6 87  0  0   
 2  0 1217396 9396536 5911040 47499212    0    0     0     0 4805 3538  7  6 87  0  0   
 2  0 1217396 9238664 5911040 47499212    0    0     0     0 5940 3459  7  6 87  0  0   
 2  0 1217396 9000136 5911040 47499216    0    0     0    32 5239 3333  7  6 87  0  0   
 2  0 1217396 8861132 5911040 47499220    0    0     0     0 5579 3351  7  6 87  0  0   
 2  0 1217396 8733688 5911040 47499220    0    0     0     0 4910 3199  7  6 87  0  0   
 2  0 1217396 8596600 5911040 47499224    0    0     0    44 5075 3453  7  6 87  0  0   
 2  0 1217396 8338468 5911040 47499232    0    0     0     0 5328 3444  7  6 87  0  0   
 2  0 1217396 8207732 5911040 47499232    0    0     0    52 5474 3370  7  6 87  0  0   
 2  0 1217396 8071212 5911040 47499236    0    0     0     0 5442 3419  7  6 87  0  0   
 2  0 1217396 7807736 5911040 47499236    0    0     0     0 6139 3456  7  6 87  0  0   
 2  0 1217396 7676080 5911044 47499232    0    0     0    16 4533 3430  6  6 87  0  0   
 2  0 1217396 7545728 5911044 47499236    0    0     0     0 6712 3957  7  6 87  0  0   
 4  0 1217396 7412444 5911044 47499240    0    0     0    68 6110 3547  7  6 87  0  0   
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0 1217396 7280148 5911048 47499244    0    0     0    68 6140 3516  7  7 86  0  0   
 2  0 1217396 7147836 5911048 47499244    0    0     0     0 4434 3400  7  6 87  0  0   
 2  0 1217396 6886980 5911048 47499248    0    0     0    16 7354 3393  7  6 87  0  0   
 2  0 1217396 6752868 5911048 47499248    0    0     0     0 5286 3573  7  6 87  0  0   
 2  0 1217396 6621772 5911048 47499248    0    0     0     0 5353 3410  7  6 87  0  0   
 2  0 1217396 6489760 5911048 47499252    0    0     0    48 5172 3454  7  6 87  0  0   
 2  0 1217396 6248732 5911048 47499256    0    0     0     0 5266 3411  7  6 87  0  0   
 2  0 1217396 6092804 5911048 47499260    0    0     0     4 6345 3473  7  6 87  0  0   
 2  0 1217396 5962544 5911048 47499260    0    0     0     0 7399 3712  7  6 87  0  0   
 2  0 1217396 5828492 5911048 47499264    0    0     0     0 5804 3516  7  6 87  0  0   
 2  0 1217396 5566720 5911048 47499264    0    0     0    44 5800 3370  7  6 87  0  0   
 2  0 1217396 5434204 5911048 47499264    0    0     0     0 6716 3446  7  6 87  0  0   
 2  0 1217396 5240724 5911048 47499268    0    0     0    68 3948 3346  7  6 87  0  0   
 2  0 1217396 5051688 5911008 47484936    0    0     0     0 4743 3734  7  6 87  0  0   
 2  0 1217396 4925680 5910500 47478444    0    0   136     0 5978 3779  7  6 87  0  0   
 2  0 1217396 4801744 5908552 47471820    0    0     0    32 4573 3237  7  6 87  0  0   
 2  0 1217396 4675772 5908552 47463984    0    0     0     0 6594 3276  7  6 87  0  0   
 2  0 1217396 4486472 5908444 47455736    0    0     0     4 6096 3256  7  6 87  0  0   
 2  0 1217396 4299908 5908392 47446964    0    0     0     0 5569 3525  7  6 87  0  0   
 2  0 1217396 4175444 5906884 47440024    0    0     0     0 4975 3141  7  6 87  0  0   
 2  0 1217396 4063472 5905976 47423860    0    0     0    56 6255 3147  6  6 87  0  0   
 2  0 1217396 3939816 5905796 47415596    0    0     0     0 5396 3143  7  6 87  0  0   
 2  0 1217396 3686540 5905796 47407152    0    0     0    44 6471 3201  7  6 87  0  0   
 2  0 1217396 3557596 5905796 47398892    0    0     0     0 7581 3727  7  6 87  0  0   
 2  0 1217396 3445536 5905796 47381812    0    0     0     0 5560 3222  7  6 87  0  0   
 2  0 1217396 3250272 5905796 47373364    0    0     0    60 5594 3343  7  6 87  0  0   
 2  0 1217396 3065232 5903744 47367156    0    0     0     0 5595 3182  7  6 87  0  0   
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  0 1217396 2951704 5903028 47350792    0    0     0    12 5210 3262  7  6 87  0  0   
 2  0 1217396 2829228 5902928 47342444    0    0     0     0 5724 3758  7  6 87  0  0   
 2  0 1217396 2575248 5902580 47334472    0    0     0     0 4377 3369  7  6 87  0  0   
 2  0 1217396 2527996 5897796 47322436    0    0     0    60 5550 3570  7  6 87  0  0   
 2  0 1217396 2398672 5893572 47322324    0    0     0     0 5603 3225  7  6 87  0  0   
 2  0 1217396 2272536 5889364 47322228    0    0     0    16 6924 3310  7  6 87  0  0   

iostat -xyz 1 60
Linux 2.6.32-642.4.2.el6.x86_64     05/09/2018  _x86_64_    (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           6.64    0.00    6.26    0.00    0.00   87.10

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7.00    0.06    5.69    0.00    0.00   87.24

我设法做“ sudo perf top”,并在发生减速时在顶行看到了这一点.

16.84%  [kernel]                                      [k] compaction_alloc

从顶部开始.还有其他几个正在运行的进程(未显示).

Tasks: 799 total,   5 running, 787 sleeping,   4 stopped,   3 zombie
Cpu(s): 23.1%us, 16.7%sy,  0.0%ni, 60.0%id,  0.0%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:  264503640k total, 256749480k used,  7754160k free,  5830508k buffers
Swap: 409259004k total,  1217112k used, 408041892k free, 50458600k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                   
23559 toddwz   20   0  165g 164g 1204 R 93.0 65.4   2:05.51 a.out                                                     

更新4
关闭THP后,我看到以下内容.直到我的程序使用240GB RAM(缓存的RAM小于1GB)之前,填充速率大约为550分钟/秒(在启用THP的情况下约为900分钟).然后交换开始,因此填充率下降.

Iteration 610
Time to malloc: 1.3113e-05
Time to fill with data: 0.181151
Fill rate with data: 552.025 Mints/sec, 2208.1Mbytes/sec
Time to second write access of data: 0.04074
Access rate of data: 2454.59 Mints/sec, 9818.36Mbytes/sec
Time to third write access of data: 0.0420492
Access rate of data: 2378.17 Mints/sec, 9512.67Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 244400Mbytes
Iteration 611
Time to malloc: 1.88351e-05
Time to fill with data: 0.306215
Fill rate with data: 326.568 Mints/sec, 1306.27Mbytes/sec
Time to second write access of data: 0.045784
Access rate of data: 2184.17 Mints/sec, 8736.68Mbytes/sec
Time to third write access of data: 0.0441492
Access rate of data: 2265.05 Mints/sec, 9060.19Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 244800Mbytes
Iteration 612
Time to malloc: 2.21729e-05
Time to fill with data: 1.33305
Fill rate with data: 75.016 Mints/sec, 300.064Mbytes/sec
Time to second write access of data: 0.048573
Access rate of data: 2058.76 Mints/sec, 8235.02Mbytes/sec
Time to third write access of data: 0.0495481
Access rate of data: 2018.24 Mints/sec, 8072.96Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 245200Mbytes

结论
关闭透明大页面(THP)后,程序的行为对我而言更加透明,因此我将继续关闭THP.对于我的特定程序,原因是THP无法交换.感谢所有的帮助.

最佳答案
由于THP,测试的第一次迭代可能使用huge pages (2 MB pages):透明大页面-https://www.kernel.org/doc/Documentation/vm/transhuge.txt
在执行测试期间,检查/ sys / kernel / mm / transparent_hugepage / enabled和grep AnonHugePages / proc / meminfo.

The reason applications are running faster is because of two
factors. The first factor is almost completely irrelevant and it’s not
of significant interest because it’ll also have the downside of
requiring larger clear-page copy-page in page faults which is a
potentially negative effect. The first factor consists in taking a
single page fault for each 2M virtual region touched by userland (so
reducing the enter/exit kernel frequency by a 512 times factor). This
only matters the first time the memory is accessed for the lifetime of
a memory mapping.

单个syscall mmap可为new或malloc分配大量内存,通常不使用物理页“填充”虚拟内存,请在MADV_POPULATE周围检查man mmap

   MAP_POPULATE (since Linux 2.5.46)
          Populate (prefault) page tables for a mapping. ... This will help
          to reduce blocking on page faults later.

mmap刚刚将该内存注册为虚拟内存(没有MAP_POPULATE),并且在页表中禁止了写访问.当您的测试尝试首先写入任何内存页面时,页面错误异常将由OS内核生成并处理. Linux内核将分配一些物理内存,并将虚拟页面映射到物理页面(填充页面).如果启用了THP(通常启用),则内核可以分配单个huge page of 2MB(如果它具有一些可用的巨大物理页).如果没有可用的大页面,内核将分配4KB页面.因此,如果没有大页面,您将有512倍以上的页面错误(可以在测试运行时通过在另一个控制台中运行vmstat 1 180或通过perf stat -I 1000进行检查).

下次访问填充的页面不会出现页面错误,因此您可以将(0..N-1)中i的第二(第三)项扩展为测试:a [i] = 1;循环并测量两个循环的时间.

您的结果仍然听起来很奇怪.您的系统是真实的还是虚拟的?系统管理程序可能支持2 MB页面,并且虚拟系统可能在内存分配和异常处理方面的成本要高得多.

在内存较少的PC上,当页面错误从巨大的页面分配切换到4KB的页面分配时,我有10%的速度降低(检查来自perf stat的页面错误字符串-每秒只有大约2000个页面错误,其中2MB页面和> 20万个页面错误和4KB页面):

$cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
$perf stat -I1000 ./a.out
Iteration 0
Time to malloc: 8.10623e-06
Time to fill with data: 0.364378
Fill rate with data: 274.44 Mints/sec, 1097.76Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 400Mbytes
Iteration 1
Time to malloc: 1.90735e-05
Time to fill with data: 0.357983
Fill rate with data: 279.343 Mints/sec, 1117.37Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 800Mbytes
Iteration 2
Time to malloc: 1.69277e-05
#           time             counts unit events
     1.000414902         999.893040      task-clock (msec)
     1.000414902                  1      context-switches          #    0.001 K/sec
     1.000414902                  0      cpu-migrations            #    0.000 K/sec
     1.000414902              2,024      page-faults               #    0.002 M/sec
     1.000414902      2,664,963,857      cycles                    #    2.665 GHz
     1.000414902      3,072,781,834      instructions              #    1.15  insn per cycle
     1.000414902        559,551,437      branches                  #  559.611 M/sec
     1.000414902             25,176      branch-misses             #    0.00% of all branches
Time to fill with data: 0.357014
Fill rate with data: 280.101 Mints/sec, 1120.4Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 1200Mbytes
Iteration 3
Time to malloc: 1.71661e-05
Time to fill with data: 0.358964
Fill rate with data: 278.579 Mints/sec, 1114.32Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 1600Mbytes
Iteration 4
Time to malloc: 1.69277e-05
Time to fill with data: 0.356918
Fill rate with data: 280.177 Mints/sec, 1120.71Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2000Mbytes
Iteration 5
Time to malloc: 1.50204e-05
     2.000779126        1000.703872      task-clock (msec)
     2.000779126                  1      context-switches          #    0.001 K/sec
     2.000779126                  0      cpu-migrations            #    0.000 K/sec
     2.000779126              2,280      page-faults               #    0.002 M/sec
     2.000779126      2,686,072,244      cycles                    #    2.685 GHz
     2.000779126      3,094,777,285      instructions              #    1.16  insn per cycle
     2.000779126        563,593,105      branches                  #  563.425 M/sec
     2.000779126              9,661      branch-misses             #    0.00% of all branches
Time to fill with data: 0.371785
Fill rate with data: 268.973 Mints/sec, 1075.89Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2400Mbytes
Iteration 6
Time to malloc: 1.90735e-05
Time to fill with data: 0.418562
Fill rate with data: 238.913 Mints/sec, 955.653Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2800Mbytes
Iteration 7
Time to malloc: 2.09808e-05
     3.001146481        1000.436128      task-clock (msec)
     3.001146481                  1      context-switches          #    0.001 K/sec
     3.001146481                  0      cpu-migrations            #    0.000 K/sec
     3.001146481            217,415      page-faults               #    0.217 M/sec
     3.001146481      2,687,783,783      cycles                    #    2.687 GHz
     3.001146481      3,100,713,038      instructions              #    1.16  insn per cycle
     3.001146481        560,207,049      branches                  #  560.014 M/sec
     3.001146481             83,230      branch-misses             #    0.01% of all branches
Time to fill with data: 0.416297
Fill rate with data: 240.213 Mints/sec, 960.853Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 3200Mbytes
Iteration 8
Time to malloc: 1.38283e-05
Time to fill with data: 0.41672
Fill rate with data: 239.969 Mints/sec, 959.877Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 3600Mbytes
Iteration 9
Time to malloc: 1.40667e-05
Time to fill with data: 0.424997
Fill rate with data: 235.296 Mints/sec, 941.183Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4000Mbytes
Iteration 10
Time to malloc: 1.28746e-05
     4.001467773        1000.378604      task-clock (msec)
     4.001467773                  2      context-switches          #    0.002 K/sec
     4.001467773                  0      cpu-migrations            #    0.000 K/sec
     4.001467773            232,690      page-faults               #    0.233 M/sec
     4.001467773      2,655,313,682      cycles                    #    2.654 GHz
     4.001467773      3,087,157,016      instructions              #    1.15  insn per cycle
     4.001467773        557,266,313      branches                  #  557.070 M/sec
     4.001467773             95,433      branch-misses             #    0.02% of all branches
Time to fill with data: 0.413271
Fill rate with data: 241.972 Mints/sec, 967.888Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4400Mbytes
Iteration 11
Time to malloc: 1.21593e-05
Time to fill with data: 0.414624
Fill rate with data: 241.182 Mints/sec, 964.73Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4800Mbytes
Iteration 12
Time to malloc: 1.5974e-05
     5.001792272        1000.372602      task-clock (msec)
     5.001792272                  2      context-switches          #    0.002 K/sec
     5.001792272                  0      cpu-migrations            #    0.000 K/sec
     5.001792272            236,260      page-faults               #    0.236 M/sec
     5.001792272      2,687,340,230      cycles                    #    2.686 GHz
     5.001792272      3,134,864,968      instructions              #    1.17  insn per cycle
     5.001792272        565,846,287      branches                  #  565.644 M/sec
     5.001792272            104,634      branch-misses             #    0.02% of all branches
Time to fill with data: 0.412331
Fill rate with data: 242.524 Mints/sec, 970.094Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 5200Mbytes
Iteration 13
Time to malloc: 1.3113e-05
Time to fill with data: 0.414433
Fill rate with data: 241.294 Mints/sec, 965.174Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 5600Mbytes
Iteration 14
Time to malloc: 1.88351e-05
Time to fill with data: 0.417277
Fill rate with data: 239.649 Mints/sec, 958.596Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 6000Mbytes
     6.002129544        1000.404270      task-clock (msec)
     6.002129544                  1      context-switches          #    0.001 K/sec
     6.002129544                  0      cpu-migrations            #    0.000 K/sec
     6.002129544            215,269      page-faults               #    0.215 M/sec
     6.002129544      2,676,269,667      cycles                    #    2.675 GHz
     6.002129544      3,286,469,282      instructions              #    1.23  insn per cycle
     6.002129544        578,367,266      branches                  #  578.156 M/sec
     6.002129544            345,470      branch-misses             #    0.06% of all branches
    ....

https://access.redhat.com/solutions/46111使用root命令禁用THP后,我总是每秒约有20万个页面错误,并且大约有950 MB / s:

$cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
$perf stat -I1000 ./a.out
Iteration 0
Time to malloc: 1.50204e-05
Time to fill with data: 0.422322
Fill rate with data: 236.786 Mints/sec, 947.145Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 400Mbytes
Iteration 1
Time to malloc: 1.50204e-05
Time to fill with data: 0.415068
Fill rate with data: 240.924 Mints/sec, 963.698Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 800Mbytes
Iteration 2
Time to malloc: 2.19345e-05
#           time             counts unit events
     1.000162191         999.429856      task-clock (msec)
     1.000162191                 14      context-switches          #    0.014 K/sec
     1.000162191                  0      cpu-migrations            #    0.000 K/sec
     1.000162191            232,727      page-faults               #    0.233 M/sec
     1.000162191      2,664,896,604      cycles                    #    2.666 GHz
     1.000162191      3,080,713,267      instructions              #    1.16  insn per cycle
     1.000162191        555,116,838      branches                  #  555.434 M/sec
     1.000162191            102,262      branch-misses             #    0.02% of all branches
Time to fill with data: 0.440695
Fill rate with data: 226.914 Mints/sec, 907.658Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 1200Mbytes
Iteration 3
Time to malloc: 2.09808e-05
Time to fill with data: 0.414463
Fill rate with data: 241.276 Mints/sec, 965.104Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 1600Mbytes
Iteration 4
Time to malloc: 1.81198e-05
     2.000544564        1000.142465      task-clock (msec)
     2.000544564                 16      context-switches          #    0.016 K/sec
     2.000544564                  0      cpu-migrations            #    0.000 K/sec
     2.000544564            229,697      page-faults               #    0.230 M/sec
     2.000544564      2,621,180,984      cycles                    #    2.622 GHz
     2.000544564      3,041,358,811      instructions              #    1.15  insn per cycle
     2.000544564        547,910,242      branches                  #  548.027 M/sec
     2.000544564             93,682      branch-misses             #    0.02% of all branches
Time to fill with data: 0.428383
Fill rate with data: 233.436 Mints/sec, 933.744Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2000Mbytes
Iteration 5
Time to malloc: 1.5974e-05
Time to fill with data: 0.421986
Fill rate with data: 236.975 Mints/sec, 947.899Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2400Mbytes
Iteration 6
Time to malloc: 1.5974e-05
Time to fill with data: 0.413477
Fill rate with data: 241.851 Mints/sec, 967.406Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2800Mbytes
Iteration 7
Time to malloc: 1.88351e-05
     3.000866438         999.980461      task-clock (msec)
     3.000866438                 20      context-switches          #    0.020 K/sec
     3.000866438                  0      cpu-migrations            #    0.000 K/sec
     3.000866438            231,194      page-faults               #    0.231 M/sec
     3.000866438      2,622,484,960      cycles                    #    2.623 GHz
     3.000866438      3,061,610,229      instructions              #    1.16  insn per cycle
     3.000866438        551,533,361      branches                  #  551.616 M/sec
     3.000866438            104,561      branch-misses             #    0.02% of all branches
Time to fill with data: 0.448333
Fill rate with data: 223.048 Mints/sec, 892.194Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 3200Mbytes
Iteration 8
Time to malloc: 1.50204e-05
Time to fill with data: 0.410566
Fill rate with data: 243.566 Mints/sec, 974.265Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 3600Mbytes
Iteration 9
Time to malloc: 1.3113e-05
     4.001231042        1000.098860      task-clock (msec)
     4.001231042                 17      context-switches          #    0.017 K/sec
     4.001231042                  0      cpu-migrations            #    0.000 K/sec
     4.001231042            228,532      page-faults               #    0.229 M/sec
     4.001231042      2,586,146,024      cycles                    #    2.586 GHz
     4.001231042      3,026,679,955      instructions              #    1.15  insn per cycle
     4.001231042        545,236,541      branches                  #  545.284 M/sec
     4.001231042            115,251      branch-misses             #    0.02% of all branches
Time to fill with data: 0.441442
Fill rate with data: 226.53 Mints/sec, 906.121Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4000Mbytes
Iteration 10
Time to malloc: 1.5974e-05
Time to fill with data: 0.42898
Fill rate with data: 233.111 Mints/sec, 932.445Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4400Mbytes
Iteration 11
Time to malloc: 2.00272e-05
     5.001547227         999.982415      task-clock (msec)
     5.001547227                 19      context-switches          #    0.019 K/sec
     5.001547227                  0      cpu-migrations            #    0.000 K/sec
     5.001547227            225,796      page-faults               #    0.226 M/sec
     5.001547227      2,560,990,918      cycles                    #    2.561 GHz
     5.001547227      3,005,384,743      instructions              #    1.15  insn per cycle
     5.001547227        542,275,580      branches                  #  542.315 M/sec
     5.001547227            116,537      branch-misses             #    0.02% of all branches
Time to fill with data: 0.414212
Fill rate with data: 241.422 Mints/sec, 965.689Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4800Mbytes
Iteration 12
Time to malloc: 1.69277e-05
Time to fill with data: 0.411084
Fill rate with data: 243.259 Mints/sec, 973.037Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 5200Mbytes
Iteration 13
Time to malloc: 1.40667e-05
Time to fill with data: 0.413644
Fill rate with data: 241.754 Mints/sec, 967.015Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 5600Mbytes
Iteration 14
Time to malloc: 1.28746e-05
     6.001849796         999.913923      task-clock (msec)
     6.001849796                 18      context-switches          #    0.018 K/sec
     6.001849796                  0      cpu-migrations            #    0.000 K/sec
     6.001849796            236,912      page-faults               #    0.237 M/sec
     6.001849796      2,685,445,660      cycles                    #    2.686 GHz
     6.001849796      3,153,464,551      instructions              #    1.20  insn per cycle
     6.001849796        568,989,467      branches                  #  569.032 M/sec
     6.001849796            125,943      branch-misses             #    0.02% of all branches
Time to fill with data: 0.444891
Fill rate with data: 224.774 Mints/sec, 899.097Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 6000Mbytes

通过费率打印和有限的迭代次数对性能统计进行了修改:

$cat test.c; g++ test.c
#include <sys/time.h>
#include <time.h>
#include <stdio.h>
#include <string.h>
#include <iostream>
#include <vector>
using namespace std;

double getWallTime()
{
  struct timeval time;
  if (gettimeofday(&time, NULL))
  {
    return 0;
  }
  return (double)time.tv_sec + (double)time.tv_usec * .000001;
}

#define M 1000000

int main()
{
  int *a;
  int n = 100000000;
  int j;
  double total = 0;
  for(j=0; j<15; j++)
  {
    cout << "Iteration " << j << endl;
    double start = getWallTime();
    a = new int[n];
    cout << "Time to malloc: " << getWallTime() - start << endl;
    for (int i = 0; i < n; i++)
    {
      a[i] = 1;
    }
    double elapsed = getWallTime()-start;
    cout << "Time to fill with data: " << elapsed << endl;
    cout << "Fill rate with data: " << n/elapsed/M << " Mints/sec, " << n*sizeof(int)/elapsed/M << "Mbytes/sec"  << endl;
    total += n*sizeof(int)*1./M;
    cout << "Allocated " << n*sizeof(int)*1./M << " Mbytes, with total memory allocated " << total << "Mbytes" << endl;
  }

  return 0;
}

测试已修改为第二和第三次写访问

$g++ second.c -o second
$cat second.c
#include <sys/time.h>
#include <time.h>
#include <stdio.h>
#include <string.h>
#include <iostream>
#include <vector>
using namespace std;

double getWallTime()
{
  struct timeval time;
  if (gettimeofday(&time, NULL))
  {
    return 0;
  }
  return (double)time.tv_sec + (double)time.tv_usec * .000001;
}

#define M 1000000

int main()
{
  int *a;
  int n = 100000000;
  int j;
  double total = 0;
  for(j=0; j<15; j++)
  {
    cout << "Iteration " << j << endl;
    double start = getWallTime();
    a = new int[n];
    cout << "Time to malloc: " << getWallTime() - start << endl;
    for (int i = 0; i < n; i++)
    {
      a[i] = 1;
    }
    double elapsed = getWallTime()-start;
    cout << "Time to fill with data: " << elapsed << endl;
    cout << "Fill rate with data: " << n/elapsed/M << " Mints/sec, " << n*sizeof(int)/elapsed/M << "Mbytes/sec"  << endl;


    start = getWallTime();
    for (int i = 0; i < n; i++)
    {
      a[i] = 2;
    }
    elapsed = getWallTime()-start;
    cout << "Time to second write access of data: " << elapsed << endl;
    cout << "Access rate of data: " << n/elapsed/M << " Mints/sec, " << n*sizeof(int)/elapsed/M << "Mbytes/sec"  << endl;

    start = getWallTime();
    for (int i = 0; i < n; i++)
    {
      a[i] = 3;
    }
    elapsed = getWallTime()-start;
    cout << "Time to third write access of data: " << elapsed << endl;
    cout << "Access rate of data: " << n/elapsed/M << " Mints/sec, " << n*sizeof(int)/elapsed/M << "Mbytes/sec"  << endl;


    total += n*sizeof(int)*1./M;
    cout << "Allocated " << n*sizeof(int)*1./M << " Mbytes, with total memory allocated " << total << "Mbytes" << endl;
  }

  return 0;
}

如果没有THP,则第二次和第三次访问速度约为1.25 GB / s:

$cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
$./second
Iteration 0
Time to malloc: 9.05991e-06
Time to fill with data: 0.426387
Fill rate with data: 234.529 Mints/sec, 938.115Mbytes/sec
Time to second write access of data: 0.318292
Access rate of data: 314.177 Mints/sec, 1256.71Mbytes/sec
Time to third write access of data: 0.321722
Access rate of data: 310.827 Mints/sec, 1243.31Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 400Mbytes
Iteration 1
Time to malloc: 3.50475e-05
Time to fill with data: 0.411859
Fill rate with data: 242.802 Mints/sec, 971.206Mbytes/sec
Time to second write access of data: 0.317989
Access rate of data: 314.476 Mints/sec, 1257.91Mbytes/sec
Time to third write access of data: 0.321637
Access rate of data: 310.91 Mints/sec, 1243.64Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 800Mbytes
Iteration 2
Time to malloc: 2.81334e-05
Time to fill with data: 0.411918
Fill rate with data: 242.767 Mints/sec, 971.067Mbytes/sec
Time to second write access of data: 0.318647
Access rate of data: 313.827 Mints/sec, 1255.31Mbytes/sec
Time to third write access of data: 0.321041
Access rate of data: 311.487 Mints/sec, 1245.95Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 1200Mbytes
Iteration 3
Time to malloc: 2.5034e-05
Time to fill with data: 0.411138
Fill rate with data: 243.227 Mints/sec, 972.909Mbytes/sec
Time to second write access of data: 0.318429
Access rate of data: 314.042 Mints/sec, 1256.17Mbytes/sec
Time to third write access of data: 0.321332
Access rate of data: 311.205 Mints/sec, 1244.82Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 1600Mbytes
Iteration 4
Time to malloc: 3.71933e-05
Time to fill with data: 0.410922
Fill rate with data: 243.355 Mints/sec, 973.421Mbytes/sec
Time to second write access of data: 0.320262
Access rate of data: 312.244 Mints/sec, 1248.98Mbytes/sec
Time to third write access of data: 0.319223
Access rate of data: 313.261 Mints/sec, 1253.04Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2000Mbytes
Iteration 5
Time to malloc: 2.19345e-05
Time to fill with data: 0.418508
Fill rate with data: 238.944 Mints/sec, 955.777Mbytes/sec
Time to second write access of data: 0.320419
Access rate of data: 312.092 Mints/sec, 1248.37Mbytes/sec
Time to third write access of data: 0.319752
Access rate of data: 312.742 Mints/sec, 1250.97Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2400Mbytes
Iteration 6
Time to malloc: 3.19481e-05
Time to fill with data: 0.410054
Fill rate with data: 243.87 Mints/sec, 975.481Mbytes/sec
Time to second write access of data: 0.320244
Access rate of data: 312.262 Mints/sec, 1249.05Mbytes/sec
Time to third write access of data: 0.319546
Access rate of data: 312.944 Mints/sec, 1251.78Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2800Mbytes
Iteration 7
Time to malloc: 3.19481e-05
Time to fill with data: 0.409491
Fill rate with data: 244.206 Mints/sec, 976.822Mbytes/sec
Time to second write access of data: 0.318501
Access rate of data: 313.971 Mints/sec, 1255.88Mbytes/sec
Time to third write access of data: 0.320052
Access rate of data: 312.449 Mints/sec, 1249.8Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 3200Mbytes
Iteration 8
Time to malloc: 2.5034e-05
Time to fill with data: 0.409922
Fill rate with data: 243.949 Mints/sec, 975.795Mbytes/sec
Time to second write access of data: 0.320583
Access rate of data: 311.932 Mints/sec, 1247.73Mbytes/sec
Time to third write access of data: 0.319478
Access rate of data: 313.011 Mints/sec, 1252.04Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 3600Mbytes
Iteration 9
Time to malloc: 2.69413e-05
Time to fill with data: 0.41104
Fill rate with data: 243.285 Mints/sec, 973.141Mbytes/sec
Time to second write access of data: 0.320389
Access rate of data: 312.121 Mints/sec, 1248.48Mbytes/sec
Time to third write access of data: 0.319762
Access rate of data: 312.733 Mints/sec, 1250.93Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4000Mbytes
Iteration 10
Time to malloc: 2.59876e-05
Time to fill with data: 0.412612
Fill rate with data: 242.358 Mints/sec, 969.434Mbytes/sec
Time to second write access of data: 0.318304
Access rate of data: 314.165 Mints/sec, 1256.66Mbytes/sec
Time to third write access of data: 0.319453
Access rate of data: 313.035 Mints/sec, 1252.14Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4400Mbytes
Iteration 11
Time to malloc: 2.98023e-05
Time to fill with data: 0.412428
Fill rate with data: 242.467 Mints/sec, 969.866Mbytes/sec
Time to second write access of data: 0.318467
Access rate of data: 314.004 Mints/sec, 1256.02Mbytes/sec
Time to third write access of data: 0.319716
Access rate of data: 312.778 Mints/sec, 1251.11Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4800Mbytes
Iteration 12
Time to malloc: 2.69413e-05
Time to fill with data: 0.410515
Fill rate with data: 243.597 Mints/sec, 974.386Mbytes/sec
Time to second write access of data: 0.31832
Access rate of data: 314.149 Mints/sec, 1256.6Mbytes/sec
Time to third write access of data: 0.319569
Access rate of data: 312.921 Mints/sec, 1251.69Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 5200Mbytes
Iteration 13
Time to malloc: 2.28882e-05
Time to fill with data: 0.412385
Fill rate with data: 242.492 Mints/sec, 969.967Mbytes/sec
Time to second write access of data: 0.318929
Access rate of data: 313.549 Mints/sec, 1254.2Mbytes/sec
Time to third write access of data: 0.31949
Access rate of data: 312.999 Mints/sec, 1252Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 5600Mbytes
Iteration 14
Time to malloc: 2.90871e-05
Time to fill with data: 0.41235
Fill rate with data: 242.512 Mints/sec, 970.05Mbytes/sec
Time to second write access of data: 0.340456
Access rate of data: 293.724 Mints/sec, 1174.89Mbytes/sec
Time to third write access of data: 0.319716
Access rate of data: 312.778 Mints/sec, 1251.11Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 6000Mbytes

使用THP-分配速度更快,但是第二次和第三次访问的速度相同:

$cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
$./second
Iteration 0
Time to malloc: 1.50204e-05
Time to fill with data: 0.365043
Fill rate with data: 273.94 Mints/sec, 1095.76Mbytes/sec
Time to second write access of data: 0.320503
Access rate of data: 312.01 Mints/sec, 1248.04Mbytes/sec
Time to third write access of data: 0.319442
Access rate of data: 313.046 Mints/sec, 1252.18Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 400Mbytes
...
Iteration 14
Time to malloc: 2.7895e-05
Time to fill with data: 0.409294
Fill rate with data: 244.323 Mints/sec, 977.293Mbytes/sec
Time to second write access of data: 0.318422
Access rate of data: 314.049 Mints/sec, 1256.19Mbytes/sec
Time to third write access of data: 0.322098
Access rate of data: 310.465 Mints/sec, 1241.86Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 6000Mbytes

转载注明原文:c-当有足够的可用RAM时使用交换.性能受到影响 - 代码日志