最快的Matlab文件读取?

我的MATLAB程序正在读一个大约7m行长的文件,浪费太多的时间在I / O。我知道每行格式化为两个整数,但我不知道他们占用了多少字符。 str2num是死亡慢,我应该使用什么matlab函数?

抓住:我必须在每行一个操作,而不存储整个文件存储器,所以没有读取整个矩阵的命令在表上。

fid = fopen('file.txt');
tline = fgetl(fid);
while ischar(tline)
    nums = str2num(tline);    
    %do stuff with nums
    tline = fgetl(fid);
end
fclose(fid);
问题陈述

这是一个共同的斗争,没有什么像测试来回答。这里是我的假设:

>一个格式正确的ASCII文件,包含两列数字。没有标题,没有不一致的行等。
>该方法必须扩展以读取太大而不能包含在内存中的文件(虽然我的耐心有限,因此我的测试文件只有500,000行)。
>实际操作(OP调用“用nums做的东西”)必须一次执行一行,不能被矢量化。

讨论

考虑到这一点,答案和意见似乎在三个方面是令人鼓舞的效率:

>以较大批量读取文件
>更有效地执行字符串到数字转换(通过批处理或使用更好的功能)
>使实际处理更有效率(我已经通过上面的规则3排除)。

结果

我放在一起一个快速脚本来测试6个变体对这些主题的摄取速度(和结果的一致性)。结果是:

>初始代码。 68.23秒。 582582检查
>使用sscanf,每行一次。 27.20秒。 582582检查
>大批量使用fscanf。 8.93秒。 582582检查
>大批量使用文本扫描。 8.79秒。 582582检查
>将大批量读入内存,然后读入sscanf。 8.15秒。 582582检查
>在单行上使用java单行文件读取器和sscanf。 63.56秒。 582582检查
>使用java单项标记扫描器。 81.19秒。 582582检查
>完全批处理操作(不符合)。 1.02秒。 508680检查(违反规则3)

概要

在str2num调用中的低效率消耗了原始时间的一半以上(68→27秒),这可以通过切换sscanf来消除。

可以通过对文件读取和字符串到数字转换使用较大批次来减少约另外2/3的剩余时间(27-> 8秒)。

如果我们愿意违反原来的第三条规则,则可以通过切换到全数字处理来减少另外7/8的时间。然而,一些算法不适合这一点,所以我们把它孤独。 (不是“检查”值不匹配最后一个条目。)

最后,在这个响应中,与我的前一个编辑的直接矛盾,通过切换可用的缓存的Java,单线读取器没有节省。事实上,该解决方案比使用本机读取器的可比较的单行结果慢2-3倍。 (63比27秒)。

以下包括上述所有解决方案的示例代码。

示例代码

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Create a test file
cd(tempdir);
fName = 'demo_file.txt';
fid = fopen(fName,'w');
for ixLoop = 1:5
    d = randi(1e6, 1e5,2);
    fprintf(fid, '%d, %d \n',d);
end
fclose(fid);


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Initial code
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
tline = fgetl(fid);
while ischar(tline)
    nums = str2num(tline);
    CHECK = round((CHECK + mean(nums) ) /2);
    tline = fgetl(fid);
end
fclose(fid);
t = toc;
fprintf(1,'Initial code.  %3.2f sec.  %d check \n', t, CHECK);


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using sscanf, once per line
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
tline = fgetl(fid);
while ischar(tline)
    nums = sscanf(tline,'%d, %d');
    CHECK = round((CHECK + mean(nums) ) /2);
    tline = fgetl(fid);
end
fclose(fid);
t = toc;
fprintf(1,'Using sscanf, once per line.  %3.2f sec.  %d check \n', t, CHECK);


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using fscanf in large batches
CHECK = 0;
tic;
bufferSize = 1e4;
fid = fopen('demo_file.txt');
scannedData = reshape(fscanf(fid, '%d, %d', bufferSize),2,[])' ;
while ~isempty(scannedData)
    for ix = 1:size(scannedData,1)
        nums = scannedData(ix,:);
        CHECK = round((CHECK + mean(nums) ) /2);
    end
    scannedData = reshape(fscanf(fid, '%d, %d', bufferSize),2,[])' ;
end
fclose(fid);
t = toc;
fprintf(1,'Using fscanf in large batches.  %3.2f sec.  %d check \n', t, CHECK);


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using textscan in large batches
CHECK = 0;
tic;
bufferSize = 1e4;
fid = fopen('demo_file.txt');
scannedData = textscan(fid, '%d, %d \n', bufferSize) ;
while ~isempty(scannedData{1})
    for ix = 1:size(scannedData{1},1)
        nums = [scannedData{1}(ix) scannedData{2}(ix)];
        CHECK = round((CHECK + mean(nums) ) /2);
    end
    scannedData = textscan(fid, '%d, %d \n', bufferSize) ;
end
fclose(fid);
t = toc;
fprintf(1,'Using textscan in large batches.  %3.2f sec.  %d check \n', t, CHECK);



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Reading in large batches into memory, incrementing to end-of-line, sscanf
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
bufferSize = 1e4;
eol = sprintf('\n');

dataBatch = fread(fid,bufferSize,'uint8=>char')';
dataIncrement = fread(fid,1,'uint8=>char');
while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
    dataIncrement(end+1) = fread(fid,1,'uint8=>char');  %This can be slightly optimized
end
data = [dataBatch dataIncrement];

while ~isempty(data)
    scannedData = reshape(sscanf(data,'%d, %d'),2,[])';
    for ix = 1:size(scannedData,1)
        nums = scannedData(ix,:);
        CHECK = round((CHECK + mean(nums) ) /2);
    end

    dataBatch = fread(fid,bufferSize,'uint8=>char')';
    dataIncrement = fread(fid,1,'uint8=>char');
    while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
        dataIncrement(end+1) = fread(fid,1,'uint8=>char');%This can be slightly optimized
    end
    data = [dataBatch dataIncrement];
end
fclose(fid);
t = toc;
fprintf(1,'Reading large batches into memory, then sscanf.  %3.2f sec.  %d check \n', t, CHECK);


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using Java single line readers + sscanf
CHECK = 0;
tic;
bufferSize = 1e4;
reader =  java.io.LineNumberReader(java.io.FileReader('demo_file.txt'),bufferSize );
tline = char(reader.readLine());
while ~isempty(tline)
    nums = sscanf(tline,'%d, %d');
    CHECK = round((CHECK + mean(nums) ) /2);
    tline = char(reader.readLine());
end
reader.close();
t = toc;
fprintf(1,'Using java single line file reader and sscanf on single lines.  %3.2f sec.  %d check \n', t, CHECK);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using Java scanner for file reading and string conversion
CHECK = 0;
tic;
jFile = java.io.File('demo_file.txt');
scanner = java.util.Scanner(jFile);
scanner.useDelimiter('[\s\,\n\r]+');
while scanner.hasNextInt()
    nums = [scanner.nextInt() scanner.nextInt()];
    CHECK = round((CHECK + mean(nums) ) /2);
end
scanner.close();
t = toc;
fprintf(1,'Using java single item token scanner.  %3.2f sec.  %d check \n', t, CHECK);


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Reading in large batches into memory, vectorized operations (non-compliant solution)
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
bufferSize = 1e4;
eol = sprintf('\n');

dataBatch = fread(fid,bufferSize,'uint8=>char')';
dataIncrement = fread(fid,1,'uint8=>char');
while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
    dataIncrement(end+1) = fread(fid,1,'uint8=>char');  %This can be slightly optimized
end
data = [dataBatch dataIncrement];

while ~isempty(data)
    scannedData = reshape(sscanf(data,'%d, %d'),2,[])';
    CHECK = round((CHECK + mean(scannedData(:)) ) /2);

    dataBatch = fread(fid,bufferSize,'uint8=>char')';
    dataIncrement = fread(fid,1,'uint8=>char');
    while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
        dataIncrement(end+1) = fread(fid,1,'uint8=>char');%This can be slightly optimized
    end
    data = [dataBatch dataIncrement];
end
fclose(fid);
t = toc;
fprintf(1,'Fully batched operations.  %3.2f sec.  %d check \n', t, CHECK);

(原答案)

要扩展Ben的观点…如果你一行一行地读这些文件,你的瓶颈总是文件I / O。

我理解有时你​​不能将整个文件放入内存。我通常读入大量的字符(1e5,1e6或大约,取决于您的系统的内存)。然后我读取额外的单个字符(或回退单个字符)获得一个圆的行数,然后运行你的字符串解析(例如sscanf)。

然后,如果你想要你可以一次处理一行一个结果的大矩阵,然后重复该过程,直到你读取文件的结尾。

这有点乏味,但不是那么难。我通常看到单线读取器的速度提高90%以上。

(可怕的想法使用Java批量行读取器在耻辱中删除)

http://stackoverflow.com/questions/9440592/fastest-matlab-file-reading

本站文章除注明转载外,均为本站原创或编译
转载请明显位置注明出处:最快的Matlab文件读取?