perl - Benchmarking utf8 file read - explanation of the differences -


have code:

#!/usr/bin/env perl  use 5.016; use warnings; use autodie; use path::tiny; use encode; use benchmark qw(:all);  $cnt = 10_000; $utf = 'utf8.txt';  $res = timethese($cnt, {     'open-utf-8' => sub {         open $fhu, '<:encoding(utf-8)', $utf;         $stru = { local $/; <$fhu>};         close $fhu;     },     'open-utf8' => sub {         open $fhu, '<:utf8', $utf;         $stru = { local $/; <$fhu>};         close $fhu;     },     'decode-utf8' => sub {         open $fhu, '<', $utf;         $stru = decode('utf8', { local $/; <$fhu>});         close $fhu;     },     'decode-utf-8' => sub {         open $fhu, '<', $utf;         $stru = decode('utf-8', { local $/; <$fhu>});         close $fhu;     },     'ptiny' => sub {         $stru = path($utf)->slurp_utf8;     }, }); cmpthese $res; 

the utf8.txt (approx 175kb) contains 1000 lines of utf8 encoded/ascii chars, like:

9áäčďéěíĺľňóôöőŕřšťúůüűýž ÁÄČĎÉĚÍĹĽŇÓÔÖŐŔŘŠŤÚŮÜŰÝŽ aáäbcčdďeéěfghiíjkľĺmnňoóôöőpqrŕřsštťuúůüűvwxyýzž 

running above, on notebook gives:

benchmark: timing 10000 iterations of decode-utf-8, decode-utf8, open-utf-8, open-utf8, ptiny... decode-utf-8: 47 wallclock secs (46.83 usr +  0.87 sys = 47.70 cpu) @ 209.64/s (n=10000)  decode-utf8: 48 wallclock secs (46.62 usr +  0.90 sys = 47.52 cpu) @ 210.44/s (n=10000)   open-utf-8: 60 wallclock secs (57.82 usr +  1.20 sys = 59.02 cpu) @ 169.43/s (n=10000)    open-utf8:  7 wallclock secs ( 6.57 usr +  0.70 sys =  7.27 cpu) @ 1375.52/s (n=10000)        ptiny:  7 wallclock secs ( 5.98 usr +  0.52 sys =  6.50 cpu) @ 1538.46/s (n=10000)                rate  open-utf-8 decode-utf-8 decode-utf8   open-utf8       ptiny open-utf-8    169/s          --         -19%        -19%        -88%        -89% decode-utf-8  210/s         24%           --         -0%        -85%        -86% decode-utf8   210/s         24%           0%          --        -85%        -86% open-utf8    1376/s        712%         556%        554%          --        -11% ptiny        1538/s        808%         634%        631%         12%          -- 

for me surprising, questions:

  • first - wrong above code?

if ok,

  • why huge difference between explicit utf-8 , relaxed utf8 @ at io-layer level (<:utf8 , <:encoding(utf-8)? so,
  • why difference not big when decode('utf-8' , decode('utf8' ?
  • why lazy - io-layer level decode much-much faster explicit lazy decode('utf8?
  • and "danger" using relaxed (fast) "utf8' vs exact (slow) 'utf-8'?

  • and finally, not question - must check path::tiny code - how fastest...

env:

  • perl v5.22.0 - perlbrew (threaded)
  • osx - darwin kernel version 14.4.0: (yosemite)
  • notebook old - macbook pro (13-inch, mid 2010) - core-2-duo, 2.4ghz, 8gb, slow hdd

:utf8

the perlio :utf8 layer pseudo layer, it's flag on perlio handle op detect. behavior varies depending on used op:

read(), sysread() , recv():

the implementation performs no validation of utf8 sequences. implementation only checks prefix octet of utf8 sequence count number of read utf8 sequences.

readline():

the implementation validates read octets if warnings category 'utf8' in effect , issues warning if read octets contains ill-formed utf8. used validation procedure same used in utf8::decode().

the ':utf8' flag/layer should never used reading unless willing accept ill-formed utf-x lead security issues or segmentation faults.

:encoding

the perlio :encoding layer provided perlio::encoding implements incremental decoder framework subclasses of encode::encoding. implementation calls out perl/xs subclass invoking method each incremental decode. buffers copied between layer , subclass.

utf8 vs utf-8

the utf8 encoding form superset of utf-8 encoding form specified unicode consortium. utf8 encoding form accepts encoded code points ill-formed in utf-8 encoding form, such surrogates , code points above u+10ffff. non-characters should avoided, though unicode recently changed mind. utf8 encoding should not used interchange, it's perl's internal encoding. use utf-8 encoding form instead.

benchmark of slurping utf-8 encoded file

modules used in benchmark:

perlio::encoding, perlio::utf8_strict, encode , unicode::utf8.

the following code available on gist.github.com.

#!/usr/bin/perl  use strict; use warnings;  use benchmark     qw[]; use config        qw[%config]; use io::dir       qw[]; use io::file      qw[seek_set];  use encode              qw[]; use unicode::utf8       qw[]; use perlio::encoding    qw[]; use perlio::utf8_strict qw[];  # https://github.com/chansen/p5-unicode-utf8/tree/master/benchmarks/data $dir  = 'benchmarks/data'; @docs = {     $d = io::dir->new($dir)       or die qq/could not open directory '$dir': $!/;     sort grep { /^[a-z]{2}\.txt/ } $d->read; };  printf "perl:                %s (%s %s)\n", $], @config{qw[osname osvers]}; printf "encode:              %s\n", encode->version; printf "unicode::utf8:       %s\n", unicode::utf8->version; printf "perlio::encoding:    %s\n", perlio::encoding->version; printf "perlio::utf8_strict: %s\n", perlio::utf8_strict->version;  foreach $doc (@docs) {      $octets = {         open $fh, '<:raw', "$dir/$doc" or die $!;         local $/; <$fh>;     };      $string = unicode::utf8::decode_utf8($octets);      @ranges = (         [    0x00,     0x7f, qr/[\x{00}-\x{7f}]/        ],         [    0x80,    0x7ff, qr/[\x{80}-\x{7ff}]/       ],         [   0x800,   0xffff, qr/[\x{800}-\x{ffff}]/     ],         [ 0x10000, 0x10ffff, qr/[\x{10000}-\x{10ffff}]/ ],     );      @out;     foreach $r (@ranges) {         ($start, $end, $regexp) = @$r;         $count = () = $string =~ m/$regexp/g;         push @out, sprintf "u+%.4x..u+%.4x: %d", $start, $end, $count           if $count;     }      printf "\n\n%s: size: %d code points: %d (%s)\n",       $doc, length $octets, length $string, join ' ', @out;      open $fh_raw, '<:raw', \$octets        or die qq/could not open :raw fh: '$!'/;     open $fh_encoding, '<:encoding(utf-8)', \$octets       or die qq/could not open :encoding fh: '$!'/;     open $fh_utf8_strict, '<:utf8_strict', \$octets        or die qq/could not open :utf8_strict fh: '$!'/;      benchmark::cmpthese( -10, {         ':encoding(utf-8)' => sub {             $data = { local $/; <$fh_encoding> };             seek($fh_encoding, 0, seek_set)               or die qq/could not rewind fh: '$!'/;         },         ':utf8_strict' => sub {             $data = { local $/; <$fh_utf8_strict> };             seek($fh_utf8_strict, 0, seek_set)               or die qq/could not rewind fh: '$!'/;         },         'encode' => sub {             $data = encode::decode('utf-8', { local $/; scalar <$fh_raw> }, encode::fb_croak|encode::leave_src);             seek($fh_raw, 0, seek_set)              or die qq/could not rewind fh: '$!'/;         },                 'unicode::utf8' => sub {             $data = unicode::utf8::decode_utf8(do { local $/; scalar <$fh_raw> });             seek($fh_raw, 0, seek_set)              or die qq/could not rewind fh: '$!'/;         },     }); } 

results:

$ perl benchmarks/slurp.pl  perl:                5.023001 (darwin 14.4.0) encode:              2.75 unicode::utf8:       0.60 perlio::encoding:    0.21 perlio::utf8_strict: 0.006   ar.txt: size: 25918 code points: 14308 (u+0000..u+007f: 2698 u+0080..u+07ff: 11610)                     rate :encoding(utf-8)      encode :utf8_strict unicode::utf8 :encoding(utf-8)  3058/s               --        -19%         -73%          -87% encode            3754/s              23%          --         -67%          -84% :utf8_strict     11361/s             272%        203%           --          -52% unicode::utf8    23620/s             672%        529%         108%            --   el.txt: size: 103974 code points: 58748 (u+0000..u+007f: 13560 u+0080..u+07ff: 45150 u+0800..u+ffff: 38)                    rate :encoding(utf-8)       encode :utf8_strict unicode::utf8 :encoding(utf-8)  780/s               --         -19%         -73%          -86% encode            958/s              23%           --         -66%          -83% :utf8_strict     2855/s             266%         198%           --          -48% unicode::utf8    5498/s             605%         474%          93%            --   en.txt: size: 82171 code points: 82055 (u+0000..u+007f: 81988 u+0080..u+07ff: 18 u+0800..u+ffff: 49)                     rate :encoding(utf-8)      encode :utf8_strict unicode::utf8 :encoding(utf-8)  1111/s               --        -16%         -90%          -96% encode            1327/s              19%          --         -88%          -95% :utf8_strict     11446/s             931%        763%           --          -60% unicode::utf8    28635/s            2478%       2058%         150%            --   ja.txt: size: 180109 code points: 64655 (u+0000..u+007f: 6913 u+0080..u+07ff: 30 u+0800..u+ffff: 57712)                    rate :encoding(utf-8)       encode :utf8_strict unicode::utf8 :encoding(utf-8)  553/s               --         -27%         -72%          -91% encode            757/s              37%           --         -61%          -87% :utf8_strict     1960/s             254%         159%           --          -67% unicode::utf8    5915/s             970%         682%         202%            --   lv.txt: size: 138397 code points: 127160 (u+0000..u+007f: 117031 u+0080..u+07ff: 9021 u+0800..u+ffff: 1108)                    rate :encoding(utf-8)       encode :utf8_strict unicode::utf8 :encoding(utf-8)  605/s               --         -19%         -80%          -91% encode            746/s              23%           --         -75%          -88% :utf8_strict     3043/s             403%         308%           --          -53% unicode::utf8    6453/s             967%         765%         112%            --   ru.txt: size: 151633 code points: 85266 (u+0000..u+007f: 19263 u+0080..u+07ff: 65639 u+0800..u+ffff: 364)                    rate :encoding(utf-8)       encode :utf8_strict unicode::utf8 :encoding(utf-8)  542/s               --         -19%         -73%          -86% encode            673/s              24%           --         -66%          -83% :utf8_strict     2001/s             269%         197%           --          -50% unicode::utf8    4010/s             640%         496%         100%            --   sv.txt: size: 96449 code points: 92894 (u+0000..u+007f: 89510 u+0080..u+07ff: 3213 u+0800..u+ffff: 171)                     rate :encoding(utf-8)      encode :utf8_strict unicode::utf8 :encoding(utf-8)   923/s               --        -17%         -85%          -93% encode            1109/s              20%          --         -82%          -92% :utf8_strict      5998/s             550%        441%           --          -56% unicode::utf8    13604/s            1374%       1127%         127%            --   zh.txt: size: 62891 code points: 24519 (u+0000..u+007f: 5317 u+0080..u+07ff: 32 u+0800..u+ffff: 19170)                     rate :encoding(utf-8)      encode :utf8_strict unicode::utf8 :encoding(utf-8)  1630/s               --        -23%         -75%          -87% encode            2104/s              29%          --         -68%          -83% :utf8_strict      6549/s             302%        211%           --          -48% unicode::utf8    12630/s             675%        500%          93%            -- 

Comments

Popular posts from this blog

python - pip install -U PySide error -

arrays - C++ error: a brace-enclosed initializer is not allowed here before ‘{’ token -

cytoscape.js - How to add nodes to Dagre layout with Cytoscape -