perl - Benchmarking utf8 file read - explanation of the differences -
have code:
#!/usr/bin/env perl use 5.016; use warnings; use autodie; use path::tiny; use encode; use benchmark qw(:all); $cnt = 10_000; $utf = 'utf8.txt'; $res = timethese($cnt, { 'open-utf-8' => sub { open $fhu, '<:encoding(utf-8)', $utf; $stru = { local $/; <$fhu>}; close $fhu; }, 'open-utf8' => sub { open $fhu, '<:utf8', $utf; $stru = { local $/; <$fhu>}; close $fhu; }, 'decode-utf8' => sub { open $fhu, '<', $utf; $stru = decode('utf8', { local $/; <$fhu>}); close $fhu; }, 'decode-utf-8' => sub { open $fhu, '<', $utf; $stru = decode('utf-8', { local $/; <$fhu>}); close $fhu; }, 'ptiny' => sub { $stru = path($utf)->slurp_utf8; }, }); cmpthese $res;
the utf8.txt
(approx 175kb) contains 1000 lines of utf8 encoded/ascii chars, like:
9áäčďéěíĺľňóôöőŕřšťúůüűýž ÁÄČĎÉĚÍĹĽŇÓÔÖŐŔŘŠŤÚŮÜŰÝŽ aáäbcčdďeéěfghiíjkľĺmnňoóôöőpqrŕřsštťuúůüűvwxyýzž
running above, on notebook gives:
benchmark: timing 10000 iterations of decode-utf-8, decode-utf8, open-utf-8, open-utf8, ptiny... decode-utf-8: 47 wallclock secs (46.83 usr + 0.87 sys = 47.70 cpu) @ 209.64/s (n=10000) decode-utf8: 48 wallclock secs (46.62 usr + 0.90 sys = 47.52 cpu) @ 210.44/s (n=10000) open-utf-8: 60 wallclock secs (57.82 usr + 1.20 sys = 59.02 cpu) @ 169.43/s (n=10000) open-utf8: 7 wallclock secs ( 6.57 usr + 0.70 sys = 7.27 cpu) @ 1375.52/s (n=10000) ptiny: 7 wallclock secs ( 5.98 usr + 0.52 sys = 6.50 cpu) @ 1538.46/s (n=10000) rate open-utf-8 decode-utf-8 decode-utf8 open-utf8 ptiny open-utf-8 169/s -- -19% -19% -88% -89% decode-utf-8 210/s 24% -- -0% -85% -86% decode-utf8 210/s 24% 0% -- -85% -86% open-utf8 1376/s 712% 556% 554% -- -11% ptiny 1538/s 808% 634% 631% 12% --
for me surprising, questions:
- first - wrong above code?
if ok,
- why huge difference between explicit
, relaxedutf8
@ at io-layer level (<:utf8
? so, - why difference not big when
? - why lazy - io-layer level decode much-much faster explicit lazy
? and "danger" using relaxed (fast) "utf8' vs exact (slow) 'utf-8'?
and finally, not question - must check path::tiny code - how fastest...
- perl v5.22.0 - perlbrew (threaded)
- osx - darwin kernel version 14.4.0: (yosemite)
- notebook old - macbook pro (13-inch, mid 2010) - core-2-duo, 2.4ghz, 8gb, slow hdd
the perlio :utf8
layer pseudo layer, it's flag on perlio handle op detect. behavior varies depending on used op:
read(), sysread() , recv():
the implementation performs no validation of utf8 sequences. implementation only checks prefix octet of utf8 sequence count number of read utf8 sequences.
the implementation validates read octets if warnings category 'utf8'
in effect , issues warning if read octets contains ill-formed utf8. used validation procedure same used in utf8::decode()
the ':utf8' flag/layer should never used reading unless willing accept ill-formed utf-x lead security issues or segmentation faults.
the perlio :encoding
layer provided perlio::encoding implements incremental decoder framework subclasses of encode::encoding. implementation calls out perl/xs subclass invoking method each incremental decode. buffers copied between layer , subclass.
utf8 vs utf-8
the utf8 encoding form superset of utf-8 encoding form specified unicode consortium. utf8 encoding form accepts encoded code points ill-formed in utf-8 encoding form, such surrogates , code points above u+10ffff. non-characters should avoided, though unicode recently changed mind. utf8 encoding should not used interchange, it's perl's internal encoding. use utf-8 encoding form instead.
benchmark of slurping utf-8 encoded file
modules used in benchmark:
perlio::encoding, perlio::utf8_strict, encode , unicode::utf8.
the following code available on
#!/usr/bin/perl use strict; use warnings; use benchmark qw[]; use config qw[%config]; use io::dir qw[]; use io::file qw[seek_set]; use encode qw[]; use unicode::utf8 qw[]; use perlio::encoding qw[]; use perlio::utf8_strict qw[]; # $dir = 'benchmarks/data'; @docs = { $d = io::dir->new($dir) or die qq/could not open directory '$dir': $!/; sort grep { /^[a-z]{2}\.txt/ } $d->read; }; printf "perl: %s (%s %s)\n", $], @config{qw[osname osvers]}; printf "encode: %s\n", encode->version; printf "unicode::utf8: %s\n", unicode::utf8->version; printf "perlio::encoding: %s\n", perlio::encoding->version; printf "perlio::utf8_strict: %s\n", perlio::utf8_strict->version; foreach $doc (@docs) { $octets = { open $fh, '<:raw', "$dir/$doc" or die $!; local $/; <$fh>; }; $string = unicode::utf8::decode_utf8($octets); @ranges = ( [ 0x00, 0x7f, qr/[\x{00}-\x{7f}]/ ], [ 0x80, 0x7ff, qr/[\x{80}-\x{7ff}]/ ], [ 0x800, 0xffff, qr/[\x{800}-\x{ffff}]/ ], [ 0x10000, 0x10ffff, qr/[\x{10000}-\x{10ffff}]/ ], ); @out; foreach $r (@ranges) { ($start, $end, $regexp) = @$r; $count = () = $string =~ m/$regexp/g; push @out, sprintf "u+%.4x..u+%.4x: %d", $start, $end, $count if $count; } printf "\n\n%s: size: %d code points: %d (%s)\n", $doc, length $octets, length $string, join ' ', @out; open $fh_raw, '<:raw', \$octets or die qq/could not open :raw fh: '$!'/; open $fh_encoding, '<:encoding(utf-8)', \$octets or die qq/could not open :encoding fh: '$!'/; open $fh_utf8_strict, '<:utf8_strict', \$octets or die qq/could not open :utf8_strict fh: '$!'/; benchmark::cmpthese( -10, { ':encoding(utf-8)' => sub { $data = { local $/; <$fh_encoding> }; seek($fh_encoding, 0, seek_set) or die qq/could not rewind fh: '$!'/; }, ':utf8_strict' => sub { $data = { local $/; <$fh_utf8_strict> }; seek($fh_utf8_strict, 0, seek_set) or die qq/could not rewind fh: '$!'/; }, 'encode' => sub { $data = encode::decode('utf-8', { local $/; scalar <$fh_raw> }, encode::fb_croak|encode::leave_src); seek($fh_raw, 0, seek_set) or die qq/could not rewind fh: '$!'/; }, 'unicode::utf8' => sub { $data = unicode::utf8::decode_utf8(do { local $/; scalar <$fh_raw> }); seek($fh_raw, 0, seek_set) or die qq/could not rewind fh: '$!'/; }, }); }
$ perl benchmarks/ perl: 5.023001 (darwin 14.4.0) encode: 2.75 unicode::utf8: 0.60 perlio::encoding: 0.21 perlio::utf8_strict: 0.006 ar.txt: size: 25918 code points: 14308 (u+0000..u+007f: 2698 u+0080..u+07ff: 11610) rate :encoding(utf-8) encode :utf8_strict unicode::utf8 :encoding(utf-8) 3058/s -- -19% -73% -87% encode 3754/s 23% -- -67% -84% :utf8_strict 11361/s 272% 203% -- -52% unicode::utf8 23620/s 672% 529% 108% -- el.txt: size: 103974 code points: 58748 (u+0000..u+007f: 13560 u+0080..u+07ff: 45150 u+0800..u+ffff: 38) rate :encoding(utf-8) encode :utf8_strict unicode::utf8 :encoding(utf-8) 780/s -- -19% -73% -86% encode 958/s 23% -- -66% -83% :utf8_strict 2855/s 266% 198% -- -48% unicode::utf8 5498/s 605% 474% 93% -- en.txt: size: 82171 code points: 82055 (u+0000..u+007f: 81988 u+0080..u+07ff: 18 u+0800..u+ffff: 49) rate :encoding(utf-8) encode :utf8_strict unicode::utf8 :encoding(utf-8) 1111/s -- -16% -90% -96% encode 1327/s 19% -- -88% -95% :utf8_strict 11446/s 931% 763% -- -60% unicode::utf8 28635/s 2478% 2058% 150% -- ja.txt: size: 180109 code points: 64655 (u+0000..u+007f: 6913 u+0080..u+07ff: 30 u+0800..u+ffff: 57712) rate :encoding(utf-8) encode :utf8_strict unicode::utf8 :encoding(utf-8) 553/s -- -27% -72% -91% encode 757/s 37% -- -61% -87% :utf8_strict 1960/s 254% 159% -- -67% unicode::utf8 5915/s 970% 682% 202% -- lv.txt: size: 138397 code points: 127160 (u+0000..u+007f: 117031 u+0080..u+07ff: 9021 u+0800..u+ffff: 1108) rate :encoding(utf-8) encode :utf8_strict unicode::utf8 :encoding(utf-8) 605/s -- -19% -80% -91% encode 746/s 23% -- -75% -88% :utf8_strict 3043/s 403% 308% -- -53% unicode::utf8 6453/s 967% 765% 112% -- ru.txt: size: 151633 code points: 85266 (u+0000..u+007f: 19263 u+0080..u+07ff: 65639 u+0800..u+ffff: 364) rate :encoding(utf-8) encode :utf8_strict unicode::utf8 :encoding(utf-8) 542/s -- -19% -73% -86% encode 673/s 24% -- -66% -83% :utf8_strict 2001/s 269% 197% -- -50% unicode::utf8 4010/s 640% 496% 100% -- sv.txt: size: 96449 code points: 92894 (u+0000..u+007f: 89510 u+0080..u+07ff: 3213 u+0800..u+ffff: 171) rate :encoding(utf-8) encode :utf8_strict unicode::utf8 :encoding(utf-8) 923/s -- -17% -85% -93% encode 1109/s 20% -- -82% -92% :utf8_strict 5998/s 550% 441% -- -56% unicode::utf8 13604/s 1374% 1127% 127% -- zh.txt: size: 62891 code points: 24519 (u+0000..u+007f: 5317 u+0080..u+07ff: 32 u+0800..u+ffff: 19170) rate :encoding(utf-8) encode :utf8_strict unicode::utf8 :encoding(utf-8) 1630/s -- -23% -75% -87% encode 2104/s 29% -- -68% -83% :utf8_strict 6549/s 302% 211% -- -48% unicode::utf8 12630/s 675% 500% 93% --
Post a Comment