perl - Benchmarking utf8 file read - explanation of the differences -
have code:
#!/usr/bin/env perl use 5.016; use warnings; use autodie; use path::tiny; use encode; use benchmark qw(:all); $cnt = 10_000; $utf = 'utf8.txt'; $res = timethese($cnt, { 'open-utf-8' => sub { open $fhu, '<:encoding(utf-8)', $utf; $stru = { local $/; <$fhu>}; close $fhu; }, 'open-utf8' => sub { open $fhu, '<:utf8', $utf; $stru = { local $/; <$fhu>}; close $fhu; }, 'decode-utf8' => sub { open $fhu, '<', $utf; $stru = decode('utf8', { local $/; <$fhu>}); close $fhu; }, 'decode-utf-8' => sub { open $fhu, '<', $utf; $stru = decode('utf-8', { local $/; <$fhu>}); close $fhu; }, 'ptiny' => sub { $stru = path($utf)->slurp_utf8; }, }); cmpthese $res;
the utf8.txt
(approx 175kb) contains 1000 lines of utf8 encoded/ascii chars, like:
9áäčďéěíĺľňóôöőŕřšťúůüűýž ÁÄČĎÉĚÍĹĽŇÓÔÖŐŔŘŠŤÚŮÜŰÝŽ aáäbcčdďeéěfghiíjkľĺmnňoóôöőpqrŕřsštťuúůüűvwxyýzž
running above, on notebook gives:
benchmark: timing 10000 iterations of decode-utf-8, decode-utf8, open-utf-8, open-utf8, ptiny... decode-utf-8: 47 wallclock secs (46.83 usr + 0.87 sys = 47.70 cpu) @ 209.64/s (n=10000) decode-utf8: 48 wallclock secs (46.62 usr + 0.90 sys = 47.52 cpu) @ 210.44/s (n=10000) open-utf-8: 60 wallclock secs (57.82 usr + 1.20 sys = 59.02 cpu) @ 169.43/s (n=10000) open-utf8: 7 wallclock secs ( 6.57 usr + 0.70 sys = 7.27 cpu) @ 1375.52/s (n=10000) ptiny: 7 wallclock secs ( 5.98 usr + 0.52 sys = 6.50 cpu) @ 1538.46/s (n=10000) rate open-utf-8 decode-utf-8 decode-utf8 open-utf8 ptiny open-utf-8 169/s -- -19% -19% -88% -89% decode-utf-8 210/s 24% -- -0% -85% -86% decode-utf8 210/s 24% 0% -- -85% -86% open-utf8 1376/s 712% 556% 554% -- -11% ptiny 1538/s 808% 634% 631% 12% --
for me surprising, questions:
- first - wrong above code?
if ok,
- why huge difference between explicit
utf-8
, relaxedutf8
@ at io-layer level (<:utf8
,<:encoding(utf-8)
? so, - why difference not big when
decode('utf-8'
,decode('utf8'
? - why lazy - io-layer level decode much-much faster explicit lazy
decode('utf8
? and "danger" using relaxed (fast) "utf8' vs exact (slow) 'utf-8'?
and finally, not question - must check path::tiny code - how fastest...
env:
- perl v5.22.0 - perlbrew (threaded)
- osx - darwin kernel version 14.4.0: (yosemite)
- notebook old - macbook pro (13-inch, mid 2010) - core-2-duo, 2.4ghz, 8gb, slow hdd
:utf8
the perlio :utf8
layer pseudo layer, it's flag on perlio handle op detect. behavior varies depending on used op:
read(), sysread() , recv():
the implementation performs no validation of utf8 sequences. implementation only checks prefix octet of utf8 sequence count number of read utf8 sequences.
readline():
the implementation validates read octets if warnings category 'utf8'
in effect , issues warning if read octets contains ill-formed utf8. used validation procedure same used in utf8::decode()
.
the ':utf8' flag/layer should never used reading unless willing accept ill-formed utf-x lead security issues or segmentation faults.
:encoding
the perlio :encoding
layer provided perlio::encoding implements incremental decoder framework subclasses of encode::encoding. implementation calls out perl/xs subclass invoking method each incremental decode. buffers copied between layer , subclass.
utf8 vs utf-8
the utf8 encoding form superset of utf-8 encoding form specified unicode consortium. utf8 encoding form accepts encoded code points ill-formed in utf-8 encoding form, such surrogates , code points above u+10ffff. non-characters should avoided, though unicode recently changed mind. utf8 encoding should not used interchange, it's perl's internal encoding. use utf-8 encoding form instead.
benchmark of slurping utf-8 encoded file
modules used in benchmark:
perlio::encoding, perlio::utf8_strict, encode , unicode::utf8.
the following code available on gist.github.com.
#!/usr/bin/perl use strict; use warnings; use benchmark qw[]; use config qw[%config]; use io::dir qw[]; use io::file qw[seek_set]; use encode qw[]; use unicode::utf8 qw[]; use perlio::encoding qw[]; use perlio::utf8_strict qw[]; # https://github.com/chansen/p5-unicode-utf8/tree/master/benchmarks/data $dir = 'benchmarks/data'; @docs = { $d = io::dir->new($dir) or die qq/could not open directory '$dir': $!/; sort grep { /^[a-z]{2}\.txt/ } $d->read; }; printf "perl: %s (%s %s)\n", $], @config{qw[osname osvers]}; printf "encode: %s\n", encode->version; printf "unicode::utf8: %s\n", unicode::utf8->version; printf "perlio::encoding: %s\n", perlio::encoding->version; printf "perlio::utf8_strict: %s\n", perlio::utf8_strict->version; foreach $doc (@docs) { $octets = { open $fh, '<:raw', "$dir/$doc" or die $!; local $/; <$fh>; }; $string = unicode::utf8::decode_utf8($octets); @ranges = ( [ 0x00, 0x7f, qr/[\x{00}-\x{7f}]/ ], [ 0x80, 0x7ff, qr/[\x{80}-\x{7ff}]/ ], [ 0x800, 0xffff, qr/[\x{800}-\x{ffff}]/ ], [ 0x10000, 0x10ffff, qr/[\x{10000}-\x{10ffff}]/ ], ); @out; foreach $r (@ranges) { ($start, $end, $regexp) = @$r; $count = () = $string =~ m/$regexp/g; push @out, sprintf "u+%.4x..u+%.4x: %d", $start, $end, $count if $count; } printf "\n\n%s: size: %d code points: %d (%s)\n", $doc, length $octets, length $string, join ' ', @out; open $fh_raw, '<:raw', \$octets or die qq/could not open :raw fh: '$!'/; open $fh_encoding, '<:encoding(utf-8)', \$octets or die qq/could not open :encoding fh: '$!'/; open $fh_utf8_strict, '<:utf8_strict', \$octets or die qq/could not open :utf8_strict fh: '$!'/; benchmark::cmpthese( -10, { ':encoding(utf-8)' => sub { $data = { local $/; <$fh_encoding> }; seek($fh_encoding, 0, seek_set) or die qq/could not rewind fh: '$!'/; }, ':utf8_strict' => sub { $data = { local $/; <$fh_utf8_strict> }; seek($fh_utf8_strict, 0, seek_set) or die qq/could not rewind fh: '$!'/; }, 'encode' => sub { $data = encode::decode('utf-8', { local $/; scalar <$fh_raw> }, encode::fb_croak|encode::leave_src); seek($fh_raw, 0, seek_set) or die qq/could not rewind fh: '$!'/; }, 'unicode::utf8' => sub { $data = unicode::utf8::decode_utf8(do { local $/; scalar <$fh_raw> }); seek($fh_raw, 0, seek_set) or die qq/could not rewind fh: '$!'/; }, }); }
results:
$ perl benchmarks/slurp.pl perl: 5.023001 (darwin 14.4.0) encode: 2.75 unicode::utf8: 0.60 perlio::encoding: 0.21 perlio::utf8_strict: 0.006 ar.txt: size: 25918 code points: 14308 (u+0000..u+007f: 2698 u+0080..u+07ff: 11610) rate :encoding(utf-8) encode :utf8_strict unicode::utf8 :encoding(utf-8) 3058/s -- -19% -73% -87% encode 3754/s 23% -- -67% -84% :utf8_strict 11361/s 272% 203% -- -52% unicode::utf8 23620/s 672% 529% 108% -- el.txt: size: 103974 code points: 58748 (u+0000..u+007f: 13560 u+0080..u+07ff: 45150 u+0800..u+ffff: 38) rate :encoding(utf-8) encode :utf8_strict unicode::utf8 :encoding(utf-8) 780/s -- -19% -73% -86% encode 958/s 23% -- -66% -83% :utf8_strict 2855/s 266% 198% -- -48% unicode::utf8 5498/s 605% 474% 93% -- en.txt: size: 82171 code points: 82055 (u+0000..u+007f: 81988 u+0080..u+07ff: 18 u+0800..u+ffff: 49) rate :encoding(utf-8) encode :utf8_strict unicode::utf8 :encoding(utf-8) 1111/s -- -16% -90% -96% encode 1327/s 19% -- -88% -95% :utf8_strict 11446/s 931% 763% -- -60% unicode::utf8 28635/s 2478% 2058% 150% -- ja.txt: size: 180109 code points: 64655 (u+0000..u+007f: 6913 u+0080..u+07ff: 30 u+0800..u+ffff: 57712) rate :encoding(utf-8) encode :utf8_strict unicode::utf8 :encoding(utf-8) 553/s -- -27% -72% -91% encode 757/s 37% -- -61% -87% :utf8_strict 1960/s 254% 159% -- -67% unicode::utf8 5915/s 970% 682% 202% -- lv.txt: size: 138397 code points: 127160 (u+0000..u+007f: 117031 u+0080..u+07ff: 9021 u+0800..u+ffff: 1108) rate :encoding(utf-8) encode :utf8_strict unicode::utf8 :encoding(utf-8) 605/s -- -19% -80% -91% encode 746/s 23% -- -75% -88% :utf8_strict 3043/s 403% 308% -- -53% unicode::utf8 6453/s 967% 765% 112% -- ru.txt: size: 151633 code points: 85266 (u+0000..u+007f: 19263 u+0080..u+07ff: 65639 u+0800..u+ffff: 364) rate :encoding(utf-8) encode :utf8_strict unicode::utf8 :encoding(utf-8) 542/s -- -19% -73% -86% encode 673/s 24% -- -66% -83% :utf8_strict 2001/s 269% 197% -- -50% unicode::utf8 4010/s 640% 496% 100% -- sv.txt: size: 96449 code points: 92894 (u+0000..u+007f: 89510 u+0080..u+07ff: 3213 u+0800..u+ffff: 171) rate :encoding(utf-8) encode :utf8_strict unicode::utf8 :encoding(utf-8) 923/s -- -17% -85% -93% encode 1109/s 20% -- -82% -92% :utf8_strict 5998/s 550% 441% -- -56% unicode::utf8 13604/s 1374% 1127% 127% -- zh.txt: size: 62891 code points: 24519 (u+0000..u+007f: 5317 u+0080..u+07ff: 32 u+0800..u+ffff: 19170) rate :encoding(utf-8) encode :utf8_strict unicode::utf8 :encoding(utf-8) 1630/s -- -23% -75% -87% encode 2104/s 29% -- -68% -83% :utf8_strict 6549/s 302% 211% -- -48% unicode::utf8 12630/s 675% 500% 93% --
Comments
Post a Comment