[Perldl] Mysterious slow down from repeated inner calls

Jim Magnuson james.magnuson at uconn.edu
Mon Feb 27 05:55:07 HST 2012


Okay, I'm very much looking forward to using Chris Marshall's improvements.
In the meantime, though, I seem to have found the culprit in the slowing.

In the code pasted in below, I've gotten rid of any adjustment to the topX
list. Instead, I just save every similarity. This is a long list, but not a
problem. Then I get the total, max, and topX at the end. I'm sure this will
not be as efficient as C. Marshall's code, but: the interesting thing is
that in the old code, having the call to inner interacted with the small
number of list operations I was using. When inner was replaced with some
non-pdl calculation, time-per-item remained more or less constant. When
inner was there, there was a constant increase in time-per-item.

Thanks very, very much,

jim

#!/usr/bin/perl -s
use PDL;
use PDL::NiceSlice;
use Time::HiRes qw ( time ) ;
$|=1;
$top = 20;

$realStart = time();

while(<>){
    chomp;
    ($wrd, @data) = split;
    $kernel{$wrd} = norm(pdl(@data));
    # EXAMPLE LINE
    # word 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

}

@kernelKeys = sort( keys %kernel );
printf STDERR "# read $#kernelKeys words in %.2f seconds\n",
  time()-$realStart;


$startAll = time();

$at1 = 0;
printf "#REC\ttheWord\tMEAN\tMAX\tmeanTOP$top\tTIME\n";

foreach $w1 (@kernelKeys) {
  $startWord = time();
  @allSims = ();
  $at2 = 0;
  foreach $w2 (@kernelKeys) {
    $at2++;
    next if($at1 == $at2); # skip identical item, but not homophones
    push @allSims, inner($kernel{$w1},$kernel{$w2});
#    $sim = inner($kernel{$w1},$kernel{w2});
#    $totalsim+=$sim;
#    if($sim > $maxsim){      $maxsim = $sim;    }
#    # keep the top 20
#    if($#topX < $top){
#      push @topX, $sim;
#    } else {
#      @topX = sort { $a <=> $b } @topX;
#      if($sim > $topX[0]){ $topX[0] = $sim;      }
#    }
  }
  $at1++;

  $allSim = qsort(pdl(@allSims));
  $now = time();
  printf "$at1\t$w1\t%.6f\t%.6f\t%.6f\t%.5f\n",
    sum($allSim)/$#kernelKeys, max($allSim),
      sum($allSim->(($#kernelKeys - $top - 1 - 1):($kernelKeys - 1)))/$top,
  $now - $startWord;
  unless($at1 % 25) {
    $elapsed = $now - $startAll;
    $thisWord = $now - $startWord;
    $perWord = $elapsed / ($at1 + 1);
    $hoursRemaining = ($perWord * ($#kernelKeys - $at1 + 1))/3600;
    printf STDERR "$at1\t$w1\t %.6f\tElapsed %.6f\tPerWord %.6f\tHoursToGo
%.6f\n",
      $thisWord, $elapsed, $perWord, $hoursRemaining;
  }

}


On Mon, Feb 27, 2012 at 4:04 PM, Chris Marshall <devel.chm.01 at gmail.com>wrote:

> To get the best performance, you'll need to use what we
> call vectorized PDL operations.  Here is an example of
> pdl-iomatic way to do some of your computation:
>
> # use rcols to read data directly into a pdl and perl array
> ($inword, $grid) = rcols 'jmtest.data',0,[], { perlcols=>[0] };
>
> # rearrange dimensions since rcols puts columns in dim(0)
> $kern = $grid->mv(1,0)->norm;
>
> # number of records is now length of dim(1)
> $nrecs = $kern->dim(1);
>
> # calculate all inner products at the same time
> $sim = inner($kern(,(0)),$kern);
>
> # calculate the top 20 values
> $topXind = zeros(long,20);
>
> # don't forget to skip diagonal elements
> $sim->(1:-1)->maximum_n_ind($topXind);
>
> # use slicing to get the max elements
> $topX = $sim($topXind);
>
> # how much wt in top 20?
> print $topX->sum . "\n";
>
>
> Cheers,
> Chris
>
> On Mon, Feb 27, 2012 at 8:46 AM, Jim Magnuson <james.magnuson at uconn.edu>
> wrote:
> > Yes, a typo -- I changed the variable name for the example code and lost
> the
> > $ somehow...
> >
> > Currently trying to get timing tests going as suggested by chm...
> >
> > thanks,
> >
> > jim
> >
> >
> > On Mon, Feb 27, 2012 at 2:39 PM, Clifford Sobchuk
> > <clifford.sobchuk at ericsson.com> wrote:
> >>
> >> I don't know if this is a typo or not, but in the code for inner loop
> you
> >> have the following:
> >> >      $sim = inner($kernel{$w1},$kernel{w2});
> >> Where $kernel{w2} should be $kernel{$w2}.
> >>
> >>
> >>
> >>
> >> CLIFF SOBCHUK
> >> Core RF Engineering
> >> Phone 613-667-1974   ecn: 8109-71974
> >> mobile 403-819-9233
> >> yahoo: sobchuk
> >> www.ericsson.com
> >>
> >> "The author works for Telefonaktiebolaget L M Ericsson ("Ericsson"), who
> >> is solely responsible for this email and its contents. All inquiries
> >> regarding this email should be addressed to Ericsson. The web site for
> >> Ericsson is www.ericsson.com."
> >>
> >> This Communication is Confidential. We only send and receive email on
> the
> >> basis of the terms set out at www.ericsson.com/email_disclaimer
> >>
> >>
> >> -----Original Message-----
> >> From: chm [mailto:devel.chm.01 at gmail.com]
> >> Sent: Monday, February 27, 2012 5:46 AM
> >> To: Jim Magnuson
> >> Cc: perldl
> >> Subject: Re: [Perldl] Mysterious slow down from repeated inner calls
> >>
> >> I don't know of any reason why inner() would slow down---have you tried
> >> using NYTProf or some such tool to track time in inner and number of
> calls
> >> to inner?  One oddity is that the first loop appears to skip all calls
> to
> >> inner which would be *very* fast.  Maybe something is going on with the
> loop
> >> structure?
> >>
> >> --Chris
> >>
> >> On 2/27/2012 2:50 AM, Jim Magnuson wrote:
> >> > Hello,
> >> >
> >> > I have a set of about 30,000 words, and I am using string kernels as a
> >> > metric of word similarity. The goal is to see whether different
> >> > kernels are better at predicting how quickly human subjects are able
> >> > to process words. I have calculated the string kernels for each word.
> >> > So now I have a file with 30,000 lines. The first field in each line
> >> > is a word, and this is followed by a 676-element vector representing
> the
> >> > kernel representation.
> >> >
> >> > Once I read this in, I need to step through and calculate the
> >> > similarity of each word to every other word using vector cosine, as
> >> > well as track the highest similarity value (excluding the word
> >> > itself), and the set of X-most similar items (there are reasons to
> >> > believe these are good predictors of human performance).
> >> >
> >> > Here's the problem: when I start running the code below, it is very
> >> > fast.
> >> > It takes 5 msecs to process the first word (that is, to do the
> >> > necessary 30,000 cosines), but by the time it reaches the 100th it is
> >> > taking 37 msecs, and by the 1,000th it is taking 398 msecs -- with
> >> > 29,000 to go, and constant slowing...
> >> >
> >> > Memory use by perl stays constant, and I cannot figure out what would
> >> > make the program slow down so much. I posted a query at Perl Monks and
> >> > I got advice about how to speed up each step (the first word used to
> >> > take 38 msecs), and they pointed out that it is indeed the call to
> >> > inner that is the culprit (replace it with a non-pdl calculation, and
> >> > the slowing goes away). They suggested I should look for advice from
> PDL
> >> > experts.
> >> >
> >> > So if anyone can give me pointers as to what is slowing things down
> >> > and whether there is a way to avoid it, I would be most grateful.
> >> > Apologies in advance for any offensively inefficient/awkward use of
> PDL!
> >> >
> >> > Thanks!
> >> >
> >> > jim
> >> > #!/usr/bin/perl -s
> >> > use PDL;
> >> > use Time::HiRes qw ( time ) ;
> >> > $|=1;
> >> > $top = 20;
> >> >
> >> > while(<>){
> >> >      chomp;
> >> >      ($wrd, @data) = split;
> >> >      $kernel{$wrd} = norm(pdl(@data));
> >> >      # EXAMPLE LINE
> >> >      # word 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> >> > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> >> > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0
> >> > 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> >> > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> >> > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> >> > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> >> > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> >> > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> >> > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> >> > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> >> > 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> >> > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> >> > 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> >> > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> >> > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> >> > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> >> > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> >> > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> >> > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> >> >
> >> > }
> >> > $nrecs = keys %kernel;
> >> > @kernelKeys = sort( keys %kernel );
> >> >
> >> > $startAll = time();
> >> >
> >> > $at1 = 0;
> >> > foreach $w1 (@kernelKeys) {
> >> >    $totalsim = $maxsim = 0;
> >> >    $startWord = time();
> >> >    @topX = ();
> >> >    $at2 = 0;
> >> >    foreach $w2 (@kernelKeys) {
> >> >      next if($at1 == $at2); # skip identical item, but not homophones
> >> >      $at2++;
> >> >      $sim = inner($kernel{$w1},$kernel{w2});
> >> >      $totalsim+=$sim;
> >> >      if($sim>  $maxsim){      $maxsim = $sim;    }
> >> >      # keep the top 20
> >> >      if($#topX<  $top){
> >> >        push @topX, $sim;
> >> >      } else {
> >> >        @topX = sort { $a<=>  $b } @topX;
> >> >        if($sim>  $topX[0]){ $topX[0] = $sim;      }
> >> >      }
> >> >    }
> >> >    $at1++;
> >> >    $topXtotal = sum(pdl(@topX));
> >> >    printf "$at1\t$w1\t$totalsim\t$maxsim\t$topXtotal\n";
> >> >    unless($at1 % 10){
> >> >      $now = time();
> >> >      $elapsed = $now - $startAll;
> >> >      $thisWord = $now - $startWord;
> >> >      $perWord = $elapsed / $at1;
> >> >      $hoursRemaining = (($nrecs - $at1) * $perWord)/3600;
> >> >      printf STDERR "#$at1\t$w1\t$totalsim\t$maxsim\t$topXtotal\t";
> >> >      printf STDERR "ELAPSED %.3f THISWORD %.3f PERWORD %.3f HOURStoGO
> >> > %.3f\n",
> >> >        $elapsed, $thisWord, $perWord, $hoursRemaining;
> >> >    }
> >> > }
> >> >
> >> >
> >> >
> >> >
> >> > _______________________________________________
> >> > Perldl mailing list
> >> > Perldl at jach.hawaii.edu
> >> > http://mailman.jach.hawaii.edu/mailman/listinfo/perldl
> >>
> >>
> >> _______________________________________________
> >> Perldl mailing list
> >> Perldl at jach.hawaii.edu
> >> http://mailman.jach.hawaii.edu/mailman/listinfo/perldl
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.jach.hawaii.edu/pipermail/perldl/attachments/20120227/972190af/attachment.html>


More information about the Perldl mailing list