[Perldl] Mysterious slow down from repeated inner calls

Jim Magnuson james.magnuson at uconn.edu
Sun Feb 26 21:50:08 HST 2012


Hello,

I have a set of about 30,000 words, and I am using string kernels as a
metric of word similarity. The goal is to see whether different kernels are
better at predicting how quickly human subjects are able to process
words. I have calculated the string kernels for each word. So now I have a
file with 30,000 lines. The first field in each line is a word, and this is
followed by a 676-element vector representing the kernel representation.

Once I read this in, I need to step through and calculate the similarity of
each word to every other word using vector cosine, as well as track the
highest similarity value (excluding the word itself), and the set of X-most
similar items (there are reasons to believe these are good predictors of
human performance).

Here's the problem: when I start running the code below, it is very fast.
It takes 5 msecs to process the first word (that is, to do the necessary
30,000 cosines), but by the time it reaches the 100th it is taking 37
msecs, and by the 1,000th it is taking 398 msecs -- with 29,000 to go, and
constant slowing...

Memory use by perl stays constant, and I cannot figure out what would make
the program slow down so much. I posted a query at Perl Monks and I got
advice about how to speed up each step (the first word used to take 38
msecs), and they pointed out that it is indeed the call to inner that is
the culprit (replace it with a non-pdl calculation, and the slowing goes
away). They suggested I should look for advice from PDL experts.

So if anyone can give me pointers as to what is slowing things down and
whether there is a way to avoid it, I would be most grateful. Apologies in
advance for any offensively inefficient/awkward use of PDL!

Thanks!

jim
#!/usr/bin/perl -s
use PDL;
use Time::HiRes qw ( time ) ;
$|=1;
$top = 20;

while(<>){
    chomp;
    ($wrd, @data) = split;
    $kernel{$wrd} = norm(pdl(@data));
    # EXAMPLE LINE
    # word 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

}
$nrecs = keys %kernel;
@kernelKeys = sort( keys %kernel );

$startAll = time();

$at1 = 0;
foreach $w1 (@kernelKeys) {
  $totalsim = $maxsim = 0;
  $startWord = time();
  @topX = ();
  $at2 = 0;
  foreach $w2 (@kernelKeys) {
    next if($at1 == $at2); # skip identical item, but not homophones
    $at2++;
    $sim = inner($kernel{$w1},$kernel{w2});
    $totalsim+=$sim;
    if($sim > $maxsim){      $maxsim = $sim;    }
    # keep the top 20
    if($#topX < $top){
      push @topX, $sim;
    } else {
      @topX = sort { $a <=> $b } @topX;
      if($sim > $topX[0]){ $topX[0] = $sim;      }
    }
  }
  $at1++;
  $topXtotal = sum(pdl(@topX));
  printf "$at1\t$w1\t$totalsim\t$maxsim\t$topXtotal\n";
  unless($at1 % 10){
    $now = time();
    $elapsed = $now - $startAll;
    $thisWord = $now - $startWord;
    $perWord = $elapsed / $at1;
    $hoursRemaining = (($nrecs - $at1) * $perWord)/3600;
    printf STDERR "#$at1\t$w1\t$totalsim\t$maxsim\t$topXtotal\t";
    printf STDERR "ELAPSED %.3f THISWORD %.3f PERWORD %.3f HOURStoGO
%.3f\n",
      $elapsed, $thisWord, $perWord, $hoursRemaining;
  }
}
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.jach.hawaii.edu/pipermail/perldl/attachments/20120227/07bf36be/attachment.html>


More information about the Perldl mailing list