Path: news.mathworks.com!not-for-mail
From: "James Tursa" <aclassyguywithaknotac@hotmail.com>
Newsgroups: comp.soft-sys.matlab
Subject: Re: Speeding up sum of squares
Date: Thu, 8 May 2008 04:32:02 +0000 (UTC)
Organization: Boeing
Lines: 131
Message-ID: <fvtvo2$n92$1@fred.mathworks.com>
References: <fm012450tjat7jgv5hqhtiokid3ketdsgj@4ax.com> <fvq0qr$7tv$1@fred.mathworks.com> <1v2124dbkuukhhprui7a8ss4vhmtgn3jcg@4ax.com> <fvq421$p13$1@fred.mathworks.com> <fvtkg4$foh$1@fred.mathworks.com> <fvtllj$n9b$1@fred.mathworks.com>
Reply-To: "James Tursa" <aclassyguywithaknotac@hotmail.com>
NNTP-Posting-Host: webapp-05-blr.mathworks.com
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 8bit
X-Trace: fred.mathworks.com 1210221122 23842 172.30.248.35 (8 May 2008 04:32:02 GMT)
X-Complaints-To: news@mathworks.com
NNTP-Posting-Date: Thu, 8 May 2008 04:32:02 +0000 (UTC)
X-Newsreader: MATLAB Central Newsreader 756104
Xref: news.mathworks.com comp.soft-sys.matlab:467293


"John D'Errico" <woodchips@rochester.rr.com> wrote in
message <fvtllj$n9b$1@fred.mathworks.com>...
> "Tim Davis" <davis@cise.ufl.edu> wrote in message 
> <fvtkg4$foh$1@fred.mathworks.com>...
> 
> > I'd hazard a guess that it's not a BLAS performance
> > difference.  There really isn't a lot of benefit in
> > performance of the level-1 BLAS as compared with plain code,
> > given modern optimizing compilers.
> > 
> > It's probably because MATLAB is computing z=x.^2, and
> > storing z as a new vector, internally.  Then it sums it up.
> >  The creation of y takes more memory traffic (8*n bytes
> > written then read back in).  I would be quite surprised that
> > MATLAB is smart enough not to form the vector z,
> > but when it does x'*x it knows not to do that.
> 
> Of course. This makes much sense.
> 
> John

I had thought of the large intermediate array overhead also,
but to confirm this was the only thing going on I made the
following test runs with a 10,000,000 size double array:

(1) Matrix Multiply, BLAS calls in background

>> tic;x'*x;toc
Elapsed time is 0.028451 seconds.

(2) Element-by-element multiply, large intermediate array
first, sum function probably has BLAS calls

>> tic;sum(x.*x);toc
Elapsed time is 0.202210 seconds.

(3) Norm function, BLAS calls in background?

>> tic;norm(x)^2;toc
Elapsed time is 0.088459 seconds.

(4) Mex routine, straightforward

>> tic;sumsquares(x);toc
Elapsed time is 0.165403 seconds.

(5) Mex routine, form large intermediate array first

>> tic;sumsquares2(x);toc
Elapsed time is 0.336897 seconds.

I find two things interesting about these results.

First, the (4) Mex routine (see below) which did a
straightforward sum of squares without forming a large
intermediate array first was still *much* slower than (1),
indicating that the BLAS calls indeed seem to have a
significant effect on this calculation as John suggested.
Maybe the only difference is keeping the intermediate
results in registers. I don't know, and I didn't try to
force this in my mex routine.

Second, the difference between (1) and (2), about 0.1738
seconds, was very close to the difference between (4) and
(5), about 0.1715 seconds.  This seems to independently
confirm the time difference is the cost of this intermediate
large array overhead. Probably the sum function uses BLAS
calls as well. This seems to confirm that the difference
between (1) and (2) is due mainly to the large intermediate
array cost as Tim suggested.

One of the worries I would have with (3) is the potential
loss of precision doing the sqrt and then the square again.

James Tursa

 ------------------------------------------

(4) Mex routine, straightforward

#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const
mxArray *prhs[])
{
    double sumsquares = 0.0;
    double *dp;
    int i, n;

    dp = mxGetPr( prhs[0] );
    n = mxGetNumberOfElements( prhs[0] );
    for( i=0; i<n; i++ )
    {
        sumsquares += (*dp) * (*dp);
        dp++;
    }
    plhs[0] = mxCreateDoubleMatrix( 1, 1, mxREAL );
    *mxGetPr( plhs[0] ) = sumsquares;
}

 ------------------------------------------

(5) Mex routine, form large intermediate array first

#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const
mxArray *prhs[])
{
    mxArray *mx;
    double sumsquares = 0.0;
    double *dp, *dx;
    int i, n;

    dp = mxGetPr( prhs[0] );
    n = mxGetNumberOfElements( prhs[0] );
    mx = mxCreateDoubleMatrix( n, 1, mxREAL );
    dx = mxGetPr( mx );
    for( i=0; i<n; i++ )
    {
        *dx++ = (*dp) * (*dp);
        dp++;
    }
    dx = mxGetPr( mx );
    for( i=0; i<n; i++ )
    {
        sumsquares += *dx++;
    }
    plhs[0] = mxCreateDoubleMatrix( 1, 1, mxREAL );
    *mxGetPr( plhs[0] ) = sumsquares;
    mxDestroyArray( mx );
}