<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/168821</link>
    <title>MATLAB Central Newsreader - Speeding up sum of squares</title>
    <description>Feed for thread: Speeding up sum of squares</description>
    <language>en-us</language>
    <copyright>&amp;copy;1994-2010 by MathWorks, Inc.</copyright>
    <webmaster>webmaster@mathworks.com</webmaster>
    <generator>MATLAB Central Newsreader</generator>
    <docs>http://blogs.law.harvard.edu/tech/rss</docs>
    <ttl>60</ttl>
    <image>
      <title>MathWorks</title>
      <url>http://www.mathworks.com/images/membrane_icon.gif</url>
    </image>
    <item>
      <pubDate>Tue, 06 May 2008 16:11:51 -0400</pubDate>
      <title>Speeding up sum of squares</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/168821#430601</link>
      <author>richardstartz@comcast.net</author>
      <description>I have a problem where most of the computational time is spent&lt;br&gt;
computing&lt;br&gt;
&lt;br&gt;
sum(x.^2)&lt;br&gt;
&lt;br&gt;
where x is a vector of length between 100 and 10,000.&lt;br&gt;
&lt;br&gt;
It seems I can speed up computation some by using &lt;br&gt;
&lt;br&gt;
x'*x&lt;br&gt;
&lt;br&gt;
instead.&lt;br&gt;
&lt;br&gt;
Any thoughts or suggestions on further speed improvement?&lt;br&gt;
Thanks.&lt;br&gt;
-Dick Startz&lt;br&gt;
</description>
    </item>
    <item>
      <pubDate>Tue, 06 May 2008 16:26:04 -0400</pubDate>
      <title>Re: Speeding up sum of squares</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/168821#430604</link>
      <author>John D'Errico</author>
      <description>richardstartz@comcast.net wrote in message &lt;br&gt;
&amp;lt;fm012450tjat7jgv5hqhtiokid3ketdsgj@4ax.com&amp;gt;...&lt;br&gt;
&amp;gt; I have a problem where most of the computational time is spent&lt;br&gt;
&amp;gt; computing&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; sum(x.^2)&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; where x is a vector of length between 100 and 10,000.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; It seems I can speed up computation some by using &lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; x'*x&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; instead.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; Any thoughts or suggestions on further speed improvement?&lt;br&gt;
&amp;gt; Thanks.&lt;br&gt;
&amp;gt; -Dick Startz&lt;br&gt;
&lt;br&gt;
As always on these problems, its the little things&lt;br&gt;
that sneak up behind you.&lt;br&gt;
&lt;br&gt;
Is x real? Is x a column vector or a row vector?&lt;br&gt;
&lt;br&gt;
If you are confident that you know the answers&lt;br&gt;
to these questions always, then yes, you can&lt;br&gt;
save some time by using the blas to do this dot&lt;br&gt;
product.&lt;br&gt;
&lt;br&gt;
x = rand(5000,1);&lt;br&gt;
&lt;br&gt;
timeit(@() sum(x.*x))&lt;br&gt;
ans =&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;0.00045792&lt;br&gt;
&lt;br&gt;
timeit(@() dot(x,x))&lt;br&gt;
ans =&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;0.00057047&lt;br&gt;
&lt;br&gt;
timeit(@() x'*x)&lt;br&gt;
ans =&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;7.4318e-05&lt;br&gt;
&lt;br&gt;
John&lt;br&gt;
</description>
    </item>
    <item>
      <pubDate>Tue, 06 May 2008 16:29:04 -0400</pubDate>
      <title>Re: Speeding up sum of squares</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/168821#430606</link>
      <author>Bruno Luong</author>
      <description>richardstartz@comcast.net wrote in message&lt;br&gt;
&amp;lt;fm012450tjat7jgv5hqhtiokid3ketdsgj@4ax.com&amp;gt;...&lt;br&gt;
&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; Any thoughts or suggestions on further speed improvement?&lt;br&gt;
&lt;br&gt;
norm(x)^2&lt;br&gt;
&lt;br&gt;
Bruno&lt;br&gt;
</description>
    </item>
    <item>
      <pubDate>Tue, 06 May 2008 16:52:33 -0400</pubDate>
      <title>Re: Speeding up sum of squares</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/168821#430613</link>
      <author>richardstartz@comcast.net</author>
      <description>On Tue, 6 May 2008 16:26:04 +0000 (UTC), &quot;John D'Errico&quot;&lt;br&gt;
&amp;lt;woodchips@rochester.rr.com&amp;gt; wrote:&lt;br&gt;
&lt;br&gt;
&amp;gt;richardstartz@comcast.net wrote in message &lt;br&gt;
&amp;gt;&amp;lt;fm012450tjat7jgv5hqhtiokid3ketdsgj@4ax.com&amp;gt;...&lt;br&gt;
&amp;gt;&amp;gt; I have a problem where most of the computational time is spent&lt;br&gt;
&amp;gt;&amp;gt; computing&lt;br&gt;
&amp;gt;&amp;gt; &lt;br&gt;
&amp;gt;&amp;gt; sum(x.^2)&lt;br&gt;
&amp;gt;&amp;gt; &lt;br&gt;
&amp;gt;&amp;gt; where x is a vector of length between 100 and 10,000.&lt;br&gt;
&amp;gt;&amp;gt; &lt;br&gt;
&amp;gt;&amp;gt; It seems I can speed up computation some by using &lt;br&gt;
&amp;gt;&amp;gt; &lt;br&gt;
&amp;gt;&amp;gt; x'*x&lt;br&gt;
&amp;gt;&amp;gt; &lt;br&gt;
&amp;gt;&amp;gt; instead.&lt;br&gt;
&amp;gt;&amp;gt; &lt;br&gt;
&amp;gt;&amp;gt; Any thoughts or suggestions on further speed improvement?&lt;br&gt;
&amp;gt;&amp;gt; Thanks.&lt;br&gt;
&amp;gt;&amp;gt; -Dick Startz&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt;As always on these problems, its the little things&lt;br&gt;
&amp;gt;that sneak up behind you.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt;Is x real? Is x a column vector or a row vector?&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt;If you are confident that you know the answers&lt;br&gt;
&amp;gt;to these questions always, then yes, you can&lt;br&gt;
&amp;gt;save some time by using the blas to do this dot&lt;br&gt;
&amp;gt;product.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt;x = rand(5000,1);&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt;timeit(@() sum(x.*x))&lt;br&gt;
&amp;gt;ans =&lt;br&gt;
&amp;gt;   0.00045792&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt;timeit(@() dot(x,x))&lt;br&gt;
&amp;gt;ans =&lt;br&gt;
&amp;gt;   0.00057047&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt;timeit(@() x'*x)&lt;br&gt;
&amp;gt;ans =&lt;br&gt;
&amp;gt;   7.4318e-05&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt;John&lt;br&gt;
&lt;br&gt;
Thank you, John.&lt;br&gt;
&lt;br&gt;
I have real column vectors, so it looks like your last suggestion will&lt;br&gt;
speed things up my nearly an order of magnitude. That would be really&lt;br&gt;
great.&lt;br&gt;
&lt;br&gt;
If you have a minute, could you explain what it is that @() x'*x does&lt;br&gt;
and why it makes such a difference, or point me to the right stuff to&lt;br&gt;
read?&lt;br&gt;
&lt;br&gt;
-Dick Startz&lt;br&gt;
&lt;br&gt;
PS If it matters, I'm running 2007B on a dual processor Windows&lt;br&gt;
machine.&lt;br&gt;
&lt;br&gt;
PPS What's timeit()?&lt;br&gt;
</description>
    </item>
    <item>
      <pubDate>Tue, 06 May 2008 17:21:05 -0400</pubDate>
      <title>Re: Speeding up sum of squares</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/168821#430621</link>
      <author>John D'Errico</author>
      <description>richardstartz@comcast.net wrote in message &lt;br&gt;
&amp;lt;1v2124dbkuukhhprui7a8ss4vhmtgn3jcg@4ax.com&amp;gt;...&lt;br&gt;
&lt;br&gt;
&amp;gt; Thank you, John.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; I have real column vectors, so it looks like your last suggestion will&lt;br&gt;
&amp;gt; speed things up my nearly an order of magnitude. That would be really&lt;br&gt;
&amp;gt; great.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; If you have a minute, could you explain what it is that @() x'*x does&lt;br&gt;
&amp;gt; and why it makes such a difference, or point me to the right stuff to&lt;br&gt;
&amp;gt; read?&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; -Dick Startz&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; PS If it matters, I'm running 2007B on a dual processor Windows&lt;br&gt;
&amp;gt; machine.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; PPS What's timeit()?&lt;br&gt;
&lt;br&gt;
timeit is a function from the file exchange,&lt;br&gt;
written by Steve Eddins. Its a better way to&lt;br&gt;
time code fragments than using tic and toc.&lt;br&gt;
&lt;br&gt;
The product x'*x uses blas routines to speed&lt;br&gt;
up the inner product. They are clearly much&lt;br&gt;
more highly optimized than is the code&lt;br&gt;
matlab produces for sum(x.^2). Bruno&lt;br&gt;
pointed out that norm(x)^2 also is fairly fast.&lt;br&gt;
I assume that it too is well optimized.&lt;br&gt;
&lt;br&gt;
John&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
</description>
    </item>
    <item>
      <pubDate>Tue, 06 May 2008 17:32:45 -0400</pubDate>
      <title>Re: Speeding up sum of squares</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/168821#430624</link>
      <author>richardstartz@comcast.net</author>
      <description>On Tue, 6 May 2008 17:21:05 +0000 (UTC), &quot;John D'Errico&quot;&lt;br&gt;
&amp;lt;woodchips@rochester.rr.com&amp;gt; wrote:&lt;br&gt;
&lt;br&gt;
&amp;gt;richardstartz@comcast.net wrote in message &lt;br&gt;
&amp;gt;&amp;lt;1v2124dbkuukhhprui7a8ss4vhmtgn3jcg@4ax.com&amp;gt;...&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt;&amp;gt; Thank you, John.&lt;br&gt;
&amp;gt;&amp;gt; &lt;br&gt;
&amp;gt;&amp;gt; I have real column vectors, so it looks like your last suggestion will&lt;br&gt;
&amp;gt;&amp;gt; speed things up my nearly an order of magnitude. That would be really&lt;br&gt;
&amp;gt;&amp;gt; great.&lt;br&gt;
&amp;gt;&amp;gt; &lt;br&gt;
&amp;gt;&amp;gt; If you have a minute, could you explain what it is that @() x'*x does&lt;br&gt;
&amp;gt;&amp;gt; and why it makes such a difference, or point me to the right stuff to&lt;br&gt;
&amp;gt;&amp;gt; read?&lt;br&gt;
&amp;gt;&amp;gt; &lt;br&gt;
&amp;gt;&amp;gt; -Dick Startz&lt;br&gt;
&amp;gt;&amp;gt; &lt;br&gt;
&amp;gt;&amp;gt; PS If it matters, I'm running 2007B on a dual processor Windows&lt;br&gt;
&amp;gt;&amp;gt; machine.&lt;br&gt;
&amp;gt;&amp;gt; &lt;br&gt;
&amp;gt;&amp;gt; PPS What's timeit()?&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt;timeit is a function from the file exchange,&lt;br&gt;
&amp;gt;written by Steve Eddins. Its a better way to&lt;br&gt;
&amp;gt;time code fragments than using tic and toc.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt;The product x'*x uses blas routines to speed&lt;br&gt;
&amp;gt;up the inner product. They are clearly much&lt;br&gt;
&amp;gt;more highly optimized than is the code&lt;br&gt;
&amp;gt;matlab produces for sum(x.^2). Bruno&lt;br&gt;
&amp;gt;pointed out that norm(x)^2 also is fairly fast.&lt;br&gt;
&amp;gt;I assume that it too is well optimized.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt;John&lt;br&gt;
&amp;gt;&lt;br&gt;
Thanks again. I got it now.&lt;br&gt;
-Dick Startz&lt;br&gt;
</description>
    </item>
    <item>
      <pubDate>Thu, 08 May 2008 01:20:04 -0400</pubDate>
      <title>Re: Speeding up sum of squares</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/168821#430933</link>
      <author>Tim Davis</author>
      <description>&quot;John D'Errico&quot; &amp;lt;woodchips@rochester.rr.com&amp;gt; wrote in&lt;br&gt;
message &amp;lt;fvq421$p13$1@fred.mathworks.com&amp;gt;...&lt;br&gt;
&lt;br&gt;
&amp;gt; The product x'*x uses blas routines to speed&lt;br&gt;
&amp;gt; up the inner product. They are clearly much&lt;br&gt;
&amp;gt; more highly optimized than is the code&lt;br&gt;
&amp;gt; matlab produces for sum(x.^2). Bruno&lt;br&gt;
&amp;gt; pointed out that norm(x)^2 also is fairly fast.&lt;br&gt;
&amp;gt; I assume that it too is well optimized.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; John&lt;br&gt;
&lt;br&gt;
I'd hazard a guess that it's not a BLAS performance&lt;br&gt;
difference.  There really isn't a lot of benefit in&lt;br&gt;
performance of the level-1 BLAS as compared with plain code,&lt;br&gt;
given modern optimizing compilers.&lt;br&gt;
&lt;br&gt;
It's probably because MATLAB is computing z=x.^2, and&lt;br&gt;
storing z as a new vector, internally.  Then it sums it up.&lt;br&gt;
&amp;nbsp;The creation of y takes more memory traffic (8*n bytes&lt;br&gt;
written then read back in).  I would be quite surprised that&lt;br&gt;
MATLAB is smart enough not to form the vector z,&lt;br&gt;
but when it does x'*x it knows not to do that.&lt;br&gt;
&lt;br&gt;
Just a guess, but it's backed up with this experiment:&lt;br&gt;
&lt;br&gt;
&amp;gt;&amp;gt; x = rand (1e6,1);&lt;br&gt;
&amp;gt;&amp;gt; tic;y=x'*x;toc&lt;br&gt;
Elapsed time is 0.004829 seconds.&lt;br&gt;
&lt;br&gt;
&amp;gt;&amp;gt; tic;y=sum(x.^2);toc&lt;br&gt;
Elapsed time is 0.017520 seconds.&lt;br&gt;
&lt;br&gt;
&amp;gt;&amp;gt; tic;z=x.^2;y=sum(z);toc&lt;br&gt;
Elapsed time is 0.018008 seconds.&lt;br&gt;
&lt;br&gt;
And, try this:&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&amp;gt;&amp;gt; clear&lt;br&gt;
&amp;gt;&amp;gt; x=rand(200e6,1);&lt;br&gt;
&amp;gt;&amp;gt; tic;y=x'*x;toc&lt;br&gt;
Elapsed time is 0.538544 seconds.&lt;br&gt;
&lt;br&gt;
when the above command was working, the memory usage of&lt;br&gt;
MATLAB didn't go up (I was watching it).&lt;br&gt;
&lt;br&gt;
&amp;gt;&amp;gt; &lt;br&gt;
&amp;gt;&amp;gt; tic;y=sum(x.^2);toc&lt;br&gt;
??? Out of memory. Type HELP MEMORY for your options.&lt;br&gt;
&lt;br&gt;
It's cool what you can learn about how MATLAB works&lt;br&gt;
internally just by looking at error messages.&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
</description>
    </item>
    <item>
      <pubDate>Thu, 08 May 2008 01:40:03 -0400</pubDate>
      <title>Re: Speeding up sum of squares</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/168821#430934</link>
      <author>John D'Errico</author>
      <description>&quot;Tim Davis&quot; &amp;lt;davis@cise.ufl.edu&amp;gt; wrote in message &lt;br&gt;
&amp;lt;fvtkg4$foh$1@fred.mathworks.com&amp;gt;...&lt;br&gt;
&lt;br&gt;
&amp;gt; I'd hazard a guess that it's not a BLAS performance&lt;br&gt;
&amp;gt; difference.  There really isn't a lot of benefit in&lt;br&gt;
&amp;gt; performance of the level-1 BLAS as compared with plain code,&lt;br&gt;
&amp;gt; given modern optimizing compilers.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; It's probably because MATLAB is computing z=x.^2, and&lt;br&gt;
&amp;gt; storing z as a new vector, internally.  Then it sums it up.&lt;br&gt;
&amp;gt;  The creation of y takes more memory traffic (8*n bytes&lt;br&gt;
&amp;gt; written then read back in).  I would be quite surprised that&lt;br&gt;
&amp;gt; MATLAB is smart enough not to form the vector z,&lt;br&gt;
&amp;gt; but when it does x'*x it knows not to do that.&lt;br&gt;
&lt;br&gt;
Of course. This makes much sense.&lt;br&gt;
&lt;br&gt;
John&lt;br&gt;
</description>
    </item>
    <item>
      <pubDate>Thu, 08 May 2008 04:32:02 -0400</pubDate>
      <title>Re: Speeding up sum of squares</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/168821#430947</link>
      <author>James Tursa</author>
      <description>&quot;John D'Errico&quot; &amp;lt;woodchips@rochester.rr.com&amp;gt; wrote in&lt;br&gt;
message &amp;lt;fvtllj$n9b$1@fred.mathworks.com&amp;gt;...&lt;br&gt;
&amp;gt; &quot;Tim Davis&quot; &amp;lt;davis@cise.ufl.edu&amp;gt; wrote in message &lt;br&gt;
&amp;gt; &amp;lt;fvtkg4$foh$1@fred.mathworks.com&amp;gt;...&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; &amp;gt; I'd hazard a guess that it's not a BLAS performance&lt;br&gt;
&amp;gt; &amp;gt; difference.  There really isn't a lot of benefit in&lt;br&gt;
&amp;gt; &amp;gt; performance of the level-1 BLAS as compared with plain code,&lt;br&gt;
&amp;gt; &amp;gt; given modern optimizing compilers.&lt;br&gt;
&amp;gt; &amp;gt; &lt;br&gt;
&amp;gt; &amp;gt; It's probably because MATLAB is computing z=x.^2, and&lt;br&gt;
&amp;gt; &amp;gt; storing z as a new vector, internally.  Then it sums it up.&lt;br&gt;
&amp;gt; &amp;gt;  The creation of y takes more memory traffic (8*n bytes&lt;br&gt;
&amp;gt; &amp;gt; written then read back in).  I would be quite surprised that&lt;br&gt;
&amp;gt; &amp;gt; MATLAB is smart enough not to form the vector z,&lt;br&gt;
&amp;gt; &amp;gt; but when it does x'*x it knows not to do that.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; Of course. This makes much sense.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; John&lt;br&gt;
&lt;br&gt;
I had thought of the large intermediate array overhead also,&lt;br&gt;
but to confirm this was the only thing going on I made the&lt;br&gt;
following test runs with a 10,000,000 size double array:&lt;br&gt;
&lt;br&gt;
(1) Matrix Multiply, BLAS calls in background&lt;br&gt;
&lt;br&gt;
&amp;gt;&amp;gt; tic;x'*x;toc&lt;br&gt;
Elapsed time is 0.028451 seconds.&lt;br&gt;
&lt;br&gt;
(2) Element-by-element multiply, large intermediate array&lt;br&gt;
first, sum function probably has BLAS calls&lt;br&gt;
&lt;br&gt;
&amp;gt;&amp;gt; tic;sum(x.*x);toc&lt;br&gt;
Elapsed time is 0.202210 seconds.&lt;br&gt;
&lt;br&gt;
(3) Norm function, BLAS calls in background?&lt;br&gt;
&lt;br&gt;
&amp;gt;&amp;gt; tic;norm(x)^2;toc&lt;br&gt;
Elapsed time is 0.088459 seconds.&lt;br&gt;
&lt;br&gt;
(4) Mex routine, straightforward&lt;br&gt;
&lt;br&gt;
&amp;gt;&amp;gt; tic;sumsquares(x);toc&lt;br&gt;
Elapsed time is 0.165403 seconds.&lt;br&gt;
&lt;br&gt;
(5) Mex routine, form large intermediate array first&lt;br&gt;
&lt;br&gt;
&amp;gt;&amp;gt; tic;sumsquares2(x);toc&lt;br&gt;
Elapsed time is 0.336897 seconds.&lt;br&gt;
&lt;br&gt;
I find two things interesting about these results.&lt;br&gt;
&lt;br&gt;
First, the (4) Mex routine (see below) which did a&lt;br&gt;
straightforward sum of squares without forming a large&lt;br&gt;
intermediate array first was still *much* slower than (1),&lt;br&gt;
indicating that the BLAS calls indeed seem to have a&lt;br&gt;
significant effect on this calculation as John suggested.&lt;br&gt;
Maybe the only difference is keeping the intermediate&lt;br&gt;
results in registers. I don't know, and I didn't try to&lt;br&gt;
force this in my mex routine.&lt;br&gt;
&lt;br&gt;
Second, the difference between (1) and (2), about 0.1738&lt;br&gt;
seconds, was very close to the difference between (4) and&lt;br&gt;
(5), about 0.1715 seconds.  This seems to independently&lt;br&gt;
confirm the time difference is the cost of this intermediate&lt;br&gt;
large array overhead. Probably the sum function uses BLAS&lt;br&gt;
calls as well. This seems to confirm that the difference&lt;br&gt;
between (1) and (2) is due mainly to the large intermediate&lt;br&gt;
array cost as Tim suggested.&lt;br&gt;
&lt;br&gt;
One of the worries I would have with (3) is the potential&lt;br&gt;
loss of precision doing the sqrt and then the square again.&lt;br&gt;
&lt;br&gt;
James Tursa&lt;br&gt;
&lt;br&gt;
&amp;nbsp;------------------------------------------&lt;br&gt;
&lt;br&gt;
(4) Mex routine, straightforward&lt;br&gt;
&lt;br&gt;
#include &quot;mex.h&quot;&lt;br&gt;
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const&lt;br&gt;
mxArray *prhs[])&lt;br&gt;
{&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;double sumsquares = 0.0;&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;double *dp;&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;int i, n;&lt;br&gt;
&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;dp = mxGetPr( prhs[0] );&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;n = mxGetNumberOfElements( prhs[0] );&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;for( i=0; i&amp;lt;n; i++ )&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;{&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;sumsquares += (*dp) * (*dp);&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;dp++;&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;}&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;plhs[0] = mxCreateDoubleMatrix( 1, 1, mxREAL );&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;*mxGetPr( plhs[0] ) = sumsquares;&lt;br&gt;
}&lt;br&gt;
&lt;br&gt;
&amp;nbsp;------------------------------------------&lt;br&gt;
&lt;br&gt;
(5) Mex routine, form large intermediate array first&lt;br&gt;
&lt;br&gt;
#include &quot;mex.h&quot;&lt;br&gt;
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const&lt;br&gt;
mxArray *prhs[])&lt;br&gt;
{&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;mxArray *mx;&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;double sumsquares = 0.0;&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;double *dp, *dx;&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;int i, n;&lt;br&gt;
&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;dp = mxGetPr( prhs[0] );&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;n = mxGetNumberOfElements( prhs[0] );&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;mx = mxCreateDoubleMatrix( n, 1, mxREAL );&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;dx = mxGetPr( mx );&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;for( i=0; i&amp;lt;n; i++ )&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;{&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;*dx++ = (*dp) * (*dp);&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;dp++;&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;}&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;dx = mxGetPr( mx );&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;for( i=0; i&amp;lt;n; i++ )&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;{&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;sumsquares += *dx++;&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;}&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;plhs[0] = mxCreateDoubleMatrix( 1, 1, mxREAL );&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;*mxGetPr( plhs[0] ) = sumsquares;&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;mxDestroyArray( mx );&lt;br&gt;
}&lt;br&gt;
&lt;br&gt;
</description>
    </item>
    <item>
      <pubDate>Thu, 08 May 2008 04:48:03 -0400</pubDate>
      <title>Re: Speeding up sum of squares</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/168821#430953</link>
      <author>James Tursa</author>
      <description>&quot;James Tursa&quot; &amp;lt;aclassyguywithaknotac@hotmail.com&amp;gt; wrote in&lt;br&gt;
message &amp;lt;fvtvo2$n92$1@fred.mathworks.com&amp;gt;...&lt;br&gt;
&amp;gt; I made the&lt;br&gt;
&amp;gt; following test runs with a 10,000,000 size double array:&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; (1) Matrix Multiply, BLAS calls in background&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; &amp;gt;&amp;gt; tic;x'*x;toc&lt;br&gt;
&amp;gt; Elapsed time is 0.028451 seconds.&lt;br&gt;
&amp;gt; &lt;br&gt;
&lt;br&gt;
Since it wasn't too difficult I went ahead and forced the&lt;br&gt;
intermediate values to be register variables using Visual&lt;br&gt;
C++ 8.0 (actually the C compiler that comes with it of&lt;br&gt;
course). The results are:&lt;br&gt;
&lt;br&gt;
&amp;gt;&amp;gt; tic;sumsquares3(x);toc&lt;br&gt;
Elapsed time is 0.029241 seconds.&lt;br&gt;
&lt;br&gt;
This result is almost identical to the x'*x method. So maybe&lt;br&gt;
the only trick behind the scenes here to get the speed&lt;br&gt;
improvement is to use register variables for the sum and&lt;br&gt;
loop indexes.&lt;br&gt;
&lt;br&gt;
James Tursa&lt;br&gt;
&lt;br&gt;
---------------------------------------------------&lt;br&gt;
&lt;br&gt;
(6) Mex routine, straightforward using register variables&lt;br&gt;
&lt;br&gt;
#include &quot;mex.h&quot;&lt;br&gt;
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const&lt;br&gt;
mxArray *prhs[])&lt;br&gt;
{&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;register double sumsquares = 0.0;&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;double *dp;&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;register int i, n;&lt;br&gt;
&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;dp = mxGetPr( prhs[0] );&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;n = mxGetNumberOfElements( prhs[0] );&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;for( i=0; i&amp;lt;n; i++ )&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;{&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;sumsquares += (*dp) * (*dp);&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;dp++;&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;}&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;plhs[0] = mxCreateDoubleMatrix( 1, 1, mxREAL );&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;*mxGetPr( plhs[0] ) = sumsquares;&lt;br&gt;
}&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
</description>
    </item>
    <item>
      <pubDate>Thu, 08 May 2008 11:58:03 -0400</pubDate>
      <title>Re: Speeding up sum of squares</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/168821#431021</link>
      <author>Tim Davis</author>
      <description>It's likely to use loop unrolling as well, such as:&lt;br&gt;
&lt;br&gt;
#include &quot;mex.h&quot;&lt;br&gt;
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const&lt;br&gt;
mxArray *prhs[])&lt;br&gt;
{&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;double w, x, y, z;&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;register double sumsquares = 0.0;&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;double *dp;&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;register int i, n;&lt;br&gt;
&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;dp = mxGetPr( prhs[0] );&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;n = mxGetNumberOfElements( prhs[0] );&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;for( i=0; i&amp;lt;n; i += 4 )&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;{&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;w = dp [i] ;&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;x = dp [i+1] ;&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;y = dp [i+2] ;&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;z = dp [i+3] ;&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;sumsquares += w * w + x * x + y * y + z * z ;&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;}&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;plhs[0] = mxCreateDoubleMatrix( 1, 1, mxREAL );&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;*mxGetPr( plhs[0] ) = sumsquares;&lt;br&gt;
}&lt;br&gt;
&lt;br&gt;
which only works if n is divisible by 4 (I didn't include&lt;br&gt;
the clean up).&lt;br&gt;
&lt;br&gt;
On my linux laptop using gcc (MATLAB 7.6, gcc 4.2.3):&lt;br&gt;
&lt;br&gt;
the code above:&lt;br&gt;
&amp;gt;&amp;gt; tic;f=s3(x);toc&lt;br&gt;
Elapsed time is 0.524779 seconds.&lt;br&gt;
&lt;br&gt;
&amp;gt;&amp;gt; tic;g=x'*x;toc&lt;br&gt;
Elapsed time is 0.543561 seconds.&lt;br&gt;
&lt;br&gt;
your code:&lt;br&gt;
&amp;gt;&amp;gt; tic;e=s1(x);toc&lt;br&gt;
Elapsed time is 0.888125 seconds.&lt;br&gt;
&lt;br&gt;
Loop unrolling is a common trick.&lt;br&gt;
&lt;br&gt;
Don't forget that when you're timing things you need to run&lt;br&gt;
the code at least twice; the first time, the code itself&lt;br&gt;
gets loaded into memory.&lt;br&gt;
&lt;br&gt;
gcc is smart enough not to need &quot;register&quot; statements.  It&lt;br&gt;
ignores them.  Visual C++ is probably not as smart.&lt;br&gt;
</description>
    </item>
    <item>
      <pubDate>Thu, 08 May 2008 14:49:05 -0400</pubDate>
      <title>Re: Speeding up sum of squares</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/168821#431062</link>
      <author>Andy Robb</author>
      <description>&quot;Tim Davis&quot; &amp;lt;davis@cise.ufl.edu&amp;gt; wrote in message&lt;br&gt;
&amp;lt;fvupsb$4fo$1@fred.mathworks.com&amp;gt;...&lt;br&gt;
&amp;gt; It's likely to use loop unrolling as well, such as:&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; #include &quot;mex.h&quot;&lt;br&gt;
&amp;gt; void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const&lt;br&gt;
&amp;gt; mxArray *prhs[])&lt;br&gt;
&amp;gt; {&lt;br&gt;
&amp;gt;     double w, x, y, z;&lt;br&gt;
&amp;gt;     register double sumsquares = 0.0;&lt;br&gt;
&amp;gt;     double *dp;&lt;br&gt;
&amp;gt;     register int i, n;&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt;     dp = mxGetPr( prhs[0] );&lt;br&gt;
&amp;gt;     n = mxGetNumberOfElements( prhs[0] );&lt;br&gt;
&amp;gt;     for( i=0; i&amp;lt;n; i += 4 )&lt;br&gt;
&amp;gt;     {&lt;br&gt;
&amp;gt;         w = dp [i] ;&lt;br&gt;
&amp;gt;         x = dp [i+1] ;&lt;br&gt;
&amp;gt;         y = dp [i+2] ;&lt;br&gt;
&amp;gt;         z = dp [i+3] ;&lt;br&gt;
&amp;gt;         sumsquares += w * w + x * x + y * y + z * z ;&lt;br&gt;
&amp;gt;     }&lt;br&gt;
&amp;gt;     plhs[0] = mxCreateDoubleMatrix( 1, 1, mxREAL );&lt;br&gt;
&amp;gt;     *mxGetPr( plhs[0] ) = sumsquares;&lt;br&gt;
&amp;gt; }&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; which only works if n is divisible by 4 (I didn't include&lt;br&gt;
&amp;gt; the clean up).&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; On my linux laptop using gcc (MATLAB 7.6, gcc 4.2.3):&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; the code above:&lt;br&gt;
&amp;gt; &amp;gt;&amp;gt; tic;f=s3(x);toc&lt;br&gt;
&amp;gt; Elapsed time is 0.524779 seconds.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; &amp;gt;&amp;gt; tic;g=x'*x;toc&lt;br&gt;
&amp;gt; Elapsed time is 0.543561 seconds.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; your code:&lt;br&gt;
&amp;gt; &amp;gt;&amp;gt; tic;e=s1(x);toc&lt;br&gt;
&amp;gt; Elapsed time is 0.888125 seconds.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; Loop unrolling is a common trick.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; Don't forget that when you're timing things you need to run&lt;br&gt;
&amp;gt; the code at least twice; the first time, the code itself&lt;br&gt;
&amp;gt; gets loaded into memory.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; gcc is smart enough not to need &quot;register&quot; statements.  It&lt;br&gt;
&amp;gt; ignores them.  Visual C++ is probably not as smart.&lt;br&gt;
&lt;br&gt;
With gcc you might want to try -march=athlon64 -mfpmath=sse&lt;br&gt;
as this seems better able to use the pipeline (use could use&lt;br&gt;
-march=pentium4 instead of athlon64).&lt;br&gt;
&lt;br&gt;
Also, bear in mind that once you overflow your cache, main&lt;br&gt;
memory can slow your code down by a factor of 6. Multiply on&lt;br&gt;
modern processors is much faster than main memory and&lt;br&gt;
probably even faster than primary cache. This might go some&lt;br&gt;
way towards explaining why an intermediate array is so slow.&lt;br&gt;
&lt;br&gt;
You might also like to try some of the following:&lt;br&gt;
&lt;br&gt;
&amp;nbsp;&amp;nbsp;/* comparison to zero can halve loop overhead */&lt;br&gt;
&amp;nbsp;&amp;nbsp;while ((n -= 4) &amp;gt;= 0) {&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;/* don't bother copying the values - just use them */&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;sumsquares += dp[0]*dp[0] + dp[1]*dp[1] + dp[2]*dp[2] +&lt;br&gt;
dp[3]*dp[3];&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;dp += 4;&lt;br&gt;
&amp;nbsp;&amp;nbsp;}&lt;br&gt;
&lt;br&gt;
&amp;nbsp;&amp;nbsp;/* finish up with trailing values */&lt;br&gt;
&amp;nbsp;&amp;nbsp;for (n = n &amp; 3; --n &amp;gt;= 0;) sumsquares += dp[n] * dp[n];&lt;br&gt;
</description>
    </item>
    <item>
      <pubDate>Thu, 08 May 2008 15:30:55 -0400</pubDate>
      <title>Re: Speeding up sum of squares</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/168821#431073</link>
      <author>James Tursa</author>
      <description>&quot;Tim Davis&quot; &amp;lt;davis@cise.ufl.edu&amp;gt; wrote in message&lt;br&gt;
&amp;lt;fvupsb$4fo$1@fred.mathworks.com&amp;gt;...&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; Don't forget that when you're timing things you need to run&lt;br&gt;
&amp;gt; the code at least twice; the first time, the code itself&lt;br&gt;
&amp;gt; gets loaded into memory.&lt;br&gt;
&amp;gt; &lt;br&gt;
&lt;br&gt;
Yes, I did that. I posted typical times after several runs.&lt;br&gt;
&lt;br&gt;
&amp;gt; gcc is smart enough not to need &quot;register&quot; statements.  It&lt;br&gt;
&amp;gt; ignores them.  Visual C++ is probably not as smart.&lt;br&gt;
&lt;br&gt;
Good point. I should have worded it better. The &quot;register&quot;&lt;br&gt;
keyword doesn't actually force the compiler to do anything&lt;br&gt;
special ... it is just a compiler request. For instance, the&lt;br&gt;
built in lcc ignores it and gives the same result with or&lt;br&gt;
without the register keyword.&lt;br&gt;
&lt;br&gt;
James Tursa&lt;br&gt;
&lt;br&gt;
</description>
    </item>
  </channel>
</rss>
