Tuesday, January 15, 2008

Microsoft Parallel Extensions

I've installed Microsoft Parallel Extension 1.0.2873.22603 on my machine recently. I want to try its capabilities, so I've created a small test application.

In general, there are different levels of parallelism: instruction, task, etc. In the example below, I'll try iteration parallelism (a kind of imperative operation parallelism). The .NET Framework should automatically divide the work to run on multicore and multiprocessor systems.

The example

The main idea behind the example is iterating in a nested for loop. The inner loop will increase an integer variable. It will generate processor load and it will take a while to compute the result. The application with basic settings will generate 10,000,000,000 (100,000*100,000) loops.

My machine runs a Windows XP operating system with Physical Address Extension, runs on Intel Core2 CPU T7200 @2.00 GHz and has 1 GB of RAM.

The source

using System;
using System.Threading;

namespace PFXTest1
{
    class Program
    {
        /// <summary>
        /// Iteration start number
        /// </summary>
        const int from = 0;
        /// <summary>
        /// Iteration end number
        /// </summary>
        const int to = 100000;
        /// <summary>
        /// Progress bar scaling
        /// </summary>
        const int progressbar = to / 40;

        /// <summary>
        /// Show progress bar
        /// </summary>
        const bool progress = true;

        /// <summary>
        /// Task start time
        /// </summary>
        static DateTime timerStart;
        /// <summary>
        /// Task finish time
        /// </summary>
        static DateTime timerFinish;
        
        /// <summary>
        /// Main function
        /// </summary>
        /// <param name="args">Application args</param>
        static void Main(string[] args)
        {
            Console.WriteLine("PARALLEL LIBRARY (PFX) TEST");
            Console.WriteLine("Artur Herczeg, http://techies.teamlupus.hu");
            Console.WriteLine();
            
            NormalRun();
            ParallelRunInner();
            ParallelRunOuter();
            ParallelRunBoth();

            Console.ReadKey();
        }

        /// <summary>
        /// Non-parallel code
        /// </summary>
        private static void NormalRun()
        {
            if (progress)
            {
                Console.Write("Normal progress:         [");
                int oldcursor = Console.CursorLeft;
                Console.CursorLeft += to / progressbar;
                Console.Write("]");
                Console.CursorLeft = oldcursor;
            }

            timerStart = DateTime.Now;

            int count = int.MinValue;
            for (int i = from; i < to; i++)
            {
                for (int j = from; j < to; j++)
                {
                    count++;
                }

                if (progress && i % progressbar == 0)
                {
                    Console.Write("=");
                }
            }

            timerFinish = DateTime.Now;

            if (progress)
            {
                Console.WriteLine("]");
            }
            Console.WriteLine("Normal runtime: {0}", timerFinish - timerStart);
        }

        /// <summary>
        /// Inner and outer for loops are both parallel
        /// </summary>
        private static void ParallelRunBoth()
        {
            if (progress)
            {
                Console.Write("Parallel both progress:  [");
                int oldcursor = Console.CursorLeft;
                Console.CursorLeft += to / progressbar;
                Console.Write("]");
                Console.CursorLeft = oldcursor;
            }

            timerStart = DateTime.Now;

            int count = int.MinValue;
            Parallel.For(from, to, i =>
            {
                Parallel.For(from, to, j =>
                {
                    count++;
                });

                if (progress && i % progressbar == 0)
                {
                    Console.Write("=");
                }
            });

            timerFinish = DateTime.Now;

            if (progress)
            {
                Console.WriteLine("]");
            }
            Console.WriteLine("Parallel both runtime: {0}", timerFinish - timerStart);
        }

        /// <summary>
        /// Outer for loop is parallel
        /// </summary>
        private static void ParallelRunOuter()
        {
            if (progress)
            {
                Console.Write("Parallel outer progress: [");
                int oldcursor = Console.CursorLeft;
                Console.CursorLeft += to / progressbar;
                Console.Write("]");
                Console.CursorLeft = oldcursor;
            }

            timerStart = DateTime.Now;

            int count = int.MinValue;
            Parallel.For(from, to, i =>
            {
                for (int j = from; j < to; j++)
                {
                    count++;
                }

                if (progress && i % progressbar == 0)
                {
                    Console.Write("=");
                }
            });

            timerFinish = DateTime.Now;

            if (progress)
            {
                Console.WriteLine("]");
            }
            Console.WriteLine("Parallel outer runtime: {0}", timerFinish - timerStart);
        }

        /// <summary>
        /// Inner for loop is parallel
        /// </summary>
        private static void ParallelRunInner()
        {
            if (progress)
            {
                Console.Write("Parallel inner progress: [");
                int oldcursor = Console.CursorLeft;
                Console.CursorLeft += to / progressbar;
                Console.Write("]");
                Console.CursorLeft = oldcursor;
            }

            timerStart = DateTime.Now;

            int count = int.MinValue;
            for (int i = from; i < to; i++)
            {
                Parallel.For(from, to, j =>
                {
                    count++;
                });

                if (progress && i % progressbar == 0)
                {
                    Console.Write("=");
                }
            }

            timerFinish = DateTime.Now;

            if (progress)
            {
                Console.WriteLine("]");
            }
            Console.WriteLine("Parallel inner runtime: {0}", timerFinish - timerStart);
        }
    }
}

The results

PARALLEL LIBRARY (PFX) TEST
Artur Herczeg, http://techies.teamlupus.hu

Normal progress:         [========================================]
Normal runtime: 00:00:38.1875000
Parallel inner progress: [========================================]
Parallel inner runtime: 00:04:20.3125000
Parallel outer progress: [========================================]
Parallel outer runtime: 00:00:16.2187500
Parallel both progress:  [========================================]
Parallel both runtime: 00:05:12.2187500

Chart: Parallel.For performance

First impressions

There is just one way we can improve code execution speed. We can get better performance only with outer for parallelism. In this case, the inner loop can run on different processors - in my case, on different core. The parallel code runs twice faster than normal code because I've two cores.

If the inner for loop or both for loops are parallel, the code runs for 6-8 times longer. We can see here if we make a bad decision, parallel execution is much more worse than normal code execution.

Deeper look

I've recorded a performance monitor session on the test code.

CPU usage

See Parallel.For CPU usage graph.

The normal execution utilizes just one core, so the CPU utilization is 50%. The parallel inner and outer code takes up to 80%. The test code containing both inner and outer parallelism takes up 100% CPU time.

JIT

See Parallel.For JIT graph.

There is a huge JIT (Just-In-Time compile) activity when we start running a new parallel code. It's not surprising because native images are not generated for parallel assemblies (for System.Threading).

Classes and assemblies

See Parallel.For classes and assemblies graph.

Parallel functionality uses about 40 classes and two assemblies for runtime. No change in application domain number, so the parallel codes run in threads.

Threads

See Parallel.For thread graph.

The number of physical threads increase from 10 to 50 after starting parallel functions. This means the parallel optimizer considers the number of CPUs and the code, and chooses the possibly optimal thread number. It reminds me of the ThreadPool mechanism.

GC

See Parallel.For GC graph.

The sum of bytes in all heaps is about 2 MB after starting parallel code execution. This is almost zero in case of normal code.

Remoting

See Parallel.For remoting graph.

There is no remoting activity (there is just one application domain).

Marshalling

See Parallel.For marshalling graph.

There are regular native calls in parallel library. It will have its impact on portability (will it be possible?).