Make Your Go Programs Lightning Fast With Profiling

Go is often used for writing distributed systems, advanced data stores, and microservices. Performance is key in these domains. 

In this tutorial, you'll learn how to profile your programs to make them lightning fast (utilize the CPU better) or feather light (use less memory). I will cover CPU and memory profiling, using the pprof (the Go profiler), visualizing the profiles, and even flame graphs.

Profiling is measuring the performance of your program in various dimensions. Go comes with great support for profiling and can profile the following dimensions out of the box:

  • a sampling of CPU time per function AND instruction
  • a sampling of all heap allocations
  • stack traces of all current goroutines
  • stack traces that led to the creation of new OS threads
  • stack traces that led to blocking on synchronization primitives
  • stack traces of holders of contended mutexes

You can even create custom profiles if you want. Go profiling involves creating a profile file and then analyzing it using the pprof go tool.

How to Create Profile Files

There are several ways to create a profile file.

Using "go test" to Generate Profile Files

The easiest way is to use go test. It has several flags that allow you to create profile files. Here is how to generate both a CPU profile file and a memory profile file for the test in the current directory: go test -cpuprofile -memprofile -bench .

Download Live Profile Data From a Long-Running Service

If you want to profile a long-running web service, you can use the built-in HTTP interface for providing profile data. Add somewhere the following import statement:

import _ "net/http/pprof"

Now, you can download live profile data from the /debug/pprof/ URL. More info is available in the net/http/pprof package documentation.

Profiling in Code

You can also add direct profiling into your code for complete control. First you need to import runtime/pprof. CPU profiling is controlled by two calls:

  • pprof.StartCPUProfile()
  • pprof.StopCPUProfile()

Memory profiling is done by calling runtime.GC() followed by pprof.WriteHeapProfile().

All the profiling functions accept a file handle that you are responsible for opening and closing appropriately.

The Sample Program

To see the profiler in action, I'll use a program that solves Project Euler's Problem 8. The problem is: given a 1,000-digit number, find the 13 adjacent digits within this number that have the largest product. 

Here is a trivial solution that iterates over all the sequences of 13 digits, and for each such sequence it multiplies all 13 digits and returns the result. The largest result is stored and finally returned:

package trivial


import (




func calcProduct(series string) int64 {

    digits := make([]int64, len(series))

    for i, c := range series {

        digits[i] = int64(c) - 48



    product := int64(1)

    for i := 0; i < len(digits); i++ {

        product *= digits[i]



    return product



func FindLargestProduct(text string) int64 {

    text = strings.Replace(text, "\n", "", -1)

    largestProduct := int64(0)

    for i := 0; i < len(text); i++ {

        end := i + 13

        if end > len(text) {

            end = len(text)


        series := text[i:end]

        result := calcProduct(series)

        if result > largestProduct {

            largestProduct = result




    return largestProduct


Later, after profiling, we will see some ways to improve the performance with another solution.

CPU Profiling

Let's profile the CPU of our program. I'll use the go test method using this test:

import (




const text = `























func TestFindLargestProduct(t *testing.T) {

    for i := 0; i < 100000; i++ {

        res := FindLargestProduct(text)

        expected := int64(23514624000)

        if res != expected {





Note that I run the test 100,000 times because the go profiler is a sampling profiler that needs the code to actually spend some significant time (several milliseconds cumulative) on each line of code. Here is the command to prepare the profile:

go test -cpuprofile -bench .

ok _/ 13.243s

It took a little over 13 seconds (for 100,000 iterations). Now, to view the profile, use the pprof go tool to get into the interactive prompt. There are many commands and options. The most basic command is topN; with the -cum option it shows the top N functions that took the most cumulative time to execute (so a function that takes very little time to execute, but is called many times, can be at the top). This is usually what I start with.

> go tool pprof

Type: cpu

Time: Oct 23, 2017 at 8:05am (PDT)

Duration: 13.22s, Total samples = 13.10s (99.06%)

Entering interactive mode (type "help" for commands)

(pprof) top5 -cum

Showing nodes accounting for 1.23s, 9.39% of 13.10s total

Dropped 76 nodes (cum <= 0.07s)

Showing top 5 nodes out of 53

 flat  flat%   sum%   cum   cum%

0.07s  0.53%  0.53%   10.64s 81.22%  FindLargestProduct

0      0%     0.53%   10.64s 81.22%  TestFindLargestProduct

0      0%     0.53%   10.64s 81.22%  testing.tRunner

1.07s  8.17%  8.70%   10.54s 80.46%  trivial.calcProduct

0.09s  0.69%  9.39%   9.47s  72.29%  runtime.makeslice

Let's understand the output. Each row represents a function. I elided the path to each function due to space constraints, but it will show in the real output as the last column. 

Flat means the time (or percentage) spent within the function, and Cum stands for cumulative—the time spent within the function and all the functions it calls. In this case, testing.tRunner actually calls TestFindLargestProduct(), which calls FindLargestProduct(), but since virtually no time is spent there, the sampling profiler counts their flat time as 0.

Memory Profiling

Memory profiling is similar, except that you create a memory profile:

go test -memprofile -bench .




ok      _/<br>

You can analyze your memory usage using the same tool.

Using pprof to Optimize Your Program's Speed

Let's see what we can do to solve the problem faster. Looking at the profile, we see that calcProduct() takes 8.17% of the flat runtime, but makeSlice(), which is called from calcProduct(), is taking 72% (cumulative because it calls other functions). This gives a pretty good indication of what we need to optimize. What does the code do? For each sequence of 13 adjacent numbers, it allocates a slice:

func calcProduct(series string) int64 {

    digits := make([]int64, len(series))


That's almost 1,000 times per run, and we run 100,000 times. Memory allocations are slow. In this case, there is really no need to allocate a new slice every time. Actually, there is no need to allocate any slice at all. We can just scan the input array. 

The following code snippet shows how to compute the running product by simply dividing by the first digit of the previous sequence and multiplying by the cur digit. 

if cur == 1 {

    currProduct /= old




if old == 1 {

    currProduct *= cur

} else {

    currProduct = currProduct / old * cur



if currProduct > largestProduct {

    largestProduct = currProduct


Here is a short list of some of the algorithmic optimizations:

  • Computing a running product. Suppose we computed the product at index N... N+13 and call it P(N). Now we need to compute the product at index N+1..N+13. P(N+1) equals to P(N) except that the first number at index N is gone and we need to take into account the new number at index N+14T. This can be done by dividing the previous product by its first number and multiplying by the new number. 
  • Not computing any sequence of 13 numbers that contain 0 (the product will always be zero).
  • Avoiding division or multiplication by 1.

The complete program is here. There is some thorny logic to work around the zeros, but other than that it is pretty straightforward. The main thing is that we only allocate one array of 1000 bytes at the beginning, and we pass it by pointer (so no copy) to the  findLargestProductInSeries() function with a range of indexes.

package scan



func findLargestProductInSeries(digits *[1000]byte,

                                start, end int) int64 {

    if (end - start)  < 13 {

        return -1



    largestProduct := int64((*digits)[start])

    for i := 1; i < 13 ; i++ {

        d := int64((*digits)[start + i])

        if d == 1 {



        largestProduct *= d



    currProduct := largestProduct

    for ii := start + 13; ii < end; ii++ {

        old := int64((*digits)[ii-13])

        cur := int64((*digits)[ii])

        if old == cur {




        if cur == 1 {

            currProduct /= old




        if old == 1 {

            currProduct *= cur

        } else {

            currProduct = currProduct / old * cur



        if currProduct > largestProduct {

            largestProduct = currProduct




    return largestProduct




func FindLargestProduct(text string) int64 {

    var digits [1000]byte

    digIndex := 0

    for _, c := range text {

        if c == 10 {



        digits[digIndex] = byte(c) - 48




    start := -1

    end := -1

    findStart := true

    var largestProduct int64

    for ii := 0; ii < len(digits) - 13; ii++ {

        if findStart {

            if digits[ii] == 0 {


            } else {

                start = ii

                findStart = false




        if digits[ii] == 0 {

            end = ii

            result := findLargestProductInSeries(&digits,



            if result > largestProduct {

                largestProduct = result


            findStart = true




    return largestProduct


The test is the same. Let's see how we did with the profile:

> go test -cpuprofile -bench .


ok      _/   0.816s

Right off the bat, we can see that the runtime dropped from more than 13 seconds to less than one second. That's pretty good. Time to peek inside. Let's use just top10, which sorts by flat time.

(pprof) top10

Showing nodes accounting for 560ms, 100% of 560ms total

flat  flat%   sum%     cum   cum%

290ms 51.79%  51.79%   290ms 51.79%  findLargestProductInSeries

250ms 44.64%  96.43%   540ms 96.43%  FindLargestProduct

20ms  3.57%   100%     20ms  3.57%   runtime.usleep

0     0%      100%     540ms 96.43%  TestFindLargestProduct

0     0%      100%     20ms  3.57%   runtime.mstart

0     0%      100%     20ms  3.57%   runtime.mstart1

0     0%      100%     20ms  3.57%   runtime.sysmon

0     0%      100%     540ms 96.43%  testing.tRunner

This is great. Pretty much the entire runtime is spent inside our code. No memory allocations at all. We can dive deeper and look at the statement level with the list command:

(pprof) list FindLargestProduct

Total: 560ms

ROUTINE ======================== scan.FindLargestProduct

250ms      540ms (flat, cum) 96.43% of Total

 .          .     44:

 .          .     45:

 .          .     46:func FindLargestProduct(t string) int64 {

 .          .     47:    var digits [1000]byte

 .          .     48:  digIndex := 0

70ms       70ms   49:  for _, c := range text {

 .          .     50:    if c == 10 {

 .          .     51:      continue

 .          .     52:    }

 .          .     53:    digits[digIndex] = byte(c) - 48

10ms       10ms   54:    digIndex++

 .          .     55:  }

 .          .     56:

 .          .     57:  start := -1

 .          .     58:  end := -1

 .          .     59:  findStart := true

 .          .     60:  var largestProduct int64

 .          .     61:  for ii := 0; ii < len(digits)-13; ii++ {

10ms       10ms   62:    if findStart {

 .          .     63:      if digits[ii] == 0 {

 .          .     64:        continue

 .          .     65:      } else {

 .          .     66:        start = ii

 .          .     67:        findStart = false

 .          .     68:      }

 .          .     69:    }

 .          .     70:

70ms       70ms   71:    if digits[ii] == 0 {

 .          .     72:      end = ii

20ms      310ms   73:      result := f(&digits,start,end)

70ms       70ms   74:      if result > largestProduct {

 .          .     75:        largestProduct = result

 .          .     76:      }

 .          .     77:      findStart = true

 .          .     78:    }

 .          .     79:  }

This is pretty amazing. You get a statement by statement timing of all the important points. Note that the call on line 73 to function f() is actually a call to findLargestProductInSeries(), which I renamed in the profile due to space limitations. This call takes 20 ms. Maybe, by embedding the function code in place, we can save the function call (including allocating stack and copying arguments) and save those 20 ms. There may be other worthwhile optimizations that this view can help pinpoint.


Looking at these text profiles can be difficult for large programs. Go gives you a lot of visualization options. You'll need to install Graphviz for the next section.

The pprof tool can generate output in many formats. One of the easiest ways (svg output) is simply to type 'web' from the pprof interactive prompt, and your browser will display a nice graph with the hot path marked in pink.




Flame Graphs

The built-in graphs are nice and helpful, but with large programs, even these graphs can be difficult to explore. One of the most popular tools for visualizing performance results is the flame graph. The pprof tool doesn't support it out of the box yet, but you can play with flame graphs already using Uber's go-torch tool. There is ongoing work to add built-in support for flame graphs to pprof.


Go is a systems programming language that is used to build high-performance distributed systems and data stores. Go comes with excellent support that keeps getting better for profiling your programs, analyzing their performance, and visualizing the results. 

There is a lot of emphasis by the Go team and the community on improving the tooling around performance. The full source code with three different algorithms can be found on GitHub.