Practical statistics for shooters

6/30/2017

How many shots do you need to fire? How do you know if a result was just random luck? Have you given up on testing because it's too hard to make an improvement?

All of these questions have answers. Last week I built a statistics calculator, but there were no instructions. This article concludes the series on statistics with an explanation of my iterative load development philosophy, and how to practically apply simple statistics to make sense of your test results.

Four random 5-shot groups from the same population

If you haven't read my previous articles, this one will be very helpful to read first.

Most of the time, we shoot groups of 3 to 5 shots each. Intuitively, we know that one group doesn't mean much, so we look at the big picture to understand if there is a conclusion to be drawn. We observe many groups at the same time, and try to get a feeling for whether they seem 'good' or 'bad'.

We can't put a number on a feeling. Our brains are wired to make simple, emotional, yes or no decisions. In a split second, we evaluate what we see, form an opinion, and then resist to change it. This is great for human survival (and makes for lively forum threads), but not so useful for tuning a rifle. To get the most from your efforts, you must train yourself to think objectively.

Process of elimination

The key to finding a good load before you burn out your barrel (or your patience) is iterative testing. Try something, make a statistical conclusion from that test, and move on to try something else. If you can measure powder and seat bullets at the range, you can work through the entire process in one day. The cost of portable reloading gear will pay itself off almost immediately.

The first step is to identify what doesn't work. If you fire 5 shots and the group is big, you can pretty confidently rule out this load. The ES, SD, and your intuition would all agree - better steer clear of this one.

Why? It's only 5 shots! The reasoning is that the relation between a sample and its population is asymmetric. It's more likely that a bad load will produce a small group than it is for a good load to produce a large group. We exploit this and it's why iteration works.

The 90% confidence intervals for a 5-shot group are -35% and +137%. This diagram shows how a large 5-shot group is very likely to be bad, so you can rule it out. However, a small 5-shot group could be good or average. It's not so easy to prove a load as it is to disprove it.

5-shot 90% confidence intervals for a large and small group.

We must speak in terms of probabilities. No test result is definite. If you have one large 5 shot group, it is within the realm of possibility that those 5 shots were extremely unlucky and the best load for your rifle is hidden within it. However, statistically, that chance of that is low, and you would have a greater return on investment trying something else.

If you've tried everything you can think of, and all you have is a paper target full of big groups, then it's time to take a second look at everything else. Are you shooting well? Is your scope tightly mounted? Is there something wrong with your barrel? Try a different bullet / primer / powder. Go back to the basics. It only takes one problem, and it's better to find it sooner than later.

To be clear, this process is not contrary to ladder testing, OCW, or any other load development technique. All data is just data, and it should be analyzed statistically. The objective with iterative testing is to only keep testing a load that has not yet already been ruled out. This is a simple matter of efficiency.

Go ahead and fire some groups, and use the ES to judge them. Try whatever comes to mind. There's nothing wrong with this. You are just looking for promising leads. When you think you've found something good, fire a couple more of the same load and see if your fortune repeats itself. Then proceed with verification.

Measuring a result with confidence

Once you happen across a load that appears promising, don't stop there. This is a mistake we make all too often. If you don't consider statistical confidence, you are setting yourself up for a disaster, or worse, a year's worth of mediocre performance that could so easily be avoided. Trust me... I speak from experience on both counts, and it's a hard lesson.

This is what NOT to do.

This is where we break out the statistics. The objective is to put a number on the chance that this load will work again in the future. For that, we calculate an SD and a confidence interval, which you can read about in my previous article. The more shots in the sample, the smaller the confidence interval, and the more likely the sample SD is to match the true SD of the load.

SD vs ES is an age-old debate. You may be surprised at my position on this. I'll use whatever method gets the job done. ES, SD, and confidence intervals are tools in the toolbox and value comes from knowing which to use when.

Here's a rule of thumb:

Use extreme spread to rule out a bad load with 5 shots.
Use standard deviation to prove a good load with confidence.

How many shots do you need? Well, you need to choose a confidence level that you are happy with. If you choose 85% confidence, then you are accepting that there's a 85% chance that the true performance of the rifle is within the interval we are about to calculate. It could be worse, or it could be better. The confidence level you are comfortable with is a personal trade-off between accepting some risk that your results are not accurate vs. investing more time and money to keep testing.

Select a confidence level that matches your risk tolerance.

Measuring group size

For group dispersion, we are interested in measuring the distance of each and every shot from the natural center of the load. You can do this is at home, with a ruler. The natural center is not your point of aim, or the center of each group, but the center of an imagined overlay group that includes all shots at that load. The distance from this point to each shot, in MOA, are your data points. Maybe someday there'll be an app for that (wink wink).

Find the distance of each point from natural center.

In practice, I don't recommend measuring each and every bullet hole of every group you fire to the millimeter. This is what I did with Damon Cali's data, because I had 35 essentially random groups to analyze and I could not comprehend it otherwise. If you test iteratively, you will narrow in on a good load, and you may not need such a thorough process at the verification stage.

What I do recommend is visualizing overlay groups. As I demonstrated through this experience, an overlay group allows you to view all the data without the bias of which shots were in which group. If you can construct these quickly, it can save you from heading in the wrong direction. Understand that groups are normally distributed, and that you should expect a tight cluster in the center. Good overlays will look a lot better than bad overlays, and it's easier to tell the difference by eye with 20+ shots.

Groups in overlay are more circular than they appear.

With an overlay group, you can estimate the SD as 1/4 of the extreme spread, as long as the group appears to be roughly circular and normally distributed. If you have a flier way outside the group, this would skew the ES more so than the SD so it should be weighted less. While crude, this will allow you to calculate confidence intervals.

Once you have a SD and a number of shots, even if it's rough, you have everything you need to calculate a confidence interval. This interval, at your given confidence level, tells you the range of SD that this load is actually within. The true performance is a single number, you just can't know it exactly. If you fired more shots, you would estimate it more accurately. How many shots you need depends on how small you would like that confidence interval to be.

Based on this sample, the true SD of the load itself is probably between 0.12 and 0.21.

Measuring velocity variation

Proper statistical measurement is much more important for shot velocity. At 600 yards and beyond, velocity variation dominates, and reducing your velocity SD will have a significant impact on your performance. A tight group is nice to have, but at long range, it pales in comparison to your velocity variation.

Dispersion is constant, but velocity effect grows with distance

Minor changes in your load can have an impact on your velocity SD. Every time I go to the range, I record my velocities, and I may notice the SD has been creeping up over the past few months. This prompts me to think about what may be different now - whether it's a new lot of powder, the cases are getting old, or a temperature shift in the seasons. Managing your SD requires maintenance, and it's time well spent.

The Two-Box Chrono was designed specifically for this purpose. It reduces random error to insignificant levels. Less error means a lower SD, and a smaller confidence interval. You'll notice more consistent SD measurements day to day, detect changes in your SD sooner, and be able to observe the effect of changing something with less shots fired (as I will show below).

To calculate the SD of a sample, simply plug the numbers from the chronograph into the calculator. The sample SD is the actual SD of this group, while the confidence interval is the likely range of SD of your load itself (given only this sample as input).

If you have more data, you can shrink your confidence interval and get a better estimate of your true load SD. The most important takeaway is to understand that the SD has a confidence interval in the first place. Just because you fire 20 shots and measure an SD doesn't mean you will get the same SD next time.

Getting more data for free

Suppose you fire ten 5-shot groups, all at different powder charges and seating depths. 3 of the groups are good, so you repeat them. Now you have 65 shots on paper, but only 5 or 10 shots of any one scenario from which to do analysis on.

With only 10 shots at a given load, you can't really draw a statistical conclusion, because the confidence intervals are too large. However, you have fired 65 shots, and if they could all be considered, that would be plenty of data to give you some confidence in.

The problem is they are not equal. However, if we apply some assumptions, we can go ahead and group them, and take advantage of the combined data.

Many different powder charges, grouped by regions of seating depths.

As I mentioned in an earlier post, I operate with a working assumption that seating depth will not affect velocity SD, and powder charge will not affect group size, within reasonable limits. It's just a theory, but it hasn't let me down yet.

With this assumption, you can combine all the data at one powder charge, regardless of seating depth, into a measurement of velocity SD. You can also consider all the groups at one seating depth as the same group, and overlay them visually. Remember that the assumption is just a theory, but go ahead and take advantage of the free statistics if you think it helps you iterate towards the best load.

This working model also allows you to focus your efforts. If your groups are large, try changing seating depth, not powder charge. If your velocity SD is large, try changing powder charge, not seating depth. If you try 10 different powder charges all at the same seating depth, you may end up with 10 bad groups and all that data is obsolete as soon as you realize seating depth was the problem. This is what happened to Damon Cali, and also what happened to me.

There's another trick to combining data. As you increase powder charge, velocity will increase predictably. For me, it's about 50 fps / grain. If you fire groups at slightly different powder charges, you can combine them into a single sample as long as you adjust the data accordingly.

Left: groups as fired. Right: same data grouped into 0.5 grain buckets. Showing 80% confidence intervals.

For example, to combine a group fired at 44.0 grains with one fired at 44.2, I might add 5 fps to the first group and subtract 5 fps from the second. This would provide twice as many data points representing 44.1, and allow higher confidence on that measurement.

Making statistical improvements

Up to this point, we have focused on measuring individual samples. Measuring improvements on the other hand, is about comparing two samples. We have two samples, and we need to know if they are statistically different. More precisely, how likely it is that the two samples came from different populations.

If you fire two groups, they will be different. Always. The question is how different. Are they different enough? Would that difference be repeatable? Is it enough warrant a change in the load?

With statistical testing, we can ask questions like:

Is 45.0 grains better than 44.0?
Is 30 jump better than 10 jump?
Should I neck size or full length size?
Does this primer produce a smaller SD than that primer?
Is this new lot of powder hotter than the old one?

To compare the averages of two samples, we use the T-test. To compare the variation between two samples, we use the F-test. These are magical formulas that are quite complicated, and I only understand enough to build a calculator that gives the right answer.

To compare two samples, you need two things:

Lots of data in each sample.
A large relative difference between them.

You need a lot of data. Beg, borrow, and steal data from other groups. For example, if you shoot a ladder at 10 different charges, you may only have 5 shots at each charge, and you can make no statistical comparisons considering each group as independent. However, if you combine data, may be able to say that everything from 44 to 45 is better than from 45 to 46.

Here's an example. Suppose I fire 10-shot groups at 44 and 45 grains. I measure an SD of 6 for one and 8 for the other. Does this represent a significant improvement?

Answer: No. There's only a 59.6% chance these groups are from different populations, so we've learned very little. Two random 10-shot groups with SDs of 6 and 8 would occur fairly often even from the same load. The confidence intervals are overlapped. We need more data.

Now we can ask the question, how many shots do we need to fire to prove such a difference with 90% confidence?

Answer: 35 shots for each group. If you relax your confidence level to 75%, you can get away with 18 shots per group, but you are gambling. It's a question of return on investment. How much is more confidence worth to you? You might get lucky, or you might find yourself back at square one next week.

Using a chronograph to improve velocity SD

Suppose I've been shooting all year and my elevation is pretty good, but maybe it could be better. Maybe with another day at the range, I could tweak the powder charge, or try small primers. My SD has been around 7 all year, so as a goal, I hope to measure a 16% improvement in SD, from 7 to 6.

I'd like to have 90% confidence in the result. After all, I plan to shoot about 800 rounds of this load in July and August alone (3 provincial matches plus the Nationals and Worlds for F-class). It might take a full day to perform this test, so I'll aim to only have to do this once.

Is it worth my time to try?

We can predict what it would require to find this result, even before going to the range. Let's make sure we bring enough ammo so that it's even possible to achieve this. Otherwise it's a doomed exercise.

So the question to ask the stats calculator: if I fire two groups, and one measured an SD of 6, and the other 7, how many shots must those groups have to be 90% confident the difference is real?

The answer: 116 shots. From each group. Well that's not realistic. It kind of puts things in perspective when you look at it that way. I don't really feel like shooting 232 shots just on a hunch that I can make an improvement.

So therein lies the problem facing many long range shooters. You find a decent load quickly, but making an improvement is statistically very difficult. You can try to get lucky again, but it's a cycle of trial and error.

Now let's consider the chronograph. Random error increases the SD of both groups, decreasing the relative difference between them, and therefore requiring more data to make a comparison. With a more precise chrono, maybe we can ease the pain.

Let's say I was using a chronograph with an inherent SD of 3.5 fps, which is reasonable number based on Applied Ballistics' testing of the Magnetospeed, Chrony, and others. Most shots would be within +/- 7 fps.

Table of results from Applied Ballistics published testing on chronographs

The random error SD of 3.5 actually means my observed SD of 6 would have to come from a load with a true SD of 4.87. The ammo is always a little better than the chronograph says it is, because it can only add error over time. It would also increase a true SD of 6.06 to 7. That means, to observe the same 16% improvement at the chrono (7 to 6), my ammo would actually have to improve by 24% (6.06 to 4.87)!

If I had a perfect chrono with no random error, I would need to fire only 59 shots of each group to see the 16% improvement I am hoping for, not 116. That's still a lot, but it's half the shooting.

Now let's say we used the Two-Box Chrono with an error SD of 0.5 fps. That same load with true performance of 4.87 would be observed as 4.90, and a load at 6.06 is observed as 6.08. The number of shots required in this case would be 60. Just one more shot.

It's hard to make an improvement in a good load. The relative differences are small and a lot of shooting is required. Fine tuning is possible, but only with a good chronograph and an understanding of the statistics that are controlling your fate.

Play with the calculator to get a feel for what difference in SD with what number of shots will give you a positive result at the confidence level you are comfortable with. You'll see there are realistic scenarios that produce 50% confidence or less, where you'd be better off flipping a coin.

In conclusion...

When I learned how to use the F-test I knew it was the key to making sense of the madness, to avoid going to range and coming home with a non-result. It completely changed my perspective on how to approach and plan tests.

Any test where variance is measured and compared, where confidence is not considered, could be very misleading. About 99.9% of information you find online falls into this category. Now you know how to make sense of whether it's really meaningful.

I used to load 10 or 20 shots of two, three or four different scenarios to compare. The results seemed significant at the time, and I thought I was learning things about reloading that I couldn't find answers to elsewhere. I've punched thousands of holes in paper and filled two books with test results.

Now I know that all I was doing was learning how little I knew about what I was doing.

Now, I go to the range with a plan that has a reasonable chance of success. I test iteratively, looking for quick clear answers most of the time, but knowing that easy answers are just suggestions. I never say anything for sure until I'm ready to test it properly. I limit my extensive testing to when I have a very specific goal in mind.

Keep it simple. I only care about my group size and SD for one rifle, for one summer, with one powder, one bullet, and one primer. My reloading procedure has been basically fixed for 2 years. I focus my energy on charge and jump and keep everything else constant. Otherwise... the madness will return.

Wind reading and shooting strategy is also important. It is something I focus on at different times. Load development is homework. The more you put into it, the easier it is to follow the wind (because the rifle is more accurate), the easier it is to learn the flags, and of course the less points you lose to elevation.

Please feel free to post any questions in the comments. Good shooting!

9 Comments

Ben Winget

7/3/2017 08:35:00 am

Checkout the on target software, this makes it very easy to measure your groups and gives you an average to center measurement, along with the group size, horizontal and vertical measurements.
https://ontargetshooting.com/

Brian Woolley

10/28/2018 06:13:10 pm

I was interested in your comments under the heading "Measuring Velocity Variation" and the graph Effect of Error on POI.
I think a factor often overlooked is the variation of BC between individual bullets due to very minor differences in shape and dimensions. Using LabRadar's multi distance velocity data is is possible to back calculate an individual bullet's BC. Using 22 rimfire target ammo I have been surprised at the variation in BC. this may overshadow the initial velocity variation at long range.
I agree with your comments about SD of velocity having a confidence level.
I did 4 details all with the same ammo.
Detail No Std Dev Spread Ave Velocity
1 8.98 48.6 1093
2 5.54 23.5 1092
3 7.1 32.1 1093
4 6.35 25.3 1093

As you can see, the average velocity over 25-30 rounds per detail was very consistent, but I was surprised by the variation in the Std Dev.
Thanks for your articles. Regards
Brian

Ron Reese

5/18/2019 01:59:35 pm

I realize this has been posted a while, but, I plugged your SD=6.48, your 0.1, and sample size of 20 into a Excel and my confidence levels for SD calculated to 4.10 - 8.86, rather than the 5.14 - 8.88 you show. The 8.88 is close enough, but could one of us have done something wrong to get 4.10 vs 5.14 for the low SD?

Great post BTW!

Adam

5/18/2019 11:23:44 pm

I'm not sure, but here's the equation, feel free to check and see if it matches your Excel function. It's been a long time since I did this.

sd_lower = Math.sqrt((sample.count-1)*sample.stdev*sample.stdev/j$.chisquare.inv(1-alpha/2, sample.count-1));

Ron Reese

5/19/2019 07:04:05 pm

Thanks Adam.

I know exactly what, "It's been a long time..." means. I get asked questions on my YouTube channel about a piece of gear, etc. I posted a long time ago, and I can't always remember what was what. :)

Kinda funny that we calculate the confidence two different ways and the upper level agrees to the third decimal and the lower turns out different???

Not sure it's worth pursuing, if I did, the next question would be which way is more accurate, so I'll go by Excel simply because it does the "dirty" work for me. :)

Scott Strehle

11/16/2019 01:21:52 pm

Thank you, Adam for your information. I have been reading through your papers here and am learning with each one. I was wondering if you had ever come back to testing the accuracy of the labradar chronograph. It wasn't listed in the chronographs tested and was wondering if there was any new insight. I bought one last year and have been working on loads with it. In an effort to try and get my es and sd numbers down I recently ordered a autotrickler and an patiently awaiting its arrival to see if that is my problem. Or am I just chasing my tail with poor chronograph data. Thanks Scott

Cal Zant link

5/25/2020 04:29:33 pm

Hey, Adam. My name is Cal Zant, and I'm the author of PrecisionRifleBlog.com. I just wanted to say how well written and helpful this was. I have read a dozen of these kinds of articles that try to make statistics approachable and applicable to the average shooter (which I 100% agree is a worthy endeavor), and I have to say that you have done the best job BY FAR! The calculator is insanely helpful, your examples are clear, and I appreciate your pragmatic approach and how you didn't get derailed or distracted by the really technical aspects of this. Really well done.

Thanks,
Cal

Maya W link

11/29/2020 11:31:58 pm

Thannks for writing this

Os Tek

12/15/2020 11:41:21 am

Has anybody tried using experiment design, "Latin Square" ?

Great work, will make me to go back my statistics book from a very long time ago.

Stay safe,
OT

Who am I?

Adam MacDonald: Canadian FTR shooter, inventor, problem solver.

With this blog I will share my experiences with load development, shooting strategy, and development of new products.

RSS Feed