Tuesday, January 15, 2013

Combinations, and the issue of repetition

(note - I'll try my best to keep the explanations easy to understand and will provide visuals where needed, but this post will end up being pretty technical nonetheless.  I also divorce myself from any responsibility for resulting boredom on your part from here on out.)

A combination is a set of objects that can be identified by its constituents.  An easy way to think about combinations is simply by thinking of words: the word "PARALLEL" is a combination of letters, those letters being 1 P, 2 As, 1 R, 3 Ls, and 1 E.  Any anagram of "PARALLEL" (such as "ALL PEARL") still contains those exact letters and numbers of those respective letters; it's a different order – a different permutation – but not a different combination, where order does not matter.

In my Intro to Probability statistics course I'm taking right now, we learned how to calculate the number of combinations of size k that you can make from n number of objects, when all of those objects are distinct (such as, 7-letter 'words' from the letters A, B, C, D, E, F, G, H, I, J, and K).  To do this we need to know first how many permutations we can form: how many different orders of the same choices can be made?  That way we can then remove all of the choices that have identical letters.

In our example we have 11 letters, and we want to choose 7 of them.  I have 11 options to start, for the first letter in my word.  Once I make that choice, I now have 10 letters left to choose from; I would then have 9, then 8, etc.  And then I'm all out of letter space in my word; my total number of possible outcomes – of possible permutations – is equal to 11*10*9*8*7*6*5; the equation for permutations is n factorial (n! = n*(n–1)*(n–2)*...*2*1) divided by (nk) factorial:


To find the number of combinations, we remove the number of repeats: the number of ways that these 7-letter words can be arranged.  This is equal to 7 factorial (k!), and so more generally:


This is an easy equation to use in the case where the n objects we're choosing from are all distinct. However, it becomes much more complicated when there is repetition in the objects.

Instead of the letters A through K, maybe we have A A A A A B B B B C C (5 As, 4 Bs, 2 Cs).  We didn't cover such a scenario in class, since the answer depends on concepts we haven't yet learned about (and I'm of the opinion we likely won't learn this answer at all).  In fact, this problem is really bad to approach from the framework that it can be treated as an entire combination problem.

Categories of letters

Let's rephrase this question: what we're essentially trying to figure out is how many ways can we add As, Bs, and Cs together to get a 7 letter word, when we restrict the number of those letters that are available to us?  Well, let's first look at what equation this is represented by if we don't restrict the letters:


We have some number of As, some number of Bs, and some number of Cs to add up to k, which is still 7.  If there are no restrictions on the number of each letter we can choose, aside from all of them having to be less than or equal to k (since we don't want 8+ letter words), we can rephrase this question again: how many ways can these 7 letters in our word be distributed amongst these three categories of letters? This question is much easier to create a visual for:

• • • | • • | • • 

The above is an example of a solution to the problem: there are three 'zones' defined by the vertical lines that represent our three categories of letters.  In this case, we have 3 As, 2 Bs, and 2 Cs.  These are also solutions:

• | • • • • | • •
• • • • • | • • |
• • • | • | • • •
• • • • • • • | |

And so on.  What we're doing is finding all of the ways we can distribute 7 dots (our 7 letters), and 2 category 'spacers' (the lines; equal to our number of categories minus 1).  We have 9 objects total, but need to remove repeats:
- the dots are all equivalent, since the disregard for order in our resulting word means the first letter's spot is no different than the second spot (so divide by 7!);
- the lines are all the same, perhaps more intuitively (so divide by 2!).

This is our resulting equation:



Where, again, 7 is the number of letters in our word, 3 is the number of different letter categories we have to choose from (As, Bs, and Cs), and 1 is taken out since we don't need three spaces to identify 3 categories.  This is a really nice result; but, we're not done.

Limiting the sizes of our categories

The above equation we derived based on the assumption that the A category could be as large or as small as it wanted, and same with the B and C categories.  Our initial question, however, had a maximum of 5 As, 4 Bs, and 2 Cs.  Instead of allowing each category to get as large as the size of the word (7), they must be reduced.  To solve this, we have to consider this from yet again another perspective, and then we'll tie that new concept back in.

Inclusion/Exclusion

This concept refers to the process by which we can narrow down sections of sample spaces to exclude regions we don't want to look at.  Venn Diagrams are excellent visual representations of sample spaces:


Here, we have our sample space S which contains all of the possible outcomes to a set of events.  Sub-space A is a certain collection of outcomes with defining characteristics; sub-space B is an outcome set with its own characteristics, and so is sub-space C.  Perhaps these sub-spaces overlap, as they all do in this particular Venn Diagram.

In our question, we want to figure out all of the words that contain at most 5 As, at most 4 Bs, and at most 2 Cs.  So long as such a word satisfies all of those conditions, then we want to include it.

This parallels saying that so long as an event satisfies both A and B in our Venn Diagram above, we want to include it - so we want only the section AB.  Or, maybe we want everything that satisfies A or B or C.

We cannot do this with simple addition of all of the sections, because AC for example is contained already in both A and C; we'd be double-counting.  We would want to subtract out intersecting regions (and in some cases add them) to remove the double/triple-counting.

Let's consider our regions again, and we'll start our sum by adding A.  Here's how many times we've added each region so far:


Now let's add B:


We've double-counted AB and ABC.  Before we fix that, let's add in C:


We've double-counted AB, BC, and AC, and have triple-counted ABC.

Let's go to the first level of intersections first: AB, BC, AC.  We'll correct these by subtracting out each of these regions:


We've corrected the first level of intersections, but the ABC region hasn't been fully counted now.  We can simply add it back in at this point, because there is no region that is a sub-space of it.  In total what we've done to find our [R]egion is added the first sub-spaces, subtracted the first level of intersections, and then added back in the next layer of intersections.  This can be expressed in an equation and also generalized:


If we had more levels of intersections, we would continue alternating addition/subtraction.

We can use the concept of sub-spaces to identify regions that correspond to words that we either do or don't want.  In this case, it would help to visualize the set of outcomes that, for instance, have too many of each variable.  Such a sample space would look like this:


Now our word of 7 letters can't have more than 5 As, can't have more than 4 Bs, and can't have more than 2 Cs.  It must satisfy all of these conditions; we want the region completely outside of all of these shaded regions, equal to S minus the shaded region.

If we call:
P1 = combinations where A > 5 (red region)
P2 = combinations where B > 4 (yellow region)
P3 = combinations where C > 2 (blue region)

we can apply our above formula (in fact, the particular example in the previous diagrams) to come up with an equation for this calculation:


where S is the combination total where the As, Bs, and Cs can be as large as they want (36).

Calculating combinations for P1, P2, and P3

We want to know how many combinations of words have more than 5 As, more than 4 Bs, and more than 2 Cs.  Let's first start listing those words that have more than 5 As; I'll put the first 6 As as the first 6 letters, since again order does not matter:

AAAAAAC
AAAAAAA
AAAAAAB

That's actually all there can be; since the only restricting condition for P1 is that A must be greater than 5, the number of combinations depends on the remaining letters (in this case, there's only one other letter left in the word).  We can cut off the leading As:

AAAAAAC
AAAAAAA
AAAAAAB

What we've effectively done is removed a selection of letters available for choosing.  Going back for a second and considering the combination:


Since we've removed 6 of the 7, our new combination is:


And indeed, there are 3 combinations from above.

We can apply this to P2 and P3:

P1 = C(3, 1) = 3
P2 = C(4, 2) = 6
P3 = C(6, 4) = 15

If we want to solve for, say, P2andP3, we simply do the same thing we did previously by eliminating the first 5 Bs and first 3 Cs.  There turns out to be no intersection between any of P1, P2, and P3:

P1andP2 = C(–2, –4) = undefined; 0
P1andP3 = C(0, –2) = undefined; 0
P2andP3 = C(1, –1) = undefined; 0

P1andP2andP3 = C(–5, 0) = undefined; 0

So our equation ought to solve this for us now:


Is this correct?  Let's actually list the combinations:

AAAAABB, AAAAABC, AAAAACC,
AAAABBB, AAAABBC, AAAABCC,
AAABBBB, AAABBBC, AAABBCC,
AABBBBC, AABBBCC, ABBBBCC

That's 12.

No comments:

Post a Comment