The Statisticator: March 2013

Friday, March 29, 2013

Life and death of Internet fads

For the first time today I heard of the new internet fad "Hadouken", which consists in impersonnating Street Fighter's Ken's role while launching his lightning ball:

Meanwhile, your friends pretend to be hit with the ball:

This actually got me thinking of internet fads. Remember "Planking"? And of course the more recent "Harlem Shake" which came and left rather quickly? What about Psy's crazy horseriding "Gangnam style" dance, which also was the only thing people could talk about for a while?

How quickly do they appear? stay? disappear?

A quick Google Trends search reveals the following query volumes for each of the fads:

I find it quite interesting to compare the fad duration and fad amplitude (max value of the fad: even though the values are normalized, maxs can be compared between fads). Planking was the one of the earlier fads, its popularity was less intense at any given point in time but it lasted almost half a year due to lack of real competition (tens if not hundreds of fads attempted to become the new planking, from owling to batmanning without much success).

But fads now appear to come and go at much faster rates with higher intensities. Do people search more to keep updated on what the latest one is?

It will be interesting to look at the life cycle of "Hadouken" in a few weeks (provided people get the spelling right, I'm not even sure there is an official one!), I expect it will have a higher peak than "Harlem Shake" but an even shorter life.

I will also ry to gather some more data for other fads and see if there is a way to come up with some quick and dirty stats on these...

Monday, March 25, 2013

iTunes Randomness Part 7: Putting it all together!!!

(This is the 7th (and final!!) part of a series of posts on iTunes randomness. The introduction and a link to all parts can be found here.)

The final formula

The long-awaited formula is finally within our reach as all pieces are put back together, taking i=N-1 in the last equation in part 6:

As a check, we can verify that the sum of the previous probabilities indeed adds up to one over all k.

Before proceeding, let us recall the main definition of the binomial coefficients.These appear in the expansion of (1+X)^N:

By differentiating the above expression d times (d ≤N) with respect to X, we obtain:

Setting X=-1 and multiplying both sides by (-1)^d yields, for all d ≤N:

As any polynom of degree N or less can be written as a linear combination of polynoms of the type X, X(X-1), ..., X(X-1)...(X-d), the following result holds for any polynom P of degree N or less:

We now return to our original equation where we will use the previous relationship, substituting P(k) with k^(N-1) which is of degree less than N.

This check having been performed, we now return to our primary calculation of interest consisting in the average number of plays required to hear each song at least once.

Unfortunately, we have been unable to further simplify the expression, and the last row is our best closed formula at this time.

The expected number of total plays is E(N + k)=N + E(k):

The next figure compares the theoretical expression with our earlier simulations from part 1.

The expectation we derived can also help compute the average number of times each song will have been played in a successful playlist, and results are shown in the next figure:

The open question that still needs to be tackled is whether this curve reaches an upper bound or continues to increase. With an infinite playlist, will we expect to listen to each song an infinite number of times?

Until then, time for celebration!

Saturday, March 23, 2013

iTunes Randomness Part 6: Regressing the Tangled Equations

(This is part 6 of a series of posts on iTunes randomness. The introduction and a link to all parts can be found here.)

Regression

Thanks to the heuristic of the decomposition in integer powers from the previous section, we can guide our favorite statistical software R as to what linear regression to use.

Let's provide each of the sequences as our dependent variable, and all the integer powers as independent variables to determine, for each N, the alpha_j coefficients from the equation in the previous part:

For N=5, the regression would look something like:

Estimating the coefficients is then straightforward with most softwares, and even by hand considering the solution is most likely exact and thus only requires solving a system of 4 equations with 4 unknowns (any 4 rows of the equation above).

The underlying function generating the sequence for N=5 is:

And so, based on the equation derived in part 3 we obtain:

We now are capable of identifying the coefficients of s_1^k (N), but can we find a pattern in those in order to finally get a general formula for P(k|N), valid for all k and all N?

The next table summarizes the alpha_j coefficients of each of the powers in the equation mentioned at the top of this post:

Let us rewrite the values as fractions and momentarily drop the signs that seem to be in perfect alternance:

Let us take a closer look at this triangular matrix, diagonal by diagonal.

First diagonal: 1, 1, 3/2, 8/3, 125/24, 54/5

The 24 in the denominator hints at factorials yet again. With this in mind, the first terms of the sequence can be re-written as:

We can then view 16 as 4^2 and 125 as 5^3 hinting at k^(k-2). The first diagonal is then k^(k-2) / (k-1)!.

Second diagonal: 1, 2, 9/2, 32/3, 625/24

The 24 hints again at factorials. The sequence can be re-written as:

Encouraged by the formula for the first diagonal, the second diagonal can be written as k^(k-1) / (k-1)!.

Third diagonal: 1/2, 2, 27/4, 64/3

Based on the previous two formulas can we generalize? We would have hoped for k^k / (k-1)! for this new sequence but unfortunately the formula doesn't work. It is just off by a half: k^k / (2(k-1)!).

Third diagonal: 1/6, 4/3, 27/4

We have now become experts and the sequence is k^k / (6(k-1)!).

i-th row, j-th column:

Based on the previous findings, we can generalize the element on the i-th row and j-th column to be:

Adding the sign back in:

Is it me or does it feel like the end is right around the corner?

Friday, March 22, 2013

iTunes Randomness Part 5: The Tangled Equations

(This is part 5 of a series of posts on iTunes randomness. The introduction and a link to all parts can be found here.)

The tangled equations

Let's revisit our decomposition of 90:
1-1-1; 1-1-2; 1-1-3; 1-2-2; 1-2-3; 1-3-3; 2-2-2; 2-2-3; 2-3-3; 3-3-3.

Thinking back of our elevator metaphor, each of these numbers indicates time spent roaming around a given floor (listening to a previously heard song) and the number actually indicates on which floor the wandering is taking place. So 1-3-3 means you stayed an extra time period on level 1, and two on level 3.

All the paths are then the ways of wasting three time periods on three different floors. BUT, and this is a very important caveat, the numbers have to be in increasing order: 1-3-2 would not be an acceptable path. Indeed, the only ways of taking 3 items with replacement among 1, 2 and 3 with replacement and with numbers increasing are:

1-1-1, 1-1-2, 1-1-3, 1-2-2, 1-2-3, 1-3-3, 2-2-2, 2-2-3, 2-3-3, 3-3-3

Does this help us? Yes, because there is now another hidden recursion in the process!

We were here studying k=3, but what did the breakout look like at k=2? Let us list all the ways of taking 2 items with replacement among 1, 2 and 3 with replacement and with increasing numbers:

1-1, 1-2, 1-3, 2-2, 2-3, 3-3

It can be verified that the sum product here yields 25 which is the value in the second table of the sequences (part 3) for k=2 and N=4.

Based on this sequence for k=2, how can the sequence for k=3 be constructed?
If we look at all sequences starting with 1, then all k=2 sequences listed in red can be added to the right of the 1:

1-1-1, 1-1-2, 1-1-3, 1-2-2, 1-2-3, 1-3-3

If we now look at all sequences starting with 2, only the sequences for k=2 listed in red starting with 2 or 3 can be added as we cannot go back from 2 to 1, yielding:

2-2-2, 2-2-3, 2-3-3

Finally there is only one sequence listed in red starting with 3 that can be added to a 3:

3-3-3

This illustrates how the 10 sequences for k=3 listed at the top of this post can be generated from the k=2 sequences listed in red.

But remember that more than the sequences themselves it is their sum-product that we are interetsed in. Denoting by s_N^i (k) the sum product of all sequences starting with value i at time k for values ranging between 1 and N, then:

This looks a little confusing at first so let's illustrate the process with N=4 in the following figure:

Each column is a new "time point", and each row represents an s_N^i (k). Each arrow is color-coded according to the multiplication involved (times 1 for green, times 2 for orange and times 3 for blue). The last row is the sum of the s_N^i (k) at a given time-point, and – oh miracle! – the sum corresponds to the mysterious sequence observed in the table of our sequences in part 3 for N=4. Another fun fact to notice is that the sum is also the first row shifted by one, which makes sense as:

So now can we actually obtain the full explicit formula for the sequence itself. It involves some work as we unwind the tangled equations, working backwards from s_3^3 (k) to s_3^2 (k) to s_3^1 (k).

The calculations are a little lengthier for $s_3^2(k)$:

And finally:

No easy task, but we finally have our general formula generating the sequence for N=3:

The tangled equations can be used again for N=4:

More generally, we can start working our way backwards for the general case:

However, this has not allowed me to fully grasp the general formula for our sequence of interest s_N^1 (k) = S_N (k) for any N, which is the remaining piece of our puzzle.

An important observation can nonetheless be made. It seems as if the formula for s_1^k (N) is composed of all the integers 1 through N-1 to the power k+1:

The alpha_j coefficients also seem to alternate between positive and negative values.

Now what?

Thursday, March 21, 2013

iTunes Randomness Part 4: One Way Elevator

(This is part 4 of a series of posts on iTunes randomness. The introduction and a link to all parts can be found here.)

The “One-way elevator

Another wall.
At this point we revisited our earlier recursion, which retrospectively seemed to provide the best chances of moving forward.

To understand what really goes on in the recursion, let us illustrate the process using the computation P(k=3|N=4) as an example.
We have 4 unique songs, and want the fourth elusive one to be heard on the seventh play. Graphically, we can let us denote by (p,u) the point corresponding to u unique songs having been heard after p songs were played. Any path going from the point with coordinates (0,0) to that of coordinates (7, 4) using blue and orange arrows is a successful song orders of length 7:

Blue arrows represent that a previously unheard song is listened to, while an orange arrow represents the fact that the newly played song has already been heard.

A few notes on the paths:

The first arrow is necessarily blue
The last arrow is also necessarily blue
From start to finish we need to use 4 blue arrows (listening to each song at least once), and 3 orange arrows to account for the three repeats

A few notes on the transition probabilities:

If u unique songs have been heard, then the probability of hearing one of them again (orange arrows) is u/N, and the probability of hearing a new one (blue arrows) is (N-u)/N
Not all paths are created equal: following the path starting with three blue arrows, we end up with an overall probability of (4 * 3 * 2 * 3 * 3 * 3 * 1) / 4^7, whereas the other extreme path yields (4 * 1 * 1 * 1 * 3 * 2 * 1) / 4^7

So how does this help us compute P(k=3|N=7)? The first thing to notice of course is confirmation of the N^(N + k) denominator: each step will have an 1/N component.
As for the numerator, we saw that depending on the paths the value could be quite different, but because we need to use four blue arrows (to go from 0 to 1 unique songs, from 1 to 2, 2 to 3 and finally 3 to 4) and that each of these transitions has constant value no matter when they occur we will systematically have a 4.3.2.1 as part of our numerator (which is the N! we had also observed, namely in the section on the sequences, part 3).

But the numerator is a product of 7 integers, we know four of them, what can the other three be?

Let's list all possible paths from the graph, noting only the contribution of the orange arrows, starting with the paths staying at the lowest "levels" first:

1-1-1; 1-1-2; 1-1-3; 1-2-2; 1-2-3; 1-3-3; 2-2-2; 2-2-3; 2-3-3; 3-3-3.

Let's do a quick check first.

If we compute the sum of the product of all those terms, we should get the value in thes second table in part 3 for k=3 and N=4. Indeed, in both cases we obtain the value of 90. Previously, 90 appeared as the 4-th term of a mysterious sequence of 1 - 6 - 25 - 90, now we get a glimpse idea as to how it was constructed.
But does the sum product get us any closer to finding the general formula of the sequences?

It takes a little time to put to light, but there is actually a very precise way in which the 90 is decomposed, a method which we will exploit in the section of the tangled equations.

This section was named the one-way elevator without any concrete explanation, but perhaps after having illustrated how the graph at the beginning of the section helped us decompose the 90 value the terminology might make a little more sense. Recall, that the number of unique songs heard can only increase, and can only do so in increments of 1. The graph can then be viewed as a building where you start at the main entrance and need to make your way to your friend on the rooftop. The elevators can go up one floor at a time, but never down. You can dilly-dally as much as you want on each floor, but eventually need to continue your ascent. This is one way of visualizing the process!

Wednesday, March 20, 2013

iTunes Randomness Part 3: Hidden Recursions

(This is part 3 of a series of posts on iTunes randomness. The introduction and a link to all parts can be found here.)

Recursion

From the past sections, it clearly appears as if the naïve approaches, despite potentially leading to the correct answer, would require much effort and are not the ideal path to proceed. Let us try to put off counting all possible permutations to later if possible.

Another way to think of the problem is to actually stop and think at what happens whenever a song is played. Only two things can happen: it is either a new song or one already heard.

More formally, denoting by Q(u, p) the probability of having heard u unique songs after p plays:

We can initialize by setting Q(1,1)=1 and Q(i,j)=0 if i>j.

When faced with such recursions, it is often helpful to expand back a few terms and try to see if a certain tendency or pattern emerges. You can try it out for yourself here, things get very messy very quickly with no clear trend in sight.

A little too early to shout victory, but at least we can write a nice recursive R function to quickly compute the P(k|N) probabilities, and no longer have to rely on simulations.

How are our P(k|N) related to the above Q(u,p) probabilities? Recall that P(k|N) is the probability of having listened to all N songs at least once after exactly p=N+k plays and not a song sooner! This means that after N+k-1 plays, we have heard all songs except one at least once, and on the next play we will hear the final elusive song (event that occurs with probability 1/N). Putting the pieces together:

First small victory!

The hidden sequences

Now that we have a way of getting exact probability values, is there any way we could get a glimpse of what the solution looks like?

For a given number of cards (2, 3, 4, ...) we have computed the P(k|N) probabilities. The numbers were quite ugly, but what really interests us is the numerator as the denominator will always be N^(N+k) which is the size of the universe for playing a random sequence of N songs N+k times (and this indeed what we saw when computing P(0|N), P(1|N) and P(k|2) in our first approaches.

Here's are the values corresponding to the numerators:

A quick glance will show that each row is a multiple of N!. Simplifying to make the core structure more apparent yields the following table:

We have now narrowed down our original problem to understanding how each of these sequences is generated. All we need now that is to get the general formula of the terms in previous table, denoting it by S_N(k). We will then have:

The second row (for N=3) is easily identifiable as 2^(k+1)-1. The second column can be recognized as N(N-1)/2. Other than that, the underlying formula generating the other columns and rows is rather elusive. Again, please feel free to try for yourself!

For the remainder of the paper, we will refer to S_N(k) as the sequence for N. The sequence for N=4 therefore starts by 1, 6, 25, 90,...

My wife actually found another very nice relationship in the table. Denoting by En^k the elements in the table ((N-1)-th row and (k+1)-th column), we observe that:

See the 301 value? It's 3 * 90 + 31.

A very nice relationship, but unfortunately very difficult to exploit for our purpose!