2017/2018 Intermediate Microeconomics (Business course at UTS)

Proposed solutions for the tutorials.



Some simple probability formulas with examples

A known relationship that is usually given axiomatically:

P(B|A) = \frac{{P(AB)}}{{P(A)}}

Upon rearrangement gives the multiplication rule of probability:

P(AB) = P(A)P(B|A) = P(B)P(A|B)

Now observe a cool set up that is handy to keep in mind for proving the law of total probability and Bayes’ theorem.

Imagine that B happens with one and only one of n mutually exclusive events A_1, A_2,..., A_n, i.e.:

 B = \sum\limits_{i = 1}^n {B{A_i}}

By addition rule:

B = \sum\limits_{i = 1}^n {P(B{A_i})}.

Now by multiplication rule:

B = \sum\limits_{i = 1}^n {P({A_i})P(B|{A_i})}.

This is the law of total probability

From the same set up imagine that we want to find the probability of even A_i if B is known to have happened. By the multiplication rule:

P(A_i B) = P(B)P(A_i|B) = P(A_i)P(B|A_i)

By neglecting P(A_i B) and dividing the rest through P(B) we get:

P\left( {{A_i}|B} \right){\rm{ = }}\frac{{P({A_i})P(B|{A_i})}}{{P(B)}}

And applying the law of total probability to the bottom we have the Bayes’ equation

P\left( {{A_i}|B} \right){\rm{ = }}\frac{{P({A_i})P(B|{A_i})}}{{\sum\limits_{j = 1}^n {P({A_j})P(B|{A_j})} }}

Bunch of examples:

Problem: P_t (k) is a known probability of receiving k phone calls during time interval t. Also k=0,1,2,.... Assuming that a number of received calls during two adjeicent time periods are independent find the probability of receiving s calls for the time interval that equal 2t.

Solution: Let A_{b.b + t}^k be an event consisted of k call in the interval b till b+t. Then clearly

A_{0,2t}^s = A_{0,t}^0A_{t,2t}^s + ... + A_{0,t}^sA_{t,2t}^0

which means that the event A_{0,2t}^s can be seen as sum of s+1 mutually exclusive events, such that in the first interval of duration t number of calls received is i and in the second interval of the same duration number of received calls is s-i (i=0,1,2,...,s). By rule of addition

P(A_{0,2t}^s) = \sum\limits_{i = 0}^s {P(A_{0,t}^iA_{t,2t}^{s - i})}.

By the rule of multiplication

P(A_{0,t}^iA_{t,2t}^{s - i}) = P(A_{0,t}^i)P(A_{t,2t}^{s - i})

If we change the notation so that

{P_{2t}}(s) = P(A_{0,2t}^s)


{P_{2t}}(s) = \sum\limits_{i = 0}^s {{P_t}(s) \cdot P(s - i)}.

It is known that under quite general conditions

{P_t}(k) = \frac{{{{(at)}^k}}}{{k!}}\exp \{ - at\} {\rm{ }}(k = 0,1,2...)

(Recall that the Poisson distribution is an appropriate model if the following assumptions are true. (a) k is the number of times an event occurs in an interval and k can take values 0,1,2,.... (b) The occurrence of one event does not affect the probability that a second event will occur. That is, events occur independently. (c) The rate at which events occur is constant. The rate cannot be higher in some intervals and lower in other intervals (that kinda a lot to take on faith really). (d) Two events cannot occur at exactly the same instant; instead, at each very small sub-interval exactly one event either occurs or does not occur. (e) The probability of an event in a small sub-interval is proportional to the length of the sub-interval.  Or instead of those assumptions, the actual probability distribution is given by a binomial distribution and the number of trials is sufficiently bigger than the number of successes one is asking about (binomial distribution approaches Poisson).)

Parametrisation then gives

{P_{2t}}(s) = \sum\limits_{i = 0}^s {\frac{{{{(at)}^s}}}{{i!(s - i)!}}\exp \{ - 2at\} } {\rm{ = }}{(at)^s}\exp \{ - 2at\} \sum\limits_{i = 0}^s {\frac{1}{{i!(s - i)!}}}

Note that

\sum\limits_{i = 0}^s {\frac{1}{{i!(s - i)!}} = \frac{1}{{s!}}\sum\limits_{i = 0}^s {\frac{{s!}}{{i!(s - i)!}}} = \frac{1}{{s!}}{{(1 + 1)}^s} = \frac{{{2^s}}}{{s!}}}


{P_{2t}}(s) = \frac{{{{(2at)}^s}\exp \{ - 2at\} }}{{s!}}{\rm{ }}(s = 0,1,2,...)

The key point is that if for time interval t we have that parametrized formula for 2t we have the one above. It holds true for any multiples of t as well.

A simple fact about sets

Out of n elementary events one can get

\sum_{m=1}^{n} C_{n}^{m} = 2^n - 1

possible outcomes. Where C_{n}^{m} is an event that contains m elementary events. Take set

\{ a,b,c\}

with the size as the only characteristic n=3. Then it power set

\{ \{ a\} ,\{ b\} ,\{ c\} ,\{ a,b\} ,\{ a,c\} ,\{ b,c\} ,\{ a,b,c\} ,\{ \emptyset \} \}

contains {2^3} = 8 elements. 3 event for one element each, C_{3}^{1}. Then 3 events with two element, C_{3}^{2}. Finally, 1 event for one with all elements, C_{3}^{1}. A emply set is an impossible event.

I personally think that this simple fact is amazing, but some would say it is kinda boring. Here is an interesting question for those.

A pack of cards that has 36 cards is randomly split equally into halves. What is the probability that halves have equal amount black and red cards?

This is just another set with 36 elements of two type.

p = \frac{{C_{18}^9 \times C_{18}^9}}{{C_{36}^{18}}} = \frac{{{{(18!)}^4}}}{{36!{{(9!)}^4}}}

The denominator indicates all possible equally likely ways the pack can be split.

Instead of computing that manually one can use this asymptotic equality

n!\ \approx \sqrt {2\pi n} \cdot {n^n}{e^{ - n}}


18!\ \approx {18^{18}}{e^{ - 18}}\sqrt {2\pi \cdot 18}

9!\ \approx {9^9}{e^{ - 9}}\sqrt {2\pi \cdot 9}

36!\ \approx {36^{36}} \cdot {e^{ - 36}}\sqrt {2\pi \cdot 36}

Which means

p \approx \frac{{{{(\sqrt {2\pi \cdot 18} \cdot {{18}^{18}} \cdot {e^{ - 18}})}^4}}}{{\sqrt {2\pi \cdot 36} \cdot {{36}^{36}} \cdot {e^{ - 36}}{{(\sqrt {2\pi \cdot 9} \cdot {9^9} \cdot {e^{ - 9}})}^4}}}

Simple algebra yields

p \approx \frac{2}{{\sqrt {18\pi } }} \approx \frac{4}{{15}} \approx 0.26

The result fascinates me. The graph visualizes data from a real experiment where a pack is split equally 100 times and \mu is a cumulated sum if exactly 9 red cards are observed in on of the halves. What is crazy is that we were able to see the results of this experiments without doing any experiments, by simply reasoning mathematically about things.

More on this topic: Гнеденко-1988 

Distribution of a ordered pile of rubble

Imagine a pile of rubble (X) where the separated elements of the pile are stones (x_i). By picking n stones we form a sample that we can sort by weight. A sequence x_1,x_2,...,x_n becomes x_{(1)},x_{(2)},...,x_{(m)},...x_{(n)}, where (m) is called “rank”.

Pretend that we do the following. Apon picking a sample and sorting it we put stones into n drawers and mark each drawer by rank. Now repeat the procedure again and again (picking a sample, sorting and putting stones into drawers). After several repetitions, we find out that drawer #1 contains the lightest stones, whereas drawer #n the heaviest. An interesting observation is that by repeating the procedure indefinitely we would be able to put all parenting set (the whole pile or the whole range of parenting distribution) into drawers and later do the opposite — take all stones (from all drawers) mix them to get back the parenting set. (The fact that distributions (and moments) of stones of particular rank and the parenting distribution are related is probably the most thought-provoking)

Now let us consider the drawers. Obviously, the weight of stones in a given drawer (in a rank) is not the same. Furthermore, they are random and governed by some distribution. In other words, they are, in turn, a random variable, called order statistics. Let us label this random variable X_{(m)}, where m is a rank. Thus a sorted sample looks like this


Its elements X_{(m)} (a set of elements (stones) x from the general set X (pile) with rank m (drawer)) are called m order statistics.


Elements X_{1} and X_{(n)} are called “extreme”. If n is odd, a value with number m=\frac{(n+1)}{2} is central. If m is of order \frac{n}{2} this statistics is called “m central” A curious question is how define “extreme” elements if n \to \infty. If n increases, then m increases as we.


Let us derive a density function of m order statistics with the sample size of n. Assume that parenting distribution F(x) and  density f(x) are continues everywhere. We’ll be dealing with a random variable X_{(m)} which share the same range as a parenting distribution (if a stone comes from the pile it won’t be bigger than the biggest stone in that pile).


The figure has F(x) and f(x) and the function of interest \varphi_n (\cdot). Index n indicates the size of the sample. The x axis has values x_{(1)},...,x_{(m)},...,x_{(n)} that belong to a particular realization of X_{(1)},X_{(2)},...,X_{(m)},...,X_{(n)}

The probability that m-order statistics X_{(m)} is in the neuborhood of x_{(m)} is by definition (recall identity: dF = F(X + dx) - F(x) = \frac{{F(x + dx) - F(x)}}{{dx}} \cdot dx = f(x) \cdot dx ):

dF_{n}(x_{(m)})=p[x_{(m)}<X_{(m)}<x_{(m)}+dx_{(m)}]=\varphi_n (x_{(m)})dx_{m}

We can express this probability in term of parenting distribution F(x), thus relating \varphi_n (x_{(m)}) and F(x).

(This bit was a little tricky for me; read it twice with a nap in between) Consider that realization of x_1,...,x_i,...,x_n is a trias (a sequence generated by parenting distribution, rather then the order statistics; remember that range is common) where “success” is when a value X<x_{(m)} is observed, and “failure” is when X>x_{(m)} (if still necessary return to a pile and stone metaphor). Obviously, the probability of success is F(x_{(m)}), and of a failure is 1-F(x_{(m)}). The number of successes is equal to m-1, failures is equal to n-m, because m value of x_m in a sample of a size n is such that m-1 values are less and n-m values are higher than it.

Clearly, that the process of counting of successes has a binomial distribution. (recall that probability of getting exactly k successes in n trials is given by pms: p(k;n,p) = p(X = k) = \left( \begin{array}{l}  n\\  k  \end{array} \right){p^k}{(1 - p)^{n - k}}  In words, k  successes occur with p^k  and n-k  failures occur with probability (1-p)^{n-k} . However, the k successes can occur anywhere among the n  trials, and there are \left( \begin{array}{l}  n\\  k  \end{array} \right)  different ways of distributing k successes in a sequence of n  trials. A little more about it)

The probability for the parenting distribution to take the value close to x_{(m)} is an element of dF(x_{(m)})=f(x_{(m)})dx.

The probability  of sample to be close to x_{(m)} in such a way that m-1 elements are to the left of it and n-m to the rights, and the random variable X to be in the neighborgood of it is equal to:

C_{n - 1}^{m - 1}{[F({x_{(m)}})]^{m - 1}}{[1 - F({x_{(m)}})]^{n - m}}f({x_m})dx

Note that this is exactly p[x_{(m)}<X_{(m)}<x_{(m)}+dx_{(m)}], thus:

\varphi_n (x_{(m)})dx_{m}=C_{n - 1}^{m - 1}{[F({x_{(m)}})]^{m - 1}}{[1 - F({x_{(m)}})]^{n - m}}f({x_m})dx

Furthermore if from switching from f(x) to \varphi_n (x_{(m)}) we maintaine the scale of x axis then

\varphi_n (x_{(m)})=C_{n - 1}^{m - 1}{[F({x_{(m)}})]^{m - 1}}{[1 - F({x_{(m)}})]^{n - m}}f({x_m})

The expression shows that the density of order statistics depends on the parenting distribution, the rank and the samples size. Note the distribution of extreme values, when m=1 and m=n

The maximum to the right element has the distribution F^{n}(x) and the minimumal 1-[1-F(x)]^n. As an example observe order statistics for ranks m=1,2,3 with the sample size n=3 for uniform distribution on the interval [0,1]. Applying the last formula with f(x)=1 (and thus F(x)=x we get the density of the smallest element

\varphi_3 (x_{(1)})=3(1-2x+x^2);

the middle element

\varphi_3 (x_{(2)})=6(x-x^2)

and the maximal

\varphi_3 (x_{(3)})=3x^2.

With full concordance with the intuition, the density of the middle value is symmetric in regard to the parenting distribution, whereas the density of extreme values is bounded by the range of the parenting distribution and increases to a corresponding bound.

Note another interesting property of order statistics. By summing densities $latex \varphi_3 (x_{(1)}), \varphi_3 (x_{(2)}), \varphi_3 (x_{(3)})$ and dividing the result over their number:

\frac{1}{3}\sum\limits_{m = 1}^3 {{\varphi _3}({x_{(m)}}) = \frac{1}{3}(3 - 6x + 3{x^2} + 6x - 6{x^2} + 3{x^2}) = 1 = f(x)}

on the interval [0,1]

The normolized sum of order statistics turned out to equla the parenting distribution f(x). It means that parenting distibution is combination of order statistics X_{(m)}. Just like above had been mentioned that after sorting the general set by ranks we could mix the sorting back together to get the general set.

Further read: Ефимов-1980; Arnord-balakrishnan-2008.

Why so many big lines into terrible restaurants…

It must be a good restaurant since the line is so long. Hm… you are likely just failed to update your beliefs in a rational way.

Imagine you are in a classroom and there is an urn with three balls in front of everyone. You don’t see the colour of balls, but you do know equally likely it could be majority blue (2 blue 1 red) or majority red (1 blue 2 red). Since you don’t know which urn exactly is there (true state of the world) you need some evidence before making a guess. Now every person in class one by one come and pick one ball from the urn and without showing it announces his choice. Believe it or not, but this is your restaurant choice situation.

Two possibilities for the urn is an analogue to whether this restaurant good or bad. A person that comes to make a choice has several pieces of information to combine. Taking one ball from urn is the same as if you have read some review about the restaurant before. The information is not perfect, the reviews could be biased or not representative for your taste. However, you also observed the choices of people before you. You do not know their private signal (what ball they picked from urn, i.e. what was their conclusion after studying the restaurant reviews), but you do know their choices.

Claiming that the restaurant must be good because the line is long would be true only if all people that come sequentially followed only their private signals. Then when your time has come to make a choice the line indicates independent draws of balls from the urn. If it the true state of the world was that the urn is majority blue you would have much more people that say so.

The thing is that those draws are clearly not independent. At some point, a person that has a private signal that states the urn is majority blue might see too many people choosing majority red and he will abandon his private signal and follow the crowd. So that when it is your turn to make a choice and you observe a line (i.e. heaps of people claiming their choice) it does not necessarily mean that the restaurant is good. Put differently, you do not account for correlation of public beliefs (a belief based on the observed choice before seeing your private signal) and private signals.

Well that is herding. And here is a presentation about it….

If that stuff sounded crazy awesome then read this and in the very very end this

It is obviously not about restaurants at all, it could be a choice of major for a college degree. Is being a doctor a good choice or not? There is no way to know for sure, you just have to combine your private signal with the public belief. If you don’t have a strong private belief, then it will be overwhelmed by the public belief and you just follow the crowd. It also could explain why in Russia or Germany during good times aaalll people would put out Nazi flags outside or put Stalin’s portrait on the wall at home and office. Or pretty much anything that involves guessing the state of the world by combining information from your guess and choices of others.

A practical advice on non-parametric density estimation.

Always start from the histogram, any non-parametric density estimation methods are essentially fancier versions of a histogram.

Compare the problem of choosing and optimal size of bins in histogram with choice of h in kernel estimator

The number of bins is too small. Important features, such as mode, of this distribution are not revealed
The number of bins is too small. Important features, such as
mode, of this distribution are not revealed
Optimal number of bins (Optimal according to Sturges' rule, but the rule is besides the point)
Optimal number of bins (Optimal according to Sturges’ rule, but the rule is besides the point)
The number of bins is too large. The distribution is overtted.

The point of the exercise is to reveal all features of data; and that what important to keep in mind.

The bandwidth h is too large. Local features of this distribution are not revealed
The bandwidth h is too large. Local features of this distribution
are not revealed
The bandwidth h is selected by a rule-of-thumb called normal reference bandwidth
The bandwidth h is selected by a rule-of-thumb called normal
reference bandwidth
The bandwidth h is too small. The distribution is overtted.
The bandwidth h is too small. The distribution is overfitted.



While histogram takes an average within a bin, kernel estimation naturally extends this idea and takes a fancier version of average around given point. How much info around a point to use is governed by the bandwidth. Conceptually a bandwidth and a bin are identical.


And now take a look at a perfect application of the idea in

Nissanov, Zoya, and Maria Grazia Pittau. “Measuring changes in the Russian middle class between 1992 and 2008: a nonparametric distributional analysis.” Empirical Economics 50.2 (2016): 503-530.

Comparison between income distributions in the period 1992–2008. Authors’ calculation on weighted household income data from RLMS. Kernel density estimates are obtained using adaptive bandwidth
Comparison between income distributions in the period 1992–2008. Authors’ calculation on
weighted household income data from RLMS. Kernel density estimates are obtained using adaptive bandwidth

Going back to advice: keep in mind that you doing it to reveal features of data and it has to be strictly more informative than a histogram, otherwise the computational costs are not justified.

Spatial competition… and what science is really about.

Check my presentation on an empirical model of firm entry with endogenous product-type choices. (here)

A normal reaction to the presentation’s topic should be “whaat? why would anyone want to do this stuff for a living?”. It is a great question, I don’t have an answer to it. It is indeed viciously technical and deadly boring.

But I do have something really cool to share. Back home I was driving my 15-year-old niece to a museum and failed to find a humanly understandable combination of words to explain what science is. So now you check this combination of words, I think it is a really cool fit….

A human eye is able to capture a quite limited portion of light wave spectrum (Visible spectrum). We are unable to travel in time or reach most of the planets in the galaxy. Yet there is no need to be able to physically see the whole light wave spectrum to actually “see” it. And you do not need to be able to travel in time to “see” the past, just like you do not need to be able to travel to another planet to “see” that planet. Here is a cool angel on it. An information integration theory of consciousness, an exceptionally creative idea that, if appreciated properly, will blow your mind.

Human bodies have an enormous amount of systems like no other living being. We feel temperature, objects, we see and hear, feel emotions like fear, shame, happiness etc.. Our brain integrates all of this information from all the systems into a sense of reality. Put differently the reality as seen by a person is but an aggregated sensations from a set of systems, which continuously register information. Think about a feeling of pain. Pain is your body’s language. If your body needs attention from you, it sends a signal. However, the signal has only one dimension, it is kind of like a baby cry. Baby can only change the intensity of a cry but it is your job to give to this cry an interpretation. Your brain does the same. (To be more precise you do it yourself but unconsciously, it is one of that automatic processe, kinda like intuition) A conscience, or a capacity to separate yourself from other things, is just another trick of your brain. Instead of giving you a row information from systems that systematically aggregate information it gives you interpretation. Instead of overwhelming you with tonnes of sensations brain gives you a meaning of them. The reality is a brain’s interpretation of the aggregation of information from a number of systems that supply raw data.

Holy bologna!! But is it not what science is? Yes, indeed. Science is nothing but a natural extension of a process that your body does almost automatically. Aggregating information from systems that continuously register information and assign meaning to them (there is also this thesis that mathematics is nothing but common sense, a quite dense at times. I’ll see if I can make this post compact and readable enough if I do I’ll give you that idea as well)

It is also interesting to look at people’s temperaments. The system integrator (our brain, our consciousness) assigns different weights to different system’s from which it gets information. That’s why sometimes we observe people who are always scared or calm, sympathetic or cold. Of course, there are other things that define character, or predilection to specific kinds of decisions, such as upbringing and genetics, yet the system integrator has the last word.

Ok. Your brain has the capacity to integrate information from systems that systematically aggregate information and assign meaning, one of a product of this process is a conscience or a sense of reality. But the systems do not have to physiological, they do not necessarily have to be attached to your brain through common nerve system. It just has to be something that contains information. Let’s go back to the very beginning of this post. Yes indeed people see a quite narrow spectrum of the lightwave, however, there are devices which can capture those waves. Cameras, for example, continuously aggregate information. It would never have been done if we limited ourselves to physiological systems. However, for your brain information which is captured by the camera will have the same value as the information captured by your eyes. The only difference is that your brain will have to readjust itself to be able to aggregate information from it. And that is why in the beginning when you look at some figure which contains information you will be confused but with time you have to realign the integration process. In other words, you have to be able to incorporate this new information and combine it with information from other systems. When you do mathematics it’s very important at some point to stop and think what is the meaning of the equations that you have. You have to integrate this information with other information that your brain has and assign meaning to it. That is, in fact, a process of co-integration of information from different sources. And it is very costly for your brain to do, that is why it is so annoying. Another example from the beginning is our incapacity to travel across time. Well, the physical world, unfortunately, has this dimension which only goes one way and the speed of this going can not normally be changed. But all of us has some videotapes from the past. Imagine that there is a probe that is able to capture some information from the past and keep it (picture, videotape, documentary movies). Some system even allows us to travel through time and for our brain this is identical to if we were to travel in past ourselves. You just have to put in some effort to integrate the information from new systems. People who study history or work on documentary movies emerge themselves with systems that continuously register information from the past and their brain is trained well enough to easily incorporate this knowledge and assign a meaning to it. Another example is that to get the information about faraway planets one does not have to physically travel there, astronomical spectroscopy allows to systematically capture the information about the planets and then you can realign this knowledge so that your brain would incorporate and integrate into a perception of reality just like it would do from your eyes. And the final example is a statistical work. So if you have some data sets you can do some statistics to make some conclusions. But most often to do some statistical work a person has to merge two data sets. If those two different data sets are nothing but systems that continuously capture the information about some object. Put differently there are two independent systems that continuously register information about some object (it is other people that put down a number, in theory instead of a number they could have used words, but then we are back to crying baby case, the signal is not rich enough). They look at the same place and what people can do the camp combine this knowledge to assign some meaning to eat.

The point is our brain is capable to aggregate information from many many systems that supply information than physiological limits dictate.

In some sense, our brain is a prisoner of our physiological systems. So one way to say is science is setting your brain free. Seeing and thinking are the same thing when your eyes are closed. Put different things that we physically see here or feel is just a little fraction of what we potentially can see if we allow our brain to aggregate information and assign meanings from much wider systems that continuously register information. The sense of reality, conscience, is a computational shortcut. Because otherwise your brain would be overwhelmed with information.

In fact, any meaning is a computational shortcut that only your brain requires. The objective reality exists as an enormous mostly meaningless set of data. Life exists only because it can, asking for the meaning of life is the most idiotic question of all. Meaning itself is senseless it is nothing but a trick of your brain to aggregate information easier (It sounds really weird… hm… I probably should wrap up with this one, better do another post).

P.S. To survive people developed a capacity to form groups very quickly (morality) and to make decisions in uncertainty very quickly. A sense of reality, or consciousness, is sort of a “sufficient statistics”. For the decision at hand (to survive) we can form one parameter, a meaning, that would contain all useful information from the data that surround us. It economized on computational requirements and minimizes the risk of a mistake (sometimes a cost of a mistake is your life)