Distribution of a ordered pile of rubble

Imagine a pile of rubble (X) where the separated elements of the pile are stones (x_i). By picking n stones we form a sample that we can sort by weight. A sequence x_1,x_2,...,x_n becomes x_{(1)},x_{(2)},...,x_{(m)},...x_{(n)}, where (m) is called “rank”.

Pretend that we do the following. Apon picking a sample and sorting it we put stones into n drawers and mark each drawer by rank. Now repeat the procedure again and again (picking a sample, sorting and putting stones into drawers). After several repetitions, we find out that drawer #1 contains the lightest stones, whereas drawer #n the heaviest. An interesting observation is that by repeating the procedure indefinitely we would be able to put all parenting set (the whole pile or the whole range of parenting distribution) into drawers and later do the opposite — take all stones (from all drawers) mix them to get back the parenting set. (The fact that distributions (and moments) of stones of particular rank and the parenting distribution are related is probably the most thought-provoking)

Now let us consider the drawers. Obviously, the weight of stones in a given drawer (in a rank) is not the same. Furthermore, they are random and governed by some distribution. In other words, they are, in turn, a random variable, called order statistics. Let us label this random variable X_{(m)}, where m is a rank. Thus a sorted sample looks like this

X_{(1)},X_{(2)},...,X_{(m)},...,X_{(n)}

Its elements X_{(m)} (a set of elements (stones) x from the general set X (pile) with rank m (drawer)) are called m order statistics.

//////////////

Elements X_{1} and X_{(n)} are called “extreme”. If n is odd, a value with number m=\frac{(n+1)}{2} is central. If m is of order \frac{n}{2} this statistics is called “m central” A curious question is how define “extreme” elements if n \to \infty. If n increases, then m increases as we.

//////////////

Let us derive a density function of m order statistics with the sample size of n. Assume that parenting distribution F(x) and  density f(x) are continues everywhere. We’ll be dealing with a random variable X_{(m)} which share the same range as a parenting distribution (if a stone comes from the pile it won’t be bigger than the biggest stone in that pile).

Untitled

The figure has F(x) and f(x) and the function of interest \varphi_n (\cdot). Index n indicates the size of the sample. The x axis has values x_{(1)},...,x_{(m)},...,x_{(n)} that belong to a particular realization of X_{(1)},X_{(2)},...,X_{(m)},...,X_{(n)}

The probability that m-order statistics X_{(m)} is in the neuborhood of x_{(m)} is by definition (recall identity: dF = F(X + dx) - F(x) = \frac{{F(x + dx) - F(x)}}{{dx}} \cdot dx = f(x) \cdot dx ):

dF_{n}(x_{(m)})=p[x_{(m)}<X_{(m)}<x_{(m)}+dx_{(m)}]=\varphi_n (x_{(m)})dx_{m}

We can express this probability in term of parenting distribution F(x), thus relating \varphi_n (x_{(m)}) and F(x).

(This bit was a little tricky for me; read it twice with a nap in between) Consider that realization of x_1,...,x_i,...,x_n is a trias (a sequence generated by parenting distribution, rather then the order statistics; remember that range is common) where “success” is when a value X<x_{(m)} is observed, and “failure” is when X>x_{(m)} (if still necessary return to a pile and stone metaphor). Obviously, the probability of success is F(x_{(m)}), and of a failure is 1-F(x_{(m)}). The number of successes is equal to m-1, failures is equal to n-m, because m value of x_m in a sample of a size n is such that m-1 values are less and n-m values are higher than it.

Clearly, that the process of counting of successes has a binomial distribution. (recall that probability of getting exactly k successes in n trials is given by pms: p(k;n,p) = p(X = k) = \left( \begin{array}{l}  n\\  k  \end{array} \right){p^k}{(1 - p)^{n - k}}  In words, k  successes occur with p^k  and n-k  failures occur with probability (1-p)^{n-k} . However, the k successes can occur anywhere among the n  trials, and there are \left( \begin{array}{l}  n\\  k  \end{array} \right)  different ways of distributing k successes in a sequence of n  trials. A little more about it)

The probability for the parenting distribution to take the value close to x_{(m)} is an element of dF(x_{(m)})=f(x_{(m)})dx.

The probability  of sample to be close to x_{(m)} in such a way that m-1 elements are to the left of it and n-m to the rights, and the random variable X to be in the neighborgood of it is equal to:

C_{n - 1}^{m - 1}{[F({x_{(m)}})]^{m - 1}}{[1 - F({x_{(m)}})]^{n - m}}f({x_m})dx

Note that this is exactly p[x_{(m)}<X_{(m)}<x_{(m)}+dx_{(m)}], thus:

\varphi_n (x_{(m)})dx_{m}=C_{n - 1}^{m - 1}{[F({x_{(m)}})]^{m - 1}}{[1 - F({x_{(m)}})]^{n - m}}f({x_m})dx

Furthermore if from switching from f(x) to \varphi_n (x_{(m)}) we maintaine the scale of x axis then

\varphi_n (x_{(m)})=C_{n - 1}^{m - 1}{[F({x_{(m)}})]^{m - 1}}{[1 - F({x_{(m)}})]^{n - m}}f({x_m})

The expression shows that the density of order statistics depends on the parenting distribution, the rank and the samples size. Note the distribution of extreme values, when m=1 and m=n

The maximum to the right element has the distribution F^{n}(x) and the minimumal 1-[1-F(x)]^n. As an example observe order statistics for ranks m=1,2,3 with the sample size n=3 for uniform distribution on the interval [0,1]. Applying the last formula with f(x)=1 (and thus F(x)=x we get the density of the smallest element

\varphi_3 (x_{(1)})=3(1-2x+x^2);

the middle element

\varphi_3 (x_{(2)})=6(x-x^2)

and the maximal

\varphi_3 (x_{(3)})=3x^2.

With full concordance with the intuition, the density of the middle value is symmetric in regard to the parenting distribution, whereas the density of extreme values is bounded by the range of the parenting distribution and increases to a corresponding bound.

Note another interesting property of order statistics. By summing densities $latex \varphi_3 (x_{(1)}), \varphi_3 (x_{(2)}), \varphi_3 (x_{(3)})$ and dividing the result over their number:

\frac{1}{3}\sum\limits_{m = 1}^3 {{\varphi _3}({x_{(m)}}) = \frac{1}{3}(3 - 6x + 3{x^2} + 6x - 6{x^2} + 3{x^2}) = 1 = f(x)}

on the interval [0,1]

The normolized sum of order statistics turned out to equla the parenting distribution f(x). It means that parenting distibution is combination of order statistics X_{(m)}. Just like above had been mentioned that after sorting the general set by ranks we could mix the sorting back together to get the general set.

Further read: Ефимов-1980; Arnord-balakrishnan-2008.

Advertisements

Math is the extension of common sense

What makes math? Isn’t it just common sense?

Yes. Mathematics is common sense. On some basic level, this is clear. How can you explain to someone why adding seven things to five things yields the same result as adding five things to seven? You can’t: that fact is baked into our way of thinking about combining things together. Mathematicians like to give names to the phenomena our common sense describes: instead of saying, “This thing added to that thing is the same thing as that thing added to this thing,” we say, “Addition is commutative.” Or, because we like our symbols, we write:

For any choice of a and b, a + b = b + a.

Despite the official-looking formula, we are talking about a fact instinctively understood by every child.

Multiplication is a slightly different story. The formula looks pretty similar:

For any choice of a and b, a × b = b × a.

The mind, presented with this statement, does not say “no duh” quite as instantly as it does for addition. Is it “common sense” that two sets of six things amount to the same as six sets of two?

Maybe not; but it can become common sense. Eight groups of six were the same as six groups of eight. Not because it is a rule I’d been told, but because it could not be any other way.

We tend to teach mathematics as a long list of rules. You learn them in order and you have to obey them, because if you don’t obey them you get a C-. This is not mathematics. Mathematics is the study of things that come out a certain way because there is no other way they could possibly be.

Now let’s be fair: not everything in mathematics can be made as perfectly transparent to our intuition as addition and multiplication. You can’t do calculus by common sense. But calculus is still derived from our common sense—Newton took our physical intuition about objects moving in straight lines, formalized it, and then built on top of that formal structure a universal mathematical description of motion. Once you have Newton’s theory in hand, you can apply it to problems that would make your head spin if you had no equations to help you. In the same way, we have built-in mental systems for assessing the likelihood of an uncertain outcome. But those systems are pretty weak and unreliable, especially when it comes to events of extreme rarity. That’s when we shore up our intuition with a few sturdy, well-placed theorems and techniques, and make out of it a mathematical theory of probability.

The specialized language in which mathematicians converse with each other is a magnificent tool for conveying complex ideas precisely and swiftly. But its foreignness can create among outsiders the impression of a sphere of thought totally alien to ordinary thinking. That’s exactly wrong.

Math is like an atomic-powered prosthesis that you attach to your common sense, vastly multiplying its reach and strength. Despite the power of mathematics, and despite its sometimes forbidding notation and abstraction, the actual mental work involved is little different from the way we think about more down-to-earth problems. I find it helpful to keep in mind an image of Iron Man punching a hole through a brick wall. On the one hand, the actual wall-breaking force is being supplied, not by Tony Stark’s muscles, but by a series of exquisitely synchronized servomechanisms powered by a compact beta particle generator. On the other hand, from Tony Stark’s point of view, what he is doing is punching a wall, exactly as he would without the armor. Only much, much harder.

To paraphrase Clausewitz: Mathematics is the extension of common sense by other means.

Without the rigorous structure that math provides, common sense can lead you astray. That’s what happened to the officers who wanted to armor the parts of the planes that were already strong enough. But formal mathematics without common sense—without the constant interplay between abstract reasoning and our intuitions about quantity, time, space, motion, behavior, and uncertainty—would just be a sterile exercise in rule-following and bookkeeping. In other words, math would actually be what the peevish calculus student believes it to be.

A citation from Ellenberg’s “How Not To Be Wrong…” book. Kinda liked it.

Science is art

Some people have an ear for music. Hm… can one have an eye for a movie, an arm for a music instrument… or a wrist for theoretical theory? Is it really about a particular organ or is it about the brain’s capacity to “listen” and “sing”? Does the brain of an artist work any different from that of a scientist?

A scientific paper is not building bridges or agricultural irrigations, its intention is not to change the world. Its sole purpose, however, to be beautiful. A beautiful proof of a theorem, an insightful outlook on a phenomenon or a clever econometric identification strategy. Conceptually and behaviorally it is indistinguishable from a poem or a painting. An idea packed into a collection of symbols. However, to unpack the symbols and to understand the true meaning of idea one needs training (one need to go through a specific type of experience).

A collection of sensations create an idea, something that is born and dies in the confines of one’s mind. Yet one can pick a metaphor to mimic the idea. A surface of a metaphor is a collection of symbols, but the creator’s hope is to communicate the idea. A metaphor can be mathematical, visual or acoustical. The amazing minds who lived centuries ago also chose to be artists. The world of today offers millions of activities to choose from, but centuries ago the menu was limited: agriculture, army, church or making selfies for the nobles (for an arbitrarily chosen collection of people who manage the resources, again, in an arbitrary way). Michelangelos of today don’t do art, they do science because they are not limited in their choice. Art is a degenerated form of science.

Using metaphors comes naturally to a human mind and mathematical family of metaphors have a lot of advantages. One could say “that guy looks like a hockey player”. If that is true, then by studying a hockey player we can understand “that guy”. A political talk show is a great demonstration of limitation of verbal reasoning. Authors change definitions, make contradictory statements and the worst of it they talk too much. The (Occam’s) principle of parsimony is hard-wired into mathematics. Verbosity in mathematics looks silly but considered a embellishment in verbal reasoning. Reasoning needs to be compact to overcome the brain’s processing limitations.

Some will be able to understand the idea from the symbols, but some will never go beyond memorizing the symbols. Religious is most often misunderstood as a collection of silly symbols. A sad outcome when sinners are fooled to believe that by observing the signs of an idea they adhere to the idea itself. Or theomachists who always fight the signs of religion, being oblivious to the concepts of faith, peace, will, patience and love (all religions exhort us to be – in the language of n player prisoners dilemma game – unconditional contributors).

A true creator perpetuates the beauty of his mind by picking metaphors that live longer than his body. He packs a dense collection of processed unobserved sensations into a something that lives on.

A conjecture on mating

What is dating and why do we even need it? Here is mine theory. I have not cross-referenced it with existing sciency literature, thus, it could lack originality or could be just nuts (it’s really just some random thoughts). The theory naturally follows from several observations, so I start with those. Medical science has a good understanding of how a perfect, textbook, human body looks like. In reality, a perfect body does not exist. It is just an idea that is useful to understand what is right and what is wrong with a patient. A deviation from this conceptual body can help in classification. Noteworthy is that there has been a considerable change in classifying deviation into the right and wrong. Many diseases that were classified before as a subject for a treatment, today let roll on their own.

Nature never creates perfect bodies, because it is not sure how a perfect body looks like. The process of a human creation by nature can be understood as following. Design a perfect, textbook, body and then introduce disturbances into some system of the body. The disturbances are known as mutations, and the whole process is known as evolution. Put differently, nature randomizes human bodies and then the environment trims the randomization that was not useful. It is conceptually indistinguishable from how a programmer develops a code of a program. There is a core functionality and over time the programmer introduces features and see if they make the program better. The key difference is that the programmer controls the “trimming” process. He’d know very well which feature came through testing and which require further testing because they are promising, but initial tests were not very successful. Well here is an amazingly awesome news. There is a conceptual analog of a programmer. A woman. Natur is agnostic about which features became successful and which were not. Let’s start with counterposition. If there were no women then the progress of medical science (with its moto “no pain is great”) generates this:

Ok, that might be obvious. So, having woman improves the sorting process and trims unsuccessful mutations. But how exactly? This is the best bit. The process of a woman picking a man has exactly the same characteristics as a patient picking a doctor and a firm picking a worker. When you come to see a doctor you would like him to know medical stuff more than you do. When you come to see a surgeon you would like him to make right choices during surgery when you are asleep and unavailable for consultation. The problem is that when you see a doctor you see a head, two legs, and two arms. These observables are not very useful to infer the unobserved characteristics of a person that actually matter for you. That is why you use potentially useless and silly observables as proxies for unobservables that matter.

Let me start with “a flip of a coin in the vacuum”. Imagine all people have perfect, text-book bodies, they are exactly the same. Then people can form a group to archive the economy of scale (to hunt elephants or to produce iphones or to make healthy fat well-nourished kids) with anyone. Then one does not need friends, family or anyone really. There is no need to designate anyone as special. If you feel like having a beer or sex you just talk to the next person next to you ask if he/she don’t mind and just do it. The same happens with kids, you have kids and if you need anyone to babysit you just give a kid to a next person on the street. One does not even have to go home in the same place every night. Just crush to the closest bed. This is a benchmark.

Now imagine nature intentionally introduce noise to every person. Sort of introducing random features and then the environment needs to test the feature by killing the versions that are no good. Now people are different and they possess characteristics that could be useful in the current environment or could be useless. Now it matters who is in your group. To form groups quickly our brain can classify people into a bad person (immoral people) and good person (moral person). There are even the whole institutions that people created to facilitate the sorting, reputation and even… church. The church is kinda like education, it helps to send signals about types. A religious person is an unconditional contributor (speaking a language of public goods provision game). Religious people are usually intense, so for them, it is a computational shortcut (it requires extra commentary, don’t worry about it at this point. In short, usually, people spend tons of time to sort people into bad and good and to come out as good. That’s what peoples’ brain is hardwired to do. Some of us decide not to spend too much time strategizing, but simply contribute all the time (hard working people, like productive scientists), but they expect others to contribute when it is crucial for them).

A family is a special case of a group. A man possesses some properties that are unobserved, thus a woman chooses observable as a proxy. Good physiqueis good, but it is not a sufficient indicator for skills. Money is better. Both already serve as a better proxy, they convey more information. Those observable are more likely to indicate strong providing properties of a person. A woman also wants a man to be responsive to incentives, thus, non-cognitive skill also matter. A woman wants someone who has good social skills. This approach refines the sorting process and makes it very intelligent; you could be narcoleptic, so you probably will lose a fight with a crocodile and won’t survive in a forest a day, but you still do fine as a scientist. In this manner, a narcoleptic gene still persists in the population even though it manifests in a really really weird behavior of passing out randomly during the day. It is not designated for trimming, the opposite it designated as a potential feature that might over year constitute a “perfect” text-book body.

It could be shown that if the world consisted of identical people a woman would interact with whoever is closer, thus it could be stated that the total amount of time a woman interacts with a man (in aggregate) is:

S=M \times T

S is given, it could be that interaction is needed due to a physical property of environment (a group is necessary because there are many dinosaurs or food is scarce, thus several people need to search for it to get a healthy fat kid). For the reasoning at hand is it given. M is a number of men in the social fragment. If all men are the same then a woman is indifferent, thus all stock of man is used because time per a man (T) is high. If the environment is too risky, woman are too cautious and they socialize with less and less per man, thus for given stock of men and in a given environment more and more men are designated for being trimmed.

I think that this conjecture naturally follows the idea that is advocated by any famous social scientists that ever existed (e.g. Hayek, Friedman). People have evolutionarily developed to construct social structures and they do work with fantastic efficiency. States, markets, mating all of these examples of social structures existed way before scientists had any say in it. People should interfere as less with those as possible, any interferences have to be very gentle. A woman has to be free in her choices because if she is not, then tons of terrible men are not trimmed.

Some interesting manifestations of it: ban on divorce produces massive suicide rates and violenceban on abortions produces massive criminalization

Если ты еще не потеряешь интерес к этой теме, то не забудь про эти статьи: 1, 2.

Game theory is ridiculous

Game theory is ridiculous. The first acquaintance with main “solution concepts” usually produces a question “wtf?!” in a man with a good common sense.

Good economics approximates essentials with assumptions to overcome limitations of verbal reasoning. Assumptions in game theory mostly exist to confuse readers without really saying anything that matters.

I believe those are not assumptions but conventions and the only question that a man with a good common sense should be trying to ask is “why so many individually silly things when come together say so many astonishingly amazing stories?!”

Why so many big lines into terrible restaurants…

It must be a good restaurant since the line is so long. Hm… you are likely just failed to update your beliefs in a rational way.

Imagine you are in a classroom and there is an urn with three balls in front of everyone. You don’t see the colour of balls, but you do know equally likely it could be majority blue (2 blue 1 red) or majority red (1 blue 2 red). Since you don’t know which urn exactly is there (true state of the world) you need some evidence before making a guess. Now every person in class one by one come and pick one ball from the urn and without showing it announces his choice. Believe it or not, but this is your restaurant choice situation.

Two possibilities for the urn is an analogue to whether this restaurant good or bad. A person that comes to make a choice has several pieces of information to combine. Taking one ball from urn is the same as if you have read some review about the restaurant before. The information is not perfect, the reviews could be biased or not representative for your taste. However, you also observed the choices of people before you. You do not know their private signal (what ball they picked from urn, i.e. what was their conclusion after studying the restaurant reviews), but you do know their choices.

Claiming that the restaurant must be good because the line is long would be true only if all people that come sequentially followed only their private signals. Then when your time has come to make a choice the line indicates independent draws of balls from the urn. If it the true state of the world was that the urn is majority blue you would have much more people that say so.

The thing is that those draws are clearly not independent. At some point, a person that has a private signal that states the urn is majority blue might see too many people choosing majority red and he will abandon his private signal and follow the crowd. So that when it is your turn to make a choice and you observe a line (i.e. heaps of people claiming their choice) it does not necessarily mean that the restaurant is good. Put differently, you do not account for correlation of public beliefs (a belief based on the observed choice before seeing your private signal) and private signals.

Well that is herding. And here is a presentation about it….

If that stuff sounded crazy awesome then read this and in the very very end this

It is obviously not about restaurants at all, it could be a choice of major for a college degree. Is being a doctor a good choice or not? There is no way to know for sure, you just have to combine your private signal with the public belief. If you don’t have a strong private belief, then it will be overwhelmed by the public belief and you just follow the crowd. It also could explain why in Russia or Germany during good times aaalll people would put out Nazi flags outside or put Stalin’s portrait on the wall at home and office. Or pretty much anything that involves guessing the state of the world by combining information from your guess and choices of others.

A practical advice on non-parametric density estimation.

Always start from the histogram, any non-parametric density estimation methods are essentially fancier versions of a histogram.

Compare the problem of choosing and optimal size of bins in histogram with choice of h in kernel estimator

The number of bins is too small. Important features, such as mode, of this distribution are not revealed
The number of bins is too small. Important features, such as
mode, of this distribution are not revealed
Optimal number of bins (Optimal according to Sturges' rule, but the rule is besides the point)
Optimal number of bins (Optimal according to Sturges’ rule, but the rule is besides the point)
capture
The number of bins is too large. The distribution is overtted.

The point of the exercise is to reveal all features of data; and that what important to keep in mind.

The bandwidth h is too large. Local features of this distribution are not revealed
The bandwidth h is too large. Local features of this distribution
are not revealed
The bandwidth h is selected by a rule-of-thumb called normal reference bandwidth
The bandwidth h is selected by a rule-of-thumb called normal
reference bandwidth
The bandwidth h is too small. The distribution is overtted.
The bandwidth h is too small. The distribution is overfitted.

 

 

Capture
While histogram takes an average within a bin, kernel estimation naturally extends this idea and takes a fancier version of average around given point. How much info around a point to use is governed by the bandwidth. Conceptually a bandwidth and a bin are identical.

 

And now take a look at a perfect application of the idea in

Nissanov, Zoya, and Maria Grazia Pittau. “Measuring changes in the Russian middle class between 1992 and 2008: a nonparametric distributional analysis.” Empirical Economics 50.2 (2016): 503-530.

Comparison between income distributions in the period 1992–2008. Authors’ calculation on weighted household income data from RLMS. Kernel density estimates are obtained using adaptive bandwidth
Comparison between income distributions in the period 1992–2008. Authors’ calculation on
weighted household income data from RLMS. Kernel density estimates are obtained using adaptive bandwidth

Going back to advice: keep in mind that you doing it to reveal features of data and it has to be strictly more informative than a histogram, otherwise the computational costs are not justified.