3 - Distribuição¶

  • Conjunto de valores e suas probabilidades correspondentes
    • E.g. distribuição de quantas vezes cada palavra aparece no dicionário

PMF - Probability Mass Function

  • Em python, key:value

É basicamente uma série no Pandas

In [1]:
from empiricaldist import Pmf
coin = Pmf()
coin['heads'] = 1/2
coin['tails'] = 1/2
print(coin)
heads    0.5
tails    0.5
dtype: float64

Dado d6:

In [3]:
die = Pmf.from_seq([1,2,3,4,5,6])
die
Out[3]:
probs
1 0.166667
2 0.166667
3 0.166667
4 0.166667
5 0.166667
6 0.166667

Letras numa palavras

In [8]:
letters = Pmf.from_seq(list('Mississippi'))
letters
Out[8]:
probs
M 0.090909
i 0.363636
p 0.181818
s 0.363636
In [12]:
print(f"{letters('s')=}")
print(f"{letters('x')=}")
letters('s')=np.float64(0.36363636363636365)
letters('x')=0

Modelando o problema dos cookies:¶

image.png

Em geral:

$$ p(A \mid B) = \frac{p(B \mid A) \cdot p(A)}{p(B)} $$

Neste problema:

$$ p(B_1 \mid V) = \frac{p(V \mid B_1) \cdot p(B_1)}{p(V)} $$

Substituindo:

$$ p(B_1 \mid V) = \frac{\frac{3}{4} \cdot \frac{1}{2} }{\frac{5}{8}} = \frac{3}{5} $$

In [14]:
prior = Pmf.from_seq(['Bowl_1', 'Bowl_2'])
prior
Out[14]:
probs
Bowl_1 0.5
Bowl_2 0.5
In [15]:
likelihood_vanilla = [3/4, 1/2]
likelihood_chocolate = [1/4, 1/2]
posterior = prior * likelihood_vanilla
posterior
Out[15]:
probs
Bowl_1 0.375
Bowl_2 0.250
In [17]:
posterior.normalize()
Out[17]:
np.float64(1.0)
In [21]:
posterior
Out[21]:
probs
Bowl_1 0.6
Bowl_2 0.4

Mesmo problema agora com 101 jarros¶

  • Bowl 0 tem 0% baunilha
  • Bowl 1 tem 1% baunilha
  • etc...
  • Bowl 100 tem 100% baunilha
In [33]:
prior_101 = Pmf.from_seq([f'Bowl_{n:03}' for n in range(0,101)])

Presumindo uma prior uniforme (jarros são igualmente prováveis de serem selecionados):

In [34]:
prior_101
Out[34]:
probs
Bowl_000 0.009901
Bowl_001 0.009901
Bowl_002 0.009901
Bowl_003 0.009901
Bowl_004 0.009901
... ...
Bowl_096 0.009901
Bowl_097 0.009901
Bowl_098 0.009901
Bowl_099 0.009901
Bowl_100 0.009901

101 rows × 1 columns

In [83]:
likelihood_vanilla_101 = [n/100 for n in range(0,101)]
likelihood_chocolate_101 = [1 - n/100 for n in range(0,101)]
posterior_101 = prior_101 * likelihood_vanilla_101
posterior_101.normalize()
posterior_101
Out[83]:
probs
Bowl_000 0.000000
Bowl_001 0.000198
Bowl_002 0.000396
Bowl_003 0.000594
Bowl_004 0.000792
... ...
Bowl_096 0.019010
Bowl_097 0.019208
Bowl_098 0.019406
Bowl_099 0.019604
Bowl_100 0.019802

101 rows × 1 columns

In [84]:
posterior_101.sum() == 1
Out[84]:
np.True_
In [85]:
posterior_101.max_prob()
Out[85]:
'Bowl_100'
In [86]:
from matplotlib import pyplot as plt

plt.plot(
    range(0,101),
    prior_101.values,
    posterior_101.values,
    
)
plt.xlabel('Bowl number')
plt.ylabel('Probability')
plt.xlim(0, 100)
plt.ylim(0, 0.02)
plt.legend(['Prior', 'Posterior'])
Out[86]:
<matplotlib.legend.Legend at 0x7f6111cd6c60>
No description has been provided for this image

Agora vamos fazer o mesmo problema mas numa segunda rodada.

Colocamos o biscoito de volta e tiramos novamente um biscoito de baunilha.

Como atualizar nossas probabilidades?

In [87]:
posterior_101_2 = posterior_101 * likelihood_vanilla_101
posterior_101_2.normalize()
Out[87]:
np.float64(0.6699999999999999)
In [88]:
posterior_101_2
Out[88]:
probs
Bowl_000 0.000000
Bowl_001 0.000003
Bowl_002 0.000012
Bowl_003 0.000027
Bowl_004 0.000047
... ...
Bowl_096 0.027238
Bowl_097 0.027808
Bowl_098 0.028385
Bowl_099 0.028967
Bowl_100 0.029555

101 rows × 1 columns

In [89]:
from matplotlib import pyplot as plt

plt.plot(range(0, 101), prior_101.values, label='Prior')
plt.plot(range(0, 101), posterior_101.values, label='Posterior')
plt.plot(range(0, 101), posterior_101_2.values, label='Posterior 2')

plt.xlabel('Bowl number')
plt.ylabel('Probability')
plt.xlim(0, 100)
plt.ylim(0, 0.03)
plt.legend(['Prior', 'Posterior', "Posterior 2"])
Out[89]:
<matplotlib.legend.Legend at 0x7f6111aafe60>
No description has been provided for this image

Jarro 100 se torna mais provável ainda

E se agora pegarmos um biscoito de chocolate?

Vamos atualizar nossas probabilidades

In [91]:
posterior_101_3 = posterior_101_2 * likelihood_chocolate_101
posterior_101_3.normalize()
Out[91]:
np.float64(0.2462686567164179)
In [93]:
from matplotlib import pyplot as plt

plt.plot(range(0, 101), prior_101.values, label='Prior')
plt.plot(range(0, 101), posterior_101.values, label='Posterior')
plt.plot(range(0, 101), posterior_101_2.values, label='Posterior 2')
plt.plot(range(0, 101), posterior_101_3.values, label='Posterior 3')

plt.xlabel('Bowl number')
plt.ylabel('Probability')
plt.xlim(0, 100)
plt.ylim(0, 0.03)
plt.legend(['Prior', 'Posterior', "Posterior 2", "Posterior 3"])
Out[93]:
<matplotlib.legend.Legend at 0x7f6111bb31a0>
No description has been provided for this image
In [101]:
posterior_101_3.max_prob()
Out[101]:
'Bowl_067'

Jarro 67 tem a proporção que observamos até agora, de 67%

Problema do Dado¶

Se eu tenho um d6, um d8 e um d12 e eu rolei um 1, qual a chance que eu rolei um d6?

Neste problema:

$$ p(d_6 \mid 1) = \frac{p(1 \mid d_6) \cdot p(d_6)}{p(1)} $$

Substituindo:

$$ p(d_6 \mid 1) = \frac{\frac{1}{6} \cdot \frac{1}{3}}{\frac{1}{3} \cdot \frac{1}{6} + \frac{1}{3} \cdot \frac{1}{8} + \frac{1}{3} \cdot \frac{1}{12}} = \frac{4}{9} \thickapprox 44\% $$

In [107]:
prior = Pmf.from_seq([6,8,12])
likelihood = [1/6, 1/8, 1/12]
posterior = prior * likelihood
posterior.normalize()
posterior
Out[107]:
probs
6 0.444444
8 0.333333
12 0.222222

E se agora eu rolar um 7?

Note que é impossível um d6 rolar um 7, a função de likelihood será alterada

In [110]:
likelihood_2 = [0, 1/8, 1/12]
posterior_2 = posterior * likelihood_2
posterior_2.normalize()
posterior_2
Out[110]:
probs
6 0.000000
8 0.692308
12 0.307692

Exercício 3-3:

Suppose I have two sock drawers. One contains equal numbers of black and white socks. The other contains equal numbers of red, green, and blue socks. Suppose I choose a drawer at random, choose two socks at random, and I tell you that I got a matching pair. What is the probability that the socks are white?

For simplicity, let’s assume that there are so many socks in both drawers that removing one sock makes a negligible change to the proportions

Neste problema:

$$ p(Branco \mid Par) = \frac{p(Par \mid Branco) \cdot p(Branco)}{p(Par)} $$

image.png

Substituindo:

$$ p(Branco \mid Par) = \frac{\frac{1}{2} \cdot \frac{1}{4}}{\frac{5}{12}} = \frac{3}{10} $$

In [ ]:
from empiricaldist import Pmf
prior = Pmf({
    'White': 1/2 * 1/2, # P(White | D1) * P(D1) + P(White | D2) + P(D2)
    'Black': 1/2 * 1/2,
    'Red': 1/3 * 1/2,
    'Green': 1/3 * 1/2,
    'Blue': 1/3 * 1/2,
})
likelihood_pair = [1/2, 1/2, 1/3, 1/3, 1/3]
posterior = prior * likelihood_pair
posterior.normalize()
posterior
Out[ ]:
probs
White 0.300000
Black 0.300000
Red 0.133333
Green 0.133333
Blue 0.133333