3 - Distribuição¶
- Conjunto de valores e suas probabilidades correspondentes
- E.g. distribuição de quantas vezes cada palavra aparece no dicionário
PMF - Probability Mass Function
- Em python, key:value
É basicamente uma série no Pandas
from empiricaldist import Pmf
coin = Pmf()
coin['heads'] = 1/2
coin['tails'] = 1/2
print(coin)
heads 0.5 tails 0.5 dtype: float64
Dado d6:
die = Pmf.from_seq([1,2,3,4,5,6])
die
| probs | |
|---|---|
| 1 | 0.166667 |
| 2 | 0.166667 |
| 3 | 0.166667 |
| 4 | 0.166667 |
| 5 | 0.166667 |
| 6 | 0.166667 |
Letras numa palavras
letters = Pmf.from_seq(list('Mississippi'))
letters
| probs | |
|---|---|
| M | 0.090909 |
| i | 0.363636 |
| p | 0.181818 |
| s | 0.363636 |
print(f"{letters('s')=}")
print(f"{letters('x')=}")
letters('s')=np.float64(0.36363636363636365)
letters('x')=0
Modelando o problema dos cookies:¶
Em geral:
$$ p(A \mid B) = \frac{p(B \mid A) \cdot p(A)}{p(B)} $$
Neste problema:
$$ p(B_1 \mid V) = \frac{p(V \mid B_1) \cdot p(B_1)}{p(V)} $$
Substituindo:
$$ p(B_1 \mid V) = \frac{\frac{3}{4} \cdot \frac{1}{2} }{\frac{5}{8}} = \frac{3}{5} $$
prior = Pmf.from_seq(['Bowl_1', 'Bowl_2'])
prior
| probs | |
|---|---|
| Bowl_1 | 0.5 |
| Bowl_2 | 0.5 |
likelihood_vanilla = [3/4, 1/2]
likelihood_chocolate = [1/4, 1/2]
posterior = prior * likelihood_vanilla
posterior
| probs | |
|---|---|
| Bowl_1 | 0.375 |
| Bowl_2 | 0.250 |
posterior.normalize()
np.float64(1.0)
posterior
| probs | |
|---|---|
| Bowl_1 | 0.6 |
| Bowl_2 | 0.4 |
Mesmo problema agora com 101 jarros¶
- Bowl 0 tem 0% baunilha
- Bowl 1 tem 1% baunilha
- etc...
- Bowl 100 tem 100% baunilha
prior_101 = Pmf.from_seq([f'Bowl_{n:03}' for n in range(0,101)])
Presumindo uma prior uniforme (jarros são igualmente prováveis de serem selecionados):
prior_101
| probs | |
|---|---|
| Bowl_000 | 0.009901 |
| Bowl_001 | 0.009901 |
| Bowl_002 | 0.009901 |
| Bowl_003 | 0.009901 |
| Bowl_004 | 0.009901 |
| ... | ... |
| Bowl_096 | 0.009901 |
| Bowl_097 | 0.009901 |
| Bowl_098 | 0.009901 |
| Bowl_099 | 0.009901 |
| Bowl_100 | 0.009901 |
101 rows × 1 columns
likelihood_vanilla_101 = [n/100 for n in range(0,101)]
likelihood_chocolate_101 = [1 - n/100 for n in range(0,101)]
posterior_101 = prior_101 * likelihood_vanilla_101
posterior_101.normalize()
posterior_101
| probs | |
|---|---|
| Bowl_000 | 0.000000 |
| Bowl_001 | 0.000198 |
| Bowl_002 | 0.000396 |
| Bowl_003 | 0.000594 |
| Bowl_004 | 0.000792 |
| ... | ... |
| Bowl_096 | 0.019010 |
| Bowl_097 | 0.019208 |
| Bowl_098 | 0.019406 |
| Bowl_099 | 0.019604 |
| Bowl_100 | 0.019802 |
101 rows × 1 columns
posterior_101.sum() == 1
np.True_
posterior_101.max_prob()
'Bowl_100'
from matplotlib import pyplot as plt
plt.plot(
range(0,101),
prior_101.values,
posterior_101.values,
)
plt.xlabel('Bowl number')
plt.ylabel('Probability')
plt.xlim(0, 100)
plt.ylim(0, 0.02)
plt.legend(['Prior', 'Posterior'])
<matplotlib.legend.Legend at 0x7f6111cd6c60>
Agora vamos fazer o mesmo problema mas numa segunda rodada.
Colocamos o biscoito de volta e tiramos novamente um biscoito de baunilha.
Como atualizar nossas probabilidades?
posterior_101_2 = posterior_101 * likelihood_vanilla_101
posterior_101_2.normalize()
np.float64(0.6699999999999999)
posterior_101_2
| probs | |
|---|---|
| Bowl_000 | 0.000000 |
| Bowl_001 | 0.000003 |
| Bowl_002 | 0.000012 |
| Bowl_003 | 0.000027 |
| Bowl_004 | 0.000047 |
| ... | ... |
| Bowl_096 | 0.027238 |
| Bowl_097 | 0.027808 |
| Bowl_098 | 0.028385 |
| Bowl_099 | 0.028967 |
| Bowl_100 | 0.029555 |
101 rows × 1 columns
from matplotlib import pyplot as plt
plt.plot(range(0, 101), prior_101.values, label='Prior')
plt.plot(range(0, 101), posterior_101.values, label='Posterior')
plt.plot(range(0, 101), posterior_101_2.values, label='Posterior 2')
plt.xlabel('Bowl number')
plt.ylabel('Probability')
plt.xlim(0, 100)
plt.ylim(0, 0.03)
plt.legend(['Prior', 'Posterior', "Posterior 2"])
<matplotlib.legend.Legend at 0x7f6111aafe60>
Jarro 100 se torna mais provável ainda
E se agora pegarmos um biscoito de chocolate?
Vamos atualizar nossas probabilidades
posterior_101_3 = posterior_101_2 * likelihood_chocolate_101
posterior_101_3.normalize()
np.float64(0.2462686567164179)
from matplotlib import pyplot as plt
plt.plot(range(0, 101), prior_101.values, label='Prior')
plt.plot(range(0, 101), posterior_101.values, label='Posterior')
plt.plot(range(0, 101), posterior_101_2.values, label='Posterior 2')
plt.plot(range(0, 101), posterior_101_3.values, label='Posterior 3')
plt.xlabel('Bowl number')
plt.ylabel('Probability')
plt.xlim(0, 100)
plt.ylim(0, 0.03)
plt.legend(['Prior', 'Posterior', "Posterior 2", "Posterior 3"])
<matplotlib.legend.Legend at 0x7f6111bb31a0>
posterior_101_3.max_prob()
'Bowl_067'
Jarro 67 tem a proporção que observamos até agora, de 67%
Problema do Dado¶
Se eu tenho um d6, um d8 e um d12 e eu rolei um 1, qual a chance que eu rolei um d6?
Neste problema:
$$ p(d_6 \mid 1) = \frac{p(1 \mid d_6) \cdot p(d_6)}{p(1)} $$
Substituindo:
$$ p(d_6 \mid 1) = \frac{\frac{1}{6} \cdot \frac{1}{3}}{\frac{1}{3} \cdot \frac{1}{6} + \frac{1}{3} \cdot \frac{1}{8} + \frac{1}{3} \cdot \frac{1}{12}} = \frac{4}{9} \thickapprox 44\% $$
prior = Pmf.from_seq([6,8,12])
likelihood = [1/6, 1/8, 1/12]
posterior = prior * likelihood
posterior.normalize()
posterior
| probs | |
|---|---|
| 6 | 0.444444 |
| 8 | 0.333333 |
| 12 | 0.222222 |
E se agora eu rolar um 7?
Note que é impossível um d6 rolar um 7, a função de likelihood será alterada
likelihood_2 = [0, 1/8, 1/12]
posterior_2 = posterior * likelihood_2
posterior_2.normalize()
posterior_2
| probs | |
|---|---|
| 6 | 0.000000 |
| 8 | 0.692308 |
| 12 | 0.307692 |
Exercício 3-3:
Suppose I have two sock drawers. One contains equal numbers of black and white socks. The other contains equal numbers of red, green, and blue socks. Suppose I choose a drawer at random, choose two socks at random, and I tell you that I got a matching pair. What is the probability that the socks are white?
For simplicity, let’s assume that there are so many socks in both drawers that removing one sock makes a negligible change to the proportions
Neste problema:
$$ p(Branco \mid Par) = \frac{p(Par \mid Branco) \cdot p(Branco)}{p(Par)} $$
Substituindo:
$$ p(Branco \mid Par) = \frac{\frac{1}{2} \cdot \frac{1}{4}}{\frac{5}{12}} = \frac{3}{10} $$
from empiricaldist import Pmf
prior = Pmf({
'White': 1/2 * 1/2, # P(White | D1) * P(D1) + P(White | D2) + P(D2)
'Black': 1/2 * 1/2,
'Red': 1/3 * 1/2,
'Green': 1/3 * 1/2,
'Blue': 1/3 * 1/2,
})
likelihood_pair = [1/2, 1/2, 1/3, 1/3, 1/3]
posterior = prior * likelihood_pair
posterior.normalize()
posterior
| probs | |
|---|---|
| White | 0.300000 |
| Black | 0.300000 |
| Red | 0.133333 |
| Green | 0.133333 |
| Blue | 0.133333 |