Data and Statistical Notes

TYPES OF VARIABLES.

1. Numerical-Take numerical values and can use arithmetic operations.

1. Continuous-Any value Height. (Though if we round off our height it might seem discrete but its not.)

2. Discrete-One of a specific set of values Number of cars a household has

2. Categorical-Take on a distinct category, can be numerical but makes no sense arithmetically.

1. Ordinal-Ordered level (Customer service review, 1,2,3,4,5)

2. Regular-Are you a morning person or an afternoon person? No order.

When two variables show some connection with one another are called associated variables.

There are two types

1. Positive

2. Negative

First always find out what type of variable are you dealing with?

STUDIES.

1. Observational-Collect data in a way that does not interfere with how data arises. Only establish an association.

1.1. Retrospective-Uses data from the past.

1.2. Prospective-Data is collected throughout the study.

2. Experiment-Randomly assign subjects to treatments. Establish causal connections.

Extraneous variables that affect both the explanatory and the response variable, and that make it seem like there is a relation between them is called a CONFOUNDING VARIABLE.

CORRELATION DOES NOT IMPLY CAUSATION.

SAMPLING AND SOURCES OF BIAS.

Cons of a census.

1. Lots of resources

2. Some individuals maybe hard to locate or measure. And these people may be different from the rest of the population

3. Populations rarely stand still. It changes constantly.

To taste soup, you take a spoonful and when you decide that the spoonful is not salty enough thats exploratory analysis.

Types of Biases:

1. Convenience sample-People easily available are used in the study.

2. Non response-Only a few NON-RANDOM people from the randomly sampled people respond then the result is not representative.

3. Voluntary Response-Contains people only who volunteer to respond. This is only when they have a strong opinion from them and thus is also not representative.

Difference between voluntary and non-response: In voluntary the sampling is not random and in non-response the sampling is random.

SAMPLING METHODS

1. Simple Random Sampling

Randomly select cases from the population and each case is likely to be elected. Drawing a name from the hat.

2. Stratified Sampling

Divide the population into homogenous strata then randomly sample from within each stratum

3. Cluster Sampling

Divide the population onto clusters. Randomly sample a few the clusters and then randomly sample from within these clusters. Unlike the strata the clusters might not be homogenous but each cluster is similar to another such that we can get away from sampling from just a few clusters.

EXPERIMENTAL DESIGN.

1. Control

Compare treatment of interest to a control group.

2. Randomize

Randomly assign subjects to treatment

3. Replicate

Collect a sufficiently large sample or replicate the entire study

4. Block

Block for variables known or suspected to affect the outcome is block.

Difference between explanatory variable and blocking variable:

Explanatory variables (factors) are conditions which we can impose on experimental units.

Blocking variables are the characteristics that the experimental units come with, that we would like to control.

Blocking is like stratifying

Blocking during randomly assigning.

Stratifying during random sampling.

Few new terms

1. Placebo

Fake treatment often used as the control group for medical studies.

2. Placebo Effect

Showing change despite being on placebo.

3. Blinding

Experimental units dont know which group theyre in.

4. Double Blind

Both experimental units and the researchers dont know the group assignment.

VISUALISING NUMERICAL DATA

1. Scatter Plot

Explanatory Variable is usually the x-axis and the response is the y-axis.

Things to bear in mind when evaluating the relationship between two variable:

1.1. Direction

Positive or negative

1.2. Shape

Linear or some other form

1.3. Strength

Strong indicated by little scatter or weak indicated by lots of scatter

1.4. Any potential outliers.

Investigate these points to make sure they are not data entry errors.

A nave approach would be to ignore (exclude) the outliers but sometimes these outliers can be very interesting cases and handling them with careful consideration of research question and other associated variables is important.

2. Histograms

2.1. Provides a view of the data density.

2.2. Identifying the shape of the distribution.

The width of the bin in the histogram can alter the story that the histogram is conveying.

3. Dot Plot

4. Box Plot

5. Intensity Map

MEASURES OF CENTER:

1. Mean

Arithmetic average

2. Median

50th Percentile

3. Most frequent

Most frequent observation

If these measurements are calculated from a sample they are known as sample statistics.

MEASURE OF SPREAD

1. Range: Max-min

2. Variance

Why do we square the difference?

To get rid of negatives so that positive and negative dont cancel each other.

Large deviations are weighed more heavily than small deviations.

3. Standard Deviation

4. Inter-Quartile Range

ROBUST STATISTICS

We define robust statistics as measures on which extreme observations have little effect.

TRANSFORMING DATA

A transformation is rescaling the data using a function.

When data are very strongly skewed we sometimes transform them so they are easier to model.

Methods

1. Log (natural) Transformation (most usual)

To make the relationship between variables more linear and hence easier to model with simple methods.

2. Other Transformations

Goals of Transformation.

To see data structure differently.

Reduce skew for assisting modeling.

To straighten a non-linear relationship in a scatterplot.

EXPLORING CATEGORICAL VARIABLES

1. Frequency table and bar plot

2. Pie Chart

Less helpful than bar plots.

3. Contingency Table.

4. Relative frequencies.

5. Segmented bar plot.

6. Relative frequency segmented bar plot.

7. Mosaic plot.

8. Side-by-side box plots.

INTRODUCTION TO INFERENCE

PROBABILITY AND DISTRIBUTIONS

Random Process

In a random process we know what outcomes could happen but we dont know which particular outcome will happen.

1. Frequentist interpretation

The probability of an outcome is the proportion of times the outcome would occur if we observed the random process an infinite number of times

2. Bayesian interpretation

A Bayesian interprets probability as a subjective degree of belief.

Largely popularized by revolutionary advance in computational technology and methods during the last twenty years.

Law of large numbers

Law of large numbers states that as more observations are collected the proportion of occurrences with a particular outcome converges to the probability of that outcome.

DISJOINT EVENTS & GENERAL ADDITION RULE

Disjoint or mutually exclusive events cannot happen at the same time.

1. Union of disjoint events

Probability of two disjoint sets A & B then the P (A or B) = P (A) + P (B) P (A & B)

For a disjoint event however, P (A&B) = 0.

2. Sample space

A sample space is a collection of all possible outcomes of a trial.

3. Probability distribution

A probability distribution lists all possible outcomes in sample space and the probabilities with which they occur.

3.1. The events must be disjoint

3.2. Each probability must be between 0 & 1

3.3. The probabilities must total 1.

4. Complementary events.

Complementary events are two mutually exclusive events whose probabilities adds up to 1.

DISJOINT vs. COMPLEMENTARY

Not necessarily the probabilities of disjoint sets adds to 1.

The probabilities in the complementary events always add to 1.

Therefore, complementary events are necessarily disjoint events however the converse is not true.

INDEPENENT EVENTS

Two processes are said to be independent if knowing the outcome of 1 provides no useful information about the outcome of the other.

Checking for independence

P (A|B) = P (A), then A & B are independent.

DETERMINING DEPENDENCE BASED ON SAMPLE DATA

If the difference is large there is stronger evidence that the difference is real

If the sample size is large even a small difference can provide strong evidence of a real difference.

RULE FOR INDEPENDENT EVENTS

If A and B are independent, P (A & B) = P (A) x P (B)

Documents

Data and Statistical Notes