PPDAC: Research Methods, Sampling, and Data Analysis
PPDAC: Problem, Plan, Data, Analysis, Conclusion
Problem
A clear statement of what is to be learned through the definition of a research question. For example:
- What is the population of interest? (collection of units to be studied)
- Concrete identifiable
- Hypothetical (constantly changing)
- Hypothetical (doesn’t exist)
- What are the characteristics of change of each unit? (explanatory or response)
- What is the goal of the research?
- Descriptive: characteristic of the population
- Causative: how explanatory changes respond
- Predictive: predict outcome based on characteristics of unit
UNIT 2: Plan
Includes data collection and analysis to address the research question:
- What are the sampling frame/sampling strategies?
Sampling: process of selecting a subset of units from which data will be collected.
Goals: make it representative, use random chance – probability sampling to avoid bias.
It goes from population of interest, to sampling frame, then sample (biggest to smallest).
Undercover bias: some groups in populations are left out of sampling frame.
Types of sampling:
- SRS (Simple Random Sampling): no restrictions or prior knowledge needed, use random chance (each unit has same chance).
- Stratified: subdivide group into 2+ strata based on variable, some units taken from each strata (either constant or non-constant proportion from each).
- Systematic: lists or rows, units selected at constant (k). N = total, n = number wanted in sample, k = N/n.
- Cluster: divided into 2+ clusters based on variables, then all units taken from some clusters, choose clusters using chance, unit chance = cluster chance.
- Multistage: Sample selected using 2+ successive iterations (using two of the other sampling methods). Bias in earlier stages cannot be corrected.
Problematic sampling:
- Non-response: some individuals don’t respond.
- Voluntary response: self-select/volunteer leads to self-selection bias.
- Convenience sampling: chosen because of proximity/accessibility – leads to sampling bias.
What will be measured for the response variable? How will you deal with explanatory variables? What statistical procedures may you use?
Unit 3: Confounding variables
Presence of extra variables whose effect can’t be separated from that of factors of interest.
- Observational: Measuring variables as they naturally occur.
- Experimental: Impose a ‘treatment’ related to explanatory variable to change response.
Control: Accounting for variation in potential explanatory variables to isolate the impact of factors of interest.
4 Methods:
- Limit variation in the variables; select a narrower sampling frame, hold other variables constant.
- Distribute variation across all treatments; subdivide units into homogenous groups based on variable of concern before treatment, ensuring groups are similarly composed.
- Using comparison groups: Allows to attribute change to factor of interest rather than natural change.
- Collect data on variables of concern: cofactors – additional variables for which data is collected for comparison/explanatory purposes, not the factors of interest. Note – doesn’t prevent confounding but helps identify it.
Unit 4: Randomization
Applies to experiments, using random chance to assign units to treatments or treatment order. Variation still exists.
Replication: having more than one individual in the treatment groups. Example: sleepeze had 30 replicates.
Blocking: refers to having groups formed from your sample, they share things in common. Then they are assigned to treatments to ensure that the treatments have diversity based on the blocking variable. Experiments have a similar composition with respect to the variable by which you block.
Randomized block design:
- First, units are divided into blocks based on pre-existing characteristics.
- Then, we randomly assign units from blocks to treatments.
Matched Pairs Design:
- First, pairs of units from the sample are matched based on similarity across existing characteristics.
- Second, assigned from pairs to treatments.
Repeated measures design (pairing in data): each unit in the sample is assigned to all treatments.
Study Types:
- Survey: Observational study that collects information about variables of interest from a sample.
- Cohort studies: Examine the emergence of a specific condition over time in a homogenous group of individuals; typically prospective and longitudinal, used to connect exposure factors to outcome, may yield incidence rate of outcome.
- Case-control study: A sample of cases with the outcome of interest is selected and compared against a sample of ‘control’ known to not have the condition. Cases and controls can be matched, useful for rare conditions to ensure sufficient observations.
Unit 5: Types of variables
- Quantitative:
- Ratio – zero means none
- Interval – distance between consecutive values is constant
- Discrete – can only take on specific values
- Continuous – can take on any value within a range
- Categorical:
- Ordinal – values can be ordered
- Nominal – values can be named
Number of samples in the context of inference, the term ‘sample’ is generally interpreted as ‘comparison group’; think of study design.
Types of samples – many statistical procedures require understanding the structure of the data collected.
Stats Lab: Basics of R Syntax & Data Exploration
R Syntax Basics:
Function Call: t.test(x=statstudy, mu=100, alternative="two.sided")
Arguments: x
, mu
, alternative
(must be in quotes for text).
Order: If no =
is used, follow argument order; =
allows any order.
Errors/Warnings: Use help()
for error details.
Common Errors: Missing objects (specify directory) or unexpected text.
Lab 2: Data Collection & Exploration
Process: Collect, monitor, explore.
Quality Check: Review patterns, trends, outliers, missing values.
Data Terms:
- Tidy Data: Columns = variables, rows = observations, cells = values.
- Metadata: Info on data (variable descriptions, collection, quality checks).
R Naming Rules:
Case Sensitive: Start names with letters, no symbols/spaces (use _
).
Importing CSV Files
File Path: file.choose()
or manually assign path.
Import: d <- read.csv(file=path)
to save as an object.
Viewing Data
View: view(d)
for spreadsheet view.
Head: head(d)
to preview first 6 rows.
Structure: str(d)
shows type, rows, columns, variable types (int
, chr
, num
).
Data Types & Structures
Types: logical
(T/F), integer
, numeric
, character
, complex
.
Structures:
- Vector: 1D, same type (e.g.,
c(1,2,6)
). - Matrix: 2D, same type.
- Dataframe: 2D, mixed types.
- Factors: For categorical data (e.g.,
factor(x=strength, levels=c("weak", "medium", "strong"))
).
Subsetting
Dataframe Columns: dataframe$column
or attach(dataframe)
.
Vector Math: vector * 10
; add to dataframe with cbind()
.
R Markdown Basics
Install Packages: install.packages("rmarkdown")
, install.packages("tinytext")
.
Import Data: Code chunk in R Markdown; avoid view()
.
Chunk Options:
eval=FALSE
: Shows code, no run.include=FALSE
: Runs code, no output.echo=FALSE
: Runs code, shows output, hides code.
Markdown Formatting
Symbols: Use $latex_code$
.
Lists: Bullet (*
) and numbered (1. text
) lists.
Tables: Use |
for columns, -
for rows, +
for corners