Data Visualization: An Overview and Practical Guide
Data Visualization Overview
Important Factors When Designing a Visualization
Goal/Purpose: This is usually determined before starting.
- Data (What?): What data will be used?
- Audience (Who?): Who is the visualization for?
- Types of Visualization (How?): How will the data be visualized?
Tools for Visualization
- Tableau: Extremely user-friendly, similar to PowerPoint.
- D3.js: JavaScript framework for manipulating SVG objects, used for programming visualizations.
- Python: Python Notebook and Matplotlib are popular tools.
- R: R Markdown/R Notebook are commonly used.
- Processing: General-purpose language for visual arts (Java, recently ported for JavaScript).
Visualization Overview
Why Do We Use Visualization?
- Easy to Understand: No technical knowledge required.
- Big Picture: Shows the overall picture of a problem or situation.
- Cognitive Load: Helps ease the cognitive load of processing data.
Numbers and Statistics vs. Visualization
While numbers and statistics like median and standard deviations provide exact and aggregated information, they may require significant cognitive processing to understand. Visualization helps bridge the gap by presenting data in a more easily digestible format.
Why is Visualization Important Now?
- New Technologies: Make visualization accessible to a wider audience.
- Big Data: Facilitates knowledge discovery from large datasets.
Historical Examples of Visualization
- John Snow (1854): Mapped cholera outbreak in London, leading to the discovery of its source.
- Anscombe’s Quartet: Four datasets with similar statistics but different shapes when graphed, highlighting the importance of visualization.
- Selective Attention: Demonstrated through examples like the basketball and gorilla video, showing how visual aids can improve focus.
- The Door Study: Illustrates change blindness and the limitations of our cognitive abilities.
The Power of Visual Aids
Visual aids free up mental capacity and allow our visual system, a parallel pattern recognition machine, to process information more efficiently.
History of Data Visualization
- Laocoon and His Sons (~25 BC): One of the oldest sculptures.
- Stonehenge: Early example of data visualization.
- Tokens and Ledgers: Ancient forms of data recording.
- Maps: Existed even before 10,000 BC.
- First Data Maps: 17th century.
- Trade Winds and Monsoons (1686): Edmond Halley (Halley’s Comet).
- John Snow (Cholera Outbreak): Mapped the outbreak to identify its source.
- J.H. Lambert (Soil Temperature): Early example of data visualization in science.
- Chronophotography: Showed data in the same context, like stick figures walking in one image.
- Small Multiples: Showed multiple aspects of data on the same scale, like horse motion.
- “Best Statistical Graphic Ever Drawn”: Charles Joseph Minard’s Carte Figurative, depicting Napoleon’s Russian Campaign (1812).
The Dark Side of Visualization
Misleading Visualizations
Graphs and charts can be misleading due to:
- Errors: Simply wrong information.
- Correlation vs. Causation: Correlation doesn’t always imply causation.
- Scales: Starting from 0 may not always be necessary, context is important.
- Context: Misleading visuals can manipulate perception.
- Perception of Lightness: Peak of light before darkness can affect perceived lightness.
- Color Perception: Using black and white can be problematic due to perception of gray in different contexts.
Psychophysics and Visual Encoding
Steven’s Power Law: Relates stimulus intensity to sensation. Different stimuli have different exponents (a), affecting perception.
Accuracy in Perception: Position, scale, and length are more accurate visual encodings than angle, area, and volume.
Color Hue-Saturation: Least accurate visual encoding, making pie charts difficult to perceive.
Color Perception in Visualization
RGB (Red, Green, Blue): Based on light emittance, popular for digital displays.
CMYK (Cyan, Magenta, Yellow, Key/Black): Based on light absorption, popular for printing.
HSL (Hue, Saturation, Lightness) and HSV (Hue, Saturation, Value): Alternative color models used in computers.
Considerations for Using Color
- Context: Differentiate elements.
- Artifacts: Changes in color can create artifacts.
- Prints: Precautions needed for accurate printing.
- Badness: Avoid bad color combinations.
Color for Categorical Data
- Visible: All colors should be visible.
- Distinguishable: Avoid confusing color combinations.
- Not Too Many: Limit the number of categories and colors.
- Contrast: Ensure sufficient contrast for readability.
- Colorblindness: Consider colorblind users.
- Grayscale: Test in grayscale.
- Tip: Print the graph to check for visibility and readability.
Visualization Maxims
What is Visualized Should Be Visible
- Accurate Representation: Does the visualization accurately represent the data?
- Accessibility: Ensure accessibility for users with disabilities.
- Robustness: Consistent appearance across different media.
- Accurate Visual Encodings: Use position, scale, and length for accuracy.
Pre-Attentive Processing (PP)
The ability of the human visual system to rapidly identify basic visual properties. Exploiting PP can lead to easier-to-read and more powerful visualizations.
Design Principles
Maximize Data-Ink Ratio (Minimalism)
Data-Ink Ratio: Proportion of ink devoted to non-redundant data information. Remove unnecessary ink (chart junk).
Sparklines
Simple visualizations that represent trends in data, often used within text.
Visualizations for Numeric Data
1-D Data
- 1-D Scatter Plot or Strip Chart: Can be problematic with overlapping data points.
- 1-D “Jittered” Scatterplot: Adds noise to reduce overlap, but sacrifices accuracy.
- Beeswarm Plot: Places data points close together without overlapping, but also sacrifices accuracy.
- Using Alpha: Overlapping points are darker, indicating density, but still prone to overlap.
Summarization vs. Aggregation
- Summarization: Represents data with summary statistics, like box plots.
- Aggregation: Represents data with aggregated values, like histograms.
Box Plot
Shows quartiles, median, and outliers. Weakness: Can’t capture density within quartiles.
Tidy Data
Data that is formatted with each variable in a column, each observation in a row, and each type of observation in a table.
Signs of Untidy Data
- Column headings used as values.
- Multiple tables combined into one.
- Aggregate data in rows.
- Blank rows or columns.
- Missing data due to poor structure.
Histograms
Visualize aggregated data using frequency. Area represents frequency, height represents frequency density.
Choosing Bins for Histograms
- Square Root: k = sqrt(n)
- Sturge’s Formula: k = [log2n + 1]
- Freedman-Diaconis’ Choice: bin size = 2 * IQR(x) / n^(1/3)
High-Dimensional Data
1-D Visualization
- Strip Chart, Beeswarm, Box Plot, Histogram: Can be extended to higher dimensions.
2-D Visualization
- Multiple 1-D Plots: If one dimension is categorical.
- 2-D Box Plot (Bag Plot): Difficult to interpret.
- 2-D Histogram: Volume represents frequency, but can be hard to see.
- Heat Map: Color represents frequency, can be extended to higher dimensions.
Visualizing 4-D, 5-D, and More
- Direct Visualization: Difficult and often inaccurate for high dimensions.
- Unfolding High-D Space into 2-D: Parallel coordinates and radar charts.
- Dimension Reduction: Scatterplot matrix, feature reduction.
Tableau
A data visualization tool with features for creating interactive dashboards and stories.
Tableau Story
A sequence of visualizations that work together to convey information.
Data Requirements for Tableau
- Data starts from column 1.
- Headers in row 1.
- No row/column totals.
- Each column contains data of the same type.
Tableau Data Types
- Dimensions: Categorical variables.
- Measures: Continuous values.
Tableau Story Types
- Change Over Time:
- Drill Down:
- Zoom Out:
- Contrast:
- Intersections:
- Factors:
- Outliers:
Pop Quizzes
(Answers not provided in the original text)