Data Representation and Probability

8A Data

Data can be either numerical (a symbol representing a number) or categorical (data that can be grouped according to a particular type).

Numerical data can either be discrete (can only be particular numerical values) or continuous (can be any numerical value in a range).

Categorical data is non-numerical data that can be grouped into ‘categories’.

Data can be represented as a graph or a table.

Common types of graphs include:

8B Tables

A tally is a tool used for counting as results are gathered. Numbers are written as vertical lines with every 5th number having a cross through a group of lines. For example, 4 is |||| and 7 is |||| ||.

Frequency tables show how common a certain value is in a frequency column. A tallying column is also often used as data is gathered.

The items can be individual values or intervals of values.

8C Histograms

A histogram is a graphical representation of a frequency table. It can be used when the items are numerical.

  • The vertical axis (y-axis) is used to represent the frequency of each item.
  • Columns are placed next to one another with no gaps in between.
  • A half-column-width space is sometimes placed between the vertical axis and the first column of the histogram if the first vertical bar does not start at zero.

8D Mean, Median, Mode and Outliers

The mean (sometimes called the average) of a set of numbers is given by

Mean = (sum of all the values) ÷ (total number of values)

For example: 7 + 8 + 1 + 10 + 2 + 1 + 6 = 35 Mean = 35 ÷ 7 = 5

The median is the middle value if the values are in order (ascending or descending). If there are two middle values then the average of them is taken, by adding them together and dividing by 2.

The mode is the most common value, i.e. the one with the highest frequency. There can be more than one mode.

For example, 1 1 2 6 7 8 10 Mode = 1

An outlier is a data point that is significantly smaller or larger than the rest of the data.

The median and mode are generally unaffected by outliers whereas the mean can be affected significantly by an outlier.

8E Range and Interquartile Range

The range of a set of data is given by:

Range = highest number – lowest number

The interquartile range (or IQR) is found by the following procedure.

  1. Sort the data into ascending order.
  2. If there is an odd number of values, remove the middle one.
  3. Split the data into two equal size groups.
  4. The median of the lower half is called the lower quartile.
  5. The median of the upper half is called the upper quartile.

IQR = upper quartile – lower quartile

The range and the interquartile range are measures of spread; they summarise the amount of spread in a set of numerical data. Outliers affect the range but not the IQR.

8F Surveys

A survey can be conducted to obtain information about a large group by using a smaller sample. A survey conducted on an entire population is called a census.

The accuracy of the survey’s conclusion can be affected by:

  • the sample size (number of participants or items considered)
  • whether the sample is representative of the larger group, or biased, which can result in a sample mean significantly different from the population mean
  • whether there were any measurement errors, which could lead to outliers – values that are noticeably different from the other values.

Data represented as a histogram can be seen as symmetric, skewed or bi-modal.

If a data distribution is symmetric, the mean and the median are approximately equal.

8G Experiments and Probability

An experiment or trial could be flipping a coin, rolling a die or spinning a spinner.

An outcome is a possible result of the experiment, like rolling a 5 or a coin showing heads.

An event is either a single outcome (e.g. rolling a 3) or a collection of outcomes (e.g. rolling a 3, 4 or 5).

The probability of an event is a number between 0 and 1 that represents the chance that the event occurs. If all the outcomes are equally likely:

Pr(event) = number of outcomes where the event occurs / total number of outcomes

Probabilities are often written as fractions, but can also be written as decimals or percentages.

The sample space is the set of possible outcomes of an experiment. For example, the sample space for the roll of a die is 1, 2, 3, 4, 5, 6.

The complement of some event E is written E′ (or not E). E′ is the event that E does not occur. For example, the complement of ‘rolling the number 3’ is ‘rolling a number other than 3’.

For any event, either it or its complement will occur. That is, Pr(E) + Pr(E′) = 1.

The following language is also commonly used in probability:

  • ‘at least’, for example, ‘at least 3’ means 3, 4, 5, ……
  • ‘at most’, for example, ‘at most 7’ means ……, 5, 6, 7
  • ‘or’, for example, ‘rolling an even or a 5’ means rolling a 2, 4, 5, 6
  • ‘and’, for example, ‘rolling an even and a prime’ means rolling a 2

8H Independent Events

If two independent events occur, the outcomes can be listed as a table.

The probability is still given by:

Pr(event) = number of outcomes where the event occurs / total number of possible outcomes

8I Tree Diagrams

A tree diagram can be used to list the outcomes of experiments that involve two or more steps.

At this stage, we will only consider tree diagrams for which each branch corresponds to an equally likely outcome.

8J Two-Way Tables

A two-way table lists the number of outcomes or people in different categories, with the final row and column being the total of the other entries in that row or column. For example:

A two-way table can be used to find probabilities. For example: Pr(like Maths) = 33/100

Pr(like Maths and not English) 5/100 = 1/20