Internet of Things (IoT), Embedded Systems, and Machine Learning Fundamentals

Posted on Jan 28, 2025 in Archaeology

Chapter 1: Internet of Things (IoT)

IoT = Thing + Computational Intelligence + Connection to the Internet -> can produce a lot of data

Uniquely identifiable embedded systems (Things) connected with the internet infrastructure

Node devices (things) collect data from sensors -> data sent to the internet through a gateway

Things and gateways are considered as “edges”

Parts: Microcontroller, Expansion board and battery, Sensors and actuators

Chapter 2: Embedded Systems

Embedded systems are special-purpose, dedicated computing systems embedded within a larger system (e.g., vehicles). They commonly take the form of SoCs (System on a Chip)

A sensor reads in analog data that is converted to digital form using an Analog to Digital Converter (ADC)

An actuator receives a digital signal that is converted to electrical or electromechanical signals through a Digital to Analog Converter (DAC). The controller contains a processor

The controller is connected to the Internet (directly or via a Gateway)

The controller has many General-Purpose Input/Output (GPIO) pins -> controlled by software and memory mapped, we can set their voltages

Interrupts are signals that temporarily halt the normal execution of a program, diverting the processor’s attention to a different piece of code known as an Interrupt Service Routine (ISR). The processor automatically executes code to handle the event: • The code is commonly known as an interrupt handler. 3. After the event is handled, the processor resumes the work it was performing prior to the event. The ESP32 has 32 interrupts per core (two cores)

Five Interrupt Trigger Modes -> Low, Rising, Falling, High, Change

Make the ISR as short as possible • Common trick: Set a flag / update the value in the ISR and let the main loop do the time-consuming part. 2. Do not use delay() in the ISR. 3. Do not use Serial read/write in the ISR. 4. If a variable is shared between the ISR and other code, declare it as volatile: When an ISR is executing, all other interrupts have to wait.

Timers are important in real-time systems: They allow us to read sensors or write to actuators at precise times. They ensure that time-sensitive algorithms run at the correct timing. Timers are often needed to switch between tasks in a multi-tasking OS

Sensors Digital Input (switches, need debouncing as the microcontroller is sensitive), Analog Sensor ▪ Provides a DC voltage that is proportional to the physical parameter being measured ▪ Examples • Temperature sensor uses the fact that as temperature increases, the voltage across a diode changes at a known rate • Flex sensor provides a change in resistance for a change in sensor flexure

Computers can only handle digital signals. The conversion between analog and digital signals is needed. The ADC performs analog to digital conversion while the DAC performs the opposite.

A continuous time signal is defined over all instances of time

A discrete signal or discrete time signal is a sequence or series of signal values defined in discrete points of time

A digital signal is a discrete signal for which not only the time but also the amplitude has been made discrete

An analog-digital converter (ADC) samples and quantizes an analog signal to a digital signal

The ESP32 has two 12-bit SARs

The ESP32 has four 64-bit timers!

The ESP32 has two 12-bit SAR (Successive Approximation Register) ADCs

The ESP32 supports 2 sets of (UART) serial communication (in two ways).

Baud rate refers to the number of signal changes or symbol changes (also called “modulation events”) that occur per second in a communication channel.

Pulse Width Modulation (PWM) is a technique used to encode information or control the amount of power delivered to a load by varying the duty cycle of a digital signal. It is widely used in electronics and communication systems for controlling the power delivered to electrical devices such as LEDs.

Pulse Width Modulation (PWM) ▪ Method to create an analog signal from a digital processor ▪ The main idea is to keep the signal high for an amount of time proportional to the amplitude of the required analog signal during each period of the digital pulse ▪ Most microcontrollers can generate PWM signals using a timer

PWM works by rapidly switching the output between on and off states, varying the duration of the “on” time relative to the “off” time to control the average power delivered to a load.

Chapter 3: Communications and Serial Interfaces

How to coordinate asynchronous serial communication?

We need a bit-level protocol. Both sides must agree on the baud rate, data length, # of parity bits, and the # of stop bits. This defines our packet

Synchronous requires an external clock signal, Asynchronous does not

Inter-Integrated Circuit or I2C (also known as Two Wired Interface TWI): For 1 sensor shared by two or more microcontrollers and 1 microcontroller connecting to many sensors but can’t use more than 2 wires. It also is a bus protocol and allows more flexible connection patterns.

I2C protocol uses two wires: 1. Serial Clock (SCL) 2. Serial Data (SDA)

The Serial Clock establishes the same clock among connected devices. Bits will be sent / read according to this global clock.

The Serial Data line is used to transfer bits between the master and slave devices. Each bit of data, whether it’s being sent from the master to the slave or from the slave to the master, is transmitted over the SDA line.

SDA and SCL Working Together:

The SCL line provides the clock signal that tells the devices when to read or write data.
The SDA line carries the actual bits of data that are being transmitted or received. It changes between high (1) and low (0) states to represent each bit of data.

The I2C protocol enables multiple masters and slaves to communicate over a two-wire bus using SCL (clock) and SDA (data). Each slave has a unique address, and the master selects which slave to communicate with by sending its address. Multiple masters can share the bus through an arbitration process, where the bus is monitored to avoid conflicts. Slaves can also pause communication using clock stretching if they need more time to process data. The protocol uses start/stop conditions to manage communication and ACK/NACK signals for confirming data receipt.

A diagram of a computer Description automatically generated

ESP32 uses the SPI (Serial Peripheral Interface). It is also known as a four-wire protocol. It is synchronous and allows for full-duplex communication. It requires a lot of pins! SCLK: Serial Clock MOSI: Master Out Slave In MISO: Master In Slave Out CS /SS: Chip/Slave Select

SPI allows communication to be sent in both directions simultaneously. This is known as full duplex. I2C is only half-duplex.

Benefits of Internet Protocol (IP): Ubiquity (used by all OS), Longevity, Standards-based, Scalability, Reliability, Manageability (many tools that support IP)

Real-Time Software Architecture

A “software architecture” = Structure of a software solution Determines whether a piece of software can meet the constraints of a problem, and whether it is correct Not all real-time systems require an operating system

Some alternatives: ▪ Round-robin ▪ Round-robin with interrupts ▪ Function queue scheduling ▪ Timed loops

Round Robin

Round Robin architecture works particularly well when: ▪ There are few devices to service ▪ There is no long, complicated processing associated with the input ▪ There are no tight time requirements However, it fails when: Any device requires attention in less time than it takes for the CPU to go around the loop This architecture is also fragile: Every single device added to the loop may render the performance unacceptable

Round Robin with Interrupts

In this architecture: When a sensor has data to send to the CPU, it will interrupt the CPU The interrupt handler reads the device and sets a flag The main loop polls these flags and takes action if any of the flags is set

This architecture is an improvement over simple round-robin: Devices get attended to immediately ➔ Unlikely to suffer loss of data However, it may still take a while for this data to actually be processed

Function Queue Scheduling

This is similar to Round Robin with Interrupts: Interrupt handlers read the data from the device BUT instead of setting a flag, it inserts a pointer to the function to process this data The main loop executes functions from the queue in First In First Out order.

Timed Loops

Like round-robin: A while loop repeatedly calls functions to handle processing Each function is called in a fixed order Differences: Functions are called at fixed intervals The main loop enforces the intervals by checking elapsed cycles This is useful for routines that must be called at fixed times 3.

Real-Time Operating Systems (RTOS)

A real-time operating system (RTOS) is an operating system (OS) for real-time applications that processes data and events that have critically defined time constraints. An emphasis on very precise timing and a high degree of reliability. Scheduling is priority-based, preemptive.

At any one time, at most one process can execute, so need to constantly switch between process states -> Running The process is being executed on the CPU. Ready The process is ready to run but not currently running. A “scheduling algorithm” is used to pick the next process for running. Blocked The process is waiting for “something” to happen, so it is not ready to run yet

A diagram of a running process Description automatically generated

Process Scheduling Scheduler (the lowest layer): 1. Maintains the list of processes ready to run 2. Selects (schedules) which process to run next

Chapter 4: Statistical Methods

4 major methods: • Regression. • Naïve Bayes Classification. • Support Vector Machines. • Decision Trees

Statistical Methods vs Deep Learning

• Statistical methods have a better theoretical basis.

• Statistical methods in general are faster to build and train.

• Statistical methods usually have fewer hyperparameters to adjust. Statistical methods also give you a good idea of whether the data is actually “learnable”

• Much faster prototyping time, much simpler, though may provide relatively poor results

Linear Regression

• Given a training dataset of pairs (x, y) where: x (observations, e.g, sensing data) is a vector y (label) is a scalar. We assume there is a linear relationship between x and y.

• This is what differentiates regression from classification, where y is discrete, like a binary label: spam or not spam.

• Learn a function (aka, “linear” – line or “non-linear”- curve) that best fits the training data

Goal of Linear Regression: Find the linear function that best fits: minimizing the distance between the line and the data points.

We must use covariance to first determine if there is a relationship between x and y.

Y = ax + b + e -> there can be random noise in that

Covariance formula cov(x, y)/(standard dev(x) * standard dev(y))

Naïve Bayes Classifier

Used for classification. Classification Problem: • Given a set of data X, and a set of classes C, we want to compute the likelihood of some x Î X belongs to some class ck Î C. • In particular, we want to find some cmax such that P(cmax | x) = maxckÎC P(ck |x), then we can claim that x belongs to class cmax. Uses Gaussian probability function.

Multinomial Naïve Bayes Frequencies

In multinomial Naive Bayes, we consider an instance x to consist of a

vector of frequencies (x1, x2, …, xn) where each xi is the frequency of

some event i occurring.

• E.g. in a document classification problem, xi is the number of times word i

appears

Issue 1: Raw frequencies in document classification face some problems: • Bias towards longer documents. • Bias towards “connectors” like “the” because they occur much more frequently

Issue 2: Zero frequencies • If xi=0 in class ck, then pik = 0 and p(x|ck) becomes 0. • Laplace smoothing: Add 1 to every xi so that no pik is 0.

Bernoulli Naïve Bayes Boolean Features

Sometimes, rather than maintaining a count of the number of times event i occurs in class ck, we are only interested in whether it occurs at all. •In such cases we use a Bernoulli distribution instead of multinomial.

Support Vector Machines (SVM)

Separation Hyperplanes A hyperplane is a d-dimension plane defined by its normal vector w.

• In 2D a hyperplane is a straight line, in 3D it is a standard plane.

• The normal w is orthogonal to all points on the plane. I.e. for any point p on the plane, wTp is 0. • The hyperplane separates the d-dimension space into two halves.

• Hyperplanes always pass through the origin: •

This restricts its ability to partition the space.

• To fix this we add a d-dimension bias b

Support Vector Machines Assumptions

These two hyperplanes must be as far apart as possible:

• This means that they will be determined by the points in two different classes (1 and -1) that are closest together.

• These points are called “support vectors”

A loss function is used if the data is not linearly separable.

Parameters to Choose When Using Linear SVMS

Loss function:

• Hinge Loss: Loss function used for linearly separable classes.

• Logistic: Produces probability estimates of the classifications.

• Perceptron: Loss function used by Perceptrons

• Modified Huber: More tolerance to outliers (see later).

Training Rate (alpha):

Controls how fast the SVM learns. Higher values can lead to poorer results, lower values can lead to slow training.

Regularization: This controls how we penalize misclassified information:

• Higher penalty: Possibly more overfitting – SVM can only understand and correctly classify training data and nothing else.

• Lower penalty: Underfitting – SVM performs poorly even on training data

For points that are not linearly separable, SVMs can apply a “kernel trick”:

• Choose a kernel function that maps the points to a higher dimension.

• If properly chosen the higher dimension points should become linearly separable.

Decision Trees

How to construct a decision tree?

Choosing a good split: use a probabilistic notion of uncertainty to decide splits. Choose the split that allows for better classification.

Entropy: quantifies the uncertainty inherent in data

Chapter 5: Neural Networks

Two classes of learning laws:

• Unsupervised learning: • Tons of data are thrown at the NN. • The NN does its own inference on the structure and relationships in the data.

•Supervised learning: • We give the NN the data and the correct labels (or values we want to produce) from the data. • The NN then optimizes its weights to mimic the generator function for the data, based on what we tell it.

Unsupervised Learning

In unsupervised learning, the neural network is given data samples, but no descriptions of what the data samples mean. The job of the neural network is then to automatically learn relationships and structure between the data samples

The Self-Organizing Map (SOM)

Is based on the following principles: • Given a sample vector vk, • The neuron (which we will call a “centroid”) that best matches vk (bmu) AND • Its nearest neighbors • Are adjusted to look like vk. This partitions the vector space into a “tessellated Voronoi surface.”

The Kohonen SOM Algorithm

The learning rate a(t) is similarly decayed. The end-result: • Neurons (sometimes called “centroids”) and their nearest neighbors are adapted to look like the sample data that are closest to them. • Likewise new sample data coming in will automatically be “classified” as the centroid they are closest to: • This means that they will be “classified” together with other data that “looks” like them.

Supervised Learning

We will look at two related supervised learning algorithms:

• Linear Perceptrons: Simple, good for classifying linearly separable data.

• Multi-layer Perceptrons: More complex, good for classifying non-linearly separable data

Perceptrons

A single Perceptron (McCulloch – Pitts) neuron consists of:

• A set of inputs x.

• A set of “weights” w.

• A unit that does a dot product wTx.

• An activation function that does outputs 1 when wTx+b >0, and -1 otherwise. (other functions are possible, e.g. the identity function id(x)=x)

• The bias b is normally modeled as a unit input, with the weight w0 becoming the bias.

How Perceptrons Learn

Input: The perceptron takes in multiple input features (e.g., x1, x2, … xn), each associated with a weight (w1, w2, … wn). These inputs can represent various features in the data.
Weighted Sum: The perceptron calculates a weighted sum of the inputs:
Activation Function: The weighted sum is passed through an activation function (often a step function). If the result is above a certain threshold, the output is 1 (positive class); otherwise, it’s 0 (negative class).
Prediction vs Target: The perceptron’s output is compared to the target label (the actual class). If the prediction is incorrect, the perceptron adjusts its weights.

Learning Steps:

Initialize Weights and Bias: Start with random weights and bias.
Calculate Output: For each input in the training set, calculate the output using the perceptron’s current weights.
Update Weights: If the output is incorrect (prediction differs from the target), update the weights
Repeat: The process continues for multiple epochs (iterations through the entire training data) until the perceptron reaches a state where it makes correct predictions for most of the training data.

XOR TABLE:

A table with numbers and symbols Description automatically generated

Perceptrons can’t solve the XOR problem because XOR is not linearly separable. A single-layer perceptron can only solve problems where the data can be separated by a straight line. This is because it is not possible to draw a straight line to separate the samples into two classes.

Multi-Layer Perceptrons

If we add a second Perceptron layer, we introduce a second plane that can neatly separate the points:

Learning Steps for Multi-Layer Perceptron (MLP):

Feedforward Pass:

The input data is passed through the network layer by layer.
Each neuron computes a weighted sum of its inputs (from the previous layer), adds a bias, and applies an activation function to produce the output.
The activation function is usually non-linear (e.g., ReLU, sigmoid, or tanh), allowing the network to model non-linear relationships.

Calculate Loss:

After the forward pass, the network makes a prediction.
The loss function (e.g., mean squared error for regression or cross-entropy for classification) is used to compute the error between the predicted output and the actual target output (label).

Backpropagation:

Backpropagation is the algorithm used to update the weights in an MLP. It computes the gradient of the loss function with respect to each weight by applying the chain rule of calculus.
The goal is to minimize the loss, so the gradients tell the network how to adjust the weights to reduce the error.

Steps in backpropagation:

Output Layer: Compute the gradient of the loss with respect to the output of the network (using the derivative of the activation function at the output layer).
Hidden Layers: Propagate the error backward through the network, adjusting the weights of each neuron using the gradients of the loss with respect to the hidden layer activations.

Weight Update (Gradient Descent):

After calculating the gradients for each weight, the weights are updated using gradient descent or a variant (e.g., Stochastic Gradient Descent (SGD), Adam).

Repeat for All Training Examples:

The forward pass, backpropagation, and weight update steps are repeated for each example in the training data.
This can be done in mini-batches (small subsets of the training data), or for the entire dataset (batch gradient descent).

Iterate Through Epochs:

The process is repeated for multiple epochs (complete passes through the entire training dataset) until the model converges (i.e., the loss stops decreasing significantly or the desired performance is achieved).

Non-Linearity: The addition of a hidden layer only partially solves the problem of classifying data points: • What if the data points occur in a circle? No straight hyperplane can solve this. Solution: Use non-linear decision boundaries. • Popular ones: Sigmoid for (0, 1) outputs, tanh for [-1, 1] outputs, and ReLu

The weights that are being adjusted by our optimization algorithms are called “parameters”, and neural networks are known as “parameterized models”. There is another dimension of neural networks that must be “optimized” as well to get good results, called ”hyperparameters”. This is related to the design of the neural network.

Important hyperparameters include:

• Architecture (Dense networks, Long-Short-Term Memories, Convolutional Neural Networks, Autoencoders, Generative-Adversarial Networks, etc)

• # of input nodes (usually constraints by the problem itself)

• Input and output encoding.

• # of hidden layers.

• Size of each hidden layer.

• Loss functions

• Transfer functions.

• Optimization functions and their parameters (learning rate, momentum, etc.)

• Dropouts and Regularizers

Activation Functions: An activation function (often also called “transfer function”) transforms the dot product of the weights and inputs (such as ReLU, sigmoid, softmax)

Loss Functions: A loss function measures the error between the function learned by the NN and the actual target value (MSE, Mean Absolute Error, Binary Cross Entropy)

Optimizers In NNs: Optimizers are algorithms that derive the best set of parameter values that minimize the loss function (such as stochastic gradient descent)

Overfitting

Overfitting occurs when a machine learning model learns the training data too well, to the point that it captures noise and random fluctuations in the data rather than the underlying patterns or relationships. As a result, the model performs extremely well on the training data but poorly on unseen data (test or validation data).

All data can be thought of as consisting of two things:

• Some sort of generator process f(x). The input x could simply be time, or an actual input.

• Noise: When we build a model, we want it to learn only the generator process f(x).

• The noise is random and unique to the training data. If we learn the noise, then our model cannot generalize well – “overfitting”.

Two choices to avoid learning the noise:

• Have a simple model (i.e. fewer parameters)

• Have more training data so that the network can ”average out” the noise. Catch:

• A simple model may not learn f(x) well – underfitting.

• The amount of training data needed grows exponentially with the model complexity – “curse of dimensionality

Marked by: • Extremely low training loss. • Increasing validation loss.

A graph of loss and overfitting Description automatically generated

Dealing with Overfitting

Dropout Layers:

• A fixed percentage of neurons in the layer are dropped from training for one or more epochs.

• They are then put back in, and another percentage of neurons are dropped.

• Reduces # of training parameters:

Noise Layers:

Noise layers add random noise to the outputs of the previous layer.

• E.g. add Gaussian noise to change the data to look like “new data” – augmentation

Regularizers:

A regularizer is a penalty that is applied to the loss function of a layer

• E.g. In regression our loss function may look like this (squared error loss):

• Our optimization algorithm will find values of the parameters (β) to minimize this loss function

We force a reduction – simplification – in parameters.

Chapter 6: Deep Learning

“Deep” learning refers to machine learning using a neural network with at least 2 hidden layers.

RNN: Modeling Sequences with Neural Networks

▪ Consider a sequence of inputs

• Input vectors

▪ Produce one or more outputs

▪ This can be done with Recurrent Neural Networks (RNN) to capture temporal dependencies between the inputs.

Key Idea: maintain hidden state 𝒉 (the hidden layer’s output ) over time.

RNN can have an arbitrarily long input

RNN Summary Training is carried out using the standard gradient descent that we just talked about. Recurrent neural networks are good at learning from patterns in past data:

• Predicting the next word in a sentence.

• Predicting stock performance based on past data.

• Trajectory prediction for a robot (e.g. learning how to pick up an object

RNNs are poor at memorization Traditional RNNs have trouble capturing long-term dependencies due to this recurrent network structure.

To solve this problem we introduce a special RNN called a “Long Short Term Memory” or LSTM Neural Network

LSTM (Long Short-Term Memory) is a type of recurrent neural network (RNN) that is designed to learn long-term dependencies and overcome the vanishing gradient problem, which standard RNNs struggle with. It does this through a more complex cell structure that includes gates to control the flow of information.

LSTMs are designed to remember important information over long sequences and forget irrelevant details.

They achieve this using gate: forget, input, and output, which control what to keep, what to discard, and what to output.

The cell state acts as a long-term memory, while the hidden state is the short-term memory that gets passed to the next time step.

Each “gate” appears to be a single node, but they are actually a collection of nodes.

Autoencoders

Idea: We train a neural network to take an input x, and generate the same output x.

Consists of two parts: Encoder and Decoder.

1. Encoder: the “analysis” net which computes the hidden representation of the input data. 2. Decoder: the synthesis net which recomposes the data from the hidden representation.

Encoder: compress the input data into a lower dimension

However, AEs are very bad at this: the recovered data will be highly “lossy”, and the AE can only reconstruct data of the type it was trained on.

Useful for more efficient use of resources (e.g., GPU memory, communication) etc.

Decoder: It is a dictionary that is highly specialized to generate the trained data. We can detect anomalies by measuring reconstruction loss:

• When the AE sees “typical” data it was trained on, it can reconstruct the input with minimal error.

• When the data becomes atypical, reconstruction error rises. We can flag anomalies when this error exceeds a threshold

Convolution Neural Networks

Basic Idea: 1. pick a convolution matrix 𝐾

2. slide this over an image and compute “inner product” of 𝐾 and the corresponding field of the image

Can be used to smooth or sharpen an image

CNNs automatically and adaptively learn spatial hierarchies of features through convolutional layers and are widely used in tasks such as image classification, object detection, and segmentation.

Input Layer:

This is the layer that receives the raw data, typically an image represented as a matrix of pixel values (e.g., an RGB image has three channels: red, green, and blue).
The input might be of shape (height, width, depth), where depth refers to the number of channels (e.g., 3 for RGB images).

Convolutional Layer (Conv Layer):

The core building block of a CNN.
A convolutional layer applies a set of filters (or kernels) to the input data. Each filter slides over the input image and performs an element-wise multiplication (convolution) with the input pixels to extract features like edges, textures, and shapes.
Each filter produces a feature map (activation map) that highlights certain characteristics of the input image.

ReLU Layer (Rectified Linear Unit):

After convolution, the CNN applies a non-linear activation function like ReLU to the feature maps. ReLU replaces all negative pixel values in the feature map with zero, introducing non-linearity into the network.
Formula: f(x)=max⁡(0,x)f(x) = \max(0, x)f(x)=max(0,x).
This helps the network learn more complex patterns.

Pooling Layer (Downsampling):

A pooling layer reduces the dimensionality of the feature maps, making the network more efficient and reducing the risk of overfitting.
The most common type is max pooling, where the maximum value from a small patch of the feature map is selected. For example, a 2×2 max-pooling layer selects the highest value from each 2×2 grid of the feature map.
Pooling reduces the spatial dimensions of the feature maps while preserving the important features.

Fully Connected Layer (FC Layer):

After several convolutional and pooling layers, the high-level reasoning in the neural network is done via fully connected layers.
The output of the last pooling layer is flattened into a vector, which is then passed through one or more fully connected layers, similar to a traditional neural network.
In these layers, every node (neuron) is connected to every node in the previous layer, and the network learns to combine the features extracted by the convolutional layers to predict the output.

Softmax Layer / Output Layer:

For classification tasks, the output layer is often a softmax layer that converts the final scores (logits) from the fully connected layer into probabilities. The softmax function ensures that the output values sum to 1 and represent the predicted probabilities for each class.

How a CNN Works (Step-by-Step):

Input Layer:

The network receives an input image (e.g., 32×32 RGB image, 3 channels for color).

Convolutional Layer:

A filter or kernel (e.g., 3×3 or 5×5) slides across the image, applying element-wise multiplication with the input, producing a feature map.
Multiple filters are applied to extract different features (e.g., edges, corners, textures).

Activation Function (ReLU):

The result from the convolutional operation is passed through the ReLU activation function to introduce non-linearity.
This step helps the model learn complex patterns.

Pooling Layer:

A pooling layer (often max pooling) is applied to the feature map to reduce its spatial dimensions, typically halving its size, and retaining the most important features.

Multiple Conv + Pool Layers:

The convolution, ReLU, and pooling operations are typically repeated several times to build deeper representations of the data.
As you go deeper, the network learns more abstract features (low-level features like edges in early layers, higher-level features like object parts in later layers).

Flattening:

After the final pooling layer, the 2D feature maps are flattened into a 1D vector to prepare for fully connected layers.

Fully Connected Layer:

The flattened vector is passed through one or more fully connected layers where the features are combined to make a prediction.

Output Layer (Softmax):

For classification tasks, the final layer is typically a softmax layer, which produces a probability distribution over the classes. The class with the highest probability is the predicted class.

Generative Adversarial Network

Motivation:

We often don’t have enough training data.

We want to build a neural network that will produce “fake” data from the real data

A GAN consists of two models:

• A “Generator” network that learns how to “counterfeit” the data. • Takes a vector of random noise as input and generates an image

• A “Discriminator” network that learns how to differentiate between the real and the fake data. Takes in an image, and classifies whether it’s real (label 1) or fake (label 0)

How Do They Work?

• Training the Generator: • Weights for the Discriminator are frozen.

• The GAN is now trained on the assumption that the fake data is real.

• End result: The Generator adjusts its weights to try to maximize the “realness” of the fake data. • The process repeats.

• So training alternates between:

• Maximizing the Discriminator’s ability to tell fake data from real.

• Maximizing the quality of the fake data to minimize the Discriminator’s ability to differentiate. This is why it is called an “adversarial network”:

• The two networks try to fight each other

Initial Setup: Create the Generator, Discriminator, and GAN networks. Load the data for training.

GAN Alternating Training:

Discriminator Training:
- Freeze the Generator’s weights.
- Generate fake data and combine it with randomly chosen real data.
- Label real data as 1 and fake data as 0.
- Train the Discriminator to correctly classify real vs. fake data.
Generator Training:
- Freeze the Discriminator’s weights.
- Generate fake data using random inputs (noise).
- Train the GAN to make the Generator improve at producing realistic data by trying to convince the Discriminator that the fake data is real (optimizing the Generator to get a “1” from the Discriminator).
- The Discriminator is not updated during this step.

Summary of How GANs Work:

A Generator creates synthetic data from random noise.
A Discriminator classifies data as real or fake.
The generator tries to fool the discriminator, while the discriminator tries to improve its ability to detect fakes.
Over time, the generator gets better at creating realistic data, and the discriminator gets better at identifying it, until the generator’s output becomes indistinguishable from real data.