Distributed Systems: Security, Fault Tolerance, and Recovery
Security Issues in Distributed Systems
Distributed systems face two primary security challenges:
Secure Communication
Ensuring authentication, message integrity, and confidentiality requires secure channels to prevent eavesdropping, tampering, and message forgery.
Authorization
Verifying if the client is permitted to perform specific operations after authentication.
Threat Models
A threat model identifies security risks and vulnerabilities in a system.
Steps in Threat Modeling
- Identify Security Objectives: Confidentiality, integrity, availability.
- Decompose the Application: Analyze system components, data flow, and entry points.
- Identify and Rank Threats: Evaluate risks based on severity.
- Develop Countermeasures: Propose fixes like encryption or secure protocols.
- Document Findings: Create a comprehensive report for stakeholders.
Benefits: Early identification of vulnerabilities, saving cost and time.
Challenges: Time-consuming, requires skilled personnel.
Authentication and Authorization
Authentication: Verifies the identity of users (e.g., password, biometrics).
Authorization: Ensures the user has appropriate access rights after authentication.
Techniques
- Authentication: Passwords, 2FA, biometrics.
- Authorization: Role-Based Access Control (RBAC)
Encryption and Decryption
Encryption: Converts plaintext into ciphertext to ensure confidentiality.
Decryption: Converts ciphertext back to plaintext using a key.
Techniques
- Symmetric Key Encryption (e.g., AES).
- Public Key Encryption (e.g., RSA).
Fault Tolerance in Distributed Systems
Fault tolerance ensures the system continues operating correctly despite failures.
Types of Faults
- Transient Faults: Occur temporarily (e.g., temporary network glitch).
- Intermittent Faults: Recurring faults (e.g., unstable hardware).
- Permanent Faults: Require repair or replacement (e.g., burned-out chip).
Failure Models
- Crash Failures: System halts without warning.
- Omission Failures: Missing request or response.
- Timing Failures: Response time exceeds the allowed interval.
- Response Failures: Incorrect output or unexpected state transitions.
- Byzantine Failures: Malicious or random incorrect outputs.
Fault Tolerance Techniques
Redundancy
- Information Redundancy: Error correction codes (e.g., Hamming codes).
- Time Redundancy: Retry operations after failures.
- Physical Redundancy: Extra hardware or software for backup.
Replication
Use multiple identical components (e.g., servers, processes) to handle failures.
- Primary-Backup: A primary system and its backups.
- Active Replication: All replicas process requests simultaneously.
Consensus Algorithms
- Paxos: Majority voting ensures agreement on a single value.
- Raft: Leader election and log replication for distributed agreement.
Recovery Mechanisms
- Backward Recovery: Restores a previous correct state using checkpoints.
- Forward Recovery: Corrects errors without rolling back to a previous state.
Example: Retransmitting lost packets in communication is a backward recovery technique.
Distributed Mutual Exclusion
Ensures only one process accesses a critical section (CS) at a time.
Approaches
- Token-Based: A unique token is passed between processes.
- Non-Token-Based: Processes exchange messages to decide access.
- Quorum-Based: Processes seek permissions from a subset of nodes.