NoSQL Databases: Types, ACID vs BASE, and Use Cases

Posted on Dec 17, 2024 in Technology

NoSQL Databases: A Deep Dive

NoSQL, short for “Not Only SQL”, marks a shift from traditional relational database management systems (RDBMS) towards more flexible data storage and retrieval. NoSQL databases offer scalable, schema-less architectures, ideal for handling unstructured or semi-structured data. They provide solutions tailored to specific data and workload needs, making them invaluable for modern applications, from real-time analytics to large-scale web applications.

Key Advantages of NoSQL

Scale: NoSQL databases scale out by distributing data across multiple servers.
Schema-less: They are more flexible with data models, accommodating semi-structured or unstructured data.
Performance: By avoiding joins and complex transactions, and using distributed architectures, NoSQL databases achieve high performance for specific workloads.

Types of NoSQL Databases

Document-based, column-based, key-value stores, and graph databases are common types.

NoSQL: BASE Properties

Before comparing NoSQL and RDBMS, it’s important to understand BASE, or “Basically Available”.

Basically Available: The system remains available for reads and writes, even when some nodes are unavailable, prioritizing availability over immediate consistency.
Soft state: The system’s state may change over time, even without input, due to the eventual consistency model.
Eventually consistent: The system will become consistent over time, with all replicas converging to the same value after some time.

Differences: ACID vs. BASE

Prioritization

ACID: Prioritizes data consistency, ensuring every read receives the most recent write, often sacrificing some availability in partition-tolerant systems.
BASE: Prioritizes availability, allowing the system to function even when some nodes are unavailable, potentially returning older data.

Use Cases

ACID: Ideal for systems where data integrity is critical, like banking systems.
BASE: Useful for systems that can tolerate some inconsistencies, like social media platforms.

System Behavior During Failures

ACID: Transactions might fail or be rolled back to ensure consistency, potentially reducing availability.
BASE: The system continues to operate, accepting read and write operations, even if some data is outdated.

Complexity

ACID: Complexity is often handled within the database system.
BASE: Requires the application to handle inconsistencies and be aware of the eventual consistency model.

NoSQL vs RDBMS

NoSQL: Schema-less or dynamic schema, allowing new fields without altering existing records. No standard query language. Follows BASE properties.
RDBMS: Fixed schema, schema modifications can be time-consuming. Uses SQL. Follows ACID properties.

MongoDB: A Document-Based NoSQL Database

MongoDB stores data in BSON (Binary JSON) format, supporting embedded documents and arrays, offering more flexibility than traditional table-based relational databases. It is schema-less, allowing each document within a collection to have different fields.

Secondary Indexes

Used to query data based on non-primary-key attributes, defining an alternate key. They can be global or local.

Provisioned Read/Write Throughput

Throughput is the maximum capacity an application can consume from a table or index, divided evenly among partitions. Read capacity units (RCU) and write capacity units (WCU) are used to measure throughput.

On-Demand R/W Throughput

Pay-per-request pricing for read and write requests. DynamoDB adapts rapidly to accommodate the workload, ideal for new tables with unknown workloads or unpredictable traffic.

DynamoDB API Operations

Control operations: Create and manage DynamoDB tables.
Data operations: Create, read, update, and delete actions on data.
Batch operations: Get and write batches of items.
Transaction operations: Make coordinated changes to multiple items.

Batch Operations

BatchGetItem: Read up to 16 MB of data, up to 100 items from multiple tables.
BatchWriteItem: Write up to 16 MB of data, up to 25 PUT or DELETE requests to multiple tables.

Transactional Operations

TransactWriteItems: Includes one or more PutItem, UpdateItem, and DeleteItem operations across multiple tables.
TransactGetItems: Includes one or more GetItem operations across multiple tables.

Physical Denormalization

The intentional process of combining tables or adding redundant data to improve performance, trading off some data redundancy for gains in query performance.

Distributed Architecture

Refers to the division of labor across multiple computers or processors, enabling scalability, fault tolerance, reduced latency, and concurrent processing.

Machine Learning in Databases

Types

Query Optimization: Machine learning models predict the most efficient order for performing joins and select the best indexes.
Cache Management: Algorithms optimize data caching based on usage patterns.
Data Compression: Machine learning recognizes patterns for efficient compression, choosing algorithms dynamically.
Data Sharding: Machine learning identifies query patterns to allocate shards, balancing load and optimizing performance.
Data Partitioning: Machine learning predicts loads and suggests optimal partitioning schemes.

Big Data and Data Science

Big data refers to large, complex data sets that traditional software cannot handle. Data science is an interdisciplinary field focused on extracting knowledge from data.

Key Enablers

Increased storage capacities
Increased processing power
Availability of data

Challenges

Include capture, storage, analysis, data curation, search, sharing, transfer, visualization, querying, updating, and information privacy.

Characteristics of Big Data

Volume: How much data (Terabyte to Zettabyte).
Variety: Types of data (Structured to unstructured).
Velocity: Speed of new data generation (Batch to streaming).
Veracity: Accuracy and relevance.

Types of Big Data

Structured: Organized with fixed size, easily stored in relational databases.
Unstructured: No organized row-column format, difficult to analyze.
Natural Language: Requires specific data science techniques and linguistics.
Machine-generated: Automatically created by computers, requiring scalable tools.
Graph-based: Focuses on relationships between objects.
Audio, image, and video streamed data.

Data Science Process

Typically consists of six steps, aiming to extract knowledge from data. The agile project model is an alternative to a sequential process with iterations.

Research Goal
Retrieving Data
Data Preparation
Data Exploration
Data Modeling
Presentation and Automation

Amazon RDS and VPC

Amazon Relational Database Service (RDS)

A managed relational database service supporting multiple database engines, Multi-AZ deployments, VPC, monitoring, and data migration. It includes automated backups, read replicas, Multi-AZ deployments, parameter and option groups, security features, monitoring, maintenance windows, and scalability.

Amazon Virtual Private Cloud (VPC)

Allows users to launch AWS resources in a logically isolated virtual network, providing isolation, customization, fine-grained access control, ease of use, integration with AWS services, and high availability.

Key Features of VPC

Security Groups: Act as a virtual firewall for instances.
VPC Peering: Connects one VPC with another.
NAT Gateways/Instances: Allow outbound traffic from private subnets.

MongoDB Special Types

ObjectId: Unique identifier for document IDs.
ISODate: Stores date and time.
Timestamp: Stores date and time, not related to real-world time.

MongoDB Atlas

A fully managed cloud database service, simplifying MongoDB deployments with built-in best practices for setup, security, and scaling.

Read/Write Throughput

DynamoDB maintains multiple copies of data for durability. Eventually consistent reads might return slightly stale data, while strongly consistent reads return the most up-to-date data.