NoSQL Databases: Types, ACID vs BASE, and Use Cases
NoSQL Databases: A Deep Dive
NoSQL, short for “Not Only SQL”, marks a shift from traditional relational database management systems (RDBMS) towards more flexible data storage and retrieval. NoSQL databases offer scalable, schema-less architectures, ideal for handling unstructured or semi-structured data. They provide solutions tailored to specific data and workload needs, making them invaluable for modern applications, from real-time analytics to large-scale web applications.
Key Advantages of NoSQL
- Scale: NoSQL databases scale out by distributing data across multiple servers.
- Schema-less: They are more flexible with data models, accommodating semi-structured or unstructured data.
- Performance: By avoiding joins and complex transactions, and using distributed architectures, NoSQL databases achieve high performance for specific workloads.
Types of NoSQL Databases
Document-based, column-based, key-value stores, and graph databases are common types.
NoSQL: BASE Properties
Before comparing NoSQL and RDBMS, it’s important to understand BASE, or “Basically Available”.
- Basically Available: The system remains available for reads and writes, even when some nodes are unavailable, prioritizing availability over immediate consistency.
- Soft state: The system’s state may change over time, even without input, due to the eventual consistency model.
- Eventually consistent: The system will become consistent over time, with all replicas converging to the same value after some time.
Differences: ACID vs. BASE
Prioritization
- ACID: Prioritizes data consistency, ensuring every read receives the most recent write, often sacrificing some availability in partition-tolerant systems.
- BASE: Prioritizes availability, allowing the system to function even when some nodes are unavailable, potentially returning older data.
Use Cases
- ACID: Ideal for systems where data integrity is critical, like banking systems.
- BASE: Useful for systems that can tolerate some inconsistencies, like social media platforms.
System Behavior During Failures
- ACID: Transactions might fail or be rolled back to ensure consistency, potentially reducing availability.
- BASE: The system continues to operate, accepting read and write operations, even if some data is outdated.
Complexity
- ACID: Complexity is often handled within the database system.
- BASE: Requires the application to handle inconsistencies and be aware of the eventual consistency model.
NoSQL vs RDBMS
- NoSQL: Schema-less or dynamic schema, allowing new fields without altering existing records. No standard query language. Follows BASE properties.
- RDBMS: Fixed schema, schema modifications can be time-consuming. Uses SQL. Follows ACID properties.
MongoDB: A Document-Based NoSQL Database
MongoDB stores data in BSON (Binary JSON) format, supporting embedded documents and arrays, offering more flexibility than traditional table-based relational databases. It is schema-less, allowing each document within a collection to have different fields.
Secondary Indexes
Used to query data based on non-primary-key attributes, defining an alternate key. They can be global or local.
Provisioned Read/Write Throughput
Throughput is the maximum capacity an application can consume from a table or index, divided evenly among partitions. Read capacity units (RCU) and write capacity units (WCU) are used to measure throughput.
On-Demand R/W Throughput
Pay-per-request pricing for read and write requests. DynamoDB adapts rapidly to accommodate the workload, ideal for new tables with unknown workloads or unpredictable traffic.
DynamoDB API Operations
- Control operations: Create and manage DynamoDB tables.
- Data operations: Create, read, update, and delete actions on data.
- Batch operations: Get and write batches of items.
- Transaction operations: Make coordinated changes to multiple items.
Batch Operations
- BatchGetItem: Read up to 16 MB of data, up to 100 items from multiple tables.
- BatchWriteItem: Write up to 16 MB of data, up to 25 PUT or DELETE requests to multiple tables.
Transactional Operations
- TransactWriteItems: Includes one or more PutItem, UpdateItem, and DeleteItem operations across multiple tables.
- TransactGetItems: Includes one or more GetItem operations across multiple tables.
Physical Denormalization
The intentional process of combining tables or adding redundant data to improve performance, trading off some data redundancy for gains in query performance.
Distributed Architecture
Refers to the division of labor across multiple computers or processors, enabling scalability, fault tolerance, reduced latency, and concurrent processing.
Machine Learning in Databases
Types
- Query Optimization: Machine learning models predict the most efficient order for performing joins and select the best indexes.
- Cache Management: Algorithms optimize data caching based on usage patterns.
- Data Compression: Machine learning recognizes patterns for efficient compression, choosing algorithms dynamically.
- Data Sharding: Machine learning identifies query patterns to allocate shards, balancing load and optimizing performance.
- Data Partitioning: Machine learning predicts loads and suggests optimal partitioning schemes.
Big Data and Data Science
Big data refers to large, complex data sets that traditional software cannot handle. Data science is an interdisciplinary field focused on extracting knowledge from data.
Key Enablers
- Increased storage capacities
- Increased processing power
- Availability of data
Challenges
Include capture, storage, analysis, data curation, search, sharing, transfer, visualization, querying, updating, and information privacy.
Characteristics of Big Data
- Volume: How much data (Terabyte to Zettabyte).
- Variety: Types of data (Structured to unstructured).
- Velocity: Speed of new data generation (Batch to streaming).
- Veracity: Accuracy and relevance.
Types of Big Data
- Structured: Organized with fixed size, easily stored in relational databases.
- Unstructured: No organized row-column format, difficult to analyze.
- Natural Language: Requires specific data science techniques and linguistics.
- Machine-generated: Automatically created by computers, requiring scalable tools.
- Graph-based: Focuses on relationships between objects.
- Audio, image, and video streamed data.
Data Science Process
Typically consists of six steps, aiming to extract knowledge from data. The agile project model is an alternative to a sequential process with iterations.
- Research Goal
- Retrieving Data
- Data Preparation
- Data Exploration
- Data Modeling
- Presentation and Automation
Amazon RDS and VPC
Amazon Relational Database Service (RDS)
A managed relational database service supporting multiple database engines, Multi-AZ deployments, VPC, monitoring, and data migration. It includes automated backups, read replicas, Multi-AZ deployments, parameter and option groups, security features, monitoring, maintenance windows, and scalability.
Amazon Virtual Private Cloud (VPC)
Allows users to launch AWS resources in a logically isolated virtual network, providing isolation, customization, fine-grained access control, ease of use, integration with AWS services, and high availability.
Key Features of VPC
- Security Groups: Act as a virtual firewall for instances.
- VPC Peering: Connects one VPC with another.
- NAT Gateways/Instances: Allow outbound traffic from private subnets.
MongoDB Special Types
- ObjectId: Unique identifier for document IDs.
- ISODate: Stores date and time.
- Timestamp: Stores date and time, not related to real-world time.
MongoDB Atlas
A fully managed cloud database service, simplifying MongoDB deployments with built-in best practices for setup, security, and scaling.
Read/Write Throughput
DynamoDB maintains multiple copies of data for durability. Eventually consistent reads might return slightly stale data, while strongly consistent reads return the most up-to-date data.