NoSQL and MongoDB: Scalable Database Solutions

Posted on Dec 16, 2024 in Computers

Week 7: NoSQL

What is Big Data?

Big Data

Scale – Dimensions

Workload – Number of concurrent sessions and operations
Operation mix (create, read, update, delete)
Read – query mix
Generally, each system use case represents a distinct workload
Data Sets – Volume, Velocity, Variety
Number of records
Record size
Record structure (e.g., sparse records)
Homogeneity/heterogeneity of structure/schema
Elasticity
Runtime peaks and valleys – how frequently, how quickly, how much

Scalability

“Scalability is the ability of a system, network, or process to handle a growing amount of work in a capable manner or its ability to be enlarged to accommodate that growth.”

Scale out: adding more nodes to a system
Scale up: adding more resources to a single node
Not considering temporal aspects
how fast, how often and what granularity scaling actions can be performed

Characterizing Scalability

Scalability = Dependability at Scale
Criteria is often “cost increases linearly as <x> increases”
Scalability is a composite quality that includes:
- Throughput
- Latency
- Availability
- Consistency
- Security
- and more
Need to define this for each system
The challenge is that the qualities are not independent – requirements must account for tradeoffs as system scales
Quality Attribute Scenarios are one useful tool
Scalability = Dependability at Scale
Criteria is often “cost increases linearly as <x> increases”
Scalability is a composite quality that includes:
- Throughput
- Latency
- Availability
- Consistency
- Security
- and more
Need to define this for each system
The challenge is that the qualities are not independent – requirements must account for tradeoffs as system scales
Quality Attribute Scenarios are one useful tool

“Vertical” Scaling

Scale up – “big iron”
Monolithic compute resource
Shared disk
More load = bigger processor and disk
Partitioned databases on disk
Optimizes the data placement on separate files or disks
Monolithic compute resource
Separate database instances
Partition database across database engine instance
Functional partitioning common (e.g., customers, orders, stock)
More compute, more license costs

“Horizontal” Scaling

Spread data and workload across large clusters of commodity-class hardware
For example, data sharding
Horizontal data partitioning
Many possible partitioning schemes, e.g. value based (region, customer ID) or arbitrary (hashing)
Issues:
- Evenly distributing read load, write load, and data volume
- Handling shard failures

Sharding

Sharding is horizontal scaling that divides the data set and distributes the data over multiple servers, or shards.

Each shard is an independent database, and collectively, the shards make up a single logical database.

Horizontal Scaling Distributes Data

Distributed systems theory is hard but well-established
Lamport’s “Time, clocks and ordering of events” (1978), “Byzantine generals” (1982), and “Part-time parliament” (1990)
Gray’s “Notes on database operating systems” (1978)
Lynch’s “Distributed algorithms” (1996, 906 pages)
Implementing the theory is hard, but possible
Google’s “Paxos made live” (2007)
Introduces fundamental tradeoff among “CAP” qualities
Consistency, Availability, Partition tolerance (see Brewer)
“When Partition occurs, tradeoff Availability against Consistency, Else tradeoff Latency against Consistency” (PACELC, see Abadi)

NoSQL – Horizontally-Scalable Database Technology

NoSQL – Originally implied “No SQL”, but stance has moderated over time
Designed to scale horizontally and provide high performance for a particular type of problem
Most originated to solve a particular system problem/use case
Later were generalized (somewhat) and many are available as open-source packages

NoSQL Databases – Common Characteristics

No relational schema
De-normalized data models (application manages redundancy)
No join queries (use an aggregation framework like MapReduce instead)
Dynamic schema – no enforcement by database, all enforcement in application
Simple physical data models – less mismatch between programming language constructs and database constructs
Built-in, automatic distribution
- sharding (partitioning across nodes)
- replication (copying across nodes)
Application controls durability and consistency at the operation level
Coordinate write and read settings to achieve desired quality

Four Types of NoSQL

Type 1: Document Store

Key-value store where “value” is a document (XML, JSON, BSON, …)
Examples – MongoDB, Couchbase

“id”: “1” “Name”: “John” “Employer”: “SEI” ; “id”: “2” “Name”: “Ian” “Employer”: “SEI” “Previous”: “PNNL”

Type 2: Key-Value Store

Simple hash table – associates unique key with a piece of un-typed data
Examples – Dynamo, riak, Redis, Oracle NoSQL

“key”: “1” value { “Name”: “John” “Employer”: “SEI”}; “key”: “2” value {“Name”: “Ian” “Employer”: “SEI” “Previous”: “PNNL”}

Type 3: Column Store

Rows store multiple key-value pairs, indexed and sorted by key
Examples – Cassandra, Accumulo, HBase, Amazon SimpleDB

“row”: “1” , “Name” “Employer”
“John” “SEI”

“row”: “2” “Name” “Employer” “Previous”
“Ian” “SEI” “PNNL”

Type 4: Graph Store

Records represent nodes and edges/links – properties attached to each store the data
Example – Neo4J

XoAAAAASUVORK5CYII=

++tQ1pkbRBxAAAAAElFTkSuQmCC

wvCF3vGUwxahgAAAABJRU5ErkJggg==

AeUMSJEXeFV8AAAAAElFTkSuQmCC

MongoDB

MongoDB is an open source project mainly driven by the company 10gen Inc
Features:
- Document Data Model and a relatively rich query language,
- Data partitioning by sharding,
- Leader – worker mode of replication,
- No data versioning
Prominent users:
- SourceForge.net,
- Foursquare,
- New York Times

Data Model

A MongoDB hosts a number of databases
A database is a physical data container of a set of collections
A collection contains a set of documents
A document contains a set of key-value (aka field_name/field_value) pairs {field1: value1,…, fieldn: valuen}

MongoDB Documents

The value can be:
- Any of the BSON (Binary JSON) data types,
- Other documents,
- Arrays, and
- Arrays of documents
Field names are strings
The field {name}_id is reserved for the document’s primary key that is indexed
- Unique in the collection
- Immutable, and
- Any type except an array

MongoDB Documents

The basic unit of data
Analogous to a JSON object ;
Analogous to structures in programming languages that associate keys with values
dictionaries, hashes, maps, and associative arrays

A Document Example

{
class_id: ObjectId (“509980df3”),
course: {code: “SWEN432”,

title: “Advanced DB”},

year: 2014,
students: [“Matt”, “Jack”,…, “Lingshu”],
no_of_st: 11
}

The document contains values of varying types
The primary key is class_id and it is of the ObjectId type,
The field course is a subdocument, and
students is an array of strings

MongoDB Documents

Documents have a dynamic schema:

Documents in the same collection do not need to have the same structure
Common fields belonging to different documents may have different data types
However, it is a common practice that documents of a collection have a similar structure

MongoDB Collections

Collections are analogous to a table in relational databases.

CRUD Operations

The acronym CRUD stands for:

Create (insert),
Read,
Update, and
Delete
MongoDB CRUD operations target a specific collection
In addition to the basic queries, MongoDb also supports aggregation

Write Operations

There are three classes of write operations in MongoDB:
- Insert that adds a new document to a collection,
- Update that modifies an existing document, and
- Remove that deletes an existing document from a collection.
- The update and remove operations allow specifying criteria or conditions that identify documents (to be modified, or removed)

Insert Operation

The following operation inserts a new document into the collection courses:
db.courses.insert (
{code: “SWEN432”,
title: “Advanced Database Design and
Implementation”,
trimester: 1,
year: 2014}
)

The inserted document is going to have five fields: code, title, trimester, year, and _id
MongoDB adds the field _id automatically and populate it with a unique ObjectId (unless specified by the application)

Update and Delete Operations

The following operation updates a document that satisfies the update criteria {code: “SWEN432”}:
db.courses.update (
{code: “SWEN432”},
{set: {year: 2015}}
)

The following operation deletes multiple documents that satisfy the remove criteria

{year: {$lt:
2014}}:
db.courses.remove (
{year: {$lt: 2014}}
)

Read Operation

MongoDB provides a db.collection.find() method
The method accepts:
- Query criteria (conditions),
- Projection list,
- Modifiers (sort, limit, skip), and
- Returns a cursor to the matching documents

Index

Indexes are special data structures that store a small portion of the collection’s data set in an easy to traverse form. Ascending: -1; Descending: 1

AAAAAElFTkSuQmCC

uVxM4eph998AAAAASUVORK5CYII=

bQTGwAAAAAElFTkSuQmCC

Projections

Queries return all fields in all matching documents by default
Projections are defined in the second argument of find() (the first is the query selector)
Projections may either specify:
- A list of fields to return (designated by {<field_name>: 1}), or
- A list of fields to exclude (designated by {<field_name>: 0}) in the result,
Only the exclusion of _id field can be mixed with fields to return
Examples:
db.courses.find({no_of_st: { $gt: 8 }}

{title: 1, _id: 0})

db.courses.find({no_of_st: { $gt: 8 }} {title: 0})

Data Modeling

The key challenge in data modeling is balancing:
- The needs of the application,
- The performance characteristics of the database engine, and
- The data retrieval patterns
The key decision in structuring a document is how to represent relationships between data:
- Embedding or
- Referencing

Embedded Relationships

By embedding, related data is stored using sub- documents within a document
This way related data is going to be stored in the same place
Embedded relationship schemes are also said to be denormalized
Recommended for use in case of:
- “Contains” relationship between entities,
- One-to-one, or one-to-many relationships between entities

wMkGfU3PekFPQAAAABJRU5ErkJggg==

HTTz8hcB20aVuO9u3aIe5FUbl8GWbOnoOViSQ6dO6Kbbv3RDxSgPqVNZg7ew5W1tagvEN7bNu9O7TWmDF9Bkw6jZ7bbgelHCxdvgxLK1agJlEHLxpB165dEfeobFdVVaOiaiVKW7VCeXk5WrVqhVgshv8HYe4OINkBbFYAAAAASUVORK5CYII=

yHH5tdmnbflkiTGGkuhUGB4eBhjDVEYEQQB1VqVMAxbqrk2T+8TUnjOlpfWNJl3GlqbR0SGIVoIVC3BZZY0ECShr0UaZYIwzrA2JS6FJMIRZlB0PtozNgk2UKTC6xiXZYROQNZjjM2wkUaUIjIBLstQKYRGgrG4SGOEIZWOhlbEJqbNKQr4KD1jQUchUkqMMRhj+P8BruJvEZrKVUAAAAAASUVORK5CYII=

Relationships by Referencing

Implementing relationships by referencing is also called the normalized data model:
- Subdocuments are implemented like individual documents, and
- Relationships are implemented by linking subdocuments from the “parent” document using primary keys of the subdocuments

Relationships by Referencing

Recommended to use:
- When gain in read performance does not compensate losses because of data duplication caused by embedding,
- To represent a many-to-many relationship, and
- To build large hierarchical data sets
Disadvantage:
- A query asking for data from the parent and children documents has to be implemented using a sequence of queries

Sharding

In MongoDB, data sharding is a synonym for horizontal data partitioning
Sharding is performed on the level of a collection
A collection is divided into sub collections called shards
Each shard is an independent database stored on a cluster of servers that is also called shard
All shards make up a logical database
Example:
- Assume a collection contains 1 TB of data
- It can be divided into four shards having 256 GB each

Sharded Cluster

Sharded clusters implement sharding.
Shards are MongoDB instances that hold a subset of a collection’s data. Each shard is either a single mongod instance or a replica set.
Config Servers are mongod instances that hold metadata about the cluster. The metadata maps chunks to shards.
Router is a mongod instance that routes reads and writes from apps to the shards. Applications do not access the shard directly.

Shard Key

Sharding partitions a collection’s data by the shard key
A shard key is either an indexed field or an indexed composite field that exists in every document in the collection
MongoDB divides the shard key value range into chunks and distributes the chunks evenly across the shards
Dividing shard key values into chunks is done either by:
- Range based partitioning or
- Hash based partitioning

Range Based Sharding

Range based sharding:
- The space of shard key values is divided into non overlapping ranges called chuks
- Each chunk is stored on a shard cluster
- Documents having close shard key values are likely to be stored on the same shard
- Favors range queries by shard key
- May result in an uneven distribution of data and the work load

n0Gh+b7vbADIGkVhkv5oAa0dDrBihEVhEsp5bEN9GMytEDFprUAAAAASUVORK5CYII=

LlEwURUKLgAAAABJRU5ErkJggg==

B3167AvuMR+EAAAAAElFTkSuQmCC

Hash Based Sharding

Hash based sharding:
- The space of hashes of shard key values is divided into non overlapping chuks
- Documents having close shard key values are likely to be stored on different shards
- Favors an even distribution of data and the work load

Balanced Data Distribution

The addition of new data or servers can result in data distribution imbalances
A shard can contain significantly more chunks than another, or
Size of a chunk may become significantly greater then another
MongoDB uses two background processes to keep a cluster balanced:
- Splitting and
- Balancer

Splitting

When a chunk grows beyond a specified size, MongoDB:

Splits the chunk into two halves,
Does not migrate any data (the shard contains more chunks now)

Balancing

When the distribution of chunks in a sharded collection in a cluster becomes uneven, a balancer that runs in all routers:

Migrates whole chunks from the shard with the largest number of chunks to the shard with the least number of chunks,
The origin shard keeps the chunk and answers queries until the chunk has been successfully transferred to the destination shard
Only then, MongoDB deletes the chunk from the origin shard

x+PwcffDD169dPuDMSEdldYg2o4g0pzdM8zdO87Z1Xnv8HcMsqEKfPJuMAAAAASUVORK5CYII=

Ukr6+voyMDDwUG8YAACsLfcdfAAAAKwtD34yKAAAAF1N8AEAAFRK8AEAAFRK8AEAAFRK8AEAAFRK8AEAAFRK8AEAAFRK8AEAAFRK8AEAAFRK8AEAAFRK8AEAAFRK8AEAAFRK8AEAAFRK8AEAAFTqP++v5GrQ2chCAAAAAElFTkSuQmCC

D3WerY9qpNftAAAAAElFTkSuQmCC

Adding and Removing Shards

When a new shard joins a cluster it creates imbalance, since the new shard has no chunks
A balancer starts migrating data immediately
When a shard needs to be removed from a cluster
- A balancer migrates all of its chunks to the other shards,
- Updates the meta data on config servers appropriately, and
- Only then, the shard can be removed

Replication

In MongoDB, replication is implemented on the level of a shard
A replica set contains replicas of a shard
A replica set is a cluster of MongoDB servers
There is a primary daemon process called mongod running on each server,
mongod handles data requests, manages data format, and performs background management operations
Comment on the term daemon:
- In multitasking computer programming systems, a daemon] is a program that runs as a background process, rather than being under the direct control of an interactive user
- Traditionally daemon names end with the letter d:

Leader and Worker

One mongod – the primary (or leader):
- Accepts all write request from clients,
- Updates its data set, and
- Records update operations in its operation log
All other servers of the replica set – secondaries:
- Receive operations from the primary’s operation logs Apply operations on their data sets
If clients read from the primary, the replica set provides a strict consistency
If clients are allowed to read from secondaries, the replica set provides an eventual consistency

Distributed Queries

Applications issue operations to one of mongos instances of a sharded cluster
Read operations are most efficient when a query includes the collections shard key
Otherwise the mongos must direct the query to all shards in the cluster (scatter gather query) and that might be inefficient
By default, MongoDB always reads data from a replica set’s primary
That behavior can be modified by changing the preference mode:
- To balance the work load,
- To allow reads during failover, but
- Eventual consistency can be guaranteed, only

Write Operations on Sharded Clusters

For sharded collections in a sharded cluster, the mongos directs write operations from applications to shards that are responsible for the portion of the data set using the sharding key value

The mongos gets needed metadata information from the config database residing on config servers

Write Operations on Replica Sets

In replica sets, all write operations go to the set’s primary
The primary applies the write operations and then records the operations on its operation log (oplog)
Oplog is a reproducible sequence of operations to the data set
Secondary members of the set continuously replicate the oplog by applying operations to themselves in an asynchronous process

Failover

Replica set members send heartbeats (pings) to each other every two seconds
If a heartbeat does not return within 10 seconds, the other members mark the not responding node inaccessible
If a secondary becomes inaccessible, the replica set can continue to function without it
If the primary becomes inaccessible, the remaining members of the replica set have to elect a new primary by voting
The replica set can not accept any write request if there is no primary
The first secondary that gets majority of votes becomes a new primary and the replica set resumes the normal mode of operation.

Leader Election

Problem Description In a cloud-based system that implements horizontal scaling, multiple instances of the same task could be running simultaneously with each instance servicing a different user. If these instances write to a shared resource, it may be necessary to coordinate their actions to prevent each instance from blindly overwriting the changes made by the others If the tasks are performing individual elements of a complex calculation in parallel, the results will need to be aggregated when they all complete.

Leader Election

Solution Selecting the task instance with the lowest-ranked instance or process ID. Racing to obtain a shared distributed mutex. The first task instance that acquires the mutex is the leader. However, the system must ensure that, if the leader terminates or becomes disconnected from the rest of the system, the mutex is released to allow another task instance to become the leader. Implementing one of the common leader election algorithms such as the Bully Algorithm or the Ring Algorithm. These algorithms assume that each candidate participating in the election has a unique ID, and that they can communicate with the other candidates in a reliable manner.

Example Token Ring Election Algorithm

1. We start with 6 processes, connected in a logical ring. Process 6 is the leader, as it has the highest number.

2. Process 6 fails.

3. Process 3 notices that Process 6 does not respond. So it starts an election, sending a message containing its id to the next node in the ring.

4. Process 5 passes the message on, adding its own id to the message.

5. Process 0 passes the message on, adding its own id to the message.

6. Process 1 passes the message on, adding its own id to the message.

7. Process 4 passes the message on, adding its own id to the message.

8. When Process 3 receives the message back, it knows the message has gone around the ring, as its own id is in the list. Picking the highest id in the list, it starts the coordinator message “5 is the leader” around the ring.

9. Process 5 passes on the coordinator message.

10. Process 0 passes on the coordinator message.

11. Process 1 passes on the coordinator message.

12. Process 4 passes on the coordinator message.

13. Process 3 receives the coordinator message, and stops it.

Summary

MongoDB is a popular free source CDBMS with a number of prominent users
Data model: document oriented
A database is a container for a number of document collections
Documents of a collection may have but don’t have to have the same structure
A document is a set of key-value pairs represented as a JSON structure
The main issue with data modeling is how to represent relationships between entities
- Embedded relationships
- Relationships by references
Data partitioning: sharding (horizontal partitioning) by
a shard key, contained in all documents of a collection
Data replication: performed on the level of a shard on the primary leader-worker approach
Leader is an instance of MongoDB that performs all writes and propagates them to workers
If all reads also go to leader – strict consistency, otherwise – eventual consistency
Leader is a single point of failure
If leader fails, a worker is elected as a new leader
Query: query selectors, projections, sorting, limit,…
aggregation operations is supported

NoSQL and MongoDB: Scalable Database Solutions

Week 7: NoSQL

What is Big Data?

Scale – Dimensions

Scalability

Characterizing Scalability

“Vertical” Scaling

“Horizontal” Scaling

Sharding

Horizontal Scaling Distributes Data

NoSQL – Horizontally-Scalable Database Technology

NoSQL Databases – Common Characteristics

Four Types of NoSQL

Type 1: Document Store

Type 2: Key-Value Store

Type 3: Column Store

Type 4: Graph Store

MongoDB

Data Model

MongoDB Documents

MongoDB Documents

A Document Example

MongoDB Documents

MongoDB Collections

CRUD Operations

Write Operations

Insert Operation

Update and Delete Operations

Read Operation

Index

Projections

Data Modeling

Embedded Relationships

Relationships by Referencing

Relationships by Referencing

Sharding

Sharded Cluster

Shard Key

Range Based Sharding

Hash Based Sharding

Balanced Data Distribution

Splitting

Balancing

Adding and Removing Shards

Replication

Leader and Worker

Distributed Queries

Write Operations on Sharded Clusters

Write Operations on Replica Sets

Failover

Leader Election

Leader Election

Example Token Ring Election Algorithm

Summary

Recent Notes

Subjects

Publicidad