Data Storage Technologies: Selection and Design

Chapter 1: Selecting Appropriate Storage Technologies

Basics of a Data Pipeline

A data pipeline typically follows these stages:

  • Ingest
  • Store
  • Process & Analyze
  • Explore & Visualize

Ingestion Modes

Data can be ingested in several ways:

  • Application Data
  • Streaming
  • Batch

Storage Considerations

  • Data Access Patterns: For example, OLTP systems like Cloud SQL, Cloud Datastore, or Cloud Storage for Machine Learning models.
  • Access Controls: For instance, BigQuery supports dataset-level access controls. A workaround for table or view-level control is to create authorized views in one dataset referencing tables in another with more restrictive access.
  • Storage Time (Duration): Consider Cloud Storage options like Nearline or Coldline, or BigQuery for longer-term storage. Frequently accessed data can reside in OLTP or OLAP systems.

Process & Analyze

  • Data Transformations: Normalization, standardization, and cleansing can be performed using Cloud Dataflow.
  • Data Analysis: Regression, mean, mode, standard deviation, and histograms can be calculated using Cloud Dataflow, BigQuery, or ML Services.

Explore & Visualize

  • Cloud Datalab (Jupyter Service)
  • Looker

Key Characteristics of Big Data

Big Data is often characterized by the “3 V’s”:

  • Volume
  • Variety
  • Velocity (Variation)

Data Considerations

  • Data Access Patterns
  • Security Requirements

Types of Data

  • Structured: Suitable for SQL databases (transactional or analytical). Examples include Cloud SQL, Spanner, and BigQuery.
  • Semi-Structured: Suitable for NoSQL or document databases (fully indexed or row key access). Examples include Cloud Datastore, Firestore, and Bigtable.
  • Unstructured: Best suited for blob storage, such as Google Cloud Storage (GCS).

Database Decision Tree

4uPjkZ7Oic2KEpnwx8TUFHZ2dtqJWnlzmOjl9C7sEhEREREREbFPJxEREREREekdhl0iIiIiIiLSOwy7REREREREpHcYdomIiIiIiEjvMOwSERERERGR3mHYJSIiIiIiIr3DsEtERERERER6h2GXiIiIiIiI9A7DLhEREREREekdhl0iIiIiIiLSOwy7REREREREpHcYdomIiIiIiEjvMOwSERERERGRngH+D33X4Ey8lNQLAAAAAElFTkSuQmCC

Schema Design

  • Relational:
    • OLAP (Denormalized)
    • OLTP (3NF)
  • NoSQL:
    • Key-Value
    • Document
    • Graph
    • Wide Column

Exam Essentials

Understand the four stages of the data lifecycle: ingest, storage, process and analyze, and explore and visualize. Ingestion is the process of bringing application data, streaming data, and batch data into the cloud. The storage stage focuses on persisting data to an appropriate storage system. Processing and analyzing is about transforming data into a form suitable for analysis. Exploring and visualizing focuses on testing hypotheses and drawing insights from data.

Understand the characteristics of streaming data. Streaming data is a set of data sent in small messages that are transmitted continuously from the data source. Streaming data may be telemetry data, which is data generated at regular intervals, and event data, which is data generated in response to a particular event. Stream ingestion services need to deal with potentially late and missing data. Streaming data is often ingested using Cloud Pub/Sub.

Understand the characteristics of batch data. Batch data is ingested in bulk, typically in files. Examples of batch data ingestion include uploading files of data exported from one application to be processed by another. Both batch and streaming data can be transformed and processed using Cloud Dataflow.

Know the technical factors to consider when choosing a data store. These factors include the volume and velocity of data, the type of structure of the data, access control requirements, and data access patterns.

Know the three levels of structure of data. These levels are structured, semi-structured, and unstructured. Structured data has a fixed schema, such as a relational database table. Semi-structured data has a schema that can vary; the schema is stored with the data. Unstructured data does not have a structure used to determine how to store the data.

Know which Google Cloud storage services are used with the different structure types. Structured data is stored in Cloud SQL and Cloud Spanner if it is used with a transaction processing system; BigQuery is used for analytical applications of structured data. Semi-structured data is stored in Cloud Datastore if data access requires full indexing; otherwise, it can be stored in Bigtable. Unstructured data is stored in Cloud Storage.

Know the difference between relational and NoSQL databases. Relational databases are used for structured data, whereas NoSQL databases are used for semi-structured data. The four types of NoSQL databases are key-value, document, wide-column, and graph databases.