Navigating the World of Big Data
Structured, Semi-structured, and Unstructured Data
Structured data, like an Excel spreadsheet, is organized in rows and columns. Unstructured data, such as images, lacks this tabular structure. Semi-structured data, including JSON and XML, falls in between. It doesn’t strictly adhere to the tabular format but uses tags or markers to separate elements and enforce hierarchies. JSON, or JavaScript Object Notation, is a common example, using attribute-value pairs.
REST (Representational State Transfer) is a way to interact with servers, often using JSON for responses. SOAP (Simple Object Access Protocol) is another protocol that uses XML for messaging.
NoSQL Databases
NoSQL databases, short for “Not Only SQL,” provide alternatives to traditional relational databases. Examples include MongoDB, Cassandra, and Amazon DynamoDB. While they can handle relational data, they excel in flexibility and scalability.
Characteristics of NoSQL Databases
- Schema Flexibility: NoSQL databases don’t require a fixed schema, allowing for easier data model evolution.
- High Performance and Scalability: They are designed for distributed systems, ensuring high availability and performance.
- Lack of Standardization: Querying methods can vary between NoSQL databases.
- Eventual Consistency: Data consistency is eventually achieved, which might not be suitable for all applications.
Types of NoSQL Databases
- Key-Value Stores: Data is stored as key-value pairs, offering simplicity and high performance. Use cases include user profiles, session information, and shopping cart contents.
- Document Stores: These databases store data in document formats like JSON or XML, suitable for content management systems, web analytics, and product catalogs.
- Graph Databases: They represent data as nodes (entities) and edges (relationships), ideal for recommendation systems and social networks.
- Wide-Column Stores: These databases organize data in columns grouped into families, often used in OLAP processes. They offer better performance and scalability than relational databases for specific workloads.
The Era of Big Data
The digital age, starting around 2002, brought an explosion of data. Big data is characterized by its volume, velocity, and variety.
Database Systems
- Distributed Systems: Data is spread across multiple locations, improving access speed and fault tolerance.
- Centralized Systems: A single database file simplifies management but can create bottlenecks.
- Data Sharding: Databases are split and distributed, balancing performance and management complexity.
Big Data Processing
This involves extracting insights from large datasets. Distributed data processing leverages multiple nodes for parallel processing, offering scalability and fault tolerance. However, it can be complex to set up and manage.
Measures of Scalability
- Size: The system’s capacity to handle growing data volumes.
- Geographical: The ability to expand to new physical locations.
- Administrative: The ease of adding administrators to manage the system.
Types of Scaling
- Vertical Scaling: Increasing the resources of a single instance (e.g., CPU, RAM).
- Horizontal Scaling: Adding more instances to distribute the workload.
Data Processing at Scale
- Acquisition: Gathering data from various sources.
- Processing and Partitioning: Preparing data for analysis.
- Analysis: Applying machine learning or data mining techniques.
- Service: Delivering insights to analysts and scientists.
MapReduce for Batch and Stream Data
MapReduce is a programming model for processing large datasets in parallel. It consists of two phases:
- Map: Data is split and processed into key-value pairs.
- Reduce: Key-value pairs are grouped and aggregated.
Apache Hadoop and Spark
Hadoop is an open-source framework for distributed storage and processing of large datasets. It uses hard drives for storage and is well-suited for batch processing. Apache Spark, on the other hand, leverages memory (RAM) for faster processing, making it ideal for stream data like online activity and sensor data.
Hadoop Components
- HDFS (Hadoop Distributed File System): For storing large datasets.
- MapReduce: For processing data in parallel.
- YARN (Yet Another Resource Negotiator): For managing cluster resources and job scheduling.
- Hadoop Common/Core: Provides common libraries and utilities.
Spark Components
- Spark Core: The foundation of the project.
- Spark Streaming: For processing live data streams.
- Spark SQL: For querying data using SQL-like syntax.
- MLlib: A library for machine learning algorithms.
- GraphX: APIs for graph analytics.
Data Warehouses, Data Lakes, and Data Lakehouses
- Data Warehouses: Centralized repositories for structured, processed data, optimized for reporting and analysis.
- Data Lakes: Cost-effective storage for all types of data, ideal for machine learning and data discovery.
- Data Lakehouses: Combine the best of both worlds, offering the storage capabilities of data lakes with the data management features of data warehouses.
DataBricks Lakehouse
DataBricks provides a platform for building data lakehouses using Delta Lake, an open-source storage layer. It offers ACID transactions, schema enforcement, and unified stream and batch processing.
Big Data Architectures
Kappa Architecture simplifies stream processing by using a single unified layer. It’s suitable for applications requiring real-time processing, such as IoT systems and monitoring systems.
Massively Parallel Processing (MPP)
MPP systems use multiple processors to work on different parts of a large dataset simultaneously, enabling faster query processing and analysis.