Data Warehousing, Data Mining, and Decision Support Systems: A Comprehensive Guide
Chapter 1: Introduction to Decision Support Systems (DSS)
Inputs, Outputs, and Measures of Success
Decision Support Systems (DSS) utilize resources (inputs) to achieve specific goals (outputs). The measure of success for a DSS is determined by the ratio of outputs to inputs. Managerial roles play a crucial part in this process, encompassing interpersonal, informational, and decisional aspects.
Managerial Roles
- Interpersonal: Figurehead, leader, liaison
- Informational: Monitor, disseminator, spokesperson
- Decisional: Entrepreneur, disturbance handler, resource allocator, negotiator
DSS and Business Intelligence (BI)
DSS are interactive computer systems that aid decision-making by providing data and models. Modern BI focuses on actionable intelligence derived from unstructured problems, while traditional Management Information Systems (MIS) primarily deal with historical reporting. The complexity of business problems often renders trial-and-error approaches impractical. DSS can incorporate qualitative elements into formal decision models, providing both decisions and rationale for employees.
Types of Decision Models
- Normative models: Examine all possible alternatives to identify the optimal solution (prescriptive).
- Descriptive models: Report on past performance (e.g., simulation).
- Predictive models: Forecast future outcomes.
Chapter 2: Decision-Making Principles
Decision-making prioritizes effectiveness, with efficiency often being a secondary benefit. The principle of choice guides the acceptability of solution approaches, while suptimization simplifies modeling by grouping viable options. Satisficing involves setting a performance threshold and selecting an alternative that meets it. Sensitivity analysis assesses the robustness of alternatives to minor changes, while what-if analysis explores the impact of significant changes. Ad hoc decision-making addresses unexpected, non-recurring problems.
Chapter 3: Data Warehousing
Data Warehouse vs. Data Mart
A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile repository of data. Data marts, on the other hand, cater to specific departments or business units. They can be dependent (created directly from the data warehouse) or independent (designed for a specific strategic business unit).
Data Warehouse Process
The data warehouse process involves several stages: sourcing data, extraction and transformation, loading, comprehensive database creation, and metadata management. Key considerations include the choice of DBMS, parallel processing capabilities, migration tools, and retrieval/analysis tools.
Data Warehouse Architectures
- Independent data marts: Separate data marts for different departments.
- Bus architecture: A central data warehouse with dependent data marts.
- Hub and spoke: A central data warehouse with independent data marts.
- Centralized: A single, centralized data warehouse.
- Federated: Integrated data warehouses with middleware.
ETL and Data Integration
ETL (extraction, transformation, load) encompasses data integration operations such as data access, data federation (integration of business views), and change capture. Enterprise application integration (EAI) pushes data into the warehouse (e.g., CSV), while enterprise information integration (EII) pulls data (e.g., cloud-based).
OLAP vs. OLTP
OLAP (online analytical processing) utilizes a cube data structure for multidimensional analysis, enabling operations like slicing, dicing, drill-down, roll-up, and pivoting. OLTP (online transaction processing) handles transaction processing from systems like ERP, CRM, and SCM.
Active Data Warehousing and In-Database Processing
Active data warehousing facilitates high-speed data transfer. In-database processing technology, which differs from in-memory storage technology, allows for data processing within the database itself.
Chapter 5: Data Mining
Data Mining Process and Robustness
Data mining is an iterative process that involves discovering patterns in data. Robustness refers to the ability of a data mining model to make accurate predictions despite noise or errors in the data.
Classification and Artificial Neural Networks (ANNs)
Classification techniques create rules to distinguish between predefined classes. ANNs, inspired by the human brain, are a powerful tool for classification and other data mining tasks.
Data Security and Cost
Maintaining data security is crucial, especially for sensitive information. Data storage costs have decreased significantly, making data mining more accessible.
Applications of Data Mining
Data mining has numerous applications, including fraud detection, sales analysis, and customer segmentation.
Open Source vs. Commercial Data Mining Tools
Open-source data mining tools are gaining popularity compared to commercial options.
Data Preprocessing
Data preprocessing addresses issues like incomplete, noisy, and inconsistent data. It involves cleaning (handling missing values, outliers, smoothing, and correcting), integrating, transforming (normalization and aggregation), reducing, and discretizing data.
Handling Noise
Noise refers to random error or variance in data. Techniques for handling noise include clustering, inspection, regression smoothing, and binning.
Data Reduction
Data reduction techniques, such as dimensionality reduction and sampling, aim to decrease data size and improve prediction accuracy. Data types include continuous (real or integer) and categorical (nominal or ordinal).
Concept Hierarchy
Concept hierarchies organize data by replacing low-level concepts with higher-level ones (e.g., street vs. country).
Data Mining Definitions
Data mining is a process involving iterative steps, experimentation, search, and inference. It aims to discover valid, novel, and potentially useful patterns in data. Data types include nominal, ordinal, interval, and ratio.
Pattern Types
Data mining uncovers various patterns, including association, prediction, clustering (segmentation), and sequential (time-series) patterns.
Data Mining vs. Statistics
Data mining utilizes all available data, while statistics typically relies on sample data. Popular data mining methodologies include CRISP-DM and SEMMA.