Riassunto Prima Parte
Di cosa parla
- Data Warehouse Design Approaches:
- Top-Down (Inmon): Views a Data Warehouse as a subject-oriented, time-variant, non-volatile, and integrated repository for enterprise data, validated and stored in a normalized database. Stores "atomic" information. Advantages include easy data mart creation, but it can be inflexible to changing departmental needs and has high implementation costs.
- Bottom-Up (Kimball): Describes a Data Warehouse as a copy of transaction data specifically architected for query and analysis, often using a star schema. Data marts are created first for particular business processes. This approach is business-driven. Advantages include quick document generation and extendability to new business units.
- Conceptual & Logical Design:
- Facts: Descriptions of relevant events for the company.
- Measures: Aggregatable streams (e.g., sum, min, max), levels (not additive over time), and units.
- Aggregate Operators: Categorized as distributive (sum, min, max), non-distributive (average), and olistic (median).
- Factless Fact Schema: Used for events not characterized by measures, typically to count occurrences.
- Time: Can be a snapshot, a direct relationship to a time dimension, or mapped simultaneously.
- Data Volume: Estimation is crucial, depending on cardinality and sparsity.
- Star vs. Snowflake Schema: Star schema requires more memory but offers easier joins; Snowflake is better for materialized views.
- Multiple Edges: Can be resolved using bridge tables or by pushing down a boolean list into the fact table.
- Degenerate Dimensions: Dimensions with a single attribute, easily implementable within the fact table.
- Junk Dimensions: Can contain several degenerate dimensions, feasible for small cardinality.
- Data Warehouse: Data Analysis (OLAP):
- Online Analytical Processing (OLAP) is an approach to answer multi-dimensional analytical (MDA) queries swiftly.
- Controlled Query Environment: Encompasses complex queries and ad-hoc analysis for predefined reports, requiring ad-hoc code development.
- Ad Hoc Environment: Arbitrary OLAP queries can be designed on demand by users. An OLAP session allows successive refinements of the same query, useful when predefined queries are insufficient.
- OLAP Operators: Roll up, drill down, slice and dice, pivot, and sort. SQL extensions allow new aggregate functions based on computation windows and rank calculations.
- ETL Process (Extraction, Transformation, Loading):
- Data preprocessing before being loaded into the Data Warehouse.
- Extraction: Depends on how operational data is collected (historical, partly historical, or transient). Incremental extraction captures data modifications (e.g., via ad hoc applications, triggers, or timestamps).
- Data Cleaning: Addresses issues like duplicates, missing data, and unexpected values using data dictionaries, approximate fusion, and prevention techniques.
- Transformation: Converts operational data into warehouse format (formatting).
- Physical Design:
- Dependent on the workload; involves selecting suitable data structures to support the most frequent queries.
- Structures: Bitmap indexes, join indexes, bitmapped join indexes, B+trees, and materialized views.
- Indexing:
- B+tree: Provides direct access based on key field values, efficient for range queries, but susceptible to updates and deletes.
- Bitmap Index: Stores relationships as a boolean matrix, very efficient for boolean predicates but inappropriate for continuous attributes.
- Join Index: Precomputes joins between two or more tables (star index for multiple tables), efficient for column-based joins but can require significant storage space.
- Hash Structure: Guarantees direct and efficient access to data based on a key field via a hash function, very efficient for equality predicates, but inefficient for range queries.