Riassunti VERIFICATO

Riassunto Prima Parte

Politecnico di Torino data science and engineering 2021
16 visualizzazioni
35 download
Nessun voto ancora
Condividi: WhatsApp Telegram
Anteprima pagina 1 — Riassunto Prima Parte

Di cosa parla

  • Data Warehouse Design Approaches:
    • Top-Down (Inmon): Views a Data Warehouse as a subject-oriented, time-variant, non-volatile, and integrated repository for enterprise data, validated and stored in a normalized database. Stores "atomic" information. Advantages include easy data mart creation, but it can be inflexible to changing departmental needs and has high implementation costs.
    • Bottom-Up (Kimball): Describes a Data Warehouse as a copy of transaction data specifically architected for query and analysis, often using a star schema. Data marts are created first for particular business processes. This approach is business-driven. Advantages include quick document generation and extendability to new business units.
  • Conceptual & Logical Design:
    • Facts: Descriptions of relevant events for the company.
    • Measures: Aggregatable streams (e.g., sum, min, max), levels (not additive over time), and units.
    • Aggregate Operators: Categorized as distributive (sum, min, max), non-distributive (average), and olistic (median).
    • Factless Fact Schema: Used for events not characterized by measures, typically to count occurrences.
    • Time: Can be a snapshot, a direct relationship to a time dimension, or mapped simultaneously.
    • Data Volume: Estimation is crucial, depending on cardinality and sparsity.
    • Star vs. Snowflake Schema: Star schema requires more memory but offers easier joins; Snowflake is better for materialized views.
    • Multiple Edges: Can be resolved using bridge tables or by pushing down a boolean list into the fact table.
    • Degenerate Dimensions: Dimensions with a single attribute, easily implementable within the fact table.
    • Junk Dimensions: Can contain several degenerate dimensions, feasible for small cardinality.
  • Data Warehouse: Data Analysis (OLAP):
    • Online Analytical Processing (OLAP) is an approach to answer multi-dimensional analytical (MDA) queries swiftly.
    • Controlled Query Environment: Encompasses complex queries and ad-hoc analysis for predefined reports, requiring ad-hoc code development.
    • Ad Hoc Environment: Arbitrary OLAP queries can be designed on demand by users. An OLAP session allows successive refinements of the same query, useful when predefined queries are insufficient.
    • OLAP Operators: Roll up, drill down, slice and dice, pivot, and sort. SQL extensions allow new aggregate functions based on computation windows and rank calculations.
  • ETL Process (Extraction, Transformation, Loading):
    • Data preprocessing before being loaded into the Data Warehouse.
    • Extraction: Depends on how operational data is collected (historical, partly historical, or transient). Incremental extraction captures data modifications (e.g., via ad hoc applications, triggers, or timestamps).
    • Data Cleaning: Addresses issues like duplicates, missing data, and unexpected values using data dictionaries, approximate fusion, and prevention techniques.
    • Transformation: Converts operational data into warehouse format (formatting).
  • Physical Design:
    • Dependent on the workload; involves selecting suitable data structures to support the most frequent queries.
    • Structures: Bitmap indexes, join indexes, bitmapped join indexes, B+trees, and materialized views.
    • Indexing:
      • B+tree: Provides direct access based on key field values, efficient for range queries, but susceptible to updates and deletes.
      • Bitmap Index: Stores relationships as a boolean matrix, very efficient for boolean predicates but inappropriate for continuous attributes.
      • Join Index: Precomputes joins between two or more tables (star index for multiple tables), efficient for column-based joins but can require significant storage space.
      • Hash Structure: Guarantees direct and efficient access to data based on a key field via a hash function, very efficient for equality predicates, but inefficient for range queries.

Altri appunti di Data management and visualization

Condividi questi appunti

WhatsApp Telegram