Riassunti VERIFICATO

Riassunto Prima Parte

Politecnico di Torino data science and engineering 2021

16 visualizzazioni

35 download

Nessun voto ancora

Anteprima pagina 1 — Riassunto Prima Parte

Di cosa parla

Data Warehouse Design Approaches:
- Top-Down (Inmon): Views a Data Warehouse as a subject-oriented, time-variant, non-volatile, and integrated repository for enterprise data, validated and stored in a normalized database. Stores "atomic" information. Advantages include easy data mart creation, but it can be inflexible to changing departmental needs and has high implementation costs.
- Bottom-Up (Kimball): Describes a Data Warehouse as a copy of transaction data specifically architected for query and analysis, often using a star schema. Data marts are created first for particular business processes. This approach is business-driven. Advantages include quick document generation and extendability to new business units.
Conceptual & Logical Design:
- Facts: Descriptions of relevant events for the company.
- Measures: Aggregatable streams (e.g., sum, min, max), levels (not additive over time), and units.
- Aggregate Operators: Categorized as distributive (sum, min, max), non-distributive (average), and olistic (median).
- Factless Fact Schema: Used for events not characterized by measures, typically to count occurrences.
- Time: Can be a snapshot, a direct relationship to a time dimension, or mapped simultaneously.
- Data Volume: Estimation is crucial, depending on cardinality and sparsity.
- Star vs. Snowflake Schema: Star schema requires more memory but offers easier joins; Snowflake is better for materialized views.
- Multiple Edges: Can be resolved using bridge tables or by pushing down a boolean list into the fact table.
- Degenerate Dimensions: Dimensions with a single attribute, easily implementable within the fact table.
- Junk Dimensions: Can contain several degenerate dimensions, feasible for small cardinality.
Data Warehouse: Data Analysis (OLAP):
- Online Analytical Processing (OLAP) is an approach to answer multi-dimensional analytical (MDA) queries swiftly.
- Controlled Query Environment: Encompasses complex queries and ad-hoc analysis for predefined reports, requiring ad-hoc code development.
- Ad Hoc Environment: Arbitrary OLAP queries can be designed on demand by users. An OLAP session allows successive refinements of the same query, useful when predefined queries are insufficient.
- OLAP Operators: Roll up, drill down, slice and dice, pivot, and sort. SQL extensions allow new aggregate functions based on computation windows and rank calculations.
ETL Process (Extraction, Transformation, Loading):
- Data preprocessing before being loaded into the Data Warehouse.
- Extraction: Depends on how operational data is collected (historical, partly historical, or transient). Incremental extraction captures data modifications (e.g., via ad hoc applications, triggers, or timestamps).
- Data Cleaning: Addresses issues like duplicates, missing data, and unexpected values using data dictionaries, approximate fusion, and prevention techniques.
- Transformation: Converts operational data into warehouse format (formatting).
Physical Design:
- Dependent on the workload; involves selecting suitable data structures to support the most frequent queries.
- Structures: Bitmap indexes, join indexes, bitmapped join indexes, B+trees, and materialized views.
- Indexing:
  - B+tree: Provides direct access based on key field values, efficient for range queries, but susceptible to updates and deletes.
  - Bitmap Index: Stores relationships as a boolean matrix, very efficient for boolean predicates but inappropriate for continuous attributes.
  - Join Index: Precomputes joins between two or more tables (star index for multiple tables), efficient for column-based joins but can require significant storage space.
  - Hash Structure: Guarantees direct and efficient access to data based on a key field via a hash function, very efficient for equality predicates, but inefficient for range queries.

Vedi tutto il file Scarica

Altri appunti di Data management and visualization

Riassunto Seconda Parte Riassunto Terza Parte MongoDB for Data Management Report Regression Report Python intro

Vedi tutti gli appunti di data science and engineering

Riassunto Prima Parte

Di cosa parla

Altri appunti di Data management and visualization

Ottieni i primi crediti!

Carica i tuoi file

Unisciti ai gruppi di studio

Invita i tuoi colleghi

Accidenti, ancora non abbiamo il tuo corso di laurea!

Consiglia ai tuoi amici

Riassunto Prima Parte

Di cosa parla

Altri appunti di Data management and visualization

Condividi questi appunti