tinformation extration
Di cosa parla
- Information Extraction (IE) Overview:
- IE focuses on extracting structured information from unstructured or partially structured text.
- It is a crucial step towards higher-level, semantic representation of knowledge, moving beyond classical Information Retrieval (IR).
- Many NLP and IE tasks involve `Sequence Labeling`, which determines specific properties within a text sequence.
- Sequence Labeling:
- This involves identifying properties such as Part-of-Speech (PoS), syntactic roles, determining if words identify specific information types (e.g., person, location, brand), inferring other properties (e.g., units of measure), linking related text pieces, and linking text to elements of a knowledge base/ontology.
- Documents are processed as sequences of tokens, where the order and role of each token are vital for the analysis outcome.
- Key IE Tasks:
- Named Entity Recognition (NER): Identifies text segments that refer to elements belonging to predefined categories (e.g., Persons, Organizations, Locations, Temporal expressions, Quantities). NER typically involves entity spotting, classification, and identification (linking).
- Relation Extraction: Focuses on identifying relationships between recognized entities, such as inferring a "Role" relationship between "Andrea Esuli" and "ISTI-CNR" as a "researcher."
- NER Approaches:
- Rule-based NER: Leverages lexicons (dictionaries, gazetteers, ontologies) and hand-crafted rules or patterns to match text context and recognize named entities. ANNIE is a prominent example of such a system.
- Machine Learning (ML) based NER:
- This approach translates the extraction problem into a word classification task.
- Each word is represented by features capturing its morphologic, syntactic, and semantic properties, along with its context.
- Binary word classifiers are typically learned for each type of recognized entity.
- The output can be a single-label or multi-label classification, depending on whether annotations can overlap.
- Modern methods include traditional ML algorithms, neural networks (recurrent and attention models), and graphical models like Conditional Random Fields (CRFs).
- CRFs (Conditional Random Fields): These are probabilistic graphical models that explicitly model dependencies among problem variables, including dependencies among labels, determining the optimal labeling of a text piece by maximizing the joint probability of the entire sequence.
- Evaluation of IE:
- Accuracy in annotation is measured by finding matching annotations between true (gold standard) and predicted annotations.
- Strict Match: Requires an exact match between predicted and true annotations.
- Lenient Match: Allows for partial matches, such as starting at the same word or just overlapping.
- Word-level Evaluation: Offers a more graded assessment than lenient match and can align with exact match for perfect annotations while still providing partial credit for graded outcomes.
- Wikification (Entity Linking):
- This task links relevant parts of a text to corresponding Wikipedia entities.
- It involves matching text segments to various "surface forms" associated with entities.
- The Wikipedia link graph is utilized to resolve ambiguities and filter out spurious matches when multiple entities are assignable. Examples include Dexter and TAGME.
- Opinion Extraction:
- Aims to establish relations between identified entities and their associated subjective expressions (evaluations, sentiments).
- Requires domain-dependent entity recognition (potentially including attributes) and subjectivity recognition (which can then be followed by polarity detection).
- Parsing trees can be valuable in connecting entities with their respective evaluations.
- Tools and Platforms:
- GATE (General Architecture for Text Engineering): A comprehensive text processing tool with support for human annotation and entity extraction.
- MPQA Opinion Corpus: A dataset comprising news articles with manual annotations for opinions and various private states (e.g., beliefs, emotions, sentiments, speculations).
- INCEpTION: An annotation platform that facilitates defining custom annotation schemas, multi-annotator collaboration (with agreement measurement), and interactive training of machine learning models for assisted automatic annotation (human-in-the-loop).
- spaCy: An advanced NLP library that allows for updating existing models with new training data and training new models for user-defined entities. It supports custom data structures for training and conversion from other formats like BILUO.