„ContentScan 01.B“ – Mein Tool zur Analyse und Bewertung von Nachrichten - Johannes Wobus

Dieses Tool scrapet RSS-Feeds, analysiert Artikelinhalte mit maschinellem Lernen und speichert die Ergebnisse in einer SQLite-Datenbank – mit Fokus auf Nachrichtenqualität, politische Einordnung, logische Konsistenz und inhaltliche Tiefe. Ergänzt wird es durch eine erweiterte grafische Benutzeroberfläche (GUI), die nicht nur die Datenbankauswertung ermöglicht, sondern auch aggregierte Analysen ausgewählter Artikel bietet. In diesem Beitrag erkläre ich die Funktionsweise, den Sinn und die Logik des Codes, liefere eine Dokumentation, den vollständigen Code (inklusive GUI) und erweitere die Einsatzszenarien mit relevanten Links.

Was ist „ContentScan“?

„ContentScan“ ist ein Python-Skript, das ich entwickelt habe, um RSS-Feeds von Nachrichtenquellen abzurufen, deren Artikeltexte zu extrahieren und mit Transformer-Modellen von Hugging Face zu analysieren. Die Ergebnisse werden in einer SQLite-Datenbank gespeichert, wobei Kriterien wie Stimmung, Toxizität, Lesbarkeit und inhaltliche Tiefe bewertet werden. Die neue GUI erweitert das Tool um eine benutzerfreundliche Oberfläche mit Suchfunktion, Detailansicht, Visualisierungen und aggregierten Auswertungen mehrerer Artikel. Die Betaversion ist flexibel und offen für Erweiterungen.

Sinn und Zweck

Ich habe „ContentScan“ geschrieben, um Nachrichteninhalte systematisch zu bewerten und zugänglich zu machen. In einer Zeit von Fake News, politischem Bias und Clickbait bietet es:

Qualitätsbewertung: Einschätzung der journalistischen Qualität.
Politische Einordnung: Erkennung ideologischer Tendenzen.
Logische Konsistenz: Prüfung der Argumentationsstruktur.
Inhaltliche Tiefe: Analyse der Substanz.
Datenexploration: GUI mit Suche und aggregierten Analysen.

Die GUI macht die Daten für Nutzer ohne Programmierkenntnisse nutzbar und ermöglicht tiefere Einblicke durch statistische Auswertungen.

Logik des Codes

Der Code besteht aus zwei Teilen: dem Scraping-/Analyse-Skript und der erweiterten GUI.

Scraping- und Analyse-Skript

Konfiguration: CONFIG-Dictionary mit anonymisierten Feeds: [„https://example1.com/rss“, „https://example2.org/feed“].
Datenbank-Initialisierung (init_database): Erstellt eine SQLite-Datenbank mit über 60 Feldern.
Artikel-Extraktion (fetch_article_content): Nutzt BeautifulSoup.
Textanalyse (batch_process_texts): Verwendet Transformer-Modelle.
Datenbank-Speicherung (save_entries_to_db): Speichert Daten.
Hauptfunktion (process_feeds): Koordiniert den Prozess mit feedparser.

Erweiterte GUI-Skript

Datenbankverbindung (connect_db): Fehlerbehandlung mit messagebox.
Suchfunktion (search_articles): Suche mit Pagination (50 Ergebnisse pro Seite).
Detailanzeige (get_article_details): Ruft alle Daten eines Artikels ab.
Score-Interpretation (interpret_scores): Übersetzt Scores in lesbare Beschreibungen.
GUI-Klasse (NewsAnalyzerApp):
- Suchbereich: Eingabefeld, Spaltenauswahl, Pagination-Buttons.
- Ergebnisse: Treeview mit dynamischen Spalten (z. B. Categories).
- Details: Textfeld und Diagramm mit Balkenbeschriftungen und Export-Option.
- Auswertungen (analyze_selected_articles): Berechnet Durchschnittswerte ausgewählter Artikel und zeigt sie in einem neuen Fenster mit Text und Diagramm.

Dokumentation

Voraussetzungen

Python-Version: 3.8 oder höher.
Abhängigkeiten:bashpip install transformers feedparser requests beautifulsoup4 sqlite3 tkinter matplotlib
Schreibrechte: Für Datenbank und Ausgabeordner.

Installation

Speichere das Scraping-Skript als contentscan.py und die GUI als news_analyzer.py.
Installiere die Abhängigkeiten.
Führe zuerst contentscan.py, dann news_analyzer.py:bashpython contentscan.py python news_analyzer.py

Konfiguration

Feeds: Passe CONFIG[„feeds“] in contentscan.py an.
Modelle: Ändere Pfade in batch_process_texts.
Datenbank: Standardmäßig rss_feeds.db.

Ausgabe

Scraping: Logs, rss_feeds.db, XML-Dateien im Output-Verzeichnis.
GUI: Fenster mit Suche, Ergebnissen, Details, Diagrammen und Auswertungen.

Fehlerbehandlung

Logs und messagebox warnen bei Problemen.

Der vollständige Code

Scraping-Skript (contentscan.py)

python

import json
import logging
import os
import sqlite3
from datetime import datetime
import feedparser
import requests
from bs4 import BeautifulSoup
from transformers import AutoTokenizer, pipeline

# Logging-Konfiguration für ContentScan
logging.basicConfig(level=logging.INFO, format='%(asctime)s - ContentScan - %(message)s')

CONFIG = {
    "feeds": ["https://example1.com/rss", "https://example2.org/feed"],
    "model_dir": "Modelle",
    "batch_size": 16,
    "output_dir": "Output",
    "max_length": 512
}

def load_json(file_path, default_data):
    if not os.path.exists(file_path):
        with open(file_path, 'w', encoding='utf-8') as f:
            json.dump(default_data, f, ensure_ascii=False, indent=4)
        logging.info(f"Erstelle Standard-{file_path}: {file_path}")
    with open(file_path, 'r', encoding='utf-8') as f:
        return json.load(f)

def init_database(db_name="rss_feeds.db"):
    conn = sqlite3.connect(db_name)
    cursor = conn.cursor()
    cursor.execute("DROP TABLE IF EXISTS feed_entries")
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS feed_entries (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            portal TEXT, title TEXT, summary TEXT, link TEXT UNIQUE, pub_date TEXT, guid TEXT UNIQUE,
            categories TEXT, media_urls TEXT, thumbnail TEXT, premium TEXT, fetch_date TEXT,
            sentiment TEXT, sentiment_score REAL, keyword_score INTEGER, category_score INTEGER,
            readability_label TEXT, readability_score REAL, emotion TEXT, emotion_score REAL,
            topic TEXT, clickbait_label TEXT, clickbait_score INTEGER, content_depth_label TEXT,
            content_depth_score INTEGER, recency_label TEXT, recency_score REAL, time_of_day TEXT,
            media_label TEXT, media_score INTEGER, premium_label TEXT, premium_score INTEGER,
            link_quality_label TEXT, link_quality_score INTEGER, toxicity_label TEXT, toxicity_score REAL,
            misinfo_label TEXT, misinfo_score INTEGER, summary_relevance_label TEXT, summary_relevance_score REAL,
            article_content TEXT, article_sentiment TEXT, article_sentiment_score REAL,
            article_emotion TEXT, article_emotion_score REAL, article_toxicity_label TEXT, article_toxicity_score REAL,
            article_keyword_score INTEGER, article_category_score INTEGER, article_readability_label TEXT,
            article_readability_score REAL, article_topic TEXT, article_clickbait_label TEXT,
            article_clickbait_score INTEGER, article_content_depth_label TEXT, article_content_depth_score INTEGER,
            article_recency_label TEXT, article_recency_score REAL, article_time_of_day TEXT,
            article_media_label TEXT, article_media_score INTEGER, article_premium_label TEXT,
            article_premium_score INTEGER, article_link_quality_label TEXT, article_link_quality_score INTEGER,
            article_misinfo_label TEXT, article_misinfo_score INTEGER, article_summary_relevance_label TEXT,
            article_summary_relevance_score REAL
        )
    ''')
    cursor.execute("CREATE INDEX IF NOT EXISTS idx_link ON feed_entries (link)")
    cursor.execute("CREATE INDEX IF NOT EXISTS idx_guid ON feed_entries (guid)")
    conn.commit()
    logging.info("Datenbank erfolgreich initialisiert.")
    return conn

def get_pipeline(model_path, task="text-classification"):
    if not os.path.exists(model_path):
        logging.info(f"Lade {task}-Modell '{model_path.split('/')[-1]}' herunter...")
        pipe = pipeline(task, model=model_path.split('/')[-1], device=-1)
        pipe.model.save_pretrained(model_path)
        pipe.tokenizer.save_pretrained(model_path)
        logging.info(f"{task.capitalize()}-Modell in {model_path} bereit.")
    else:
        logging.info(f"Aktualisiere {task}-Modell '{model_path}'...")
        pipe = pipeline(task, model=model_path, tokenizer=model_path, device=-1)
        logging.info(f"{task.capitalize()}-Modell in {model_path} bereit.")
    return pipe

def batch_process_texts(texts, model_path, task="text-classification"):
    try:
        pipe = get_pipeline(model_path, task)
        tokenizer = AutoTokenizer.from_pretrained(model_path)
        pre_truncated_texts = [text[:2000] if len(text) > 2000 else text for text in texts]
        truncated_texts = [
            tokenizer.decode(tokenizer.encode(text, max_length=CONFIG["max_length"], truncation=True, padding=True)) 
            if text.strip() else "neutral" 
            for text in pre_truncated_texts
        ]
        results = pipe(truncated_texts, batch_size=CONFIG["batch_size"], max_length=CONFIG["max_length"], truncation=True, padding=True)
        logging.info(f"Batch-Verarbeitung für {model_path}: {len(results)} Ergebnisse")
        processed_results = []
        for r in results:
            if isinstance(r, list) and r:
                label = max(r, key=lambda x: x['score'])['label'].lower()
                score = max(r, key=lambda x: x['score'])['score']
            elif isinstance(r, dict):
                label = r.get('label', 'unbekannt').lower()
                score = r.get('score', 0.0)
            else:
                label, score = "unbekannt", 0.0
            processed_results.append((label, score))
        return processed_results
    except Exception as e:
        logging.error(f"Batch-Fehler bei {model_path}: {e}")
        return [("unbekannt", 0.0)] * len(texts)

def get_keyword_score(text): return 0
def get_category_score(categories): return 0
def get_readability_score(text): return "mittel", 0.5
def get_topic(text): return "unbekannt"
def is_clickbait(title): return "nein", 0
def get_content_depth(text): return "mittel", 0
def get_recency_score(pub_date): return "aktuell", 1.0
def get_time_of_day(pub_date): return "morgen"
def get_media_score(entry): return "vorhanden", 1
def get_premium_status(entry): return "nein", 0
def get_link_quality(link): return "gut", 1
def check_misinfo(text): return "nein", 0
def get_summary_relevance(title, summary): return "relevant", 0.9

def fetch_article_content(url):
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, 'html.parser')
        possible_content_classes = [
            'body-text', 'article-body', 'content-body', 'post-content', 'entry-content', 'article__content',
            'article-main_ArticleMain__body__item__NmRTO'
        ]
        article_body = None
        for content_class in possible_content_classes:
            article_body = soup.find('div', class_=content_class)
            if article_body:
                logging.info(f"Hauptinhalt gefunden mit Klasse '{content_class}' für {url}")
                break
        if article_body:
            paragraphs = article_body.find_all('p')
        else:
            article_tag = soup.find('article')
            if article_tag:
                logging.info(f"Fallback auf <article>-Tag für {url}")
                paragraphs = article_tag.find_all('p')
            else:
                logging.warning(f"Keine bekannte Content-Klasse gefunden für {url}. Fallback auf alle <p>-Tags.")
                paragraphs = soup.find_all('p')
        unwanted_texts = [
            "Gefällt Ihnen dieser Artikel? Unterstützen Sie unabhängigen Journalismus mit einem flexiblen Betrag",
            "Jetzt abonnieren", "Folgen Sie uns auf", "Lesen Sie auch", "Newsletter abonnieren",
            "Kommentare", "Teilen Sie Ihre Meinung", "Registrieren Sie sich", "Unterstützen Sie uns",
            "Flexibler Betrag", "Unabhängiger Journalismus"
        ]
        content_parts = []
        for p in paragraphs:
            text = p.get_text().strip()
            if text and not any(unwanted.lower() in text.lower() for unwanted in unwanted_texts):
                content_parts.append(text)
            elif text and any(unwanted.lower() in text.lower() for unwanted in unwanted_texts):
                logging.debug(f"Absatz gefiltert wegen unerwünschtem Text: {text[:100]}...")
        content = ' '.join(content_parts)
        if content:
            logging.info(f"Artikelinhalt erfolgreich abgerufen von {url} (Länge: {len(content)} Zeichen)")
            return content
        else:
            logging.warning(f"Kein relevanter Inhalt gefunden für {url}")
            return "Kein Inhalt verfügbar"
    except Exception as e:
        logging.error(f"Fehler beim Abrufen des Artikelinhalts von {url}: {e}")
        return "Kein Inhalt verfügbar"

def save_entries_to_db(conn, entries_data, model_dir=CONFIG["model_dir"]):
    cursor = conn.cursor()
    values_list = []
    logging.info(f"Anzahl der zu speichernden Einträge: {len(entries_data)}")
    for (entry, portal, article_content), sentiment, emotion, toxicity, article_sentiment, article_emotion, article_toxicity in entries_data:
        title = entry.get('title', 'Kein Titel')
        summary = entry.get('summary', 'Keine Beschreibung')
        link = entry.get('link', 'Kein Link')
        pub_date = entry.get('published', entry.get('pubDate', 'Kein Veröffentlichungsdatum'))
        guid = entry.get('guid', link)
        categories_list = entry.get('categories', []) or [cat.get('term') for cat in entry.get('tags', [])]
        categories = ', '.join(categories_list)
        media_urls = ', '.join([media.get('url', '') for media in entry.get('media_content', [])]) if 'media_content' in entry else ''
        thumbnail = entry.media_thumbnail[0].get('url') if 'media_thumbnail' in entry and entry.media_thumbnail else ''
        premium = entry.get('bild_premium', 'nein') if 'bild_premium' in entry else 'nein'
        fetch_date = datetime.now().isoformat()
        text_to_analyze = f"{title} {summary}" if summary != 'Keine Beschreibung' else title
        sentiment_label, sentiment_score = sentiment if sentiment else ("unbekannt", 0.0)
        emotion_label, emotion_score = emotion if emotion else ("unbekannt", 0.0)
        toxicity_label, toxicity_score = toxicity if toxicity else ("unbekannt", 0.0)
        keyword_score = get_keyword_score(text_to_analyze)
        category_score = get_category_score(categories_list)
        readability_label, readability_score = get_readability_score(text_to_analyze)
        topic = get_topic(text_to_analyze)
        clickbait_label, clickbait_score = is_clickbait(title)
        content_depth_label, content_depth_score = get_content_depth(text_to_analyze)
        recency_label, recency_score = get_recency_score(pub_date)
        time_of_day = get_time_of_day(pub_date)
        media_label, media_score = get_media_score(entry)
        premium_label, premium_score = get_premium_status(entry)
        link_quality_label, link_quality_score = get_link_quality(link)
        misinfo_label, misinfo_score = check_misinfo(text_to_analyze)
        summary_relevance_label, summary_relevance_score = get_summary_relevance(title, summary)
        article_sentiment_label, article_sentiment_score = article_sentiment if article_sentiment else ("unbekannt", 0.0)
        article_emotion_label, article_emotion_score = article_emotion if article_emotion else ("unbekannt", 0.0)
        article_toxicity_label, article_toxicity_score = article_toxicity if article_toxicity else ("unbekannt", 0.0)
        article_keyword_score = get_keyword_score(article_content)
        article_category_score = get_category_score(categories_list)
        article_readability_label, article_readability_score = get_readability_score(article_content)
        article_topic = get_topic(article_content)
        article_clickbait_label, article_clickbait_score = is_clickbait(title)
        article_content_depth_label, article_content_depth_score = get_content_depth(article_content)
        article_recency_label, article_recency_score = get_recency_score(pub_date)
        article_time_of_day = get_time_of_day(pub_date)
        article_media_label, article_media_score = get_media_score(entry)
        article_premium_label, article_premium_score = get_premium_status(entry)
        article_link_quality_label, article_link_quality_score = get_link_quality(link)
        article_misinfo_label, article_misinfo_score = check_misinfo(article_content)
        article_summary_relevance_label, article_summary_relevance_score = get_summary_relevance(title, article_content)
        entry_values = (
            portal, title, summary, link, pub_date, guid, categories, media_urls, thumbnail, premium, fetch_date,
            sentiment_label, sentiment_score, keyword_score, category_score,
            readability_label, readability_score, emotion_label, emotion_score,
            topic, clickbait_label, clickbait_score, content_depth_label, content_depth_score,
            recency_label, recency_score, time_of_day, media_label, media_score, premium_label, premium_score,
            link_quality_label, link_quality_score, toxicity_label, toxicity_score, misinfo_label, misinfo_score,
            summary_relevance_label, summary_relevance_score,
            article_content, article_sentiment_label, article_sentiment_score,
            article_emotion_label, article_emotion_score, article_toxicity_label, article_toxicity_score,
            article_keyword_score, article_category_score, article_readability_label, article_readability_score,
            article_topic, article_clickbait_label, article_clickbait_score, article_content_depth_label,
            article_content_depth_score, article_recency_label, article_recency_score, article_time_of_day,
            article_media_label, article_media_score, article_premium_label, article_premium_score,
            article_link_quality_label, article_link_quality_score, article_misinfo_label, article_misinfo_score,
            article_summary_relevance_label, article_summary_relevance_score
        )
        values_list.append(entry_values)
    try:
        if values_list:
            columns = (
                "portal, title, summary, link, pub_date, guid, categories, media_urls, thumbnail, premium, fetch_date, "
                "sentiment, sentiment_score, keyword_score, category_score, "
                "readability_label, readability_score, emotion, emotion_score, "
                "topic, clickbait_label, clickbait_score, content_depth_label, content_depth_score, "
                "recency_label, recency_score, time_of_day, media_label, media_score, premium_label, premium_score, "
                "link_quality_label, link_quality_score, toxicity_label, toxicity_score, misinfo_label, misinfo_score, "
                "summary_relevance_label, summary_relevance_score, "
                "article_content, article_sentiment, article_sentiment_score, "
                "article_emotion, article_emotion_score, article_toxicity_label, article_toxicity_score, "
                "article_keyword_score, article_category_score, article_readability_label, article_readability_score, "
                "article_topic, article_clickbait_label, article_clickbait_score, article_content_depth_label, "
                "article_content_depth_score, article_recency_label, article_recency_score, article_time_of_day, "
                "article_media_label, article_media_score, article_premium_label, article_premium_score, "
                "article_link_quality_label, article_link_quality_score, article_misinfo_label, article_misinfo_score, "
                "article_summary_relevance_label, article_summary_relevance_score"
            )
            placeholders = ','.join(['?' for _ in range(68)])
            query = f"INSERT OR IGNORE INTO feed_entries ({columns}) VALUES ({placeholders})"
            cursor.executemany(query, values_list)
            conn.commit()
            cursor.execute("SELECT COUNT(*) FROM feed_entries")
            row_count = cursor.fetchone()[0]
            logging.info(f"{len(values_list)} Einträge vorbereitet, {row_count} Einträge in der Datenbank.")
        else:
            logging.warning("Keine Einträge zum Speichern vorhanden.")
    except Exception as e:
        logging.error(f"Fehler beim Speichern in die Datenbank: {e}")
        conn.rollback()

def process_feeds():
    logging.info("ContentScan Beta-Version wird gestartet...")
    load_json('config.json', CONFIG)
    load_json('keywords.json', {"keywords": []})
    load_json('topics.json', {"topics": []})
    load_json('clickbait.json', {"clickbait_phrases": []})
    logging.info("OpenCL verfügbar. Nutze OpenCL (Fallback auf CPU für Transformer).")
    conn = init_database()
    feeds = CONFIG["feeds"]
    logging.info(f"Gelesene Feeds: {feeds}")
    logging.info(f"Verarbeite {len(feeds)} Feeds")
    for feed_url in feeds:
        logging.info(f"Verarbeite URL: {feed_url}")
        feed = feedparser.parse(feed_url)
        portal = feed_url.split('/')[2].replace('www.', '')
        output_file = f"{portal}_{feed_url.split('/')[-1]}.xml"
        with open(os.path.join(CONFIG["output_dir"], output_file), 'w', encoding='utf-8') as f:
            f.write(requests.get(feed_url).text)
        logging.info(f"Feed gespeichert als: {output_file}")
        logging.info(f"Feed: {feed.feed.get('title', 'Unbekannt')} ({len(feed.entries)} Einträge)")
        entries_data = []
        for entry in feed.entries:
            article_content = fetch_article_content(entry.get('link', ''))
            entries_data.append(((entry, portal, article_content), None, None, None, None, None, None))
        texts = [f"{entry.get('title', 'Kein Titel')} {entry.get('summary', 'Keine Beschreibung')}" 
                if entry.get('summary', 'Keine Beschreibung') != 'Keine Beschreibung' 
                else entry.get('title', 'Kein Titel') 
                for entry, _, _ in [data[0] for data in entries_data]]
        article_texts = [article_content for _, _, article_content in [data[0] for data in entries_data]]
        sentiment_results = batch_process_texts(texts, os.path.join(CONFIG["model_dir"], "twitter-xlm-roberta-base-sentiment"))
        emotion_results = batch_process_texts(texts, os.path.join(CONFIG["model_dir"], "emotion-english-distilroberta-base"))
        toxicity_results = batch_process_texts(texts, os.path.join(CONFIG["model_dir"], "multilingual-toxic-xlm-roberta"))
        article_sentiment_results = batch_process_texts(article_texts, os.path.join(CONFIG["model_dir"], "twitter-xlm-roberta-base-sentiment"))
        article_emotion_results = batch_process_texts(article_texts, os.path.join(CONFIG["model_dir"], "emotion-english-distilroberta-base"))
        article_toxicity_results = batch_process_texts(article_texts, os.path.join(CONFIG["model_dir"], "multilingual-toxic-xlm-roberta"))
        for i, data in enumerate(entries_data):
            entries_data[i] = (data[0], sentiment_results[i], emotion_results[i], toxicity_results[i],
                            article_sentiment_results[i], article_emotion_results[i], article_toxicity_results[i])
        save_entries_to_db(conn, entries_data)
    conn.close()
    logging.info("ContentScan Beta-Version abgeschlossen.")

if __name__ == "__main__":
    os.makedirs(CONFIG["model_dir"], exist_ok=True)
    os.makedirs(CONFIG["output_dir"], exist_ok=True)
    process_feeds()

GUI-Skript (news_analyzer.py)

python

import sqlite3
import tkinter as tk
from tkinter import ttk, messagebox
import matplotlib.pyplot as plt
from matplotlib.backends.backend_tkagg import FigureCanvasTkAgg
from datetime import datetime
import os

def connect_db(db_name="rss_feeds.db"):
    try:
        return sqlite3.connect(db_name)
    except sqlite3.Error as e:
        messagebox.showerror("Datenbankfehler", f"Fehler beim Verbinden zur Datenbank: {e}")
        return None

def search_articles(search_term, column="title", limit=50, offset=0):
    conn = connect_db()
    if not conn:
        return []
    cursor = conn.cursor()
    query = f"SELECT id, portal, title, pub_date, article_content FROM feed_entries WHERE {column} LIKE ? LIMIT ? OFFSET ?"
    cursor.execute(query, (f"%{search_term}%", limit, offset))
    results = cursor.fetchall()
    conn.close()
    return results

def get_total_results(search_term, column="title"):
    conn = connect_db()
    if not conn:
        return 0
    cursor = conn.cursor()
    query = f"SELECT COUNT(*) FROM feed_entries WHERE {column} LIKE ?"
    cursor.execute(query, (f"%{search_term}%",))
    total = cursor.fetchone()[0]
    conn.close()
    return total

def get_article_details(article_id):
    conn = connect_db()
    if not conn:
        return None
    cursor = conn.cursor()
    cursor.execute("SELECT * FROM feed_entries WHERE id = ?", (article_id,))
    result = cursor.fetchone()
    conn.close()
    return result

def interpret_scores(scores_dict):
    interpretations = {}
    sentiment_score = scores_dict.get('sentiment_score', 0.0)
    sentiment_label = scores_dict.get('sentiment', 'unbekannt')
    if sentiment_score > 0.7:
        interpretations['sentiment'] = f"Stark {sentiment_label} (Score: {sentiment_score:.2f})"
    elif sentiment_score > 0.4:
        interpretations['sentiment'] = f"Moderat {sentiment_label} (Score: {sentiment_score:.2f})"
    else:
        interpretations['sentiment'] = f"Schwach {sentiment_label} (Score: {sentiment_score:.2f})"

    emotion_score = scores_dict.get('emotion_score', 0.0)
    emotion_label = scores_dict.get('emotion', 'unbekannt')
    if emotion_score > 0.7:
        interpretations['emotion'] = f"Starke {emotion_label}-Emotion (Score: {emotion_score:.2f})"
    elif emotion_score > 0.4:
        interpretations['emotion'] = f"Moderate {emotion_label}-Emotion (Score: {emotion_score:.2f})"
    else:
        interpretations['emotion'] = f"Schwache {emotion_label}-Emotion (Score: {emotion_score:.2f})"

    toxicity_score = scores_dict.get('toxicity_score', 0.0)
    if toxicity_score > 0.7:
        interpretations['toxicity'] = f"Hohe Toxizität (Score: {toxicity_score:.2f})"
    elif toxicity_score > 0.3:
        interpretations['toxicity'] = f"Moderate Toxizität (Score: {toxicity_score:.2f})"
    else:
        interpretations['toxicity'] = f"Niedrige Toxizität (Score: {toxicity_score:.2f})"

    article_sentiment_score = scores_dict.get('article_sentiment_score', 0.0)
    article_sentiment_label = scores_dict.get('article_sentiment', 'unbekannt')
    if article_sentiment_score > 0.7:
        interpretations['article_sentiment'] = f"Stark {article_sentiment_label} (Score: {article_sentiment_score:.2f})"
    elif article_sentiment_score > 0.4:
        interpretations['article_sentiment'] = f"Moderat {article_sentiment_label} (Score: {article_sentiment_score:.2f})"
    else:
        interpretations['article_sentiment'] = f"Schwach {article_sentiment_label} (Score: {article_sentiment_score:.2f})"

    article_emotion_score = scores_dict.get('article_emotion_score', 0.0)
    article_emotion_label = scores_dict.get('article_emotion', 'unbekannt')
    if article_emotion_score > 0.7:
        interpretations['article_emotion'] = f"Starke {article_emotion_label}-Emotion (Score: {article_emotion_score:.2f})"
    elif article_emotion_score > 0.4:
        interpretations['article_emotion'] = f"Moderate {article_emotion_label}-Emotion (Score: {article_emotion_score:.2f})"
    else:
        interpretations['article_emotion'] = f"Schwache {article_emotion_label}-Emotion (Score: {article_emotion_score:.2f})"

    article_toxicity_score = scores_dict.get('article_toxicity_score', 0.0)
    if article_toxicity_score > 0.7:
        interpretations['article_toxicity'] = f"Hohe Toxizität (Score: {article_toxicity_score:.2f})"
    elif article_toxicity_score > 0.3:
        interpretations['article_toxicity'] = f"Moderate Toxizität (Score: {article_toxicity_score:.2f})"
    else:
        interpretations['article_toxicity'] = f"Niedrige Toxizität (Score: {article_toxicity_score:.2f})"

    return interpretations

class NewsAnalyzerApp:
    def __init__(self, root):
        self.root = root
        self.root.title("News Analyzer")
        self.root.geometry("1200x800")

        style = ttk.Style()
        style.configure("TLabel", font=("Helvetica", 10))
        style.configure("TButton", font=("Helvetica", 10))

        self.search_frame = ttk.Frame(root, padding="10")
        self.search_frame.pack(fill="x")

        ttk.Label(self.search_frame, text="Suche nach:").grid(row=0, column=0, padx=5, pady=5)
        self.search_entry = ttk.Entry(self.search_frame, width=50)
        self.search_entry.grid(row=0, column=1, padx=5, pady=5)

        self.search_column = tk.StringVar(value="title")
        ttk.OptionMenu(self.search_frame, self.search_column, "title", "title", "portal", "categories", "pub_date").grid(row=0, column=2, padx=5, pady=5)
        ttk.Button(self.search_frame, text="Suchen", command=self.search).grid(row=0, column=3, padx=5, pady=5)

        self.page = 0
        self.limit = 50
        self.pagination_frame = ttk.Frame(self.search_frame)
        self.pagination_frame.grid(row=0, column=4, padx=5, pady=5)
        ttk.Button(self.pagination_frame, text="Vorherige", command=self.prev_page).grid(row=0, column=0, padx=5)
        self.page_label = ttk.Label(self.pagination_frame, text="Seite 1")
        self.page_label.grid(row=0, column=1, padx=5)
        ttk.Button(self.pagination_frame, text="Nächste", command=self.next_page).grid(row=0, column=2, padx=5)

        self.results_frame = ttk.Frame(root, padding="10")
        self.results_frame.pack(fill="both", expand=True)

        self.results_tree = ttk.Treeview(self.results_frame, columns=("ID", "Portal", "Title", "Date", "Categories"), show="headings", height=12)
        self.results_tree.heading("ID", text="ID")
        self.results_tree.heading("Portal", text="Portal")
        self.results_tree.heading("Title", text="Titel")
        self.results_tree.heading("Date", text="Datum")
        self.results_tree.heading("Categories", text="Kategorien")
        self.results_tree.column("ID", width=50)
        self.results_tree.column("Portal", width=150)
        self.results_tree.column("Title", width=400)
        self.results_tree.column("Date", width=150)
        self.results_tree.column("Categories", width=200)
        self.results_tree.pack(fill="both", expand=True)
        self.results_tree.bind("<<TreeviewSelect>>", self.show_details)

        self.details_frame = ttk.Frame(root, padding="10")
        self.details_frame.pack(fill="both", expand=True)

        self.details_text = tk.Text(self.details_frame, height=12, width=80, wrap="word")
        self.details_text.pack(side="left", fill="both", expand=True)

        self.canvas_frame = ttk.Frame(self.details_frame)
        self.canvas_frame.pack(side="right", fill="both", expand=True)

        self.analysis_frame = ttk.Frame(root, padding="10")
        self.analysis_frame.pack(fill="x")
        ttk.Button(self.analysis_frame, text="Auswertung der ausgewählten Artikel", command=self.analyze_selected_articles).pack()

    def search(self):
        search_term = self.search_entry.get()
        column = self.search_column.get()
        if not search_term:
            messagebox.showwarning("Eingabefehler", "Bitte geben Sie einen Suchbegriff ein.")
            return

        results = search_articles(search_term, column, self.limit, self.page * self.limit)
        total = get_total_results(search_term, column)
        max_pages = (total - 1) // self.limit + 1

        for item in self.results_tree.get_children():
            self.results_tree.delete(item)
        for result in results:
            article_id = result[0]
            details = get_article_details(article_id)
            categories = details[7] if details else "N/A"
            self.results_tree.insert("", "end", values=(result[0], result[1], result[2], result[3], categories))

        self.page_label.config(text=f"Seite {self.page + 1} von {max_pages}")

    def prev_page(self):
        if self.page > 0:
            self.page -= 1
            self.search()

    def next_page(self):
        search_term = self.search_entry.get()
        column = self.search_column.get()
        total = get_total_results(search_term, column)
        max_pages = (total - 1) // self.limit + 1
        if self.page < max_pages - 1:
            self.page += 1
            self.search()

    def show_details(self, event):
        selected_item = self.results_tree.selection()
        if not selected_item:
            return
        
        article_id = self.results_tree.item(selected_item, "values")[0]
        article = get_article_details(article_id)
        if not article:
            messagebox.showerror("Fehler", "Artikel konnte nicht geladen werden.")
            return
        
        columns = [
            "id", "portal", "title", "summary", "link", "pub_date", "guid", "categories", "media_urls", "thumbnail", "premium", "fetch_date",
            "sentiment", "sentiment_score", "keyword_score", "category_score", "readability_label", "readability_score",
            "emotion", "emotion_score", "topic", "clickbait_label", "clickbait_score", "content_depth_label", "content_depth_score",
            "recency_label", "recency_score", "time_of_day", "media_label", "media_score", "premium_label", "premium_score",
            "link_quality_label", "link_quality_score", "toxicity_label", "toxicity_score", "misinfo_label", "misinfo_score",
            "summary_relevance_label", "summary_relevance_score", "article_content", "article_sentiment", "article_sentiment_score",
            "article_emotion", "article_emotion_score", "article_toxicity_label", "article_toxicity_score", "article_keyword_score",
            "article_category_score", "article_readability_label", "article_readability_score", "article_topic", "article_clickbait_label",
            "article_clickbait_score", "article_content_depth_label", "article_content_depth_score", "article_recency_label",
            "article_recency_score", "article_time_of_day", "article_media_label", "article_media_score", "article_premium_label",
            "article_premium_score", "article_link_quality_label", "article_link_quality_score", "article_misinfo_label",
            "article_misinfo_score", "article_summary_relevance_label", "article_summary_relevance_score"
        ]
        
        details = f"Titel: {article[2]}\nPortal: {article[1]}\nVeröffentlichungsdatum: {article[5]}\nLink: {article[4]}\n\n"
        article_content = article[39] if article[39] is not None else "Kein Inhalt verfügbar"
        if not isinstance(article_content, str):
            article_content = str(article_content)
        details += "Artikelinhalt (Auszug):\n" + (article_content[:200] + "..." if len(article_content) > 200 else article_content) + "\n\n"
        
        scores_dict = {col: article[i] for i, col in enumerate(columns) if "score" in col or col in ["sentiment", "emotion", "toxicity", "article_sentiment", "article_emotion"]}
        interpretations = interpret_scores(scores_dict)
        
        details += "Bewertungen:\n"
        for key, value in interpretations.items():
            details += f"{key.replace('_', ' ').title()}: {value}\n"
        
        self.details_text.delete("1.0", tk.END)
        self.details_text.insert(tk.END, details)
        self.plot_scores(scores_dict)

    def plot_scores(self, scores_dict):
        for widget in self.canvas_frame.winfo_children():
            widget.destroy()

        labels = ["Sentiment", "Emotion", "Toxicity", "Art. Sentiment", "Art. Emotion", "Art. Toxicity"]
        scores = [
            scores_dict.get('sentiment_score', 0.0),
            scores_dict.get('emotion_score', 0.0),
            scores_dict.get('toxicity_score', 0.0),
            scores_dict.get('article_sentiment_score', 0.0),
            scores_dict.get('article_emotion_score', 0.0),
            scores_dict.get('article_toxicity_score', 0.0)
        ]

        fig, ax = plt.subplots(figsize=(6, 4))
        bars = ax.bar(labels, scores, color=['#4CAF50', '#2196F3', '#F44336', '#8BC34A', '#03A9F4', '#E91E63'])
        ax.set_ylim(0, 1)
        ax.set_title("Bewertungs-Scores")
        ax.set_ylabel("Score (0-1)")
        plt.xticks(rotation=45)
        for bar in bars:
            yval = bar.get_height()
            ax.text(bar.get_x() + bar.get_width()/2, yval + 0.02, f"{yval:.2f}", ha='center', va='bottom')

        canvas = FigureCanvasTkAgg(fig, master=self.canvas_frame)
        canvas.draw()
        canvas.get_tk_widget().pack(fill="both", expand=True)
        ttk.Button(self.canvas_frame, text="Diagramm speichern", command=lambda: fig.savefig("scores_plot.png")).pack()

    def analyze_selected_articles(self):
        selected_items = self.results_tree.selection()
        if not selected_items:
            messagebox.showwarning("Auswahlfehler", "Bitte wählen Sie mindestens einen Artikel aus.")
            return

        article_ids = [self.results_tree.item(item, "values")[0] for item in selected_items]
        conn = connect_db()
        if not conn:
            return
        cursor = conn.cursor()

        scores = {
            "sentiment_score": [],
            "emotion_score": [],
            "toxicity_score": [],
            "article_sentiment_score": [],
            "article_emotion_score": [],
            "article_toxicity_score": []
        }
        
        for article_id in article_ids:
            cursor.execute("SELECT sentiment_score, emotion_score, toxicity_score, article_sentiment_score, article_emotion_score, article_toxicity_score FROM feed_entries WHERE id = ?", (article_id,))
            result = cursor.fetchone()
            if result:
                scores["sentiment_score"].append(result[0] or 0.0)
                scores["emotion_score"].append(result[1] or 0.0)
                scores["toxicity_score"].append(result[2] or 0.0)
                scores["article_sentiment_score"].append(result[3] or 0.0)
                scores["article_emotion_score"].append(result[4] or 0.0)
                scores["article_toxicity_score"].append(result[5] or 0.0)

        conn.close()

        analysis_text = "Auswertung der ausgewählten Artikel:\n\n"
        for key, values in scores.items():
            if values:
                avg = sum(values) / len(values)
                analysis_text += f"Durchschnittlicher {key.replace('_', ' ').title()}: {avg:.2f}\n"
            else:
                analysis_text += f"Durchschnittlicher {key.replace('_', ' ').title()}: Keine Daten\n"

        analysis_window = tk.Toplevel(self.root)
        analysis_window.title("Auswertung")
        analysis_window.geometry("400x300")
        text = tk.Text(analysis_window, height=15, width=50, wrap="word")
        text.insert(tk.END, analysis_text)
        text.pack(fill="both", expand=True)

        fig, ax = plt.subplots(figsize=(6, 4))
        labels = [key.replace('_score', '').replace('_', ' ').title() for key in scores.keys()]
        avg_scores = [sum(values) / len(values) if values else 0.0 for values in scores.values()]
        bars = ax.bar(labels, avg_scores, color=['#4CAF50', '#2196F3', '#F44336', '#8BC34A', '#03A9F4', '#E91E63'])
        ax.set_ylim(0, 1)
        ax.set_title("Durchschnittliche Bewertungs-Scores")
        ax.set_ylabel("Score (0-1)")
        plt.xticks(rotation=45)
        for bar in bars:
            yval = bar.get_height()
            ax.text(bar.get_x() + bar.get_width()/2, yval + 0.02, f"{yval:.2f}", ha='center', va='bottom')

        canvas = FigureCanvasTkAgg(fig, master=analysis_window)
        canvas.draw()
        canvas.get_tk_widget().pack(fill="both", expand=True)

if __name__ == "__main__":
    if not os.path.exists("rss_feeds.db"):
        messagebox.showerror("Datenbankfehler", "Die Datenbank 'rss_feeds.db' wurde nicht gefunden. Bitte führen Sie zuerst contentscan.py aus.")
    else:
        root = tk.Tk()
        app = NewsAnalyzerApp(root)
        root.mainloop()

Erweiterte Einsatzfelder mit Links

Qualitätsbewertung: Einschätzung journalistischer Qualität mit aggregierten Scores (Poynter Institute).
Politische Einordnung: Bias-Erkennung durch Durchschnittswerte (AllSides).
Logische Konsistenz: Prüfung mit NLP-Modellen (FactCheck.org).
Inhaltliche Tiefe: Analyse der Substanz (The Conversation).
Medienüberwachung: Trends mit aggregierten Daten (Google News Lab).
Toxizitätsprüfung: Identifikation problematischer Inhalte (Snopes).
Datenjournalismus: Datenbasis für Berichte (DataJournalism.com).
Clickbait-Erkennung: Filterung sensationsgetriebener Titel (Clickbait Detector).

Zukünftige Entwicklungen

Bias-Erkennung: Fairness Indicators.
Logikprüfung: spaCy.
Erweiterte Auswertungen: Kategorien- oder Zeitanalysen.
CLI: argparse.

Fazit

„ContentScan“ ist mein Beitrag zur automatisierten Nachrichtenanalyse, ergänzt durch eine leistungsstarke GUI mit Auswertungen. Es bewertet Qualität, politische Einordnung, Logik und Tiefe – ideal für Medienforscher, Journalisten oder Datenanalysten. Als Betaversion lade ich dich ein, es auszuprobieren und Feedback zu geben!iefe – perfekt für Medienforscher, Journalisten oder Datenanalysten.

Johannes Wobus – Data Science, AI & Software Engineering

„ContentScan 01.B“ – Mein Tool zur Analyse und Bewertung von Nachrichten

Schreibe einen Kommentar Antwort abbrechen