{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "", "# Semantic Bridge Analysis", "## A TACC Computational Cookbook", "", "
", "\ud83d\udcda Computational Cookbook Series
", "This notebook is part of the TACC Computational Cookbook series - reproducible workflows for scientific computing on High Performance Computing (HPC) systems.", "
", "", "**Developed by:** Texas Advanced Computing Center (TACC) ", "**Institution:** The University of Texas at Austin ", "**Contact:** [tacc-help@tacc.utexas.edu](mailto:tacc-help@tacc.utexas.edu)", "", "---", "", "### \ud83c\udfaf What This Cookbook Does", "", "This workflow enables researchers and decision-makers to bridge the gap between qualitative stakeholder narratives and quantitative scientific analysis. Using natural language processing and machine learning, the cookbook:", "", "1. **Discovers themes** in community narratives, stakeholder interviews, and planning documents", "2. **Identifies decision components** including goals, objectives, variables, and constraints ", "3. **Maps concepts** to established scientific disciplines and domains", "4. **Links terminology** to standardized scientific variables and data sources", "5. **Creates visualizations** showing connections between human perspectives and scientific frameworks", "", "**Use cases include:**", "- Environmental planning and climate adaptation", "- Infrastructure decision support", "- Community-driven research", "- Participatory modeling and stakeholder engagement", "- Interdisciplinary problem framing", "", "---", "", "### \ud83d\udccb Prerequisites", "", "**TACC Account & Allocation:**", "- Active TACC user account ([register here](https://accounts.tacc.utexas.edu/register))", "- Allocation on TACC systems (or use startup allocation)", "- Familiarity with Jupyter notebooks", "", "**Input Data:**", "- Text documents describing your problem/situation (.txt, .json, .docx formats)", "- Examples: interview transcripts, meeting notes, stakeholder reports, community narratives", "", "**Knowledge:**", "- Basic Python programming", "- Understanding of your domain/problem area", "- Ability to customize scientific vocabularies for your field", "", "---", "", "### \ud83d\ude80 Quick Start", "", "1. **Prepare your data:** Place text documents in `data/transcripts/` directory", "2. **Run Setup (Step 1):** Install required packages ", "3. **Load Data (Step 2):** The notebook will process your documents", "4. **Analyze (Steps 3-6):** Follow the workflow to discover topics, map to science, extract components", "5. **Review Outputs:** Check the `outputs/` folder for results and visualizations", "", "---", "", "### \ud83d\udcca Expected Outputs", "", "- `*_topic_mappings.csv`: Topics linked to scientific domains", "- `*_components.csv`: Extracted decision components ", "- `*_svo_mappings.csv`: Semantic links to scientific variables", "- `*_network.html`: Interactive science backbone visualization", "- `*_report.md`: Comprehensive analysis summary", "", "---", "", "### \ud83d\udd27 Customization", "", "This cookbook is designed to be adapted to different domains:", "", "- **Modify `science_backbone`** (Step 4): Add your discipline-specific domains", "- **Expand `svo_vocabulary`** (Step 6): Include variables relevant to your field ", "- **Adjust `n_topics`** (Step 3): Match complexity of your document corpus", "- **Update decision patterns** (Step 5): Customize for your decision framework", "", "---", "", "### \ud83d\udcd6 Learn More", "", "- [TACC Documentation](https://docs.tacc.utexas.edu)", "- [TACC Training](https://learn.tacc.utexas.edu)", "- [Science Gateways Community Institute](https://sciencegateways.org/)", "", "---", "", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 1: Setup\n", "\n", "We will load the tools needed for text analysis and visualization\n", "\n", "This setup includes: \n", "- Import verification:\n", " Checks each package before installing spaCy\n", " model check: Verifies if the model is already downloaded\n", "- Selective installation:\n", " Only installs what's missing\n", "- Clear feedback:\n", " Shows which packages are already available vs. need installation\n", "- Import name mapping:\n", " Handles cases where package name \u2260 import name (like scikit-learn vs sklearn)\n", "- Efficient for reuse:\n", " Won't waste time reinstalling on subsequent runs\n", "\n", "This is perfect for TACC computational cookbooks where the notebook might be run multiple times by different users or in environments with varying pre-installed packages." ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " \u2713 pandas already installed\n", " \u2713 numpy already installed\n", " \u2713 nltk already installed\n", " \u2713 spacy already installed\n", " \u2713 scikit-learn already installed\n", " \u2713 networkx already installed\n", " \u2713 plotly already installed\n", " \u2713 python-docx already installed\n", " \u2713 pillow already installed\n", "\n", "\u2713 All packages already installed!\n", "\u2713 spaCy model already available!\n", "\n", "\u2713 Setup complete and verified!\n" ] } ], "source": [ "# Cell 1: Installation with verification checks\n", "\n", "# Install required packages for TACC Jupyter environment\n", "import sys\n", "import subprocess\n", "import importlib\n", "import os\n", "from pathlib import Path\n", "\n", "# Add user's local bin to PATH (needed for TACC)\n", "user_bin = Path.home() / '.local' / 'bin'\n", "if str(user_bin) not in os.environ['PATH']:\n", " os.environ['PATH'] = f\"{user_bin}:{os.environ['PATH']}\"\n", " print(f'\u2713 Added {user_bin} to PATH')\n", "\n", "def check_package_installed(package_name, import_name=None):\n", " \"\"\"Check if a package is already installed\"\"\"\n", " if import_name is None:\n", " import_name = package_name\n", " try:\n", " importlib.import_module(import_name)\n", " return True\n", " except ImportError:\n", " return False\n", "\n", "def check_spacy_model(model_name='en_core_web_sm'):\n", " \"\"\"Check if spaCy model is already downloaded\"\"\"\n", " try:\n", " import spacy\n", " spacy.load(model_name)\n", " return True\n", " except (ImportError, OSError):\n", " return False\n", "\n", "def install_packages():\n", " \"\"\"Install required packages if not already available\"\"\"\n", " packages = {\n", " 'pandas': 'pandas',\n", " 'numpy': 'numpy',\n", " 'nltk': 'nltk',\n", " 'spacy': 'spacy',\n", " 'scikit-learn': 'sklearn',\n", " 'networkx': 'networkx',\n", " 'plotly': 'plotly',\n", " 'python-docx': 'docx',\n", " 'pillow': 'PIL'\n", " }\n", " \n", " missing_packages = []\n", " for package, import_name in packages.items():\n", " if not check_package_installed(package, import_name):\n", " missing_packages.append(package)\n", " print(f' - {package} needs installation')\n", " else:\n", " print(f' \u2713 {package} already installed')\n", " \n", " if missing_packages:\n", " print(f'\\nInstalling {len(missing_packages)} package(s)...')\n", " subprocess.check_call([\n", " sys.executable, '-m', 'pip', 'install', \n", " '--quiet', '--user', '--no-warn-script-location'\n", " ] + missing_packages)\n", " print('\u2713 Package installation complete!')\n", " else:\n", " print('\\n\u2713 All packages already installed!')\n", " \n", " if not check_spacy_model('en_core_web_sm'):\n", " print('\\nDownloading spaCy model...')\n", " subprocess.check_call([\n", " sys.executable, '-m', 'spacy', 'download', \n", " 'en_core_web_sm', '--quiet'\n", " ])\n", " print('\u2713 spaCy model downloaded!')\n", " else:\n", " print('\u2713 spaCy model already available!')\n", " \n", " print('\\n\u2713 Setup complete and verified!')\n", "\n", "install_packages()" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u2139 pytesseract not available - image OCR disabled\n", " (This is fine if you only use .txt, .json, or .docx files)\n", "\u2713 Libraries loaded successfully!\n" ] } ], "source": [ "## Cell 2: Import verification, document handling, and library loading\n", "\n", "# Be sure to run Cell 2 each time you restart the kernel\n", "\n", "# Import libraries\n", "import pandas as pd\n", "import numpy as np\n", "import json\n", "from pathlib import Path\n", "from collections import Counter, defaultdict\n", "import re\n", "\n", "# NLP\n", "import nltk\n", "from nltk.tokenize import sent_tokenize, word_tokenize\n", "from nltk.corpus import stopwords\n", "import spacy\n", "\n", "# Machine Learning\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.decomposition import LatentDirichletAllocation\n", "\n", "# Network analysis\n", "import networkx as nx\n", "\n", "# Visualization\n", "import plotly.graph_objects as go\n", "import plotly.express as px\n", "\n", "# Document handling\n", "from docx import Document\n", "from PIL import Image\n", "\n", "# Optional: OCR support for images\n", "try:\n", " import pytesseract\n", " OCR_AVAILABLE = True\n", "except ImportError:\n", " OCR_AVAILABLE = False\n", " print('\u2139 pytesseract not available - image OCR disabled')\n", " print(' (This is fine if you only use .txt, .json, or .docx files)')\n", "\n", "# Download NLTK data\n", "for pkg in ['punkt', 'stopwords', 'averaged_perceptron_tagger']:\n", " try:\n", " nltk.data.find(f'tokenizers/{pkg}')\n", " except LookupError:\n", " nltk.download(pkg, quiet=True)\n", "\n", "# Load spaCy\n", "try:\n", " nlp = spacy.load('en_core_web_sm')\n", " print('\u2713 Libraries loaded successfully!')\n", "except OSError:\n", " print('\u26a0 Run the previous cell to install spaCy model')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Prepare Input Data (this is your \"Corpora\")\n", "\n", "This notebook works with multiple document formats describing a problem, situation, or descriptive collections of documents. For example:\n", "- Interview transcripts\n", "- Meeting notes\n", "- Stakeholder reports\n", "- Grey Literature reports\n", "- Community narratives\n", "\n", "**Supported formats:**\n", "- `.txt` - Plain text files\n", "- `.json` - JSON with text content (field: \"text\" or \"content\")\n", "- `.docx` - Microsoft Word documents\n", "- `.png` / `.jpg` / `.jpeg` - Images (OCR extraction)\n", "\n", "**Setup:** Place your files in `data/transcripts/` folder\n", "\n", "For this demo, sample documents in various formats have been created. You can use these samples if you do not have test datasets." ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u2713 Created directory: data/transcripts\n", " Place your .txt transcript files here, or run the next cell for demo data\n" ] } ], "source": [ "# Create data directory structure\n", "from pathlib import Path\n", "\n", "data_dir = Path('data/transcripts')\n", "data_dir.mkdir(parents=True, exist_ok=True)\n", "\n", "print(f'\u2713 Created directory: {data_dir}')\n", "print(f' Place your .txt transcript files here, or run the next cell for demo data')" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u2713 Created: interview_001.txt\n", "\u2713 Created: interview_002.txt\n", "\u2713 Created: meeting_notes_001.txt\n", "\u2713 Created: stakeholder_report.txt\n", "\n", "\u2713 Created 4 expanded sample documents in data/transcripts\n", " Total words: 851\n", "\u2713 Created: interview_001.txt\n", "\u2713 Created: interview_002.txt\n", "\u2713 Created: meeting_notes_001.txt\n", "\u2713 Created: stakeholder_report.txt\n", "\n", "\u2713 Created 4 sample documents in data/transcripts\n" ] } ], "source": [ "# Sample transcript documents for demonstration\n", "\n", "# Skip this if you have real data\n", "\n", "# Create expanded sample transcript documents for demonstration\n", "\n", "sample_transcripts = {\n", " 'interview_001.txt': \"\"\"\n", " Interview with Community Resident - Sarah Martinez\n", " Date: March 15, 2024\n", " \n", " We've been experiencing significant flooding in our neighborhood during heavy rains. \n", " The storm drains seem inadequate, and water pools on Main Street for hours. \n", " Several basements have flooded in the past year. \n", " \n", " Our primary goal is to protect our homes and preserve property values in this neighborhood. \n", " We need to maintain safe access to schools and emergency services even during storm events.\n", " \n", " The main objective should be to minimize flood damage to residential properties and reduce \n", " the frequency of street closures. We're looking at different investment options for \n", " stormwater management, but we have a budget constraint of about $2 million from the city \n", " council allocation.\n", " \n", " I think we need better drainage infrastructure as our decision variable. The implementation \n", " strategy could include both green and gray infrastructure. We cannot exceed the current \n", " budget without additional grant funding.\n", " \n", " We should use flood depth as a key indicator of success, measuring the water depth on \n", " Main Street during storm events. Another important metric would be the number of properties \n", " with basement flooding per year.\n", " \"\"\",\n", " \n", " 'interview_002.txt': \"\"\"\n", " Interview with Local Business Owner - James Chen\n", " Date: March 18, 2024\n", " \n", " The flooding issue is directly related to new development upstream. Since they built \n", " the shopping center, our area gets much more runoff during storms.\n", " \n", " Our goal is to protect local businesses from flood damage while maintaining economic \n", " vitality downtown. We aim to preserve the historic character of our business district.\n", " \n", " The key objective here is to maximize stormwater retention upstream and minimize runoff \n", " reaching our downtown area. We need green infrastructure like retention ponds and \n", " permeable pavement as part of our strategy.\n", " \n", " The investment decision should consider both short-term fixes and long-term solutions. \n", " We face a major constraint - the shopping center owner won't participate unless required \n", " by regulation. We also have a time limit since hurricane season starts in June.\n", " \n", " I'd suggest tracking business interruption days as an indicator of improvement. We should \n", " measure economic damage in dollars per storm event. The depth of flooding in parking areas \n", " would be another useful metric to monitor progress.\n", " \"\"\",\n", " \n", " 'meeting_notes_001.txt': \"\"\"\n", " Community Stakeholder Meeting - Flood Mitigation Planning\n", " Date: March 22, 2024\n", " Attendees: 45 residents, city council members, county planning staff\n", " \n", " Meeting Summary:\n", " \n", " Community members reported increased flooding frequency over the past five years. \n", " Main concerns include inadequate drainage, upstream development impacts, and aging infrastructure.\n", " \n", " GOALS IDENTIFIED:\n", " - Protect residential and commercial properties from flood damage\n", " - Maintain neighborhood livability and safety during storm events \n", " - Preserve environmental quality of local waterways\n", " - Aim to restore pre-development runoff conditions\n", " \n", " OBJECTIVES DISCUSSED:\n", " - Minimize flood damage costs to the community\n", " - Reduce flood depth on critical roadways by 50%\n", " - Maximize green infrastructure implementation where feasible\n", " \n", " DECISION VARIABLES:\n", " - Infrastructure investment levels (ranging from $1M to $5M)\n", " - Strategy selection: gray infrastructure vs. green infrastructure vs. hybrid\n", " - Implementation timeline: phased over 3 years vs. comprehensive approach\n", " \n", " CONSTRAINTS IDENTIFIED:\n", " - Budget limit of $2.5 million from city general fund\n", " - Cannot disrupt traffic on Main Street for more than 2 weeks\n", " - Must comply with historic district design guidelines\n", " - Limited right-of-way for new infrastructure\n", " \n", " PERFORMANCE INDICATORS:\n", " - Measure flood depth at 5 key monitoring locations\n", " - Track number of flood events per year exceeding 6 inches\n", " - Calculate economic damage per storm event\n", " - Monitor basement flooding frequency as a key metric\n", " - Assess stormwater quality indicators (pollutant levels)\n", " \n", " Proposed solutions include comprehensive stormwater management systems, coordination \n", " with county planning on upstream development, and establishment of maintenance protocols.\n", " \n", " Next steps: Form technical committee to evaluate decision alternatives using \n", " multi-criteria decision analysis framework.\n", " \"\"\",\n", " \n", " 'stakeholder_report.txt': \"\"\"\n", " Stormwater Infrastructure Assessment Report\n", " Prepared by: City Engineering Department\n", " Date: April 1, 2024\n", " \n", " EXECUTIVE SUMMARY\n", " \n", " This report evaluates stormwater management alternatives for the downtown district \n", " experiencing chronic flooding issues.\n", " \n", " PROJECT GOALS:\n", " The overarching goal is to protect the community from flood hazards while preserving \n", " environmental resources. We aim to maintain infrastructure resilience under future \n", " climate conditions.\n", " \n", " SPECIFIC OBJECTIVES:\n", " 1. Minimize annual flood damage costs to less than $500,000\n", " 2. Reduce peak flood depths by 40% during 10-year storm events\n", " 3. Maximize community co-benefits (recreation, green space, water quality)\n", " \n", " DECISION FRAMEWORK:\n", " \n", " The primary decision variable is the selection of infrastructure investment strategy \n", " from three alternatives:\n", " \n", " Alternative A: Traditional gray infrastructure ($3.2M investment)\n", " Alternative B: Green infrastructure approach ($2.8M investment) \n", " Alternative C: Hybrid strategy ($3.5M investment)\n", " \n", " Implementation decisions also include phasing schedules and maintenance strategies.\n", " \n", " CONSTRAINTS:\n", " - Cannot exceed $3.5 million budget constraint\n", " - Must complete implementation within 24-month time limit\n", " - Cannot impact historic building foundations\n", " - Limited to existing public right-of-way areas\n", " \n", " PERFORMANCE METRICS:\n", " \n", " Key indicators for evaluating alternatives:\n", " - Maximum flood depth at critical intersections (target: <6 inches)\n", " - Frequency of road closures (metric: closures per year)\n", " - Economic damage per storm event (measured in dollars)\n", " - Stormwater volume captured (measure in acre-feet)\n", " - Cost-effectiveness indicator (damage reduced per dollar invested)\n", " \n", " RECOMMENDATIONS:\n", " \n", " Our objective is to reduce flood risk while maximizing return on investment. The decision \n", " should minimize lifecycle costs while achieving flood depth reduction goals. We aim to \n", " preserve flexibility for future adaptations as climate conditions change.\n", " \"\"\"\n", "}\n", "\n", "# Write files to directory\n", "for filename, content in sample_transcripts.items():\n", " filepath = data_dir / filename\n", " filepath.write_text(content.strip())\n", " print(f'\u2713 Created: {filename}')\n", "\n", "print(f'\\n\u2713 Created {len(sample_transcripts)} expanded sample documents in {data_dir}')\n", "print(f' Total words: {sum(len(content.split()) for content in sample_transcripts.values())}')\n", "for filename, content in sample_transcripts.items():\n", " filepath = data_dir / filename\n", " filepath.write_text(content.strip())\n", " print(f'\u2713 Created: {filename}')\n", "\n", "print(f'\\n\u2713 Created {len(sample_transcripts)} sample documents in {data_dir}')" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u2713 Loaded 4 transcript files:\n", " - interview_001.txt\n", " - interview_002.txt\n", " - meeting_notes_001.txt\n", " - stakeholder_report.txt\n" ] } ], "source": [ "# Load all transcript files from the data directory\n", "transcripts = {}\n", "\n", "for filepath in sorted(data_dir.glob('*.txt')):\n", " transcripts[filepath.name] = filepath.read_text()\n", " \n", "print(f'\u2713 Loaded {len(transcripts)} transcript files:')\n", "for filename in transcripts.keys():\n", " print(f' - {filename}')" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "============================================================\n", "File: interview_001.txt\n", "============================================================\n", "Interview with Community Resident - Sarah Martinez\n", " Date: March 15, 2024\n", " \n", " We've been experiencing significant flooding in our neighborhood during heavy rains. \n", " The storm drains seem ina...\n", "\n", "============================================================\n", "File: interview_002.txt\n", "============================================================\n", "Interview with Local Business Owner - James Chen\n", " Date: March 18, 2024\n", " \n", " The flooding issue is directly related to new development upstream. Since they built \n", " the shopping center, our ar...\n", "\n", "============================================================\n", "File: meeting_notes_001.txt\n", "============================================================\n", "Community Stakeholder Meeting - Flood Mitigation Planning\n", " Date: March 22, 2024\n", " Attendees: 45 residents, city council members, county planning staff\n", " \n", " Meeting Summary:\n", " \n", " Community...\n", "\n", "============================================================\n", "File: stakeholder_report.txt\n", "============================================================\n", "Stormwater Infrastructure Assessment Report\n", " Prepared by: City Engineering Department\n", " Date: April 1, 2024\n", " \n", " EXECUTIVE SUMMARY\n", " \n", " This report evaluates stormwater management alterna...\n" ] } ], "source": [ "# Display preview of loaded transcripts\n", "for filename, content in transcripts.items():\n", " print(f'\\n{\"=\"*60}')\n", " print(f'File: {filename}')\n", " print(f'{\"=\"*60}')\n", " print(content[:200] + '...' if len(content) > 200 else content)" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u2713 Transcript Statistics:\n", "\n", " File Characters Words Sentences\n", " interview_001.txt 1281 182 12\n", " interview_002.txt 1185 169 12\n", " meeting_notes_001.txt 2021 251 4\n", "stakeholder_report.txt 2122 249 10\n" ] } ], "source": [ "# Calculate basic statistics for each transcript\n", "import pandas as pd\n", "\n", "stats = []\n", "for filename, content in transcripts.items():\n", " stats.append({\n", " 'File': filename,\n", " 'Characters': len(content),\n", " 'Words': len(content.split()),\n", " 'Sentences': len(sent_tokenize(content))\n", " })\n", "\n", "stats_df = pd.DataFrame(stats)\n", "print('\u2713 Transcript Statistics:\\n')\n", "print(stats_df.to_string(index=False))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#Step 3: Discover Topics\n", "**What is Topic Modeling?**\n", "Topic modeling automatically identifies themes in your documents. \n", "It groups words that appear together frequently into topics.\n", "\n", "Example: If \"flooding\", \"water\", \"drainage\" appear together, the topic might be about coastal hydrology.\n", "\n", "**How it works:**\n", "1. Break text into words\n", "2. Find patterns of co-occurring words\n", "3. Group related words into topics\n", "4. Assign topics to documents" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Analysis Parameters:\n", " \u2022 Number of topics: 5\n", " \u2022 Vocabulary size: 100\n", " \u2022 Keywords per topic: 6\n" ] } ], "source": [ "# Set analysis parameters\n", "n_topics = 5 # Change this number to discover more or fewer topics\n", "max_vocabulary = 100 # Maximum number of terms to consider\n", "top_words_display = 6 # Number of keywords to show per topic\n", "\n", "print(f'Analysis Parameters:')\n", "print(f' \u2022 Number of topics: {n_topics}')\n", "print(f' \u2022 Vocabulary size: {max_vocabulary}')\n", "print(f' \u2022 Keywords per topic: {top_words_display}')" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u2713 Text preprocessing complete\n", "\n", "Example: interview with community resident sarah martinez date march we ve been experiencing significant flooding in our neighborhood during heavy rains the st...\n" ] } ], "source": [ "# Preprocess text\n", "def preprocess_text(text):\n", " text = text.lower()\n", " text = re.sub(r'[^a-z\\s]', ' ', text)\n", " text = ' '.join(text.split())\n", " return text\n", "\n", "processed_docs = [preprocess_text(text) for text in transcripts.values()]\n", "doc_names = list(transcripts.keys())\n", "\n", "print('\u2713 Text preprocessing complete')\n", "print(f'\\nExample: {processed_docs[0][:150]}...')" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Discovering topics...\n", "\n", "\u2713 Vocabulary: 100 terms\n", "\u2713 Discovered 5 topics\n", "\n", "Topics Discovered:\n", "============================================================\n", "\n", "Topic 1: goal, constraint, goal protect\n", " Full keywords: goal, constraint, goal protect, objective, indicator, measure\n", "\n", "Topic 2: flood, infrastructure, main\n", " Full keywords: flood, infrastructure, main, storm, street, investment\n", "\n", "Topic 3: business, retention, downtown\n", " Full keywords: business, retention, downtown, local, runoff, upstream\n", "\n", "Topic 4: goal, constraint, goal protect\n", " Full keywords: goal, constraint, goal protect, objective, indicator, measure\n", "\n", "Topic 5: goal, constraint, goal protect\n", " Full keywords: goal, constraint, goal protect, objective, indicator, measure\n", "\n", "============================================================\n" ] } ], "source": [ "# Topic Discovery \n", "# Extract topics using Latent Derelecht Analysis (LDA)\n", "print('Discovering topics...\\n')\n", "\n", "# Create document-term matrix\n", "vectorizer = TfidfVectorizer(\n", " max_features=max_vocabulary,\n", " stop_words='english',\n", " ngram_range=(1, 2)\n", ")\n", "doc_term_matrix = vectorizer.fit_transform(processed_docs)\n", "feature_names = vectorizer.get_feature_names_out()\n", "\n", "print(f'\u2713 Vocabulary: {len(feature_names)} terms')\n", "\n", "# Discover topics\n", "lda_model = LatentDirichletAllocation(\n", " n_components=n_topics,\n", " random_state=42,\n", " max_iter=20\n", ")\n", "doc_topic_dist = lda_model.fit_transform(doc_term_matrix)\n", "\n", "print(f'\u2713 Discovered {n_topics} topics\\n')\n", "\n", "# Extract and store topic information\n", "print('Topics Discovered:')\n", "print('=' * 60)\n", "\n", "topics_info = {}\n", "for idx, topic in enumerate(lda_model.components_):\n", " top_indices = topic.argsort()[-8:][::-1]\n", " top_words = [feature_names[i] for i in top_indices]\n", " \n", " # Create topic label with keywords\n", " topic_label = f\"Topic {idx + 1}: {', '.join(top_words[:3])}\"\n", " topics_info[f'Topic {idx + 1}'] = {\n", " 'label': topic_label,\n", " 'keywords': top_words\n", " }\n", " \n", " print(f'\\n{topic_label}')\n", " print(f' Full keywords: {\", \".join(top_words[:top_words_display])}')\n", "\n", "print('\\n' + '=' * 60)" ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Creating enhanced visualization...\n", "\n" ] }, { "data": { "application/vnd.plotly.v1+json": { "config": { "plotlyServerURL": "https://plot.ly" }, "data": [ { "hoverinfo": "text", "hovertext": [ "Topic 1: goal, constraint, goal protect
Document: interview_001
Proportion: 2.6%
Keywords: goal, constraint, goal protect, objective, indicator, measure", "Topic 1: goal, constraint, goal protect
Document: interview_002
Proportion: 3.0%
Keywords: goal, constraint, goal protect, objective, indicator, measure", "Topic 1: goal, constraint, goal protect
Document: meeting_notes_001
Proportion: 2.2%
Keywords: goal, constraint, goal protect, objective, indicator, measure", "Topic 1: goal, constraint, goal protect
Document: stakeholder_report
Proportion: 2.4%
Keywords: goal, constraint, goal protect, objective, indicator, measure" ], "name": "Topic 1: goal, constraint, goal protect", "text": [ "2.6%", "3.0%", "2.2%", "2.4%" ], "textposition": "auto", "type": "bar", "x": [ "interview_001", "interview_002", "meeting_notes_001", "stakeholder_report" ], "y": { "bdata": "+JntkDuMmj9m3FLC+4ieP9zqfftntpY/P6TEbMPmmD8=", "dtype": "f8" } }, { "hoverinfo": "text", "hovertext": [ "Topic 2: flood, infrastructure, main
Document: interview_001
Proportion: 89.6%
Keywords: flood, infrastructure, main, storm, street, investment", "Topic 2: flood, infrastructure, main
Document: interview_002
Proportion: 3.1%
Keywords: flood, infrastructure, main, storm, street, investment", "Topic 2: flood, infrastructure, main
Document: meeting_notes_001
Proportion: 91.1%
Keywords: flood, infrastructure, main, storm, street, investment", "Topic 2: flood, infrastructure, main
Document: stakeholder_report
Proportion: 90.3%
Keywords: flood, infrastructure, main, storm, street, investment" ], "name": "Topic 2: flood, infrastructure, main", "text": [ "89.6%", "3.1%", "91.1%", "90.3%" ], "textposition": "auto", "type": "bar", "x": [ "interview_001", "interview_002", "meeting_notes_001", "stakeholder_report" ], "y": { "bdata": "w/DSzpKt7D+dENTgjO2fP8Rb2yMvJ+0/yCiMsoDh7D8=", "dtype": "f8" } }, { "hoverinfo": "text", "hovertext": [ "Topic 3: business, retention, downtown
Document: interview_001
Proportion: 2.6%
Keywords: business, retention, downtown, local, runoff, upstream", "Topic 3: business, retention, downtown
Document: interview_002
Proportion: 87.9%
Keywords: business, retention, downtown, local, runoff, upstream", "Topic 3: business, retention, downtown
Document: meeting_notes_001
Proportion: 2.2%
Keywords: business, retention, downtown, local, runoff, upstream", "Topic 3: business, retention, downtown
Document: stakeholder_report
Proportion: 2.5%
Keywords: business, retention, downtown, local, runoff, upstream" ], "name": "Topic 3: business, retention, downtown", "text": [ "2.6%", "87.9%", "2.2%", "2.5%" ], "textposition": "auto", "type": "bar", "x": [ "interview_001", "interview_002", "meeting_notes_001", "stakeholder_report" ], "y": { "bdata": "SQrZcvOomj+SmsH+uyPsP0/FGpLj9pY/V/osaJ8bmT8=", "dtype": "f8" } }, { "hoverinfo": "text", "hovertext": [ "Topic 4: goal, constraint, goal protect
Document: interview_001
Proportion: 2.6%
Keywords: goal, constraint, goal protect, objective, indicator, measure", "Topic 4: goal, constraint, goal protect
Document: interview_002
Proportion: 3.0%
Keywords: goal, constraint, goal protect, objective, indicator, measure", "Topic 4: goal, constraint, goal protect
Document: meeting_notes_001
Proportion: 2.2%
Keywords: goal, constraint, goal protect, objective, indicator, measure", "Topic 4: goal, constraint, goal protect
Document: stakeholder_report
Proportion: 2.4%
Keywords: goal, constraint, goal protect, objective, indicator, measure" ], "name": "Topic 4: goal, constraint, goal protect", "text": [ "2.6%", "3.0%", "2.2%", "2.4%" ], "textposition": "auto", "type": "bar", "x": [ "interview_001", "interview_002", "meeting_notes_001", "stakeholder_report" ], "y": { "bdata": "v8vtkDuMmj8iAVPC+4ieP9v0fftntpY/Rq/EbMPmmD8=", "dtype": "f8" } }, { "hoverinfo": "text", "hovertext": [ "Topic 5: goal, constraint, goal protect
Document: interview_001
Proportion: 2.6%
Keywords: goal, constraint, goal protect, objective, indicator, measure", "Topic 5: goal, constraint, goal protect
Document: interview_002
Proportion: 3.0%
Keywords: goal, constraint, goal protect, objective, indicator, measure", "Topic 5: goal, constraint, goal protect
Document: meeting_notes_001
Proportion: 2.2%
Keywords: goal, constraint, goal protect, objective, indicator, measure", "Topic 5: goal, constraint, goal protect
Document: stakeholder_report
Proportion: 2.4%
Keywords: goal, constraint, goal protect, objective, indicator, measure" ], "name": "Topic 5: goal, constraint, goal protect", "text": [ "2.6%", "3.0%", "2.2%", "2.4%" ], "textposition": "auto", "type": "bar", "x": [ "interview_001", "interview_002", "meeting_notes_001", "stakeholder_report" ], "y": { "bdata": "gnftkDuMmj+bv1LC+4ieP4LifftntpY/CJnEbMPmmD8=", "dtype": "f8" } } ], "layout": { "barmode": "stack", "height": 500, "legend": { "orientation": "v", "title": { "text": "Topics (hover for details)" }, "x": 1.02, "xanchor": "left", "y": 1, "yanchor": "top" }, "template": { "data": { "bar": [ { "error_x": { "color": "#2a3f5f" }, "error_y": { "color": "#2a3f5f" }, "marker": { "line": { "color": "#E5ECF6", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "bar" } ], "barpolar": [ { "marker": { "line": { "color": "#E5ECF6", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "barpolar" } ], "carpet": [ { "aaxis": { "endlinecolor": "#2a3f5f", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "#2a3f5f" }, "baxis": { "endlinecolor": "#2a3f5f", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "#2a3f5f" }, "type": "carpet" } ], "choropleth": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "choropleth" } ], "contour": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "contour" } ], "contourcarpet": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "contourcarpet" } ], "heatmap": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "heatmap" } ], "histogram": [ { "marker": { "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "histogram" } ], "histogram2d": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "histogram2d" } ], "histogram2dcontour": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "histogram2dcontour" } ], "mesh3d": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "mesh3d" } ], "parcoords": [ { "line": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "parcoords" } ], "pie": [ { "automargin": true, "type": "pie" } ], "scatter": [ { "fillpattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 }, "type": "scatter" } ], "scatter3d": [ { "line": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatter3d" } ], "scattercarpet": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattercarpet" } ], "scattergeo": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattergeo" } ], "scattergl": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattergl" } ], "scattermap": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattermap" } ], "scattermapbox": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattermapbox" } ], "scatterpolar": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterpolar" } ], "scatterpolargl": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterpolargl" } ], "scatterternary": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterternary" } ], "surface": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "surface" } ], "table": [ { "cells": { "fill": { "color": "#EBF0F8" }, "line": { "color": "white" } }, "header": { "fill": { "color": "#C8D4E3" }, "line": { "color": "white" } }, "type": "table" } ] }, "layout": { "annotationdefaults": { "arrowcolor": "#2a3f5f", "arrowhead": 0, "arrowwidth": 1 }, "autotypenumbers": "strict", "coloraxis": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "colorscale": { "diverging": [ [ 0, "#8e0152" ], [ 0.1, "#c51b7d" ], [ 0.2, "#de77ae" ], [ 0.3, "#f1b6da" ], [ 0.4, "#fde0ef" ], [ 0.5, "#f7f7f7" ], [ 0.6, "#e6f5d0" ], [ 0.7, "#b8e186" ], [ 0.8, "#7fbc41" ], [ 0.9, "#4d9221" ], [ 1, "#276419" ] ], "sequential": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "sequentialminus": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ] }, "colorway": [ "#636efa", "#EF553B", "#00cc96", "#ab63fa", "#FFA15A", "#19d3f3", "#FF6692", "#B6E880", "#FF97FF", "#FECB52" ], "font": { "color": "#2a3f5f" }, "geo": { "bgcolor": "white", "lakecolor": "white", "landcolor": "#E5ECF6", "showlakes": true, "showland": true, "subunitcolor": "white" }, "hoverlabel": { "align": "left" }, "hovermode": "closest", "mapbox": { "style": "light" }, "paper_bgcolor": "white", "plot_bgcolor": "#E5ECF6", "polar": { "angularaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "bgcolor": "#E5ECF6", "radialaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" } }, "scene": { "xaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" }, "yaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" }, "zaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" } }, "shapedefaults": { "line": { "color": "#2a3f5f" } }, "ternary": { "aaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "baxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "bgcolor": "#E5ECF6", "caxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" } }, "title": { "x": 0.05 }, "xaxis": { "automargin": true, "gridcolor": "white", "linecolor": "white", "ticks": "", "title": { "standoff": 15 }, "zerolinecolor": "white", "zerolinewidth": 2 }, "yaxis": { "automargin": true, "gridcolor": "white", "linecolor": "white", "ticks": "", "title": { "standoff": 15 }, "zerolinecolor": "white", "zerolinewidth": 2 } } }, "title": { "text": "Topic Distribution Across Documents (5 Topics)" }, "xaxis": { "title": { "text": "Document" } }, "yaxis": { "title": { "text": "Topic Proportion" } } } }, "text/html": [ "
\n", "
" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\u2713 Visualization complete!\n", "\n", "\ud83d\udca1 Tip: To analyze with different number of topics, change n_topics in Cell 1 and re-run cells 3-4\n" ] } ], "source": [ "# Visualize Topics\n", "# Visualize topic distribution with labeled topics\n", "print('Creating enhanced visualization...\\n')\n", "\n", "# Prepare data with topic labels\n", "topic_df = pd.DataFrame(\n", " doc_topic_dist,\n", " columns=[f'Topic {i+1}' for i in range(n_topics)],\n", " index=[name.replace('.txt', '').replace('.json', '').replace('.docx', '') \n", " for name in doc_names]\n", ")\n", "\n", "fig = go.Figure()\n", "\n", "# Add bars with topic labels and hover information\n", "for topic_num in topic_df.columns:\n", " topic_info = topics_info[topic_num]\n", " \n", " # Create hover text with full keywords\n", " hover_text = [\n", " f\"{topic_info['label']}
\" +\n", " f\"Document: {doc}
\" +\n", " f\"Proportion: {val:.1%}
\" +\n", " f\"Keywords: {', '.join(topic_info['keywords'][:6])}\"\n", " for doc, val in zip(topic_df.index, topic_df[topic_num])\n", " ]\n", " \n", " fig.add_trace(go.Bar(\n", " name=topic_info['label'],\n", " x=topic_df.index,\n", " y=topic_df[topic_num],\n", " text=[f'{val:.1%}' for val in topic_df[topic_num]],\n", " textposition='auto',\n", " hovertext=hover_text,\n", " hoverinfo='text'\n", " ))\n", "\n", "fig.update_layout(\n", " title=f'Topic Distribution Across Documents ({n_topics} Topics)',\n", " xaxis_title='Document',\n", " yaxis_title='Topic Proportion',\n", " barmode='stack',\n", " height=500,\n", " legend=dict(\n", " title=\"Topics (hover for details)\",\n", " orientation=\"v\",\n", " yanchor=\"top\",\n", " y=1,\n", " xanchor=\"left\",\n", " x=1.02\n", " )\n", ")\n", "\n", "fig.show()\n", "print('\u2713 Visualization complete!')\n", "print(f'\\n\ud83d\udca1 Tip: To analyze with different number of topics, change n_topics in Cell 1 and re-run cells 3-4')" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Topic Summary:\n", "================================================================================\n", " Topic Top Keywords Avg Coverage\n", "Topic 1 goal, constraint, goal protect, objective, indicator, measure 2.6%\n", "Topic 2 flood, infrastructure, main, storm, street, investment 68.5%\n", "Topic 3 business, retention, downtown, local, runoff, upstream 23.8%\n", "Topic 4 goal, constraint, goal protect, objective, indicator, measure 2.6%\n", "Topic 5 goal, constraint, goal protect, objective, indicator, measure 2.6%\n", "================================================================================\n" ] } ], "source": [ "# Create summary table of topics\n", "print('\\nTopic Summary:')\n", "print('=' * 80)\n", "\n", "summary_data = []\n", "for topic_num, info in topics_info.items():\n", " summary_data.append({\n", " 'Topic': topic_num,\n", " 'Top Keywords': ', '.join(info['keywords'][:top_words_display]),\n", " 'Avg Coverage': f\"{doc_topic_dist[:, int(topic_num.split()[1])-1].mean():.1%}\"\n", " })\n", "\n", "summary_df = pd.DataFrame(summary_data)\n", "print(summary_df.to_string(index=False))\n", "print('=' * 80)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Step 4: Map to Science Backbone\n", "\n", "logical flow: setup \u2192 define backbone \u2192 map topics \u2192 visualize network\n", "\n", "**What is the Science Backbone?**\n", "Science is organized into domains, like a family tree:- Major domains: Environmental Science, Social Science, etc.\n", "- Subdisciplines: Hydrology, Economics, etc.\n", "- Specific topics: Flood modeling, cost-benefit analysis, etc.\n", "- We map each topic to relevant scientific domains to show which fields are needed." ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u2713 Output directory: outputs\n", "\u2713 Case study name: community_analysis\n" ] } ], "source": [ "# Set up output directory and case study name\n", "OUTPUT_DIR = Path('outputs')\n", "OUTPUT_DIR.mkdir(exist_ok=True)\n", "\n", "CASE_STUDY_NAME = 'community_analysis' # Change this to match your case study\n", "\n", "print(f'\u2713 Output directory: {OUTPUT_DIR}')\n", "print(f'\u2713 Case study name: {CASE_STUDY_NAME}')" ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Science Backbone Structure:\n", "============================================================\n", "\n", "Environmental Science:\n", " \u2022 Climate Science\n", " \u2022 Hydrology\n", " \u2022 Ecology\n", " \u2022 Coastal Systems\n", "\n", "Physical Science:\n", " \u2022 Oceanography\n", " \u2022 Atmospheric Science\n", " \u2022 Geology\n", " \u2022 Geophysics\n", "\n", "Social Science:\n", " \u2022 Economics\n", " \u2022 Urban Planning\n", " \u2022 Policy Analysis\n", " \u2022 Community Development\n", "\n", "Engineering:\n", " \u2022 Civil Engineering\n", " \u2022 Infrastructure Design\n", " \u2022 Water Resources\n", " \u2022 Risk Management\n", "\n", "Data Science:\n", " \u2022 Statistical Modeling\n", " \u2022 GIS Analysis\n", " \u2022 Machine Learning\n", " \u2022 Scenario Planning\n", "\n", "============================================================\n" ] } ], "source": [ "# Define science backbone structure\n", "science_backbone = {\n", " 'Environmental Science': [\n", " 'Climate Science',\n", " 'Hydrology',\n", " 'Ecology',\n", " 'Coastal Systems'\n", " ],\n", " 'Physical Science': [\n", " 'Oceanography',\n", " 'Atmospheric Science',\n", " 'Geology',\n", " 'Geophysics'\n", " ],\n", " 'Social Science': [\n", " 'Economics',\n", " 'Urban Planning',\n", " 'Policy Analysis',\n", " 'Community Development'\n", " ],\n", " 'Engineering': [\n", " 'Civil Engineering',\n", " 'Infrastructure Design',\n", " 'Water Resources',\n", " 'Risk Management'\n", " ],\n", " 'Data Science': [\n", " 'Statistical Modeling',\n", " 'GIS Analysis',\n", " 'Machine Learning',\n", " 'Scenario Planning'\n", " ]\n", "}\n", "\n", "print('Science Backbone Structure:')\n", "print('=' * 60)\n", "for domain, subdisciplines in science_backbone.items():\n", " print(f'\\n{domain}:')\n", " for sub in subdisciplines:\n", " print(f' \u2022 {sub}')\n", "print('\\n' + '=' * 60)" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\ud83d\udd17 Mapping topics to domains...\n", "\n", "Topic-to-Domain Mappings:\n", "============================================================\n", "\n", "Topic 1\n", " Keywords: goal, constraint, goal protect, objective, indicator\n", " Primary Domain: General\n", "\n", "Topic 2\n", " Keywords: flood, infrastructure, main, storm, street\n", " Primary Domain: Social Science\n", " Secondary Domain: Engineering\n", "\n", "Topic 3\n", " Keywords: business, retention, downtown, local, runoff\n", " Primary Domain: Environmental Science\n", " Secondary Domain: Social Science\n", "\n", "Topic 4\n", " Keywords: goal, constraint, goal protect, objective, indicator\n", " Primary Domain: General\n", "\n", "Topic 5\n", " Keywords: goal, constraint, goal protect, objective, indicator\n", " Primary Domain: General\n", "\n", "============================================================\n", "\n", "\u2713 Saved: outputs/community_analysis_topic_mappings.csv\n" ] } ], "source": [ "# Map topics to domains\n", "def map_topics_to_domains(topics_info, science_backbone):\n", " \"\"\"Map discovered topics to scientific domains based on keyword matching\"\"\"\n", " domain_keywords = {\n", " 'Environmental Science': ['water', 'flooding', 'climate', 'coastal', 'wells', 'aquifer'],\n", " 'Physical Science': ['surge', 'level', 'rise', 'ocean', 'groundwater', 'subsidence'],\n", " 'Social Science': ['community', 'people', 'planning', 'economic', 'residents'],\n", " 'Engineering': ['infrastructure', 'drainage', 'systems', 'facilities', 'monitoring'],\n", " 'Data Science': ['model', 'data', 'scenarios', 'analysis', 'indicators']\n", " }\n", " \n", " mappings = []\n", " for topic_id, topic_data in topics_info.items():\n", " # Get topic keywords as a set\n", " topic_keywords = set(' '.join(topic_data['keywords']).lower().split())\n", " domain_scores = {}\n", " \n", " # Score each domain based on keyword matches\n", " for domain, keywords in domain_keywords.items():\n", " matches = topic_keywords.intersection(set(keywords))\n", " if matches:\n", " domain_scores[domain] = len(matches)\n", " \n", " # Get top 2 matching domains\n", " relevant = sorted(domain_scores.items(), key=lambda x: x[1], reverse=True)[:2]\n", " \n", " mappings.append({\n", " 'topic': topic_id,\n", " 'keywords': ', '.join(topic_data['keywords'][:5]),\n", " 'primary_domain': relevant[0][0] if relevant else 'General',\n", " 'secondary_domain': relevant[1][0] if len(relevant) > 1 else None\n", " })\n", " \n", " return mappings\n", "\n", "print('\ud83d\udd17 Mapping topics to domains...\\n')\n", "topic_mappings = map_topics_to_domains(topics_info, science_backbone)\n", "\n", "print('Topic-to-Domain Mappings:')\n", "print('=' * 60)\n", "for mapping in topic_mappings:\n", " print(f'\\n{mapping[\"topic\"]}')\n", " print(f' Keywords: {mapping[\"keywords\"]}')\n", " print(f' Primary Domain: {mapping[\"primary_domain\"]}')\n", " if mapping['secondary_domain']:\n", " print(f' Secondary Domain: {mapping[\"secondary_domain\"]}')\n", "print('\\n' + '=' * 60)\n", "\n", "# Save mappings to CSV\n", "mapping_df = pd.DataFrame(topic_mappings)\n", "mapping_path = OUTPUT_DIR / f'{CASE_STUDY_NAME}_topic_mappings.csv'\n", "mapping_df.to_csv(mapping_path, index=False)\n", "print(f'\\n\u2713 Saved: {mapping_path}')" ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Creating network...\n", "\n" ] }, { "data": { "application/vnd.plotly.v1+json": { "config": { "plotlyServerURL": "https://plot.ly" }, "data": [ { "hoverinfo": "none", "line": { "color": "#888", "width": 0.5 }, "mode": "lines", "type": "scatter", "x": [ -0.08862721800110031, -0.6609213412026099, null, -0.08862721800110031, 0.9305162298150339, null, -0.08862721800110031, -0.4887569873610387, null, -0.08862721800110031, -0.6331115487727773, null, -0.08862721800110031, -0.8948556372237245, null, 0.928291683451839, 0.5997619837892582, null, 0.928291683451839, 0.5475233434607764, null, 0.928291683451839, -0.6953646455573029, null, 0.928291683451839, 0.17199317522877047, null, -0.5429061289579697, -0.9212504893830854, null, -0.5429061289579697, 0.33358632810126565, null, -0.5429061289579697, 0.7501395949868217, null, -0.5429061289579697, -0.4638537487878906, null, -0.5429061289579697, 0.35019858631750167, null, -0.7091529880641608, 0.888831282548372, null, -0.7091529880641608, -0.28821082662545916, null, -0.7091529880641608, 0.9082649132479864, null, -0.7091529880641608, -0.957977873238819, null, 0.4315108494268777, -0.8488548929126516, null, 0.4315108494268777, -0.23730545745976148, null, 0.4315108494268777, 0.6920204438087642, null, 0.4315108494268777, 0.17331578790032182, null, 1, 0.7072008898998369, null, 0.7072008898998369, -0.967305800525968, null, 0.7072008898998369, -0.014699507909107309, null ], "y": [ 0.8091188742964339, 0.8210775907681885, null, 0.8091188742964339, -0.3791037381522579, null, 0.8091188742964339, -0.8635221082349869, null, 0.8091188742964339, 0.602547517602519, null, 0.8091188742964339, -0.45741290168915905, null, -0.026398802886959005, -0.6481674971622888, null, -0.026398802886959005, -0.8231373303659012, null, -0.026398802886959005, -0.5941089546132075, null, -0.026398802886959005, 0.9125190735422782, null, -0.6421244404201198, 0.11356649515691372, null, -0.6421244404201198, -0.9647033470145588, null, -0.6421244404201198, -0.5863520919438714, null, -0.6421244404201198, 0.9050786042751627, null, -0.6421244404201198, 0.9205260001244188, null, 0.2657690371169924, 0.5401980420459978, null, 0.2657690371169924, -0.9583586203306582, null, 0.2657690371169924, -0.1947454673232831, null, 0.2657690371169924, -0.06817238836298029, null, 0.686966706317194, 0.5918869167687147, null, 0.686966706317194, 0.868037226410428, null, 0.686966706317194, 0.37089056241488355, null, 0.686966706317194, -0.9532631252973314, null, 0.22098387810852502, 0.7103931935224619, null, 0.7103931935224619, -0.24658181127274792, null, 0.7103931935224619, -0.9334070934008023, null ] }, { "marker": { "color": "#FF6B6B", "size": 20 }, "mode": "markers+text", "name": "Domain", "text": [ "Environmental Science", "Physical Science", "Social Science", "Engineering", "Data Science" ], "textposition": "top center", "type": "scatter", "x": [ -0.08862721800110031, 0.928291683451839, -0.5429061289579697, -0.7091529880641608, 0.4315108494268777 ], "y": [ 0.8091188742964339, -0.026398802886959005, -0.6421244404201198, 0.2657690371169924, 0.686966706317194 ] }, { "marker": { "color": "#4ECDC4", "size": 15 }, "mode": "markers+text", "name": "Subdiscipline", "text": [ "Climate Science", "Hydrology", "Ecology", "Coastal Systems", "Oceanography", "Atmospheric Science", "Geology", "Geophysics", "Economics", "Urban Planning", "Policy Analysis", "Community Development", "Civil Engineering", "Infrastructure Design", "Water Resources", "Risk Management", "Statistical Modeling", "GIS Analysis", "Machine Learning", "Scenario Planning" ], "textposition": "top center", "type": "scatter", "x": [ -0.6609213412026099, 0.9305162298150339, -0.4887569873610387, -0.6331115487727773, 0.5997619837892582, 0.5475233434607764, -0.6953646455573029, 0.17199317522877047, -0.9212504893830854, 0.33358632810126565, 0.7501395949868217, -0.4638537487878906, 0.888831282548372, -0.28821082662545916, 0.9082649132479864, -0.957977873238819, -0.8488548929126516, -0.23730545745976148, 0.6920204438087642, 0.17331578790032182 ], "y": [ 0.8210775907681885, -0.3791037381522579, -0.8635221082349869, 0.602547517602519, -0.6481674971622888, -0.8231373303659012, -0.5941089546132075, 0.9125190735422782, 0.11356649515691372, -0.9647033470145588, -0.5863520919438714, 0.9050786042751627, 0.5401980420459978, -0.9583586203306582, -0.1947454673232831, -0.06817238836298029, 0.5918869167687147, 0.868037226410428, 0.37089056241488355, -0.9532631252973314 ] }, { "marker": { "color": "#FFE66D", "size": 12 }, "mode": "markers+text", "name": "Topic", "text": [ "Topic 1", "Topic 2", "Topic 3", "Topic 4", "Topic 5" ], "textposition": "top center", "type": "scatter", "x": [ 1, 0.35019858631750167, -0.8948556372237245, -0.967305800525968, -0.014699507909107309 ], "y": [ 0.22098387810852502, 0.9205260001244188, -0.45741290168915905, -0.24658181127274792, -0.9334070934008023 ] } ], "layout": { "height": 600, "hovermode": "closest", "margin": { "b": 20, "l": 5, "r": 5, "t": 40 }, "showlegend": true, "template": { "data": { "bar": [ { "error_x": { "color": "#2a3f5f" }, "error_y": { "color": "#2a3f5f" }, "marker": { "line": { "color": "#E5ECF6", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "bar" } ], "barpolar": [ { "marker": { "line": { "color": "#E5ECF6", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "barpolar" } ], "carpet": [ { "aaxis": { "endlinecolor": "#2a3f5f", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "#2a3f5f" }, "baxis": { "endlinecolor": "#2a3f5f", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "#2a3f5f" }, "type": "carpet" } ], "choropleth": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "choropleth" } ], "contour": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "contour" } ], "contourcarpet": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "contourcarpet" } ], "heatmap": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "heatmap" } ], "histogram": [ { "marker": { "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "histogram" } ], "histogram2d": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "histogram2d" } ], "histogram2dcontour": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "histogram2dcontour" } ], "mesh3d": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "mesh3d" } ], "parcoords": [ { "line": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "parcoords" } ], "pie": [ { "automargin": true, "type": "pie" } ], "scatter": [ { "fillpattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 }, "type": "scatter" } ], "scatter3d": [ { "line": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatter3d" } ], "scattercarpet": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattercarpet" } ], "scattergeo": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattergeo" } ], "scattergl": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattergl" } ], "scattermap": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattermap" } ], "scattermapbox": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattermapbox" } ], "scatterpolar": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterpolar" } ], "scatterpolargl": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterpolargl" } ], "scatterternary": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterternary" } ], "surface": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "surface" } ], "table": [ { "cells": { "fill": { "color": "#EBF0F8" }, "line": { "color": "white" } }, "header": { "fill": { "color": "#C8D4E3" }, "line": { "color": "white" } }, "type": "table" } ] }, "layout": { "annotationdefaults": { "arrowcolor": "#2a3f5f", "arrowhead": 0, "arrowwidth": 1 }, "autotypenumbers": "strict", "coloraxis": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "colorscale": { "diverging": [ [ 0, "#8e0152" ], [ 0.1, "#c51b7d" ], [ 0.2, "#de77ae" ], [ 0.3, "#f1b6da" ], [ 0.4, "#fde0ef" ], [ 0.5, "#f7f7f7" ], [ 0.6, "#e6f5d0" ], [ 0.7, "#b8e186" ], [ 0.8, "#7fbc41" ], [ 0.9, "#4d9221" ], [ 1, "#276419" ] ], "sequential": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "sequentialminus": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ] }, "colorway": [ "#636efa", "#EF553B", "#00cc96", "#ab63fa", "#FFA15A", "#19d3f3", "#FF6692", "#B6E880", "#FF97FF", "#FECB52" ], "font": { "color": "#2a3f5f" }, "geo": { "bgcolor": "white", "lakecolor": "white", "landcolor": "#E5ECF6", "showlakes": true, "showland": true, "subunitcolor": "white" }, "hoverlabel": { "align": "left" }, "hovermode": "closest", "mapbox": { "style": "light" }, "paper_bgcolor": "white", "plot_bgcolor": "#E5ECF6", "polar": { "angularaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "bgcolor": "#E5ECF6", "radialaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" } }, "scene": { "xaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" }, "yaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" }, "zaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" } }, "shapedefaults": { "line": { "color": "#2a3f5f" } }, "ternary": { "aaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "baxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "bgcolor": "#E5ECF6", "caxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" } }, "title": { "x": 0.05 }, "xaxis": { "automargin": true, "gridcolor": "white", "linecolor": "white", "ticks": "", "title": { "standoff": 15 }, "zerolinecolor": "white", "zerolinewidth": 2 }, "yaxis": { "automargin": true, "gridcolor": "white", "linecolor": "white", "ticks": "", "title": { "standoff": 15 }, "zerolinecolor": "white", "zerolinewidth": 2 } } }, "title": { "text": "Science Backbone Network with Topics (community_analysis)" }, "xaxis": { "showgrid": false, "showticklabels": false, "zeroline": false }, "yaxis": { "showgrid": false, "showticklabels": false, "zeroline": false } } }, "text/html": [ "
\n", "
" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\u2713 Saved: outputs/community_analysis_network.html\n" ] } ], "source": [ "# Create network visualization\n", "print('Creating network...\\n')\n", "\n", "G = nx.Graph()\n", "\n", "# Add domain nodes\n", "for domain in science_backbone.keys():\n", " G.add_node(domain, node_type='domain')\n", "\n", "# Add subdiscipline nodes and connect to domains\n", "for domain, subs in science_backbone.items():\n", " for sub in subs:\n", " G.add_node(sub, node_type='subdiscipline')\n", " G.add_edge(domain, sub)\n", "\n", "# Add topic nodes and connect to primary domains\n", "for mapping in topic_mappings:\n", " topic = mapping['topic']\n", " G.add_node(topic, node_type='topic')\n", " G.add_edge(mapping['primary_domain'], topic)\n", "\n", "# Calculate layout\n", "pos = nx.spring_layout(G, k=2, iterations=50, seed=42)\n", "\n", "# Create edge trace\n", "edge_trace = go.Scatter(\n", " x=[], \n", " y=[],\n", " line=dict(width=0.5, color='#888'),\n", " hoverinfo='none',\n", " mode='lines'\n", ")\n", "\n", "for edge in G.edges():\n", " x0, y0 = pos[edge[0]]\n", " x1, y1 = pos[edge[1]]\n", " edge_trace['x'] += (x0, x1, None)\n", " edge_trace['y'] += (y0, y1, None)\n", "\n", "# Create node traces by type\n", "colors = {\n", " 'domain': '#FF6B6B', \n", " 'subdiscipline': '#4ECDC4', \n", " 'topic': '#FFE66D'\n", "}\n", "sizes = {\n", " 'domain': 20, \n", " 'subdiscipline': 15, \n", " 'topic': 12\n", "}\n", "\n", "node_traces = []\n", "for ntype in ['domain', 'subdiscipline', 'topic']:\n", " nodes = [n for n, d in G.nodes(data=True) if d.get('node_type') == ntype]\n", " trace = go.Scatter(\n", " x=[pos[n][0] for n in nodes],\n", " y=[pos[n][1] for n in nodes],\n", " mode='markers+text',\n", " text=nodes,\n", " textposition='top center',\n", " marker=dict(size=sizes[ntype], color=colors[ntype]),\n", " name=ntype.title()\n", " )\n", " node_traces.append(trace)\n", "\n", "# Create figure\n", "fig = go.Figure(\n", " data=[edge_trace] + node_traces,\n", " layout=go.Layout(\n", " title=f'Science Backbone Network with Topics ({CASE_STUDY_NAME})',\n", " showlegend=True,\n", " hovermode='closest',\n", " margin=dict(b=20, l=5, r=5, t=40),\n", " xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),\n", " yaxis=dict(showgrid=False, zeroline=False, showticklabels=False),\n", " height=600\n", " )\n", ")\n", "\n", "fig.show()\n", "\n", "# Save network as HTML\n", "network_path = OUTPUT_DIR / f'{CASE_STUDY_NAME}_network.html'\n", "fig.write_html(str(network_path))\n", "print(f'\\n\u2713 Saved: {network_path}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#Step 5: Extract Decision Components\n", "\n", "Every decision problem has key parts:\n", "- Goals: What are we trying to achieve?\n", "- Objectives: Specific, measurable aims\n", "- Decision Variables: What can we control?\n", "- Constraints: Limitations that must be adhered to and worked within, these help define the bounds of feasibility and satisficing solutions\n", "- Indicators: How do we measure success?" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\ud83c\udfaf Extracting decision components...\n", "\n", "Decision Components:\n", "============================================================\n", "\n", "GOALS (10 found):\n", " 1. Our primary goal\n", " Source: interview_001.txt\n", " 2. our homes\n", " Source: interview_001.txt\n", " 3. property values\n", " Source: interview_001.txt\n", " 4. this neighborhood\n", " Source: interview_001.txt\n", " 5. safe access\n", " Source: interview_001.txt\n", "\n", "OBJECTIVES (10 found):\n", " 1. The main objective\n", " Source: interview_001.txt\n", " 2. flood damage\n", " Source: interview_001.txt\n", " 3. residential properties\n", " Source: interview_001.txt\n", " 4. the frequency\n", " Source: interview_001.txt\n", " 5. street closures\n", " Source: interview_001.txt\n", "\n", "VARIABLES (10 found):\n", " 1. different investment options\n", " Source: interview_001.txt\n", " 2. stormwater management\n", " Source: interview_001.txt\n", " 3. a budget constraint\n", " Source: interview_001.txt\n", " 4. the city \n", " council allocation\n", " Source: interview_001.txt\n", " 5. better drainage infrastructure\n", " Source: interview_001.txt\n", "\n", "CONSTRAINTS (10 found):\n", " 1. different investment options\n", " Source: interview_001.txt\n", " 2. stormwater management\n", " Source: interview_001.txt\n", " 3. a budget constraint\n", " Source: interview_001.txt\n", " 4. the city \n", " council allocation\n", " Source: interview_001.txt\n", " 5. the current \n", " budget\n", " Source: interview_001.txt\n", "\n", "INDICATORS (10 found):\n", " 1. The main objective\n", " Source: interview_001.txt\n", " 2. flood damage\n", " Source: interview_001.txt\n", " 3. residential properties\n", " Source: interview_001.txt\n", " 4. the frequency\n", " Source: interview_001.txt\n", " 5. street closures\n", " Source: interview_001.txt\n", "\n", "============================================================\n" ] } ], "source": [ "# Extract decision components from transcripts\n", "def extract_decision_components(documents):\n", " \"\"\"Extract decision-making components from text using NLP\"\"\"\n", " components = {\n", " 'goals': [],\n", " 'objectives': [],\n", " 'variables': [],\n", " 'constraints': [],\n", " 'indicators': []\n", " }\n", " \n", " # Keyword patterns for each component type\n", " patterns = {\n", " 'goals': ['goal', 'aim', 'protect', 'maintain', 'preserve'],\n", " 'objectives': ['objective', 'minimize', 'maximize', 'reduce'],\n", " 'variables': ['investment', 'decision', 'strategy', 'implementation'],\n", " 'constraints': ['constraint', 'limit', 'budget', 'cannot'],\n", " 'indicators': ['indicator', 'measure', 'metric', 'depth', 'damage']\n", " }\n", " \n", " # Process each document\n", " for doc_name, text in documents.items():\n", " doc = nlp(text)\n", " for sent in doc.sents:\n", " sent_text = sent.text.lower()\n", " \n", " # Check each component type\n", " for comp_type, keywords in patterns.items():\n", " if any(kw in sent_text for kw in keywords):\n", " # Extract noun chunks as potential components\n", " for chunk in sent.noun_chunks:\n", " if len(chunk.text.split()) >= 2:\n", " components[comp_type].append({\n", " 'text': chunk.text,\n", " 'source': doc_name,\n", " 'context': sent.text[:100]\n", " })\n", " \n", " # Remove duplicates and limit to top 10 per type\n", " for comp_type in components:\n", " seen = set()\n", " unique = []\n", " for item in components[comp_type]:\n", " if item['text'] not in seen:\n", " seen.add(item['text'])\n", " unique.append(item)\n", " components[comp_type] = unique[:10]\n", " \n", " return components\n", "\n", "print('\ud83c\udfaf Extracting decision components...\\n')\n", "decision_components = extract_decision_components(transcripts)\n", "\n", "print('Decision Components:')\n", "print('=' * 60)\n", "for comp_type, items in decision_components.items():\n", " print(f'\\n{comp_type.upper()} ({len(items)} found):')\n", " for i, item in enumerate(items[:5], 1):\n", " print(f' {i}. {item[\"text\"]}')\n", " print(f' Source: {item[\"source\"]}')\n", "print('\\n' + '=' * 60)" ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "config": { "plotlyServerURL": "https://plot.ly" }, "data": [ { "marker": { "color": [ "#FF6B6B", "#4ECDC4", "#45B7D1", "#FFA07A", "#98D8C8" ] }, "text": [ "10", "10", "10", "10", "10" ], "textposition": "auto", "type": "bar", "x": [ "goals", "objectives", "variables", "constraints", "indicators" ], "y": [ 10, 10, 10, 10, 10 ] } ], "layout": { "height": 400, "template": { "data": { "bar": [ { "error_x": { "color": "#2a3f5f" }, "error_y": { "color": "#2a3f5f" }, "marker": { "line": { "color": "#E5ECF6", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "bar" } ], "barpolar": [ { "marker": { "line": { "color": "#E5ECF6", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "barpolar" } ], "carpet": [ { "aaxis": { "endlinecolor": "#2a3f5f", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "#2a3f5f" }, "baxis": { "endlinecolor": "#2a3f5f", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "#2a3f5f" }, "type": "carpet" } ], "choropleth": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "choropleth" } ], "contour": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "contour" } ], "contourcarpet": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "contourcarpet" } ], "heatmap": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "heatmap" } ], "histogram": [ { "marker": { "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "histogram" } ], "histogram2d": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "histogram2d" } ], "histogram2dcontour": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "histogram2dcontour" } ], "mesh3d": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "mesh3d" } ], "parcoords": [ { "line": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "parcoords" } ], "pie": [ { "automargin": true, "type": "pie" } ], "scatter": [ { "fillpattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 }, "type": "scatter" } ], "scatter3d": [ { "line": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatter3d" } ], "scattercarpet": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattercarpet" } ], "scattergeo": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattergeo" } ], "scattergl": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattergl" } ], "scattermap": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattermap" } ], "scattermapbox": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattermapbox" } ], "scatterpolar": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterpolar" } ], "scatterpolargl": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterpolargl" } ], "scatterternary": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterternary" } ], "surface": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "surface" } ], "table": [ { "cells": { "fill": { "color": "#EBF0F8" }, "line": { "color": "white" } }, "header": { "fill": { "color": "#C8D4E3" }, "line": { "color": "white" } }, "type": "table" } ] }, "layout": { "annotationdefaults": { "arrowcolor": "#2a3f5f", "arrowhead": 0, "arrowwidth": 1 }, "autotypenumbers": "strict", "coloraxis": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "colorscale": { "diverging": [ [ 0, "#8e0152" ], [ 0.1, "#c51b7d" ], [ 0.2, "#de77ae" ], [ 0.3, "#f1b6da" ], [ 0.4, "#fde0ef" ], [ 0.5, "#f7f7f7" ], [ 0.6, "#e6f5d0" ], [ 0.7, "#b8e186" ], [ 0.8, "#7fbc41" ], [ 0.9, "#4d9221" ], [ 1, "#276419" ] ], "sequential": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "sequentialminus": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ] }, "colorway": [ "#636efa", "#EF553B", "#00cc96", "#ab63fa", "#FFA15A", "#19d3f3", "#FF6692", "#B6E880", "#FF97FF", "#FECB52" ], "font": { "color": "#2a3f5f" }, "geo": { "bgcolor": "white", "lakecolor": "white", "landcolor": "#E5ECF6", "showlakes": true, "showland": true, "subunitcolor": "white" }, "hoverlabel": { "align": "left" }, "hovermode": "closest", "mapbox": { "style": "light" }, "paper_bgcolor": "white", "plot_bgcolor": "#E5ECF6", "polar": { "angularaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "bgcolor": "#E5ECF6", "radialaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" } }, "scene": { "xaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" }, "yaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" }, "zaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" } }, "shapedefaults": { "line": { "color": "#2a3f5f" } }, "ternary": { "aaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "baxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "bgcolor": "#E5ECF6", "caxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" } }, "title": { "x": 0.05 }, "xaxis": { "automargin": true, "gridcolor": "white", "linecolor": "white", "ticks": "", "title": { "standoff": 15 }, "zerolinecolor": "white", "zerolinewidth": 2 }, "yaxis": { "automargin": true, "gridcolor": "white", "linecolor": "white", "ticks": "", "title": { "standoff": 15 }, "zerolinecolor": "white", "zerolinewidth": 2 } } }, "title": { "text": "Decision Components Identified" }, "xaxis": { "title": { "text": "Component Type" } }, "yaxis": { "title": { "text": "Count" } } } }, "text/html": [ "
\n", "
" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\u2713 Visualization complete!\n" ] } ], "source": [ "# Visualize component distribution\n", "component_counts = {k: len(v) for k, v in decision_components.items()}\n", "\n", "fig = go.Figure([\n", " go.Bar(\n", " x=list(component_counts.keys()),\n", " y=list(component_counts.values()),\n", " text=list(component_counts.values()),\n", " textposition='auto',\n", " marker=dict(\n", " color=['#FF6B6B', '#4ECDC4', '#45B7D1', '#FFA07A', '#98D8C8']\n", " )\n", " )\n", "])\n", "\n", "fig.update_layout(\n", " title='Decision Components Identified',\n", " xaxis_title='Component Type',\n", " yaxis_title='Count',\n", " height=400\n", ")\n", "\n", "fig.show()\n", "\n", "print('\u2713 Visualization complete!')" ] }, { "cell_type": "code", "execution_count": 102, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u2713 Saved 50 components to: outputs/community_analysis_components.csv\n", "\n", "Component breakdown:\n", " \u2022 goals: 10\n", " \u2022 objectives: 10\n", " \u2022 variables: 10\n", " \u2022 constraints: 10\n", " \u2022 indicators: 10\n" ] } ], "source": [ "# Save all components to CSV\n", "all_components = []\n", "for comp_type, items in decision_components.items():\n", " for item in items:\n", " all_components.append({\n", " 'component_type': comp_type,\n", " 'text': item['text'],\n", " 'source': item['source'],\n", " 'context': item['context']\n", " })\n", "\n", "components_df = pd.DataFrame(all_components)\n", "comp_path = OUTPUT_DIR / f'{CASE_STUDY_NAME}_components.csv'\n", "components_df.to_csv(comp_path, index=False)\n", "\n", "print(f'\u2713 Saved {len(all_components)} components to: {comp_path}')\n", "print(f'\\nComponent breakdown:')\n", "for comp_type, count in component_counts.items():\n", " print(f' \u2022 {comp_type}: {count}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#Step 6: Link to Scientific Variables\n", "\n", "**What are Scientific Variable Objects (SVOs)?**\n", "\n", "An SVO is a standardized way to describe a measurable quantity:\n", "- Variable Name: What we measure (e.g., \"Water Level\")\n", "- Units: How we measure it (e.g., \"meters\")\n", "- Data Source: Where we get data (e.g., \"USGS gauges\")\n", "- Standard Name: Scientific terminology\n", "\n", "This creates a translation table between everyday language and scientific variables." ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Scientific Variable Object (SVO) Vocabulary:\n", "============================================================\n", "\n", "water_level:\n", " Standard: surface_water_elevation\n", " Units: meters\n", " Domain: Hydrology\n", "\n", "precipitation:\n", " Standard: rainfall_rate\n", " Units: mm/hour\n", " Domain: Climate Science\n", "\n", "groundwater_level:\n", " Standard: depth_to_groundwater\n", " Units: meters below surface\n", " Domain: Hydrology\n", "\n", "sea_level:\n", " Standard: sea_surface_height\n", " Units: meters\n", " Domain: Oceanography\n", "\n", "population_at_risk:\n", " Standard: exposed_population\n", " Units: count\n", " Domain: Social Science\n", "\n", "economic_damage:\n", " Standard: flood_damage_cost\n", " Units: USD\n", " Domain: Economics\n", "\n", "infrastructure_vulnerability:\n", " Standard: critical_infrastructure_exposure\n", " Units: index\n", " Domain: Engineering\n", "\n", "============================================================\n" ] } ], "source": [ "# Define Scientific Variable Object (SVO) vocabulary\n", "# Customize this vocabulary for your specific domain!\n", "\n", "svo_vocabulary = {\n", " 'water_level': {\n", " 'standard_name': 'surface_water_elevation',\n", " 'units': 'meters',\n", " 'data_source': 'USGS stream gauges',\n", " 'keywords': ['water', 'level', 'flood', 'depth'],\n", " 'domain': 'Hydrology'\n", " },\n", " 'precipitation': {\n", " 'standard_name': 'rainfall_rate',\n", " 'units': 'mm/hour',\n", " 'data_source': 'NOAA precipitation network',\n", " 'keywords': ['rain', 'rainfall', 'precipitation', 'storm'],\n", " 'domain': 'Climate Science'\n", " },\n", " 'groundwater_level': {\n", " 'standard_name': 'depth_to_groundwater',\n", " 'units': 'meters below surface',\n", " 'data_source': 'USGS groundwater monitoring',\n", " 'keywords': ['groundwater', 'aquifer', 'wells'],\n", " 'domain': 'Hydrology'\n", " },\n", " 'sea_level': {\n", " 'standard_name': 'sea_surface_height',\n", " 'units': 'meters',\n", " 'data_source': 'NOAA tide gauges',\n", " 'keywords': ['sea level', 'ocean', 'tide', 'surge'],\n", " 'domain': 'Oceanography'\n", " },\n", " 'population_at_risk': {\n", " 'standard_name': 'exposed_population',\n", " 'units': 'count',\n", " 'data_source': 'Census data',\n", " 'keywords': ['people', 'population', 'residents'],\n", " 'domain': 'Social Science'\n", " },\n", " 'economic_damage': {\n", " 'standard_name': 'flood_damage_cost',\n", " 'units': 'USD',\n", " 'data_source': 'HAZUS assessments',\n", " 'keywords': ['damage', 'cost', 'economic', 'loss'],\n", " 'domain': 'Economics'\n", " },\n", " 'infrastructure_vulnerability': {\n", " 'standard_name': 'critical_infrastructure_exposure',\n", " 'units': 'index',\n", " 'data_source': 'Infrastructure inventories',\n", " 'keywords': ['infrastructure', 'facilities', 'buildings'],\n", " 'domain': 'Engineering'\n", " }\n", "}\n", "\n", "print('Scientific Variable Object (SVO) Vocabulary:')\n", "print('=' * 60)\n", "for svo_name, svo_info in svo_vocabulary.items():\n", " print(f'\\n{svo_name}:')\n", " print(f' Standard: {svo_info[\"standard_name\"]}')\n", " print(f' Units: {svo_info[\"units\"]}')\n", " print(f' Domain: {svo_info[\"domain\"]}')\n", "print('\\n' + '=' * 60)" ] }, { "cell_type": "code", "execution_count": 104, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\ud83d\udd17 Creating semantic links...\n", "\n", "\u2713 Created 11 unique semantic links\n", "\n", "Sample Links (Natural Language \u2192 Scientific Variables):\n", "============================================================\n", "\n", "1. \"water\" \u2192\n", " Variable: water_level\n", " Standard: surface_water_elevation\n", " Units: meters\n", " Domain: Hydrology\n", "\n", "2. \"flood\" \u2192\n", " Variable: water_level\n", " Standard: surface_water_elevation\n", " Units: meters\n", " Domain: Hydrology\n", "\n", "3. \"depth\" \u2192\n", " Variable: water_level\n", " Standard: surface_water_elevation\n", " Units: meters\n", " Domain: Hydrology\n", "\n", "4. \"rain\" \u2192\n", " Variable: precipitation\n", " Standard: rainfall_rate\n", " Units: mm/hour\n", " Domain: Climate Science\n", "\n", "5. \"storm\" \u2192\n", " Variable: precipitation\n", " Standard: rainfall_rate\n", " Units: mm/hour\n", " Domain: Climate Science\n", "\n", "6. \"damage\" \u2192\n", " Variable: economic_damage\n", " Standard: flood_damage_cost\n", " Units: USD\n", " Domain: Economics\n", "\n", "============================================================\n", "\n", "\u2713 Saved: outputs/community_analysis_svo_mappings.csv\n" ] } ], "source": [ "# Create semantic links between natural language and scientific variables\n", "def create_svo_mappings(documents, svo_vocabulary):\n", " \"\"\"Map natural language terms to standardized scientific variables\"\"\"\n", " mappings = []\n", " \n", " for doc_name, text in documents.items():\n", " text_lower = text.lower()\n", " \n", " # Check each scientific variable\n", " for svo_name, svo_info in svo_vocabulary.items():\n", " # Look for keyword matches\n", " for keyword in svo_info['keywords']:\n", " if keyword in text_lower:\n", " # Find context sentence\n", " sentences = sent_tokenize(text)\n", " context = ''\n", " for sent in sentences:\n", " if keyword in sent.lower():\n", " context = sent\n", " break\n", " \n", " mappings.append({\n", " 'natural_language_term': keyword,\n", " 'scientific_variable': svo_name,\n", " 'standard_name': svo_info['standard_name'],\n", " 'units': svo_info['units'],\n", " 'domain': svo_info['domain'],\n", " 'data_source': svo_info['data_source'],\n", " 'source_document': doc_name,\n", " 'context': context[:150]\n", " })\n", " \n", " return mappings\n", "\n", "print('\ud83d\udd17 Creating semantic links...\\n')\n", "svo_mappings = create_svo_mappings(transcripts, svo_vocabulary)\n", "\n", "# Remove duplicates\n", "unique_mappings = []\n", "seen = set()\n", "for mapping in svo_mappings:\n", " key = (mapping['natural_language_term'], mapping['scientific_variable'])\n", " if key not in seen:\n", " seen.add(key)\n", " unique_mappings.append(mapping)\n", "\n", "print(f'\u2713 Created {len(unique_mappings)} unique semantic links\\n')\n", "\n", "print('Sample Links (Natural Language \u2192 Scientific Variables):')\n", "print('=' * 60)\n", "for i, mapping in enumerate(unique_mappings[:6], 1):\n", " print(f'\\n{i}. \"{mapping[\"natural_language_term\"]}\" \u2192')\n", " print(f' Variable: {mapping[\"scientific_variable\"]}')\n", " print(f' Standard: {mapping[\"standard_name\"]}')\n", " print(f' Units: {mapping[\"units\"]}')\n", " print(f' Domain: {mapping[\"domain\"]}')\n", "print('\\n' + '=' * 60)\n", "\n", "# Save mappings to CSV\n", "svo_df = pd.DataFrame(unique_mappings)\n", "svo_path = OUTPUT_DIR / f'{CASE_STUDY_NAME}_svo_mappings.csv'\n", "svo_df.to_csv(svo_path, index=False)\n", "print(f'\\n\u2713 Saved: {svo_path}')" ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "config": { "plotlyServerURL": "https://plot.ly" }, "data": [ { "domain": { "x": [ 0, 1 ], "y": [ 0, 1 ] }, "hovertemplate": "labels=%{label}
value=%{value}
parent=%{parent}", "labels": [ "water_level", "water_level", "water_level", "precipitation", "precipitation", "economic_damage", "infrastructure_vulnerability", "economic_damage", "water_level", "population_at_risk", "economic_damage", "Hydrology", "Economics", "Climate Science", "Engineering", "Social Science" ], "name": "", "parents": [ "Hydrology", "Hydrology", "Hydrology", "Climate Science", "Climate Science", "Economics", "Engineering", "Economics", "Hydrology", "Social Science", "Economics", "", "", "", "", "" ], "type": "sunburst", "values": { "bdata": "AQEBAQEBAQEBAQEEAwIBAQ==", "dtype": "i1" } } ], "layout": { "height": 500, "legend": { "tracegroupgap": 0 }, "template": { "data": { "bar": [ { "error_x": { "color": "#2a3f5f" }, "error_y": { "color": "#2a3f5f" }, "marker": { "line": { "color": "#E5ECF6", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "bar" } ], "barpolar": [ { "marker": { "line": { "color": "#E5ECF6", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "barpolar" } ], "carpet": [ { "aaxis": { "endlinecolor": "#2a3f5f", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "#2a3f5f" }, "baxis": { "endlinecolor": "#2a3f5f", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "#2a3f5f" }, "type": "carpet" } ], "choropleth": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "choropleth" } ], "contour": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "contour" } ], "contourcarpet": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "contourcarpet" } ], "heatmap": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "heatmap" } ], "histogram": [ { "marker": { "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "histogram" } ], "histogram2d": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "histogram2d" } ], "histogram2dcontour": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "histogram2dcontour" } ], "mesh3d": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "mesh3d" } ], "parcoords": [ { "line": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "parcoords" } ], "pie": [ { "automargin": true, "type": "pie" } ], "scatter": [ { "fillpattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 }, "type": "scatter" } ], "scatter3d": [ { "line": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatter3d" } ], "scattercarpet": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattercarpet" } ], "scattergeo": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattergeo" } ], "scattergl": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattergl" } ], "scattermap": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattermap" } ], "scattermapbox": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattermapbox" } ], "scatterpolar": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterpolar" } ], "scatterpolargl": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterpolargl" } ], "scatterternary": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterternary" } ], "surface": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "surface" } ], "table": [ { "cells": { "fill": { "color": "#EBF0F8" }, "line": { "color": "white" } }, "header": { "fill": { "color": "#C8D4E3" }, "line": { "color": "white" } }, "type": "table" } ] }, "layout": { "annotationdefaults": { "arrowcolor": "#2a3f5f", "arrowhead": 0, "arrowwidth": 1 }, "autotypenumbers": "strict", "coloraxis": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "colorscale": { "diverging": [ [ 0, "#8e0152" ], [ 0.1, "#c51b7d" ], [ 0.2, "#de77ae" ], [ 0.3, "#f1b6da" ], [ 0.4, "#fde0ef" ], [ 0.5, "#f7f7f7" ], [ 0.6, "#e6f5d0" ], [ 0.7, "#b8e186" ], [ 0.8, "#7fbc41" ], [ 0.9, "#4d9221" ], [ 1, "#276419" ] ], "sequential": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "sequentialminus": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ] }, "colorway": [ "#636efa", "#EF553B", "#00cc96", "#ab63fa", "#FFA15A", "#19d3f3", "#FF6692", "#B6E880", "#FF97FF", "#FECB52" ], "font": { "color": "#2a3f5f" }, "geo": { "bgcolor": "white", "lakecolor": "white", "landcolor": "#E5ECF6", "showlakes": true, "showland": true, "subunitcolor": "white" }, "hoverlabel": { "align": "left" }, "hovermode": "closest", "mapbox": { "style": "light" }, "paper_bgcolor": "white", "plot_bgcolor": "#E5ECF6", "polar": { "angularaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "bgcolor": "#E5ECF6", "radialaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" } }, "scene": { "xaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" }, "yaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" }, "zaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" } }, "shapedefaults": { "line": { "color": "#2a3f5f" } }, "ternary": { "aaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "baxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "bgcolor": "#E5ECF6", "caxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" } }, "title": { "x": 0.05 }, "xaxis": { "automargin": true, "gridcolor": "white", "linecolor": "white", "ticks": "", "title": { "standoff": 15 }, "zerolinecolor": "white", "zerolinewidth": 2 }, "yaxis": { "automargin": true, "gridcolor": "white", "linecolor": "white", "ticks": "", "title": { "standoff": 15 }, "zerolinecolor": "white", "zerolinewidth": 2 } } }, "title": { "text": "Scientific Variables by Domain" } } }, "text/html": [ "
\n", "
" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\u2713 Visualization complete!\n", "\u2713 Saved: outputs/community_analysis_svo_sunburst.html\n", "\n", "\ud83d\udcca Summary:\n", " Total semantic links: 11\n", " Domains covered: 5\n", "\n", " Links by domain:\n", " \u2022 Hydrology: 4\n", " \u2022 Economics: 3\n", " \u2022 Climate Science: 2\n", " \u2022 Engineering: 1\n", " \u2022 Social Science: 1\n" ] } ], "source": [ "# Visualize scientific variable coverage by domain\n", "domain_counts = svo_df['domain'].value_counts().to_dict()\n", "\n", "# Prepare data for sunburst chart\n", "sunburst_data = []\n", "\n", "# Add individual variables\n", "for mapping in unique_mappings:\n", " sunburst_data.append({\n", " 'labels': mapping['scientific_variable'],\n", " 'parents': mapping['domain'],\n", " 'values': 1\n", " })\n", "\n", "# Add domain totals\n", "for domain in domain_counts.keys():\n", " sunburst_data.append({\n", " 'labels': domain,\n", " 'parents': '',\n", " 'values': domain_counts[domain]\n", " })\n", "\n", "df_sunburst = pd.DataFrame(sunburst_data)\n", "\n", "# Create sunburst visualization\n", "fig = px.sunburst(\n", " df_sunburst,\n", " names='labels',\n", " parents='parents',\n", " values='values',\n", " title='Scientific Variables by Domain',\n", " height=500\n", ")\n", "\n", "fig.show()\n", "\n", "# Save visualization\n", "sunburst_path = OUTPUT_DIR / f'{CASE_STUDY_NAME}_svo_sunburst.html'\n", "fig.write_html(str(sunburst_path))\n", "\n", "print('\u2713 Visualization complete!')\n", "print(f'\u2713 Saved: {sunburst_path}')\n", "\n", "# Print summary statistics\n", "print(f'\\n\ud83d\udcca Summary:')\n", "print(f' Total semantic links: {len(unique_mappings)}')\n", "print(f' Domains covered: {len(domain_counts)}')\n", "print(f'\\n Links by domain:')\n", "for domain, count in sorted(domain_counts.items(), key=lambda x: x[1], reverse=True):\n", " print(f' \u2022 {domain}: {count}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The **\"Scientific Variables by Domain\"** sunburst visualization shows the semantic bridge between stakeholder narratives and scientific concepts. Here's what it reveals:\n", "\n", "## What the Figure Shows:\n", "\n", "**Inner Ring (Center):** Scientific domains like Hydrology, Climate Science, Social Science, Economics, Engineering, and Oceanography\n", "\n", "**Outer Ring:** Individual scientific variables nested within each domain, such as:\n", "- `water_level` and `groundwater_level` under Hydrology\n", "- `precipitation` under Climate Science \n", "- `economic_damage` under Economics\n", "- `infrastructure_vulnerability` under Engineering\n", "\n", "**Segment Size:** Proportional to how frequently each variable/domain appears in the stakeholder transcripts - larger segments indicate terms that stakeholders mentioned more often\n", "\n", "## What It Tells You:\n", "\n", "1. **Problem Scope:** Which scientific disciplines are relevant to the community's concerns - is it primarily a hydrological problem? Or does it span multiple domains?\n", "\n", "2. **Data Needs:** What types of scientific data and monitoring would support this decision-making process (e.g., if Hydrology dominates, you need stream gauges and groundwater monitoring)\n", "\n", "3. **Interdisciplinary Nature:** How many different domains are involved, indicating the need for cross-disciplinary collaboration\n", "\n", "4. **Stakeholder Priorities:** Which scientific concepts align with what the community actually cares about - their natural language about \"flooding\" and \"damage\" maps to specific measurable variables\n", "\n", "## For TACC Users:\n", "\n", "This visualization helps you identify:\n", "- Which computational models you need (hydrological, economic, infrastructure assessment)\n", "- What datasets to access from TACC resources or external sources\n", "- Which scientific experts should be engaged\n", "- How to structure your decision support system to match stakeholder mental models\n", "\n", "It's essentially showing **which science is needed to answer the questions your community is asking**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Step 7: Generate Summary Report\n", "Complete the semantic bridge analysis!" ] }, { "cell_type": "code", "execution_count": 109, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "======================================================================\n", "SEMANTIC BRIDGE ANALYSIS COMPLETE\n", "======================================================================\n", "\n", " Analysis Statistics:\n", " \u2022 Documents Analyzed: 4\n", " \u2022 Topics Identified: 5\n", " \u2022 Scientific Domains: 3\n", " \u2022 Decision Components: 50\n", " \u2022 Scientific Variables: 5\n", "\n", "\ud83d\udcc1 Output Files Generated:\n", " \u2022 community_analysis_topic_mappings.csv\n", " \u2514\u2500 Topic-to-domain mappings\n", " \u2022 community_analysis_components.csv\n", " \u2514\u2500 Decision components\n", " \u2022 community_analysis_svo_mappings.csv\n", " \u2514\u2500 SVO semantic links\n", " \u2022 community_analysis_network.html\n", " \u2514\u2500 Interactive network visualization\n", " \u2022 community_analysis_svo_sunburst.html\n", " \u2514\u2500 Domain coverage visualization\n", " \u2022 community_analysis_report.md\n", " \u2514\u2500 Comprehensive analysis report\n", "\n", "======================================================================\n", "\u2713 All files saved to outputs/ folder\n", "======================================================================\n", "\n", "Next Actions:\n", "\n", " For Your Own Project:\n", " 1. Replace sample documents with your own files\n", " 2. Customize science_backbone and svo_vocabulary \n", " 3. Adjust n_topics parameter as needed\n", " 4. Re-run all cells\n", "\n", " For TACC Integration:\n", " - Upload outputs to CKAN data portal\n", " - Link to computational models via APIs\n", " - Scale analysis using HPC resources\n", "\n" ] } ], "source": [ "# Generate comprehensive analysis report\n", "from datetime import datetime\n", "\n", "summary = f'''# Semantic Bridge Analysis Report\n", "\n", "## {CASE_STUDY_NAME}\n", "\n", "**Generated:** {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M')}\n", "\n", "---\n", "\n", "## Analysis Summary\n", "\n", "- **Documents Analyzed:** {len(transcripts)}\n", "- **Topics Identified:** {n_topics}\n", "- **Scientific Domains:** {len(set(m[\"primary_domain\"] for m in topic_mappings))}\n", "- **Decision Components:** {sum(len(v) for v in decision_components.values())}\n", "- **Scientific Variables:** {len(set(m[\"scientific_variable\"] for m in unique_mappings))}\n", "\n", "---\n", "\n", "## Key Findings\n", "\n", "### Topics Discovered\n", "\n", "'''\n", "\n", "for mapping in topic_mappings:\n", " summary += f'''**{mapping[\"topic\"]}**\n", "- Keywords: {mapping[\"keywords\"]}\n", "- Primary Domain: {mapping[\"primary_domain\"]}\n", "'''\n", " if mapping[\"secondary_domain\"]:\n", " summary += f'- Secondary Domain: {mapping[\"secondary_domain\"]}\\n'\n", " summary += '\\n'\n", "\n", "summary += '''---\n", "\n", "### Scientific Domains Engaged\n", "\n", "'''\n", "\n", "domain_summary = svo_df.groupby('domain').size().sort_values(ascending=False)\n", "for domain, count in domain_summary.items():\n", " summary += f'- **{domain}:** {count} variables\\n'\n", "\n", "summary += '''\n", "---\n", "\n", "### Decision Components Extracted\n", "\n", "'''\n", "\n", "for comp_type, items in decision_components.items():\n", " if len(items) > 0:\n", " summary += f'\\n**{comp_type.title()}** ({len(items)}): '\n", " summary += ', '.join([item['text'] for item in items[:3]])\n", " if len(items) > 3:\n", " summary += f' ... (+{len(items)-3} more)'\n", " summary += '\\n'\n", "\n", "summary += '''\n", "---\n", "\n", "## Outputs Generated\n", "\n", "The following files have been created in the `outputs/` directory:\n", "\n", "1. **Topic Mappings:** Links discovered topics to scientific domains\n", "2. **Decision Components:** Extracted goals, objectives, variables, constraints, and indicators\n", "3. **SVO Mappings:** Semantic links between natural language and scientific variables\n", "4. **Network Visualization:** Interactive visualization of the science backbone\n", "5. **Analysis Report:** This comprehensive summary document\n", "\n", "---\n", "\n", "## Next Steps\n", "\n", "1. **Validate results** with domain experts and stakeholders\n", "2. **Refine vocabularies** (science_backbone and svo_vocabulary) based on feedback\n", "3. **Integrate** with computational models and decision support systems\n", "4. **Iterate** the analysis with additional documents or refined parameters\n", "5. **Deploy** as part of larger decision pathways framework\n", "\n", "---\n", "\n", "## How to Customize for Your Project\n", "\n", "### For Your Own Data:\n", "1. Replace sample documents with your `.txt`, `.json`, `.docx` files in `data/transcripts/`\n", "2. Adjust `n_topics` parameter in Step 3 to match your data complexity\n", "3. Re-run all cells from Step 2 onward\n", "\n", "### For Your Domain:\n", "1. Customize `science_backbone` structure (Step 4) for your field\n", "2. Expand `svo_vocabulary` (Step 6) with domain-specific variables\n", "3. Modify decision component patterns (Step 5) if needed\n", "\n", "### For Advanced Use:\n", "- Integrate with TACC HPC resources for larger document sets\n", "- Connect to live data sources (APIs, databases)\n", "- Export to decision support platforms (e.g., CKAN portals)\n", "\n", "'''\n", "\n", "# Save report\n", "report_path = OUTPUT_DIR / f'{CASE_STUDY_NAME}_report.md'\n", "with open(report_path, 'w') as f:\n", " f.write(summary)\n", "\n", "print('=' * 70)\n", "print('SEMANTIC BRIDGE ANALYSIS COMPLETE')\n", "print('=' * 70)\n", "\n", "print(f'\\n Analysis Statistics:')\n", "print(f' \u2022 Documents Analyzed: {len(transcripts)}')\n", "print(f' \u2022 Topics Identified: {n_topics}')\n", "print(f' \u2022 Scientific Domains: {len(set(m[\"primary_domain\"] for m in topic_mappings))}')\n", "print(f' \u2022 Decision Components: {sum(len(v) for v in decision_components.values())}')\n", "print(f' \u2022 Scientific Variables: {len(set(m[\"scientific_variable\"] for m in unique_mappings))}')\n", "\n", "print(f'\\n\ud83d\udcc1 Output Files Generated:')\n", "output_files = [\n", " (mapping_path.name, 'Topic-to-domain mappings'),\n", " (comp_path.name, 'Decision components'),\n", " (svo_path.name, 'SVO semantic links'),\n", " (network_path.name, 'Interactive network visualization'),\n", " (sunburst_path.name, 'Domain coverage visualization'),\n", " (report_path.name, 'Comprehensive analysis report')\n", "]\n", "\n", "for filename, description in output_files:\n", " print(f' \u2022 {filename}')\n", " print(f' \u2514\u2500 {description}')\n", "\n", "print('\\n' + '=' * 70)\n", "print('\u2713 All files saved to outputs/ folder')\n", "print('=' * 70)\n", "\n", "print('''\n", "Next Actions:\n", "\n", " For Your Own Project:\n", " 1. Replace sample documents with your own files\n", " 2. Customize science_backbone and svo_vocabulary \n", " 3. Adjust n_topics parameter as needed\n", " 4. Re-run all cells\n", "\n", " For TACC Integration:\n", " - Upload outputs to CKAN data portal\n", " - Link to computational models via APIs\n", " - Scale analysis using HPC resources\n", "''')" ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\ud83d\udcc4 REPORT PREVIEW\n", "======================================================================\n", "\n", "First 1000 characters of generated report:\n", "\n", "# Semantic Bridge Analysis Report\n", "\n", "## community_analysis\n", "\n", "**Generated:** 2025-11-12 22:22\n", "\n", "---\n", "\n", "## Analysis Summary\n", "\n", "- **Documents Analyzed:** 4\n", "- **Topics Identified:** 5\n", "- **Scientific Domains:** 3\n", "- **Decision Components:** 50\n", "- **Scientific Variables:** 5\n", "\n", "---\n", "\n", "## Key Findings\n", "\n", "### Topics Discovered\n", "\n", "**Topic 1**\n", "- Keywords: goal, constraint, goal protect, objective, indicator\n", "- Primary Domain: General\n", "\n", "**Topic 2**\n", "- Keywords: flood, infrastructure, main, storm, street\n", "- Primary Domain: Social Science\n", "- Secondary Domain: Engineering\n", "\n", "**Topic 3**\n", "- Keywords: business, retention, downtown, local, runoff\n", "- Primary Domain: Environmental Science\n", "- Secondary Domain: Social Science\n", "\n", "**Topic 4**\n", "- Keywords: goal, constraint, goal protect, objective, indicator\n", "- Primary Domain: General\n", "\n", "**Topic 5**\n", "- Keywords: goal, constraint, goal protect, objective, indicator\n", "- Primary Domain: General\n", "\n", "---\n", "\n", "### Scientific Domains Engaged\n", "\n", "- **Hydrology:** 4 variables\n", "- **Economics:** 3 variables\n", "- **Clima\n", "\n", "...\n", "\n", "\n", "\u2713 Full report available at: outputs/community_analysis_report.md\n", "\n", " View report: open outputs/community_analysis_report.md\n" ] } ], "source": [ "# Display report preview\n", "print('\\n\ud83d\udcc4 REPORT PREVIEW')\n", "print('=' * 70)\n", "print('\\nFirst 1000 characters of generated report:\\n')\n", "print(summary[:1000])\n", "print('\\n...\\n')\n", "print(f'\\n\u2713 Full report available at: {report_path}')\n", "print(f'\\n View report: open {report_path}')" ] }, { "cell_type": "code", "execution_count": 110, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\ud83d\udccb QUICK REFERENCE SUMMARY\n", "======================================================================\n", " Analysis Component Count Output File\n", " Input Documents 4 N/A\n", " Topics Discovered 5 community_analysis_topic_mappings.csv\n", " Scientific Domains 3 community_analysis_network.html\n", " Decision Components 50 community_analysis_components.csv\n", "Scientific Variables 7 community_analysis_svo_mappings.csv\n", " Semantic Links 11 community_analysis_svo_mappings.csv\n", "======================================================================\n", "\n", "\u2713 Saved quick reference: outputs/community_analysis_summary_table.csv\n" ] } ], "source": [ "# Create quick reference summary table\n", "quick_ref = pd.DataFrame({\n", " 'Analysis Component': [\n", " 'Input Documents',\n", " 'Topics Discovered',\n", " 'Scientific Domains',\n", " 'Decision Components',\n", " 'Scientific Variables',\n", " 'Semantic Links'\n", " ],\n", " 'Count': [\n", " len(transcripts),\n", " n_topics,\n", " len(set(m[\"primary_domain\"] for m in topic_mappings)),\n", " sum(len(v) for v in decision_components.values()),\n", " len(svo_vocabulary),\n", " len(unique_mappings)\n", " ],\n", " 'Output File': [\n", " 'N/A',\n", " mapping_path.name,\n", " network_path.name,\n", " comp_path.name,\n", " svo_path.name,\n", " svo_path.name\n", " ]\n", "})\n", "\n", "print('\\n\ud83d\udccb QUICK REFERENCE SUMMARY')\n", "print('=' * 70)\n", "print(quick_ref.to_string(index=False))\n", "print('=' * 70)\n", "\n", "# Save quick reference\n", "quick_ref_path = OUTPUT_DIR / f'{CASE_STUDY_NAME}_summary_table.csv'\n", "quick_ref.to_csv(quick_ref_path, index=False)\n", "print(f'\\n\u2713 Saved quick reference: {quick_ref_path}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You have completed the semantic bridge analysis!\n", "### What You Learned\n", "- How to extract topics from text automatically\n", "- How to map narratives to scientific domains\n", "- How to identify decision components\n", "- How to link language to measurable variables\n", "\n", "### Using This for Your Work. \n", "1. Replace sample data with your own documents\n", "2. Customize domains in science_backbone\n", "3. Expand SVOs in svo_vocabulary\n", "4. Adjust topics with n_topics parameter\n", "5. Re-run and validate with stakeholders\n", " \n", "### Output Files\n", "All results are saved in the outputs/ folder:\n", "- CSV files for further analysis\n", "- HTML visualizations for sharing\n", "- Markdown report for documentation\n", "\n", "### For More Information:\n", "- See the User Guide for detailed customization\n", "- Check the Quick Reference for common tasks\n", "- Share results with your team for validation\n", "\n", "This tool helps honor lived experience while connecting to scientific and engineering analysis capabilities" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.9" } }, "nbformat": 4, "nbformat_minor": 4 }