",
"\ud83d\udcda Computational Cookbook Series ",
"This notebook is part of the TACC Computational Cookbook series - reproducible workflows for scientific computing on High Performance Computing (HPC) systems.",
"
",
"",
"**Developed by:** Texas Advanced Computing Center (TACC) ",
"**Institution:** The University of Texas at Austin ",
"**Contact:** [tacc-help@tacc.utexas.edu](mailto:tacc-help@tacc.utexas.edu)",
"",
"---",
"",
"### \ud83c\udfaf What This Cookbook Does",
"",
"This workflow enables researchers and decision-makers to bridge the gap between qualitative stakeholder narratives and quantitative scientific analysis. Using natural language processing and machine learning, the cookbook:",
"",
"1. **Discovers themes** in community narratives, stakeholder interviews, and planning documents",
"2. **Identifies decision components** including goals, objectives, variables, and constraints ",
"3. **Maps concepts** to established scientific disciplines and domains",
"4. **Links terminology** to standardized scientific variables and data sources",
"5. **Creates visualizations** showing connections between human perspectives and scientific frameworks",
"",
"**Use cases include:**",
"- Environmental planning and climate adaptation",
"- Infrastructure decision support",
"- Community-driven research",
"- Participatory modeling and stakeholder engagement",
"- Interdisciplinary problem framing",
"",
"---",
"",
"### \ud83d\udccb Prerequisites",
"",
"**TACC Account & Allocation:**",
"- Active TACC user account ([register here](https://accounts.tacc.utexas.edu/register))",
"- Allocation on TACC systems (or use startup allocation)",
"- Familiarity with Jupyter notebooks",
"",
"**Input Data:**",
"- Text documents describing your problem/situation (.txt, .json, .docx formats)",
"- Examples: interview transcripts, meeting notes, stakeholder reports, community narratives",
"",
"**Knowledge:**",
"- Basic Python programming",
"- Understanding of your domain/problem area",
"- Ability to customize scientific vocabularies for your field",
"",
"---",
"",
"### \ud83d\ude80 Quick Start",
"",
"1. **Prepare your data:** Place text documents in `data/transcripts/` directory",
"2. **Run Setup (Step 1):** Install required packages ",
"3. **Load Data (Step 2):** The notebook will process your documents",
"4. **Analyze (Steps 3-6):** Follow the workflow to discover topics, map to science, extract components",
"5. **Review Outputs:** Check the `outputs/` folder for results and visualizations",
"",
"---",
"",
"### \ud83d\udcca Expected Outputs",
"",
"- `*_topic_mappings.csv`: Topics linked to scientific domains",
"- `*_components.csv`: Extracted decision components ",
"- `*_svo_mappings.csv`: Semantic links to scientific variables",
"- `*_network.html`: Interactive science backbone visualization",
"- `*_report.md`: Comprehensive analysis summary",
"",
"---",
"",
"### \ud83d\udd27 Customization",
"",
"This cookbook is designed to be adapted to different domains:",
"",
"- **Modify `science_backbone`** (Step 4): Add your discipline-specific domains",
"- **Expand `svo_vocabulary`** (Step 6): Include variables relevant to your field ",
"- **Adjust `n_topics`** (Step 3): Match complexity of your document corpus",
"- **Update decision patterns** (Step 5): Customize for your decision framework",
"",
"---",
"",
"### \ud83d\udcd6 Learn More",
"",
"- [TACC Documentation](https://docs.tacc.utexas.edu)",
"- [TACC Training](https://learn.tacc.utexas.edu)",
"- [Science Gateways Community Institute](https://sciencegateways.org/)",
"",
"---",
"",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## STEP 1: Setup\n",
"\n",
"We will load the tools needed for text analysis and visualization\n",
"\n",
"This setup includes: \n",
"- Import verification:\n",
" Checks each package before installing spaCy\n",
" model check: Verifies if the model is already downloaded\n",
"- Selective installation:\n",
" Only installs what's missing\n",
"- Clear feedback:\n",
" Shows which packages are already available vs. need installation\n",
"- Import name mapping:\n",
" Handles cases where package name \u2260 import name (like scikit-learn vs sklearn)\n",
"- Efficient for reuse:\n",
" Won't waste time reinstalling on subsequent runs\n",
"\n",
"This is perfect for TACC computational cookbooks where the notebook might be run multiple times by different users or in environments with varying pre-installed packages."
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" \u2713 pandas already installed\n",
" \u2713 numpy already installed\n",
" \u2713 nltk already installed\n",
" \u2713 spacy already installed\n",
" \u2713 scikit-learn already installed\n",
" \u2713 networkx already installed\n",
" \u2713 plotly already installed\n",
" \u2713 python-docx already installed\n",
" \u2713 pillow already installed\n",
"\n",
"\u2713 All packages already installed!\n",
"\u2713 spaCy model already available!\n",
"\n",
"\u2713 Setup complete and verified!\n"
]
}
],
"source": [
"# Cell 1: Installation with verification checks\n",
"\n",
"# Install required packages for TACC Jupyter environment\n",
"import sys\n",
"import subprocess\n",
"import importlib\n",
"import os\n",
"from pathlib import Path\n",
"\n",
"# Add user's local bin to PATH (needed for TACC)\n",
"user_bin = Path.home() / '.local' / 'bin'\n",
"if str(user_bin) not in os.environ['PATH']:\n",
" os.environ['PATH'] = f\"{user_bin}:{os.environ['PATH']}\"\n",
" print(f'\u2713 Added {user_bin} to PATH')\n",
"\n",
"def check_package_installed(package_name, import_name=None):\n",
" \"\"\"Check if a package is already installed\"\"\"\n",
" if import_name is None:\n",
" import_name = package_name\n",
" try:\n",
" importlib.import_module(import_name)\n",
" return True\n",
" except ImportError:\n",
" return False\n",
"\n",
"def check_spacy_model(model_name='en_core_web_sm'):\n",
" \"\"\"Check if spaCy model is already downloaded\"\"\"\n",
" try:\n",
" import spacy\n",
" spacy.load(model_name)\n",
" return True\n",
" except (ImportError, OSError):\n",
" return False\n",
"\n",
"def install_packages():\n",
" \"\"\"Install required packages if not already available\"\"\"\n",
" packages = {\n",
" 'pandas': 'pandas',\n",
" 'numpy': 'numpy',\n",
" 'nltk': 'nltk',\n",
" 'spacy': 'spacy',\n",
" 'scikit-learn': 'sklearn',\n",
" 'networkx': 'networkx',\n",
" 'plotly': 'plotly',\n",
" 'python-docx': 'docx',\n",
" 'pillow': 'PIL'\n",
" }\n",
" \n",
" missing_packages = []\n",
" for package, import_name in packages.items():\n",
" if not check_package_installed(package, import_name):\n",
" missing_packages.append(package)\n",
" print(f' - {package} needs installation')\n",
" else:\n",
" print(f' \u2713 {package} already installed')\n",
" \n",
" if missing_packages:\n",
" print(f'\\nInstalling {len(missing_packages)} package(s)...')\n",
" subprocess.check_call([\n",
" sys.executable, '-m', 'pip', 'install', \n",
" '--quiet', '--user', '--no-warn-script-location'\n",
" ] + missing_packages)\n",
" print('\u2713 Package installation complete!')\n",
" else:\n",
" print('\\n\u2713 All packages already installed!')\n",
" \n",
" if not check_spacy_model('en_core_web_sm'):\n",
" print('\\nDownloading spaCy model...')\n",
" subprocess.check_call([\n",
" sys.executable, '-m', 'spacy', 'download', \n",
" 'en_core_web_sm', '--quiet'\n",
" ])\n",
" print('\u2713 spaCy model downloaded!')\n",
" else:\n",
" print('\u2713 spaCy model already available!')\n",
" \n",
" print('\\n\u2713 Setup complete and verified!')\n",
"\n",
"install_packages()"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u2139 pytesseract not available - image OCR disabled\n",
" (This is fine if you only use .txt, .json, or .docx files)\n",
"\u2713 Libraries loaded successfully!\n"
]
}
],
"source": [
"## Cell 2: Import verification, document handling, and library loading\n",
"\n",
"# Be sure to run Cell 2 each time you restart the kernel\n",
"\n",
"# Import libraries\n",
"import pandas as pd\n",
"import numpy as np\n",
"import json\n",
"from pathlib import Path\n",
"from collections import Counter, defaultdict\n",
"import re\n",
"\n",
"# NLP\n",
"import nltk\n",
"from nltk.tokenize import sent_tokenize, word_tokenize\n",
"from nltk.corpus import stopwords\n",
"import spacy\n",
"\n",
"# Machine Learning\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"from sklearn.decomposition import LatentDirichletAllocation\n",
"\n",
"# Network analysis\n",
"import networkx as nx\n",
"\n",
"# Visualization\n",
"import plotly.graph_objects as go\n",
"import plotly.express as px\n",
"\n",
"# Document handling\n",
"from docx import Document\n",
"from PIL import Image\n",
"\n",
"# Optional: OCR support for images\n",
"try:\n",
" import pytesseract\n",
" OCR_AVAILABLE = True\n",
"except ImportError:\n",
" OCR_AVAILABLE = False\n",
" print('\u2139 pytesseract not available - image OCR disabled')\n",
" print(' (This is fine if you only use .txt, .json, or .docx files)')\n",
"\n",
"# Download NLTK data\n",
"for pkg in ['punkt', 'stopwords', 'averaged_perceptron_tagger']:\n",
" try:\n",
" nltk.data.find(f'tokenizers/{pkg}')\n",
" except LookupError:\n",
" nltk.download(pkg, quiet=True)\n",
"\n",
"# Load spaCy\n",
"try:\n",
" nlp = spacy.load('en_core_web_sm')\n",
" print('\u2713 Libraries loaded successfully!')\n",
"except OSError:\n",
" print('\u26a0 Run the previous cell to install spaCy model')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 2: Prepare Input Data (this is your \"Corpora\")\n",
"\n",
"This notebook works with multiple document formats describing a problem, situation, or descriptive collections of documents. For example:\n",
"- Interview transcripts\n",
"- Meeting notes\n",
"- Stakeholder reports\n",
"- Grey Literature reports\n",
"- Community narratives\n",
"\n",
"**Supported formats:**\n",
"- `.txt` - Plain text files\n",
"- `.json` - JSON with text content (field: \"text\" or \"content\")\n",
"- `.docx` - Microsoft Word documents\n",
"- `.png` / `.jpg` / `.jpeg` - Images (OCR extraction)\n",
"\n",
"**Setup:** Place your files in `data/transcripts/` folder\n",
"\n",
"For this demo, sample documents in various formats have been created. You can use these samples if you do not have test datasets."
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u2713 Created directory: data/transcripts\n",
" Place your .txt transcript files here, or run the next cell for demo data\n"
]
}
],
"source": [
"# Create data directory structure\n",
"from pathlib import Path\n",
"\n",
"data_dir = Path('data/transcripts')\n",
"data_dir.mkdir(parents=True, exist_ok=True)\n",
"\n",
"print(f'\u2713 Created directory: {data_dir}')\n",
"print(f' Place your .txt transcript files here, or run the next cell for demo data')"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u2713 Created: interview_001.txt\n",
"\u2713 Created: interview_002.txt\n",
"\u2713 Created: meeting_notes_001.txt\n",
"\u2713 Created: stakeholder_report.txt\n",
"\n",
"\u2713 Created 4 expanded sample documents in data/transcripts\n",
" Total words: 851\n",
"\u2713 Created: interview_001.txt\n",
"\u2713 Created: interview_002.txt\n",
"\u2713 Created: meeting_notes_001.txt\n",
"\u2713 Created: stakeholder_report.txt\n",
"\n",
"\u2713 Created 4 sample documents in data/transcripts\n"
]
}
],
"source": [
"# Sample transcript documents for demonstration\n",
"\n",
"# Skip this if you have real data\n",
"\n",
"# Create expanded sample transcript documents for demonstration\n",
"\n",
"sample_transcripts = {\n",
" 'interview_001.txt': \"\"\"\n",
" Interview with Community Resident - Sarah Martinez\n",
" Date: March 15, 2024\n",
" \n",
" We've been experiencing significant flooding in our neighborhood during heavy rains. \n",
" The storm drains seem inadequate, and water pools on Main Street for hours. \n",
" Several basements have flooded in the past year. \n",
" \n",
" Our primary goal is to protect our homes and preserve property values in this neighborhood. \n",
" We need to maintain safe access to schools and emergency services even during storm events.\n",
" \n",
" The main objective should be to minimize flood damage to residential properties and reduce \n",
" the frequency of street closures. We're looking at different investment options for \n",
" stormwater management, but we have a budget constraint of about $2 million from the city \n",
" council allocation.\n",
" \n",
" I think we need better drainage infrastructure as our decision variable. The implementation \n",
" strategy could include both green and gray infrastructure. We cannot exceed the current \n",
" budget without additional grant funding.\n",
" \n",
" We should use flood depth as a key indicator of success, measuring the water depth on \n",
" Main Street during storm events. Another important metric would be the number of properties \n",
" with basement flooding per year.\n",
" \"\"\",\n",
" \n",
" 'interview_002.txt': \"\"\"\n",
" Interview with Local Business Owner - James Chen\n",
" Date: March 18, 2024\n",
" \n",
" The flooding issue is directly related to new development upstream. Since they built \n",
" the shopping center, our area gets much more runoff during storms.\n",
" \n",
" Our goal is to protect local businesses from flood damage while maintaining economic \n",
" vitality downtown. We aim to preserve the historic character of our business district.\n",
" \n",
" The key objective here is to maximize stormwater retention upstream and minimize runoff \n",
" reaching our downtown area. We need green infrastructure like retention ponds and \n",
" permeable pavement as part of our strategy.\n",
" \n",
" The investment decision should consider both short-term fixes and long-term solutions. \n",
" We face a major constraint - the shopping center owner won't participate unless required \n",
" by regulation. We also have a time limit since hurricane season starts in June.\n",
" \n",
" I'd suggest tracking business interruption days as an indicator of improvement. We should \n",
" measure economic damage in dollars per storm event. The depth of flooding in parking areas \n",
" would be another useful metric to monitor progress.\n",
" \"\"\",\n",
" \n",
" 'meeting_notes_001.txt': \"\"\"\n",
" Community Stakeholder Meeting - Flood Mitigation Planning\n",
" Date: March 22, 2024\n",
" Attendees: 45 residents, city council members, county planning staff\n",
" \n",
" Meeting Summary:\n",
" \n",
" Community members reported increased flooding frequency over the past five years. \n",
" Main concerns include inadequate drainage, upstream development impacts, and aging infrastructure.\n",
" \n",
" GOALS IDENTIFIED:\n",
" - Protect residential and commercial properties from flood damage\n",
" - Maintain neighborhood livability and safety during storm events \n",
" - Preserve environmental quality of local waterways\n",
" - Aim to restore pre-development runoff conditions\n",
" \n",
" OBJECTIVES DISCUSSED:\n",
" - Minimize flood damage costs to the community\n",
" - Reduce flood depth on critical roadways by 50%\n",
" - Maximize green infrastructure implementation where feasible\n",
" \n",
" DECISION VARIABLES:\n",
" - Infrastructure investment levels (ranging from $1M to $5M)\n",
" - Strategy selection: gray infrastructure vs. green infrastructure vs. hybrid\n",
" - Implementation timeline: phased over 3 years vs. comprehensive approach\n",
" \n",
" CONSTRAINTS IDENTIFIED:\n",
" - Budget limit of $2.5 million from city general fund\n",
" - Cannot disrupt traffic on Main Street for more than 2 weeks\n",
" - Must comply with historic district design guidelines\n",
" - Limited right-of-way for new infrastructure\n",
" \n",
" PERFORMANCE INDICATORS:\n",
" - Measure flood depth at 5 key monitoring locations\n",
" - Track number of flood events per year exceeding 6 inches\n",
" - Calculate economic damage per storm event\n",
" - Monitor basement flooding frequency as a key metric\n",
" - Assess stormwater quality indicators (pollutant levels)\n",
" \n",
" Proposed solutions include comprehensive stormwater management systems, coordination \n",
" with county planning on upstream development, and establishment of maintenance protocols.\n",
" \n",
" Next steps: Form technical committee to evaluate decision alternatives using \n",
" multi-criteria decision analysis framework.\n",
" \"\"\",\n",
" \n",
" 'stakeholder_report.txt': \"\"\"\n",
" Stormwater Infrastructure Assessment Report\n",
" Prepared by: City Engineering Department\n",
" Date: April 1, 2024\n",
" \n",
" EXECUTIVE SUMMARY\n",
" \n",
" This report evaluates stormwater management alternatives for the downtown district \n",
" experiencing chronic flooding issues.\n",
" \n",
" PROJECT GOALS:\n",
" The overarching goal is to protect the community from flood hazards while preserving \n",
" environmental resources. We aim to maintain infrastructure resilience under future \n",
" climate conditions.\n",
" \n",
" SPECIFIC OBJECTIVES:\n",
" 1. Minimize annual flood damage costs to less than $500,000\n",
" 2. Reduce peak flood depths by 40% during 10-year storm events\n",
" 3. Maximize community co-benefits (recreation, green space, water quality)\n",
" \n",
" DECISION FRAMEWORK:\n",
" \n",
" The primary decision variable is the selection of infrastructure investment strategy \n",
" from three alternatives:\n",
" \n",
" Alternative A: Traditional gray infrastructure ($3.2M investment)\n",
" Alternative B: Green infrastructure approach ($2.8M investment) \n",
" Alternative C: Hybrid strategy ($3.5M investment)\n",
" \n",
" Implementation decisions also include phasing schedules and maintenance strategies.\n",
" \n",
" CONSTRAINTS:\n",
" - Cannot exceed $3.5 million budget constraint\n",
" - Must complete implementation within 24-month time limit\n",
" - Cannot impact historic building foundations\n",
" - Limited to existing public right-of-way areas\n",
" \n",
" PERFORMANCE METRICS:\n",
" \n",
" Key indicators for evaluating alternatives:\n",
" - Maximum flood depth at critical intersections (target: <6 inches)\n",
" - Frequency of road closures (metric: closures per year)\n",
" - Economic damage per storm event (measured in dollars)\n",
" - Stormwater volume captured (measure in acre-feet)\n",
" - Cost-effectiveness indicator (damage reduced per dollar invested)\n",
" \n",
" RECOMMENDATIONS:\n",
" \n",
" Our objective is to reduce flood risk while maximizing return on investment. The decision \n",
" should minimize lifecycle costs while achieving flood depth reduction goals. We aim to \n",
" preserve flexibility for future adaptations as climate conditions change.\n",
" \"\"\"\n",
"}\n",
"\n",
"# Write files to directory\n",
"for filename, content in sample_transcripts.items():\n",
" filepath = data_dir / filename\n",
" filepath.write_text(content.strip())\n",
" print(f'\u2713 Created: {filename}')\n",
"\n",
"print(f'\\n\u2713 Created {len(sample_transcripts)} expanded sample documents in {data_dir}')\n",
"print(f' Total words: {sum(len(content.split()) for content in sample_transcripts.values())}')\n",
"for filename, content in sample_transcripts.items():\n",
" filepath = data_dir / filename\n",
" filepath.write_text(content.strip())\n",
" print(f'\u2713 Created: {filename}')\n",
"\n",
"print(f'\\n\u2713 Created {len(sample_transcripts)} sample documents in {data_dir}')"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u2713 Loaded 4 transcript files:\n",
" - interview_001.txt\n",
" - interview_002.txt\n",
" - meeting_notes_001.txt\n",
" - stakeholder_report.txt\n"
]
}
],
"source": [
"# Load all transcript files from the data directory\n",
"transcripts = {}\n",
"\n",
"for filepath in sorted(data_dir.glob('*.txt')):\n",
" transcripts[filepath.name] = filepath.read_text()\n",
" \n",
"print(f'\u2713 Loaded {len(transcripts)} transcript files:')\n",
"for filename in transcripts.keys():\n",
" print(f' - {filename}')"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"============================================================\n",
"File: interview_001.txt\n",
"============================================================\n",
"Interview with Community Resident - Sarah Martinez\n",
" Date: March 15, 2024\n",
" \n",
" We've been experiencing significant flooding in our neighborhood during heavy rains. \n",
" The storm drains seem ina...\n",
"\n",
"============================================================\n",
"File: interview_002.txt\n",
"============================================================\n",
"Interview with Local Business Owner - James Chen\n",
" Date: March 18, 2024\n",
" \n",
" The flooding issue is directly related to new development upstream. Since they built \n",
" the shopping center, our ar...\n",
"\n",
"============================================================\n",
"File: meeting_notes_001.txt\n",
"============================================================\n",
"Community Stakeholder Meeting - Flood Mitigation Planning\n",
" Date: March 22, 2024\n",
" Attendees: 45 residents, city council members, county planning staff\n",
" \n",
" Meeting Summary:\n",
" \n",
" Community...\n",
"\n",
"============================================================\n",
"File: stakeholder_report.txt\n",
"============================================================\n",
"Stormwater Infrastructure Assessment Report\n",
" Prepared by: City Engineering Department\n",
" Date: April 1, 2024\n",
" \n",
" EXECUTIVE SUMMARY\n",
" \n",
" This report evaluates stormwater management alterna...\n"
]
}
],
"source": [
"# Display preview of loaded transcripts\n",
"for filename, content in transcripts.items():\n",
" print(f'\\n{\"=\"*60}')\n",
" print(f'File: {filename}')\n",
" print(f'{\"=\"*60}')\n",
" print(content[:200] + '...' if len(content) > 200 else content)"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u2713 Transcript Statistics:\n",
"\n",
" File Characters Words Sentences\n",
" interview_001.txt 1281 182 12\n",
" interview_002.txt 1185 169 12\n",
" meeting_notes_001.txt 2021 251 4\n",
"stakeholder_report.txt 2122 249 10\n"
]
}
],
"source": [
"# Calculate basic statistics for each transcript\n",
"import pandas as pd\n",
"\n",
"stats = []\n",
"for filename, content in transcripts.items():\n",
" stats.append({\n",
" 'File': filename,\n",
" 'Characters': len(content),\n",
" 'Words': len(content.split()),\n",
" 'Sentences': len(sent_tokenize(content))\n",
" })\n",
"\n",
"stats_df = pd.DataFrame(stats)\n",
"print('\u2713 Transcript Statistics:\\n')\n",
"print(stats_df.to_string(index=False))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#Step 3: Discover Topics\n",
"**What is Topic Modeling?**\n",
"Topic modeling automatically identifies themes in your documents. \n",
"It groups words that appear together frequently into topics.\n",
"\n",
"Example: If \"flooding\", \"water\", \"drainage\" appear together, the topic might be about coastal hydrology.\n",
"\n",
"**How it works:**\n",
"1. Break text into words\n",
"2. Find patterns of co-occurring words\n",
"3. Group related words into topics\n",
"4. Assign topics to documents"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Analysis Parameters:\n",
" \u2022 Number of topics: 5\n",
" \u2022 Vocabulary size: 100\n",
" \u2022 Keywords per topic: 6\n"
]
}
],
"source": [
"# Set analysis parameters\n",
"n_topics = 5 # Change this number to discover more or fewer topics\n",
"max_vocabulary = 100 # Maximum number of terms to consider\n",
"top_words_display = 6 # Number of keywords to show per topic\n",
"\n",
"print(f'Analysis Parameters:')\n",
"print(f' \u2022 Number of topics: {n_topics}')\n",
"print(f' \u2022 Vocabulary size: {max_vocabulary}')\n",
"print(f' \u2022 Keywords per topic: {top_words_display}')"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u2713 Text preprocessing complete\n",
"\n",
"Example: interview with community resident sarah martinez date march we ve been experiencing significant flooding in our neighborhood during heavy rains the st...\n"
]
}
],
"source": [
"# Preprocess text\n",
"def preprocess_text(text):\n",
" text = text.lower()\n",
" text = re.sub(r'[^a-z\\s]', ' ', text)\n",
" text = ' '.join(text.split())\n",
" return text\n",
"\n",
"processed_docs = [preprocess_text(text) for text in transcripts.values()]\n",
"doc_names = list(transcripts.keys())\n",
"\n",
"print('\u2713 Text preprocessing complete')\n",
"print(f'\\nExample: {processed_docs[0][:150]}...')"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Discovering topics...\n",
"\n",
"\u2713 Vocabulary: 100 terms\n",
"\u2713 Discovered 5 topics\n",
"\n",
"Topics Discovered:\n",
"============================================================\n",
"\n",
"Topic 1: goal, constraint, goal protect\n",
" Full keywords: goal, constraint, goal protect, objective, indicator, measure\n",
"\n",
"Topic 2: flood, infrastructure, main\n",
" Full keywords: flood, infrastructure, main, storm, street, investment\n",
"\n",
"Topic 3: business, retention, downtown\n",
" Full keywords: business, retention, downtown, local, runoff, upstream\n",
"\n",
"Topic 4: goal, constraint, goal protect\n",
" Full keywords: goal, constraint, goal protect, objective, indicator, measure\n",
"\n",
"Topic 5: goal, constraint, goal protect\n",
" Full keywords: goal, constraint, goal protect, objective, indicator, measure\n",
"\n",
"============================================================\n"
]
}
],
"source": [
"# Topic Discovery \n",
"# Extract topics using Latent Derelecht Analysis (LDA)\n",
"print('Discovering topics...\\n')\n",
"\n",
"# Create document-term matrix\n",
"vectorizer = TfidfVectorizer(\n",
" max_features=max_vocabulary,\n",
" stop_words='english',\n",
" ngram_range=(1, 2)\n",
")\n",
"doc_term_matrix = vectorizer.fit_transform(processed_docs)\n",
"feature_names = vectorizer.get_feature_names_out()\n",
"\n",
"print(f'\u2713 Vocabulary: {len(feature_names)} terms')\n",
"\n",
"# Discover topics\n",
"lda_model = LatentDirichletAllocation(\n",
" n_components=n_topics,\n",
" random_state=42,\n",
" max_iter=20\n",
")\n",
"doc_topic_dist = lda_model.fit_transform(doc_term_matrix)\n",
"\n",
"print(f'\u2713 Discovered {n_topics} topics\\n')\n",
"\n",
"# Extract and store topic information\n",
"print('Topics Discovered:')\n",
"print('=' * 60)\n",
"\n",
"topics_info = {}\n",
"for idx, topic in enumerate(lda_model.components_):\n",
" top_indices = topic.argsort()[-8:][::-1]\n",
" top_words = [feature_names[i] for i in top_indices]\n",
" \n",
" # Create topic label with keywords\n",
" topic_label = f\"Topic {idx + 1}: {', '.join(top_words[:3])}\"\n",
" topics_info[f'Topic {idx + 1}'] = {\n",
" 'label': topic_label,\n",
" 'keywords': top_words\n",
" }\n",
" \n",
" print(f'\\n{topic_label}')\n",
" print(f' Full keywords: {\", \".join(top_words[:top_words_display])}')\n",
"\n",
"print('\\n' + '=' * 60)"
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Creating enhanced visualization...\n",
"\n"
]
},
{
"data": {
"application/vnd.plotly.v1+json": {
"config": {
"plotlyServerURL": "https://plot.ly"
},
"data": [
{
"hoverinfo": "text",
"hovertext": [
"Topic 1: goal, constraint, goal protect Document: interview_001 Proportion: 2.6% Keywords: goal, constraint, goal protect, objective, indicator, measure",
"Topic 1: goal, constraint, goal protect Document: interview_002 Proportion: 3.0% Keywords: goal, constraint, goal protect, objective, indicator, measure",
"Topic 1: goal, constraint, goal protect Document: meeting_notes_001 Proportion: 2.2% Keywords: goal, constraint, goal protect, objective, indicator, measure",
"Topic 1: goal, constraint, goal protect Document: stakeholder_report Proportion: 2.4% Keywords: goal, constraint, goal protect, objective, indicator, measure"
],
"name": "Topic 1: goal, constraint, goal protect",
"text": [
"2.6%",
"3.0%",
"2.2%",
"2.4%"
],
"textposition": "auto",
"type": "bar",
"x": [
"interview_001",
"interview_002",
"meeting_notes_001",
"stakeholder_report"
],
"y": {
"bdata": "+JntkDuMmj9m3FLC+4ieP9zqfftntpY/P6TEbMPmmD8=",
"dtype": "f8"
}
},
{
"hoverinfo": "text",
"hovertext": [
"Topic 2: flood, infrastructure, main Document: interview_001 Proportion: 89.6% Keywords: flood, infrastructure, main, storm, street, investment",
"Topic 2: flood, infrastructure, main Document: interview_002 Proportion: 3.1% Keywords: flood, infrastructure, main, storm, street, investment",
"Topic 2: flood, infrastructure, main Document: meeting_notes_001 Proportion: 91.1% Keywords: flood, infrastructure, main, storm, street, investment",
"Topic 2: flood, infrastructure, main Document: stakeholder_report Proportion: 90.3% Keywords: flood, infrastructure, main, storm, street, investment"
],
"name": "Topic 2: flood, infrastructure, main",
"text": [
"89.6%",
"3.1%",
"91.1%",
"90.3%"
],
"textposition": "auto",
"type": "bar",
"x": [
"interview_001",
"interview_002",
"meeting_notes_001",
"stakeholder_report"
],
"y": {
"bdata": "w/DSzpKt7D+dENTgjO2fP8Rb2yMvJ+0/yCiMsoDh7D8=",
"dtype": "f8"
}
},
{
"hoverinfo": "text",
"hovertext": [
"Topic 3: business, retention, downtown Document: interview_001 Proportion: 2.6% Keywords: business, retention, downtown, local, runoff, upstream",
"Topic 3: business, retention, downtown Document: interview_002 Proportion: 87.9% Keywords: business, retention, downtown, local, runoff, upstream",
"Topic 3: business, retention, downtown Document: meeting_notes_001 Proportion: 2.2% Keywords: business, retention, downtown, local, runoff, upstream",
"Topic 3: business, retention, downtown Document: stakeholder_report Proportion: 2.5% Keywords: business, retention, downtown, local, runoff, upstream"
],
"name": "Topic 3: business, retention, downtown",
"text": [
"2.6%",
"87.9%",
"2.2%",
"2.5%"
],
"textposition": "auto",
"type": "bar",
"x": [
"interview_001",
"interview_002",
"meeting_notes_001",
"stakeholder_report"
],
"y": {
"bdata": "SQrZcvOomj+SmsH+uyPsP0/FGpLj9pY/V/osaJ8bmT8=",
"dtype": "f8"
}
},
{
"hoverinfo": "text",
"hovertext": [
"Topic 4: goal, constraint, goal protect Document: interview_001 Proportion: 2.6% Keywords: goal, constraint, goal protect, objective, indicator, measure",
"Topic 4: goal, constraint, goal protect Document: interview_002 Proportion: 3.0% Keywords: goal, constraint, goal protect, objective, indicator, measure",
"Topic 4: goal, constraint, goal protect Document: meeting_notes_001 Proportion: 2.2% Keywords: goal, constraint, goal protect, objective, indicator, measure",
"Topic 4: goal, constraint, goal protect Document: stakeholder_report Proportion: 2.4% Keywords: goal, constraint, goal protect, objective, indicator, measure"
],
"name": "Topic 4: goal, constraint, goal protect",
"text": [
"2.6%",
"3.0%",
"2.2%",
"2.4%"
],
"textposition": "auto",
"type": "bar",
"x": [
"interview_001",
"interview_002",
"meeting_notes_001",
"stakeholder_report"
],
"y": {
"bdata": "v8vtkDuMmj8iAVPC+4ieP9v0fftntpY/Rq/EbMPmmD8=",
"dtype": "f8"
}
},
{
"hoverinfo": "text",
"hovertext": [
"Topic 5: goal, constraint, goal protect Document: interview_001 Proportion: 2.6% Keywords: goal, constraint, goal protect, objective, indicator, measure",
"Topic 5: goal, constraint, goal protect Document: interview_002 Proportion: 3.0% Keywords: goal, constraint, goal protect, objective, indicator, measure",
"Topic 5: goal, constraint, goal protect Document: meeting_notes_001 Proportion: 2.2% Keywords: goal, constraint, goal protect, objective, indicator, measure",
"Topic 5: goal, constraint, goal protect Document: stakeholder_report Proportion: 2.4% Keywords: goal, constraint, goal protect, objective, indicator, measure"
],
"name": "Topic 5: goal, constraint, goal protect",
"text": [
"2.6%",
"3.0%",
"2.2%",
"2.4%"
],
"textposition": "auto",
"type": "bar",
"x": [
"interview_001",
"interview_002",
"meeting_notes_001",
"stakeholder_report"
],
"y": {
"bdata": "gnftkDuMmj+bv1LC+4ieP4LifftntpY/CJnEbMPmmD8=",
"dtype": "f8"
}
}
],
"layout": {
"barmode": "stack",
"height": 500,
"legend": {
"orientation": "v",
"title": {
"text": "Topics (hover for details)"
},
"x": 1.02,
"xanchor": "left",
"y": 1,
"yanchor": "top"
},
"template": {
"data": {
"bar": [
{
"error_x": {
"color": "#2a3f5f"
},
"error_y": {
"color": "#2a3f5f"
},
"marker": {
"line": {
"color": "#E5ECF6",
"width": 0.5
},
"pattern": {
"fillmode": "overlay",
"size": 10,
"solidity": 0.2
}
},
"type": "bar"
}
],
"barpolar": [
{
"marker": {
"line": {
"color": "#E5ECF6",
"width": 0.5
},
"pattern": {
"fillmode": "overlay",
"size": 10,
"solidity": 0.2
}
},
"type": "barpolar"
}
],
"carpet": [
{
"aaxis": {
"endlinecolor": "#2a3f5f",
"gridcolor": "white",
"linecolor": "white",
"minorgridcolor": "white",
"startlinecolor": "#2a3f5f"
},
"baxis": {
"endlinecolor": "#2a3f5f",
"gridcolor": "white",
"linecolor": "white",
"minorgridcolor": "white",
"startlinecolor": "#2a3f5f"
},
"type": "carpet"
}
],
"choropleth": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"type": "choropleth"
}
],
"contour": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "contour"
}
],
"contourcarpet": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"type": "contourcarpet"
}
],
"heatmap": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "heatmap"
}
],
"histogram": [
{
"marker": {
"pattern": {
"fillmode": "overlay",
"size": 10,
"solidity": 0.2
}
},
"type": "histogram"
}
],
"histogram2d": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "histogram2d"
}
],
"histogram2dcontour": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "histogram2dcontour"
}
],
"mesh3d": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"type": "mesh3d"
}
],
"parcoords": [
{
"line": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "parcoords"
}
],
"pie": [
{
"automargin": true,
"type": "pie"
}
],
"scatter": [
{
"fillpattern": {
"fillmode": "overlay",
"size": 10,
"solidity": 0.2
},
"type": "scatter"
}
],
"scatter3d": [
{
"line": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scatter3d"
}
],
"scattercarpet": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattercarpet"
}
],
"scattergeo": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattergeo"
}
],
"scattergl": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattergl"
}
],
"scattermap": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattermap"
}
],
"scattermapbox": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattermapbox"
}
],
"scatterpolar": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scatterpolar"
}
],
"scatterpolargl": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scatterpolargl"
}
],
"scatterternary": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scatterternary"
}
],
"surface": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "surface"
}
],
"table": [
{
"cells": {
"fill": {
"color": "#EBF0F8"
},
"line": {
"color": "white"
}
},
"header": {
"fill": {
"color": "#C8D4E3"
},
"line": {
"color": "white"
}
},
"type": "table"
}
]
},
"layout": {
"annotationdefaults": {
"arrowcolor": "#2a3f5f",
"arrowhead": 0,
"arrowwidth": 1
},
"autotypenumbers": "strict",
"coloraxis": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"colorscale": {
"diverging": [
[
0,
"#8e0152"
],
[
0.1,
"#c51b7d"
],
[
0.2,
"#de77ae"
],
[
0.3,
"#f1b6da"
],
[
0.4,
"#fde0ef"
],
[
0.5,
"#f7f7f7"
],
[
0.6,
"#e6f5d0"
],
[
0.7,
"#b8e186"
],
[
0.8,
"#7fbc41"
],
[
0.9,
"#4d9221"
],
[
1,
"#276419"
]
],
"sequential": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"sequentialminus": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
]
},
"colorway": [
"#636efa",
"#EF553B",
"#00cc96",
"#ab63fa",
"#FFA15A",
"#19d3f3",
"#FF6692",
"#B6E880",
"#FF97FF",
"#FECB52"
],
"font": {
"color": "#2a3f5f"
},
"geo": {
"bgcolor": "white",
"lakecolor": "white",
"landcolor": "#E5ECF6",
"showlakes": true,
"showland": true,
"subunitcolor": "white"
},
"hoverlabel": {
"align": "left"
},
"hovermode": "closest",
"mapbox": {
"style": "light"
},
"paper_bgcolor": "white",
"plot_bgcolor": "#E5ECF6",
"polar": {
"angularaxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
},
"bgcolor": "#E5ECF6",
"radialaxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
}
},
"scene": {
"xaxis": {
"backgroundcolor": "#E5ECF6",
"gridcolor": "white",
"gridwidth": 2,
"linecolor": "white",
"showbackground": true,
"ticks": "",
"zerolinecolor": "white"
},
"yaxis": {
"backgroundcolor": "#E5ECF6",
"gridcolor": "white",
"gridwidth": 2,
"linecolor": "white",
"showbackground": true,
"ticks": "",
"zerolinecolor": "white"
},
"zaxis": {
"backgroundcolor": "#E5ECF6",
"gridcolor": "white",
"gridwidth": 2,
"linecolor": "white",
"showbackground": true,
"ticks": "",
"zerolinecolor": "white"
}
},
"shapedefaults": {
"line": {
"color": "#2a3f5f"
}
},
"ternary": {
"aaxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
},
"baxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
},
"bgcolor": "#E5ECF6",
"caxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
}
},
"title": {
"x": 0.05
},
"xaxis": {
"automargin": true,
"gridcolor": "white",
"linecolor": "white",
"ticks": "",
"title": {
"standoff": 15
},
"zerolinecolor": "white",
"zerolinewidth": 2
},
"yaxis": {
"automargin": true,
"gridcolor": "white",
"linecolor": "white",
"ticks": "",
"title": {
"standoff": 15
},
"zerolinecolor": "white",
"zerolinewidth": 2
}
}
},
"title": {
"text": "Topic Distribution Across Documents (5 Topics)"
},
"xaxis": {
"title": {
"text": "Document"
}
},
"yaxis": {
"title": {
"text": "Topic Proportion"
}
}
}
},
"text/html": [
"