Files
EB_Dashboard/DOCUMENTATION/DOCUMENTATION_10_ARCHITECTURE.md

1349 lines
45 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Endobest Clinical Research Dashboard - Technical Documentation
## Part 1: General Architecture & Report Generation Workflow
**Document Version:** 2.0 (Updated with Excel Export feature)
**Last Updated:** 2025-11-08
**Audience:** Developers, Technical Architects
**Language:** English
---
## Table of Contents
1. [Overview](#overview)
2. [System Architecture](#system-architecture)
3. [Module Structure](#module-structure)
4. [Complete Data Collection Workflow](#complete-data-collection-workflow)
5. [API Integration](#api-integration)
6. [Multithreading & Performance](#multithreading--performance)
7. [Data Processing Pipeline](#data-processing-pipeline)
8. [Execution Modes](#execution-modes)
9. [Error Handling & Resilience](#error-handling--resilience)
---
## Overview
The **Endobest Clinical Research Dashboard** is an automated data collection and processing system designed to extract, validate, and consolidate patient inclusion data from the Endobest clinical research protocol across multiple healthcare organizations.
### Key Characteristics
- **100% Externalized Configuration**: All extraction fields defined in Excel, zero code changes needed
- **Multi-Source Data Integration**: Fetches from RC (Research Clinic), GDD (Lab), and questionnaire APIs
- **High-Performance Multithreading**: 20+ concurrent workers for API parallelization
- **Comprehensive Quality Assurance**: Built-in coherence checks and regression testing
- **Thread-Safe Operations**: Dedicated HTTP clients per thread, synchronized access to shared resources
- **Automated Error Recovery**: Token refresh, automatic retry with exponential backoff
- **Audit Trail**: Detailed logging and JSON backup versioning
---
## System Architecture
### High-Level Component Diagram
```
┌─────────────────────────────────────────────────────────┐
│ Endobest Dashboard Main Process │
│ eb_dashboard.py │
├─────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Block 1-3 │ │ Block 4 │ │ Block 5-6 │ │
│ │ Config & Auth│ │ Config Load │ │ Data Extract │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ ↓ ↓ ↓ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Extended Fields Configuration │ │
│ │ (Excel: Mapping Sheet → JSON field mapping) │ │
│ └─────────────────────────────────────────────────┘ │
│ ↓ ↓ ↓ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Block 7 │ │ Block 8 │ │ Block 9 │ │
│ │ API Calls │ │ Orchestration│ │ Quality QA │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ ↓ ↓ ↓ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Multithreaded Processing (ThreadPoolExecutor) │ │
│ │ - Organizations: 20 workers (parallel) │ │
│ │ - Requests/Questionnaires: 40 workers (async) │ │
│ └─────────────────────────────────────────────────┘ │
│ ↓ ↓ ↓ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Quality Checks & Validation │ │
│ │ - Coherence Check (stats vs detail) │ │
│ │ - Non-Regression Check (config-driven) │ │
│ └─────────────────────────────────────────────────┘ │
│ ↓ ↓ ↓ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Export & Persistence │ │
│ │ - endobest_inclusions.json │ │
│ │ - endobest_organizations.json │ │
│ │ - Versioned backups (_old suffix) │ │
│ └─────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────┘
┌──────────────────────────────────┐
│ Utility Modules │
├──────────────────────────────────┤
│ • eb_dashboard_utils.py │
│ • eb_dashboard_quality_checks.py │
└──────────────────────────────────┘
┌──────────────────────────────────┐
│ External APIs │
├──────────────────────────────────┤
│ • IAM (Authentication) │
│ • RC (Research Clinic) │
│ • GDD (Lab / Diagnostic Data) │
└──────────────────────────────────┘
```
---
## Module Structure
### 1. **eb_dashboard.py** (Primary Orchestrator)
**Size:** ~45 KB | **Lines:** 1,021
**Responsibility:** Main application logic, API coordination, multithreading
#### Major Blocks:
- **Block 1**: Configuration & Base Infrastructure (constants, global variables, progress bar setup)
- **Block 2**: Decorators & Resilience (retry logic, token refresh)
- **Block 3**: Authentication (IAM login, token management)
- **Block 4**: Extended Fields Configuration (Excel loading & validation)
- **Block 5**: Data Search & Extraction (questionnaire finding, field retrieval)
- **Block 6**: Custom Functions & Field Processing (business logic, calculated fields)
- **Block 7**: Business API Calls (RC, GDD endpoints)
- **Block 7b**: Organization Center Mapping (organization enrichment with center identifiers)
- **Block 8**: Processing Orchestration (patient data processing)
- **Block 9**: Main Execution (entry point, quality checks, export)
### 2. **eb_dashboard_utils.py** (Reusable Utilities)
**Size:** ~6.4 KB | **Lines:** 184
**Responsibility:** Generic utility functions shared across modules
#### Core Functions:
```python
get_httpx_client() # Thread-local HTTP client management
get_thread_position() # Progress bar positioning
get_nested_value() # JSON path navigation with wildcard support
get_config_path() # Config folder resolution (script vs PyInstaller)
get_old_filename() # Backup filename generation
```
### 3. **eb_dashboard_quality_checks.py** (QA & Validation)
**Size:** ~59 KB | **Lines:** 1,266
**Responsibility:** Quality assurance, data validation, regression checking
#### Core Functions:
```python
load_regression_check_config() # Load regression rules from Excel
run_quality_checks() # Orchestrate all QA checks
coherence_check() # Verify stats vs detailed data consistency
non_regression_check() # Config-driven change validation
run_check_only_mode() # Standalone validation mode
backup_output_files() # Create versioned backups
```
### 4. **eb_dashboard_excel_export.py** (Excel Report Generation & Orchestration)
**Size:** ~38 KB | **Lines:** ~1,340 (v1.1+)
**Responsibility:** Configuration-driven Excel workbook generation with data transformation + high-level orchestration
#### Low-Level Functions (Data Processing):
```python
load_excel_export_config() # Load Excel_Workbooks and Excel_Sheets config
validate_excel_config() # Validate templates and named ranges
export_to_excel() # Main export orchestration (openpyxl + win32com)
_apply_filter() # AND-condition filtering
_apply_sort() # Multi-key sorting with datetime support
_apply_value_replacement() # Strict type matching value transformation
_handle_output_exists() # File conflict resolution (Overwrite/Increment/Backup)
_recalculate_workbook() # Formula recalculation via win32com (optional)
_process_sheet() # Sheet-specific data filling
```
#### High-Level Orchestration Functions (v1.1+):
```python
export_excel_only(sys_argv, console, ...) # Complete --excel-only mode orchestration
run_normal_mode_export(data, data, enabled, config, ...) # Normal mode export phase
prepare_excel_export(inclusions_file, organizations_file, ...) # Prep + validate
execute_excel_export(inclusions_data, organizations_data, config, ...) # Exec + error handling
_load_json_file_internal(filename) # Safe JSON loading helper
```
**Design Pattern (v1.1+):**
- All export mechanics delegated to module (follows quality_checks pattern)
- Main script calls single function per mode: `export_excel_only()` or `run_normal_mode_export()`
- Configuration validation and error handling centralized in module
- Result: Main script focused on business logic, export details encapsulated
**Note:** See [DOCUMENTATION_13_EXCEL_EXPORT.md](DOCUMENTATION_13_EXCEL_EXPORT.md) for complete architecture and configuration details.
### 5. **eb_dashboard_constants.py** (Centralized Configuration)
**Size:** ~3.5 KB | **Lines:** 120
**Responsibility:** Single source of truth for all application constants
#### Constants Categories:
```python
# File Management
INCLUSIONS_FILE_NAME, ORGANIZATIONS_FILE_NAME, CONFIG_FOLDER_NAME, etc.
# Excel Configuration
DASHBOARD_CONFIG_FILE_NAME, ORG_CENTER_MAPPING_FILE_NAME
EXCEL_WORKBOOKS_TABLE_NAME, EXCEL_SHEETS_TABLE_NAME, etc.
# API Configuration
API_TIMEOUT, API_*_ENDPOINT (9 endpoints across Auth, RC, GDD)
DEFAULT_USER_NAME, DEFAULT_PASSWORD, IAM_URL, RC_URL, GDD_URL, RC_APP_ID
# Research Protocol
RC_ENDOBEST_PROTOCOL_ID, RC_ENDOBEST_EXCLUDED_CENTERS
# Performance & Quality
ERROR_MAX_RETRY, WAIT_BEFORE_RETRY, MAX_THREADS
EXCEL_RECALC_TIMEOUT
# Logging & UI
LOG_FILE_NAME, BAR_N_FMT_WIDTH, BAR_TOTAL_FMT_WIDTH, etc.
```
**Design Principle:** All constants are imported from this module - never duplicated or redefined in other modules. This ensures a single source of truth for all configuration values across the entire application.
---
## Complete Data Collection Workflow
### Phase 1: Initialization & Authentication
```
START
[1] User Login Prompt
├─ Input: username, password (defaults available)
├─ IAM Authentication: POST /api/auth/ziwig-pro/login
├─ Get Master Token + User ID
└─ RC Token Exchange: POST /api/auth/config-token
└─ Output: access_token, refresh_token
[2] Configuration Loading
├─ Parse Excel: Endobest_Dashboard_Config.xlsx
├─ Load Inclusions_Mapping sheet → Field mapping definition
├─ Validate all field configurations
└─ Load Regression_Check sheet → Quality rules
[3] Thread Pool Configuration
├─ Main pool: ThreadPoolExecutor(user_input_threads, max=20)
├─ Async pool: ThreadPoolExecutor(40) for nested tasks
└─ Initialize per-thread HTTP clients
```
### Phase 2: Organization & Counters Retrieval
```
[4] Get All Organizations
├─ API: GET /api/inclusions/getAllOrganizations
├─ Filter: Exclude RC_ENDOBEST_EXCLUDED_CENTERS
└─ Output: List of all centers
[5] Fetch Organization Counters (Parallelized)
├─ For each organization:
│ └─ POST /api/inclusions/inclusion-statistics
│ ├─ Protocol: RC_ENDOBEST_PROTOCOL_ID
│ └─ Store: patients_count, preincluded_count, included_count, prematurely_terminated_count
├─ Execute: 20 parallel workers
└─ Output: Organizations with counters
[5b] Enrich Organizations with Center Mapping (Optional)
├─ Load mapping file: eb_org_center_mapping.xlsx (if exists)
├─ Parse sheet: Org_Center_Mapping
│ ├─ Extract: Organization_Name → Center_Name pairs
│ ├─ Validate: No duplicate organizations or centers
│ └─ Build: Normalized key mapping (case-insensitive, trimmed)
├─ For each organization:
│ ├─ Normalize organization name
│ ├─ Lookup in mapping dictionary
│ ├─ If found: Add Center_Name field (mapped value)
│ └─ If not found: Add Center_Name field (fallback to org name)
├─ Error Handling: Graceful degradation (missing file = skip silently)
└─ Output: Organizations with enriched Center_Name field
[6] Calculate Totals & Sort
├─ Sum all patient counts across organizations
├─ Sort organizations by patient count (descending)
└─ Display summary statistics
```
### Phase 3: Patient Inclusion Data Collection
```
[7] For Each Organization (Parallelized - 20 workers):
├─ API: POST /api/inclusions/search?limit=1000&page=1
│ └─ Retrieve up to 1000 inclusions per organization
├─ Store: inclusions_list[]
└─ For Each Patient in Inclusions (Sequential):
[8] Fetch Patient Data Sources (Parallel):
├─ THREAD 1: GET /api/records/byPatient
│ └─ Retrieve clinical record, protocol inclusions, data
├─ THREAD 2: GET /api/surveys/filter/with-answers (OPTIMIZED)
│ └─ Single call retrieves ALL questionnaires + answers for patient
├─ THREAD 3: GET /api/requests/by-tube-id/{tubeId}
│ └─ Retrieve lab test results
└─ WAIT: All parallel threads complete
[9] Process Field Mappings
├─ For each field in field mapping config:
│ ├─ Determine field source (questionnaire, record, inclusion, request)
│ ├─ Extract raw value using field_path (supports JSON path + wildcards)
│ ├─ Apply field condition (if specified)
│ ├─ Execute custom functions (if Calculated type)
│ ├─ Apply post-processing transformations:
│ │ ├─ true_if_any: Convert to boolean if value matches list
│ │ ├─ value_labels: Map value to localized text
│ │ ├─ field_template: Apply formatting template
│ │ └─ List joining: Join array values with pipe delimiter
│ └─ Store in output_inclusion[field_group][field_name]
└─ Output: Complete inclusion record with all fields
[10] Progress Update
├─ Update per-organization progress bar
└─ Update global progress bar (thread-safe)
[11] Aggregate Results
└─ Combine all inclusions from all organizations
```
### Phase 4: Quality Assurance & Validation
```
[12] Sorting
├─ Sort by: Organization Name, Inclusion Date, Patient Pseudo
└─ Output: Ordered inclusions_list[]
[13] Quality Checks Execution
├─ COHERENCE CHECK:
│ ├─ Compare organization statistics (API counters)
│ ├─ vs. actual inclusion data (detailed records)
│ ├─ Verify: total, preincluded, included, prematurely_terminated counts
│ └─ Report mismatches with severity levels
├─ NON-REGRESSION CHECK:
│ ├─ Load previous inclusions (_old file)
│ ├─ Compare current vs. previous data
│ ├─ Apply config-driven regression rules
│ ├─ Detect: new inclusions, deleted inclusions, field changes
│ ├─ Apply transition patterns and exceptions
│ └─ Report violations by severity (Warning/Critical)
└─ Result: has_coherence_critical, has_regression_critical flags
[14] Critical Issues Handling
├─ If NO critical issues:
│ └─ Continue to export
├─ If YES critical issues:
│ ├─ Display warning: ⚠ CRITICAL issues detected!
│ ├─ Prompt user: "Do you want to write results anyway?"
│ ├─ If NO → Cancel export, exit gracefully
│ └─ If YES → Continue to export (user override)
```
### Phase 5: Export & Persistence
**Phase 5 covers both JSON persistence and optional Excel export. The architecture is flexible:**
```
[15] Backup Old Files (only if checks passed)
├─ endobest_inclusions.json → endobest_inclusions_old.json
├─ endobest_organizations.json → endobest_organizations_old.json
└─ Operation: Silent, overwrite existing backups
[16] Write JSON Output Files
├─ File 1: endobest_inclusions.json
│ ├─ Format: JSON array of inclusion objects
│ ├─ Structure: Nested by field groups
│ └─ Size: Typically 6-7 MB (for full Endobest)
├─ File 2: endobest_organizations.json
│ ├─ Format: JSON array of organization objects
│ ├─ Includes: counters, statistics
│ └─ Size: Typically 17-20 KB
└─ Both: UTF-8 encoding, 4-space indentation
[17] Excel Export (if configured)
├─ DELEGATED TO: run_normal_mode_export()
├─ (from eb_dashboard_excel_export module)
├─ Workflow:
│ ├─ Check: Is Excel export enabled?
│ │ └─ If NO → Skip to Completion (step 18)
│ │ └─ If YES → Continue
│ │
│ ├─ Load JSONs from filesystem
│ │ └─ Ensures consistency with just-written files
│ │
│ ├─ Load Excel export configuration
│ │ ├─ Sheet: Excel_Workbooks (workbook definitions)
│ │ └─ Sheet: Excel_Sheets (sheet configurations)
│ │
│ ├─ For each configured workbook:
│ │ ├─ Load template file (openpyxl)
│ │ ├─ For each sheet in workbook:
│ │ │ ├─ Load source data (Inclusions or Organizations JSON)
│ │ │ ├─ Apply filter (AND conditions)
│ │ │ ├─ Apply multi-key sort (datetime-aware)
│ │ │ ├─ Apply value replacements (strict type matching)
│ │ │ └─ Fill data into cells/named ranges
│ │ │
│ │ ├─ Handle file conflicts (Overwrite/Increment/Backup strategy)
│ │ ├─ Save workbook (openpyxl)
│ │ └─ Recalculate formulas (optional, via win32com)
│ │
│ └─ Return: status (success/failure) + error message
└─ Note: See DOCUMENTATION_13_EXCEL_EXPORT.md for data transformation details
[18] Completion & Reporting
├─ Display elapsed time
├─ Report all file locations (JSONs + Excel files if generated)
├─ Log all operations to dashboard.log
└─ EXIT
```
**Three Operating Modes:**
1. **NORMAL MODE** (full workflow)
- Collect data → Quality checks → Write JSONs → Excel export (if enabled)
2. **--excel-only MODE**
- Skip data collection + quality checks
- Load existing JSONs → Excel export
- Uses: `export_excel_only()` function from module
3. **--check-only MODE**
- Skip data collection
- Run quality checks only
- Uses: `run_check_only_mode()` function from quality_checks module
### Expected Output Structure
```json
[
{
"Patient_Identification": {
"Organisation_Id": "uuid",
"Organisation_Name": "Center Name",
"Patient_Id": "internal_id",
"Pseudo": "ENDO-001",
"Patient_Name": "Doe, John",
"Patient_Birthday": "1975-05-15",
"Patient_Age": 49
},
"Inclusion": {
"Consent_Signed": true,
"Inclusion_Date": "15/10/2024",
"Inclusion_Status": "incluse",
"Inclusion_Complex": "Non",
"isPrematurelyTerminated": false,
"Inclusion_Status_Complete": "incluse",
"Need_RCP": false
},
"Extended_Fields": {
"Custom_Field_1": "value",
"Custom_Field_2": 42
},
"Endotest": {
"Request_Sent": true,
"Diagnostic_Status": "Completed",
"Request_Overall_Status": "Accepted par Ziwig Lab"
},
"Infos Générales": {
"Couleurs (ex: 8/10)": "8/10",
"Qualité de vie (ex: 43/55)": "43/55"
}
}
]
```
---
## API Integration
### Authentication APIs (IAM)
#### Login Endpoint
```
POST https://api-auth.ziwig-connect.com/api/auth/ziwig-pro/login
Request:
{
"username": "user@example.com",
"password": "password123"
}
Response:
{
"access_token": "jwt_token_master",
"userId": "user-uuid",
...
}
```
#### Token Exchange (RC-specific)
```
POST https://api-hcp.ziwig-connect.com/api/auth/config-token
Headers:
Authorization: Bearer {master_token}
Request:
{
"userId": "user-uuid",
"clientId": "602aea51-cdb2-4f73-ac99-fd84050dc393",
"userAgent": "Mozilla/5.0..."
}
Response:
{
"access_token": "jwt_token_rc",
"refresh_token": "refresh_token_value"
}
```
#### Token Refresh (Automatic on 401)
```
POST https://api-hcp.ziwig-connect.com/api/auth/refreshToken
Headers:
Authorization: Bearer {current_access_token}
Request:
{
"refresh_token": "refresh_token_value"
}
Response:
{
"access_token": "new_jwt_token",
"refresh_token": "new_refresh_token"
}
```
### Research Clinic APIs (RC)
#### Get All Organizations
```
GET https://api-hcp.ziwig-connect.com/api/inclusions/getAllOrganizations
Headers:
Authorization: Bearer {access_token}
Response:
[
{
"id": "org-uuid",
"name": "Center Name",
"address": "...",
...
}
]
```
#### Get Organization Statistics
```
POST https://api-hcp.ziwig-connect.com/api/inclusions/inclusion-statistics
Headers:
Authorization: Bearer {access_token}
Request:
{
"protocolId": "3c7bcb4d-91ed-4e9f-b93f-99d8447a276e",
"center": "org-uuid",
"excludedCenters": ["excluded-org-uuid-1", "excluded-org-uuid-2"]
}
Response:
{
"statistic": {
"totalInclusions": 145,
"preIncluded": 23,
"included": 110,
"prematurelyTerminated": 12
}
}
```
#### Search Inclusions by Organization
```
POST https://api-hcp.ziwig-connect.com/api/inclusions/search?limit=1000&page=1
Headers:
Authorization: Bearer {access_token}
Request:
{
"protocolId": "3c7bcb4d-91ed-4e9f-b93f-99d8447a276e",
"center": "org-uuid",
"keywords": ""
}
Response:
{
"data": [
{
"id": "patient-uuid",
"name": "Doe, John",
"status": "incluse",
...
}
]
}
```
#### Get Patient Clinical Record
```
POST https://api-hcp.ziwig-connect.com/api/records/byPatient
Headers:
Authorization: Bearer {access_token}
Request:
{
"center": "org-uuid",
"patientId": "patient-uuid",
"mode": "exchange",
"state": "ongoing",
"includeEndoParcour": false,
"sourceClient": "pro_prm"
}
Response:
{
"record": {
"protocol_inclusions": [
{
"status": "incluse",
"blockedQcmVersions": [],
"clinicResearchData": [
{
"requestMetaData": {
"tubeId": "tube-uuid"
}
}
]
}
]
}
}
```
#### Get All Questionnaires for Patient (Optimized)
```
POST https://api-hcp.ziwig-connect.com/api/surveys/filter/with-answers
Headers:
Authorization: Bearer {access_token}
Request:
{
"context": "clinic_research",
"subject": "patient-uuid",
"blockedQcmVersions": [] (optional)
}
Response:
[
{
"questionnaire": {
"id": "qcm-uuid",
"name": "Questionnaire Name",
"category": "Category"
},
"answers": {
"question_1": "answer_value",
"question_2": true,
...
}
}
]
```
### Lab APIs (GDD)
#### Get Request by Tube ID
```
GET https://api-lab.ziwig-connect.com/api/requests/by-tube-id/{tubeId}?isAdmin=true&organization=undefined
Headers:
Authorization: Bearer {access_token}
Response:
{
"id": "request-uuid",
"status": "completed",
"tubeId": "tube-uuid",
"diagnostic_status": "Completed",
"results": [
{
"test_name": "Test Result",
"value": "Result Value"
}
]
}
```
---
## Multithreading & Performance
### Thread Pool Architecture
```
Main Application Thread
┌─────────────────────────────────────────────────────┐
│ Phase 1: Counter Fetching │
│ ThreadPoolExecutor(max_workers=user_input) │
│ ├─ Task 1: Get counter for Org 1 │
│ ├─ Task 2: Get counter for Org 2 │
│ └─ Task N: Get counter for Org N │
│ [Sequential wait: tqdm.as_completed] │
└─────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ Phase 2: Inclusion Data Collection (Nested) │
│ Outer: ThreadPoolExecutor(max_workers=user_input) │
│ ├─ For Org 1: │
│ │ └─ Inner: ThreadPoolExecutor(max_workers=40) │
│ │ ├─ Patient 1: Async request/questionnaires │
│ │ ├─ Patient 2: Async request/questionnaires │
│ │ └─ Patient N: Async request/questionnaires │
│ │ └─ [Sequential wait: as_completed] │
│ │ │
│ ├─ For Org 2: │
│ │ └─ [Similar parallel processing] │
│ │ │
│ └─ For Org N: │
│ └─ [Similar parallel processing] │
│ [Outer wait: tqdm.as_completed] │
└─────────────────────────────────────────────────────┘
```
### Performance Optimizations
#### 1. Questionnaire Batching
**Problem:** Multiple filtered API calls per patient (slow)
**Solution:** Single optimized API call retrieves all questionnaires with answers
**Impact:** 4-5x performance improvement
```python
# BEFORE (inefficient):
for qcm_id in questionnaire_ids:
answers = GET /api/surveys/{qcm_id}/answers?subject={patient_id}
# AFTER (optimized):
all_answers = POST /api/surveys/filter/with-answers
with payload: {"context": "clinic_research", "subject": patient_id}
```
#### 2. Thread-Local HTTP Clients
**Problem:** Shared httpx.Client causes connection conflicts
**Solution:** Each thread maintains its own client
**Implementation:**
```python
def get_httpx_client() -> httpx.Client:
thread_id = threading.get_ident()
if thread_id not in httpx_clients:
httpx_clients[thread_id] = httpx.Client()
return httpx_clients[thread_id]
```
#### 3. Nested Parallelization
**Problem:** Sequential patient processing within organization
**Solution:** Submitting request/questionnaire fetches to async pool
**Benefit:** Non-blocking I/O during main thread processing
```python
for inclusion in inclusions:
output_inclusion = _process_inclusion_data(inclusion, organization)
# Within _process_inclusion_data():
request_future = subtasks_thread_pool.submit(get_request_by_tube_id, tube_id)
all_questionnaires = get_all_questionnaires_by_patient(patient_id, record_data)
request_data = request_future.result() # Wait for async completion
```
#### 4. Configurable Worker Threads
**User Input:** Thread count selection (1-20 workers)
**Rationale:** Allows tuning for network bandwidth, API rate limits, system resources
### Progress Tracking
#### Multi-Level Progress Bars
```
Overall Progress [████████████░░░░░░░░░░░░] 847/1200
1/15 - Center 1 [██████████░░░░░░░░░░░░░░░] 73/95
2/15 - Center 2 [██████░░░░░░░░░░░░░░░░░░░] 42/110
3/15 - Center 3 [████░░░░░░░░░░░░░░░░░░░░░] 28/85
```
#### Thread-Safe Progress Updates
```python
with _global_pbar_lock:
if global_pbar:
global_pbar.update(1) # Thread-safe update
```
---
## Data Processing Pipeline
### Field Extraction Logic
```
For each field in field mapping configuration:
├─ Input: field configuration from Excel
├─ Step 1: Determine Field Source
│ ├─ If source_type in [q_id, q_name, q_category]
│ │ └─ Find questionnaire in all_questionnaires dict
│ ├─ If source_type == "record"
│ │ └─ Use record_data (clinical record)
│ ├─ If source_type == "inclusion"
│ │ └─ Use inclusion_data (patient inclusion data)
│ ├─ If source_type == "request"
│ │ └─ Use request_data (lab test request)
│ └─ If source_name == "Calculated"
│ └─ Execute custom function
├─ Step 2: Extract Raw Value
│ ├─ Navigate JSON using field_path (supports * wildcard)
│ ├─ Example: ["record", "clinicResearchData", "*", "value"]
│ └─ Result: raw_value or "undefined"
├─ Step 3: Check Field Condition (optional)
│ ├─ If condition field is undefined
│ │ └─ Set final_value = "undefined"
│ ├─ If condition field is not boolean
│ │ └─ Set final_value = "$$$$ Condition Field Error"
│ ├─ If condition field is False
│ │ └─ Set final_value = "N/A"
│ └─ If condition field is True
│ └─ Continue processing
├─ Step 4: Apply Post-Processing Transformations
│ ├─ true_if_any: Convert to boolean
│ │ └─ If raw_value matches any value in true_if_any list → True
│ │ └─ Otherwise → False
│ │
│ ├─ value_labels: Map to localized text
│ │ └─ Find matching label_map entry by raw_value
│ │ └─ Replace with French text (text.fr)
│ │
│ ├─ field_template: Apply formatting
│ │ └─ Replace "$value" placeholder with formatted value
│ │ └─ Example: "$value%" → "85%"
│ │
│ └─ List joining: Flatten arrays
│ └─ Join array elements with "|" delimiter
├─ Step 5: Format Score Dictionaries
│ ├─ If value is dict with keys ['total', 'max']
│ │ └─ Format as "total/max" string
│ │ └─ Example: {"total": 8, "max": 10} → "8/10"
│ └─ Otherwise: Keep as-is
└─ Output: final_value
└─ Stored in output_inclusion[field_group][field_name]
```
### Custom Functions for Calculated Fields
#### 1. search_in_fields_using_regex
**Purpose:** Search multiple fields for regex pattern match
**Syntax:** `["search_in_fields_using_regex", "regex_pattern", "field_1", "field_2", ...]`
**Logic:**
```
FOR each field in [field_1, field_2, ...]:
IF field value matches regex_pattern (case-insensitive):
RETURN True
RETURN False
```
**Example:**
```json
{
"source_id": "search_in_fields_using_regex",
"field_path": [".*surgery.*", "Indication", "Previous_Surgery"]
}
```
#### 2. extract_parentheses_content
**Purpose:** Extract text within parentheses
**Syntax:** `["extract_parentheses_content", "field_name"]`
**Logic:**
```
value = get_value_from_inclusion(field_name)
RETURN match first occurrence of (content) pattern
```
**Example:**
```
Input: "Status (Active)"
Output: "Active"
```
#### 3. append_terminated_suffix
**Purpose:** Add " - AP" suffix if patient prematurely terminated
**Syntax:** `["append_terminated_suffix", "status_field", "is_terminated_field"]`
**Logic:**
```
status = get_value_from_inclusion(status_field)
is_terminated = get_value_from_inclusion(is_terminated_field)
IF is_terminated == True:
RETURN status + " - AP"
ELSE:
RETURN status
```
#### 4. if_then_else
**Purpose:** Unified conditional logic with 8 operators
**Syntax:** `["if_then_else", "operator", arg1, arg2_optional, result_if_true, result_if_false]`
**Operators:**
| Operator | Args | Logic |
|----------|------|-------|
| `is_true` | field, true_val, false_val | IF field == True THEN true_val ELSE false_val |
| `is_false` | field, true_val, false_val | IF field == False THEN true_val ELSE false_val |
| `is_defined` | field, true_val, false_val | IF field is not undefined THEN true_val ELSE false_val |
| `is_undefined` | field, true_val, false_val | IF field is undefined THEN true_val ELSE false_val |
| `all_true` | [fields_list], true_val, false_val | IF all fields are True THEN true_val ELSE false_val |
| `all_defined` | [fields_list], true_val, false_val | IF all fields are defined THEN true_val ELSE false_val |
| `==` | value1, value2, true_val, false_val | IF value1 == value2 THEN true_val ELSE false_val |
| `!=` | value1, value2, true_val, false_val | IF value1 != value2 THEN true_val ELSE false_val |
**Value Resolution Rules:**
- **Boolean literals:** `true`, `false` → used directly
- **Numeric literals:** `42`, `3.14` → used directly
- **String literals:** Prefixed with `$``$"Active"``"Active"`
- **Field references:** No prefix → looked up from inclusion data
**Examples:**
```json
{
"source_id": "if_then_else",
"field_path": ["is_defined", "Patient_Id", "$\"DEFINED\"", "$\"UNDEFINED\""]
}
{
"source_id": "if_then_else",
"field_path": ["==", "Status", "$\"Active\"", "$\"Is Active\"", "$\"Not Active\""]
}
{
"source_id": "if_then_else",
"field_path": ["all_true", ["Is_Consented", "Is_Included"], true, false]
}
```
---
## Execution Modes
### Mode 1: Normal Mode (Full Data Collection)
```bash
python eb_dashboard.py
```
**Workflow:**
1. User login (with defaults)
2. Load configuration
3. Collect organizations & counters
4. Collect all inclusion data (parallelized)
5. Run quality checks (coherence + regression)
6. Prompt user if critical issues
7. Export JSON files
8. Display elapsed time
**Output Files:**
- `endobest_inclusions.json`
- `endobest_organizations.json`
- Backup files with `_old` suffix
- Excel files (if configured in Excel_Workbooks table)
### Mode 2: Excel-Only Mode (Fast Export) - NEW
```bash
python eb_dashboard.py --excel-only
```
**Workflow:**
1. Load existing JSON files (no API calls, no collection)
2. Load Excel export configuration
3. Generate Excel workbooks from existing data
4. Exit
**Use Case:** Regenerate Excel reports without data collection (faster iteration), test new configurations, apply new filters/sorts
**Output Files:**
- Excel files as specified in Excel_Workbooks configuration
### Mode 3: Check-Only Mode (Validation Only)
```bash
python eb_dashboard.py --check-only
```
**Workflow:**
1. Load existing JSON files (no API calls)
2. Load regression check configuration
3. Run quality checks without collecting new data
4. Report any issues
5. Exit
**Use Case:** Validate data before distribution, no fresh collection needed
### Mode 4: Check-Only Compare Mode (File Comparison)
```bash
python eb_dashboard.py --check-only file1.json file2.json
```
**Workflow:**
1. Load two specific JSON files
2. Run regression check comparing file1 vs file2
3. Skip coherence check (organizations file not needed)
4. Report differences
5. Exit
**Use Case:** Compare two snapshot versions without coherence validation
### Mode 4: Debug Mode (Detailed Output)
```bash
python eb_dashboard.py --debug
```
**Workflow:**
1. Execute as normal mode
2. Enable DEBUG_MODE in quality checks module
3. Display detailed field-by-field changes
4. Show individual inclusion comparisons
5. Verbose logging
**Use Case:** Troubleshoot regression check rules, understand data changes
---
## Organization ↔ Center Mapping
### Overview
The organization-to-center mapping feature enriches healthcare organization records with standardized center identifiers. This enables center-based reporting without requiring code modifications.
### Configuration
**File:** `eb_org_center_mapping.xlsx` (optional, in script directory)
**Sheet Name:** `Org_Center_Mapping` (case-sensitive)
**Required Columns:**
```
| Organization_Name | Center_Name |
|-------------------|-------------|
| Hospital A | HOSP-A |
| Hospital B | HOSP-B |
```
### Workflow
1. **Load Mapping** (Step [5b] of Phase 2)
- Read `eb_org_center_mapping.xlsx` if file exists
- Parse `Org_Center_Mapping` sheet
- Skip silently if file not found (graceful degradation)
2. **Validate Data**
- Check for duplicate organization names (normalized: lowercase, trimmed)
- Check for duplicate center names
- If duplicates found: abort mapping, return empty dict
3. **Build Mapping Dictionary**
- Key: normalized organization name
- Value: center name (original case preserved)
- Example: `{"hospital a": "HOSP-A"}`
4. **Apply to Organizations**
- For each organization from RC API:
- Normalize organization name (lowercase, trim)
- Lookup in mapping dictionary
- If found: Add `Center_Name` field with mapped value
- If not found: Add `Center_Name` field with fallback (org name)
### Error Handling
| Scenario | Behavior |
|----------|----------|
| File missing | Print warning, skip mapping |
| Sheet not found | Print warning, skip mapping |
| Columns missing | Print warning, skip mapping |
| Duplicate organizations | Abort mapping, print error |
| Duplicate centers | Abort mapping, print error |
| Organization not in mapping | Use fallback (org name) |
### Output
**In `endobest_organizations.json`:**
```json
{
"id": "org-uuid",
"name": "Hospital A",
"Center_Name": "HOSP-A",
"patients_count": 45,
...
}
```
**In `endobest_inclusions.json` (if extended field configured):**
```json
{
"Patient_Identification": {
"Organisation_Name": "Hospital A",
"Center_Name": "HOSP-A",
...
}
}
```
### Example
**Input Organizations (from RC API):**
```json
[
{"id": "org1", "name": "Hospital A"},
{"id": "org2", "name": "Hospital B"},
{"id": "org3", "name": "Clinic C"}
]
```
**Mapping File:**
```
Organization_Name | Center_Name
Hospital A | HOSP-A
Hospital B | HOSP-B
```
**Console Output:**
```
Mapping organizations to centers...
⚠ 1 organization(s) not mapped:
- Clinic C
```
**Result:** Clinic C uses fallback → `Center_Name = "Clinic C"`
### Features
-**Case-Insensitive Matching**: "Hospital A" matches "hospital a" in file
-**Whitespace Trimming**: " Hospital A " matches "Hospital A"
-**Graceful Degradation**: Missing file doesn't break process
-**Fallback Strategy**: Unmapped organizations use original name
-**No Code Changes**: Fully configurable via Excel file
---
## Error Handling & Resilience
### Token Management Strategy
#### 1. Automatic Token Refresh on 401
```python
@api_call_with_retry
def some_api_call():
# If response.status_code == 401:
# new_token() is called automatically
# Request is retried
pass
```
#### 2. Thread-Safe Token Refresh
```python
def new_token():
global access_token, refresh_token
with _token_refresh_lock: # Only one thread refreshes at a time
# Attempt refresh up to ERROR_MAX_RETRY times
for attempt in range(ERROR_MAX_RETRY):
try:
# POST /api/auth/refreshToken
# Update global tokens
except:
sleep(WAIT_BEFORE_RETRY)
```
### Retry Mechanism
#### Configuration Constants
```python
ERROR_MAX_RETRY = 10 # Maximum retry attempts
WAIT_BEFORE_RETRY = 0.5 # Seconds between retries (no exponential backoff)
```
#### Retry Logic
```python
for attempt in range(ERROR_MAX_RETRY):
try:
# Make API call
response.raise_for_status()
return result
except (httpx.RequestError, httpx.HTTPStatusError) as exc:
logging.warning(f"Error (Attempt {attempt + 1}/{ERROR_MAX_RETRY}): {exc}")
# Handle 401 (token expired)
if isinstance(exc, httpx.HTTPStatusError) and exc.response.status_code == 401:
logging.info("Token expired. Refreshing token.")
new_token()
# Wait before retry (except last attempt)
if attempt < ERROR_MAX_RETRY - 1:
sleep(WAIT_BEFORE_RETRY)
# If all retries fail
logging.critical(f"Persistent error after {ERROR_MAX_RETRY} attempts")
raise httpx.RequestError(message="Persistent error")
```
### Exception Handling
#### API Errors
- **httpx.RequestError:** Network errors, connection timeouts, DNS failures
- **httpx.HTTPStatusError:** HTTP status codes >= 400
- **json.JSONDecodeError:** Invalid JSON in configuration or response
#### File I/O Errors
- **FileNotFoundError:** Configuration file missing
- **IOError:** Cannot write output files
- **json.JSONDecodeError:** Corrupted JSON file loading
#### Validation Errors
- **Configuration validation:** Invalid field definitions in Excel
- **Data validation:** Incoherent statistics vs. detailed data
- **Regression check violations:** Unexpected data changes
#### Error Logging
```python
import logging
logging.basicConfig(
level=logging.WARNING,
format='%(asctime)s - %(levelname)s - %(message)s',
filename='dashboard.log',
filemode='w'
)
```
**Logged Events:**
- API errors with attempt numbers
- Token refresh events
- Configuration loading status
- Quality check results
- File I/O operations
- Thread errors with stack traces
### Graceful Degradation
#### User Confirmation on Critical Issues
```
If has_coherence_critical or has_regression_critical:
Display: "⚠ CRITICAL issues detected in quality checks!"
Prompt: "Do you want to write the results anyway?"
If YES:
Continue with export (user override)
If NO:
Cancel export, preserve old files
Exit gracefully
```
#### Thread Failure Handling
```python
try:
result = future.result()
output_inclusions.extend(result)
except Exception as exc:
logging.critical(f"Critical error in worker: {exc}", exc_info=True)
thread_pool.shutdown(wait=False, cancel_futures=True)
raise # Propagate to main handler
```
#### Main Exception Handler
```python
if __name__ == '__main__':
try:
main()
except Exception as e:
logging.critical(f"Script terminated prematurely: {e}", exc_info=True)
print(f"Error: {e}")
finally:
if 'subtasks_thread_pool' in globals():
subtasks_thread_pool.shutdown(wait=False, cancel_futures=True)
input("Press Enter to exit...")
```
---
## Performance Metrics & Benchmarks
### Typical Execution Times
For a full Endobest dataset (1,200+ patients, 15+ organizations):
| Phase | Duration | Notes |
|-------|----------|-------|
| Login & Config | ~2-3 sec | Sequential |
| Fetch Counters (20 workers) | ~5-8 sec | Parallelized |
| Collect Inclusions (20 workers) | ~2-4 min | Includes API calls + processing |
| Quality Checks | ~10-15 sec | Loads files, compares data |
| Export to JSON | ~3-5 sec | File I/O |
| **Total** | **~2.5-5 min** | Depends on network, API performance |
### Network Optimization Impact
**With old questionnaire fetching (N filtered calls per patient):**
- 1,200 patients × 15 questionnaires = 18,000 API calls
- Estimated: 15-30 minutes
**With optimized single-call questionnaire fetching:**
- 1,200 patients × 1 call = 1,200 API calls
- Estimated: 2-5 minutes
- **Improvement: 3-6x faster**
---
## Configuration Files
### Excel Configuration File: `Endobest_Dashboard_Config.xlsx`
#### Sheet 1: Inclusions_Mapping (Field Mapping Definition)
Defines all fields to be extracted and their transformation rules.
See **DOCUMENTATION_11_FIELD_MAPPING.md** for detailed guide.
#### Sheet 2: Regression_Check (Non-Regression Rules)
Defines data validation rules for detecting unexpected changes.
See **DOCUMENTATION_12_QUALITY_CHECKS.md** for detailed guide.
---
## Summary
The **Endobest Dashboard** implements a sophisticated, production-grade data collection system with:
**Flexible Configuration:** Zero-code field definitions via Excel
**High Performance:** 4-5x faster via optimized API calls
**Robust Resilience:** Automatic token refresh, retries, error recovery
**Thread Safety:** Per-thread clients, synchronized shared state
**Quality Assurance:** Coherence checks + config-driven regression testing
**Comprehensive Logging:** Full audit trail in dashboard.log
**User-Friendly:** Progress bars, interactive prompts, clear error messages
This architecture enables non-technical users to configure new data sources without code changes, while providing developers with extensible hooks for custom logic and quality validation.
---
**Document End**