Skip to content

[FLINK-39079][Web Frontend] Add Diagnosis Advisor to Flink Web UI#27772

Open
featzhang wants to merge 5 commits intoapache:masterfrom
featzhang:feature/FLINK-ui-diagnostic-enhancement
Open

[FLINK-39079][Web Frontend] Add Diagnosis Advisor to Flink Web UI#27772
featzhang wants to merge 5 commits intoapache:masterfrom
featzhang:feature/FLINK-ui-diagnostic-enhancement

Conversation

@featzhang
Copy link
Member

Purpose

This PR adds the Diagnosis Advisor feature to the Flink Web UI, providing automated diagnostic suggestions based on job metrics analysis. Users often face difficulty diagnosing performance issues like high CPU usage, memory leaks, and backpressure. The current Flink UI provides raw metrics but lacks intelligent diagnostic suggestions.

The Diagnosis Advisor analyzes multiple metric categories and provides actionable recommendations for common performance scenarios.

Change log

Backend Changes:

  • Added DiagnosisHandler.java - REST API handler for diagnostic analysis
  • Added DiagnosisResponseBody.java - Response body with diagnostic suggestions
  • Added DiagnosisHeaders.java - REST endpoint definition

Frontend Changes:

  • Added DiagnosisService.ts - Service to fetch diagnostic suggestions
  • Added diagnosis.ts - TypeScript interface definitions
  • Added DiagnosisComponent - Angular component with HTML template and Less styles
  • Added DiagnosisDemoComponent - Demo page showcasing scenarios

New REST API:

  • GET /jobs/:jobid/diagnosis - Returns automated diagnostic suggestions

Diagnostic Rules Implemented:

  1. High CPU + High Heap Memory → Suggests GC-related issues
  2. High CPU + Normal Heap Memory → Indicates heavy computation
  3. Low CPU + High Backpressure → Points to I/O bottlenecks
  4. High GC Count → Flags performance concerns

Verifying

  1. Build the Flink project: mvn clean install -DskipTests
  2. Build the web dashboard: cd flink-runtime-web/web-dashboard && npm install && npm run build
  3. Access the demo page at /diagnosis-demo to see the Diagnosis Advisor
  4. Test each diagnostic scenario to verify recommendations
  5. For integration testing, navigate to /jobs/{jobId}/diagnosis

Impact

Scope:

  • New feature addition to the Flink Web UI
  • Does not modify existing API behavior

Performance:

  • Minimal overhead as metrics are already being collected
  • Diagnostic analysis runs on-demand

Compatibility:

  • Fully backward compatible
  • Diagnostic rules are extensible

Documentation

Documentation updates needed for:

  • Web UI documentation
  • REST API reference
  • Troubleshooting guides

@flinkbot
Copy link
Collaborator

flinkbot commented Mar 15, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@featzhang featzhang force-pushed the feature/FLINK-ui-diagnostic-enhancement branch 2 times, most recently from 3895a67 to 46d9e73 Compare March 15, 2026 14:39
@featzhang featzhang closed this Mar 16, 2026
@featzhang featzhang reopened this Mar 16, 2026
This commit introduces a Top N Metrics Dashboard to the Flink Web UI,
providing visibility into resource-intensive components:

- Top N CPU Consumers: Identify tasks with highest CPU usage
- Top N Backpressure Operators: Highlight operators experiencing backpressure
- Top N GC Intensive Tasks: Show tasks with highest GC overhead

The implementation includes:
- REST API endpoint: /jobs/:jobid/metrics/top-n
- Response body with three metric categories
- Angular components for displaying metrics
- Demo page showcasing the feature

This feature helps operators quickly identify performance bottlenecks
and optimize job execution.
This commit introduces the Diagnosis Advisor feature to the Flink Web UI,
providing automated diagnostic suggestions based on job metrics analysis.

The Diagnosis Advisor analyzes multiple metric categories:
- CPU usage metrics
- Memory consumption (heap usage)
- Garbage collection activity
- Backpressure ratios

It provides intelligent recommendations for common scenarios:
- High CPU + High Memory: Suggests GC-related issues
- High CPU + Normal Memory: Indicates heavy computation
- Low CPU + High Backpressure: Points to I/O bottlenecks
- Excessive GC count: Flags performance concerns

The implementation includes:
- REST API endpoint: /jobs/:jobid/diagnosis
- DiagnosisHandler with rule-based diagnostic engine
- Angular components for displaying diagnostic suggestions
- Demo page showcasing various diagnostic scenarios

This feature significantly improves the diagnostic experience for both
new and experienced users by automating the analysis of complex metric
correlations and providing actionable recommendations.
- Replace 'any' types with 'unknown' for type safety
- Fix all Prettier formatting issues
- Remove unused imports
… handlers

- Refactor DiagnosisHandler to use AbstractRestHandler instead of AbstractJobHandler
- Update metric names to match Flink 2.3 naming conventions
- Add missing DiagnosisMessageParameters class
- Fix imports and code formatting
- Rebase onto latest master branch
@featzhang featzhang force-pushed the feature/FLINK-ui-diagnostic-enhancement branch from de1f2c1 to 1ab8148 Compare March 17, 2026 04:05
@featzhang featzhang changed the title [FLINK-Web][Web Frontend] Add Diagnosis Advisor to Flink Web UI [FLINK-39079][Web Frontend] Add Diagnosis Advisor to Flink Web UI Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants