Evaluating GPT-4.5 for Financial Reasoning and Greek Financial Tasks

Introduction

GPT-4.5 is the latest upgrade in OpenAI's GPT family, promising improved reasoning and better general performance across domains. To understand how well this general-purpose powerhouse handles domain-specific financial challenges, we evaluated GPT-4.5 on two of TheFinAI’s open leaderboards:

Financial Reasoning: A benchmark for complex numerical and textual reasoning in finance, with tasks like table-based QA, document analysis, and financial math problems.
Greek Financial Tasks: A comprehensive benchmark for financial NLP in Greek, spanning entity recognition, question answering, summarization, and classification.

The results offer valuable insights into GPT-4.5’s strengths and limitations when applied to the nuanced world of finance, and particularly how it fares against domain-specific models.

Financial Reasoning: Strong Math, but Troubled Reasoning

Analysis

Solid Financial QA Performance: GPT-4.5 scores 68.94% on FinQA, showing that it handles fact retrieval and reasoning in financial contexts reasonably well. It outperforms smaller GPT variants (GPT-o1, GPT-o3-mini), but still lags behind specialized models like DeepSeek-V3 (73.2%). This highlights competence in general-purpose financial QA, but not best-in-class precision.
Struggles with Long-Form Document Reasoning: On DM-Simplong, GPT-4.5 scores 59%, which is only a minor improvement over GPT-o3-mini (59%) and GPT-o1 (56%). This reflects difficulty handling long, complex financial documents, especially where domain-specific document structures (like regulatory filings) are involved.
Limited Financial Math Capability: In XBRL-Math, GPT-4.5 reaches 74.44%, which is significantly behind DeepSeek-R1 (86.67%) and Qwen2.5-72B-Instruct-Math (83.33%). This suggests GPT-4.5’s general numerical reasoning is solid, but it lacks awareness of financial-specific math patterns (e.g., handling accounting calculations, structured numeric formats in XBRL reports).

What This Tells Us

👉 GPT-4.5’s general reasoning is strong, but it struggles with the specialized reasoning and formatting rules found in financial documents — particularly for long-form documents and structured numeric data.

Visit more results in Financial Reasoning Leaderboard

Greek Financial Tasks: Strong QA, But Outperformed by Specialized Models

Analysis

Exceptional QA Performance: GPT-4.5 achieves the highest score on Greek financial question answering (74.67%), indicating excellent language understanding and fact retrieval abilities.
Struggles with Greek Numerical Entities: GPT-4.5 falls short on FinNum (21.76%), which focuses on recognizing and interpreting numerical values in Greek financial text — a domain task where localized patterns matter more than general reasoning.
Mediocre Summarization & Classification: GPT-4.5 also underperforms on multi-class classification (MultiFin), indicating that broad multilingual training doesn’t always transfer well to Greek financial terminology.

What This Tells Us

👉 GPT-4.5’s general understanding and factual reasoning are top-tier, but it lacks the fine-grained pattern recognition and financial format awareness needed for high accuracy in Greek financial documents.
👉 Specialized models like plutus-8B, trained specifically on Greek financial texts, retain a substantial edge on numeric entity recognition and overall average performance.

Visit more results in Plutus-ben

Page updated

Google Sites

Report abuse