Introduction
Job titles are linguistic riddles for computers. A single role might be called “Financial Planning and Reporting Analyst” (formal), “FP&A Wizard” (start-up flair), or “Numbers Guru – Budgets & Forecasts” (creative but useless for algorithms). When I began mapping these titles to the Lightcast Occupation Taxonomy—a structured framework of 1,800+ standardized roles—I quickly realized that job titles prioritize human appeal over machine readability.
This first post explores my journey to automate classification, the pitfalls of rule-based systems, and why even simple titles defy machines.
The Problem: Why Job Titles Break Machines
The Cost of Ambiguity
- Job seekers miss roles due to mismatched keywords.
- Recruiters struggle to benchmark salaries or skills.
- Analysts waste hours untangling titles like “Interim Senior FP&A Analyst – Insurance” (Is this finance? Insurance? Consulting?).
Introducing the Lightcast Taxonomy

The Lightcast Occupation Taxonomy (LOT) is a global standard for classifying jobs into a hierarchical framework:
- Career Areas: Broad domains like Healthcare or Technology.
- Occupation Groups: Subdomains like Data Science or Financial Planning.
- Occupations: Granular roles like Data Analyst or FP&A Accountant.
- Specialized Occupations: Niche roles like AI Ethics Engineer or Cryptocurrency Auditor.
Unlike government systems (e.g., O*NET), the LOT is updated annually and derived from real-world job postings, making it adaptable to emerging roles.
The Challenge: Job titles compress complexity into ambiguous phrases that clash with the taxonomy’s structure.
First Attempt: Rules, Regex, and Regret
I started with rule-based logic. After all, “Accountant” should map to “Accountant”, right?
Rule-Based Wins
- Simple cases: “Payroll Manager” → Finance Manager.
- Industry flags: “Hotel Accountant” → Hospitality Accountant.
Rule-Based Fails
- Overlap: “Financial Crimes Manager” vs. “Forensic Financial Data Analysis Manager” (similar but distinct).
- Ambiguity: “HR Process Specialist” could map to HR Operations or Process Optimization.
- Novelty: “AI Engineer – Ecommerce” had no taxonomy match.
The Breaking Point
My regex system assigned:
- “Steel Market Researcher” → generic “Market Researcher” (ignoring the niche industry).
- “D365 & Power Platform Analyst” → “Informatica Developer” (conflating unrelated tech stacks).
Result: Rules achieved ~65% accuracy. I needed a smarter approach.
Data Dive: Similarity Scores and Systemic Flaws
I analysed job title matches paired with similarity scores (0.58–0.74) using Google’s BERT to uncover patterns:
Original Title | Matched Taxonomy Title | Score |
---|---|---|
“FP&A Accountant – Hotel Group” | Hotel Accountant | 0.739 |
“Lead Data Scientist-Category Management” | Category Insights Manager | 0.736 |
“Microsoft D365 Analyst” | Informatica Developer | 0.689 |
Case Studies: Why Matches Failed
- Industry Blindspots
- Example: “FP&A Accountant – Hotel Group” → “Hotel Accountant” (0.739).
- Issue: Ignored “FP&A” (strategic finance) and mapped purely on “Hotel”.
- Root Cause: The taxonomy lacks hybrid roles like “Hospitality FP&A Accountant”.
- Tech Stack Confusion
- Example: “Microsoft D365 Analyst” → “Informatica Developer” (0.689).
- Issue: Conflated ERP systems (D365) with data integration tools (Informatica).
- Root Cause: Embeddings focused on “Platform”, ignoring domain-specific tools.
- Seniority Mismatches
- Example: “Fraud Analytics Lead” → “Fraud Strategy Analyst” (0.697).
- Issue: “Lead” (technical leadership) ≠ “Analyst” (individual contributor).
- Root Cause: The taxonomy merges seniority levels under broad titles.
Decoding the Taxonomy’s Hierarchy
Misclassifications cascade across the LOT’s four levels:
- Career Areas: A Level 1 error (e.g., Technology vs. Business Operations) invalidates downstream matches.
- Occupation Groups: “Data Science” vs. “Business Intelligence” have distinct skill requirements.
- Occupations: “Data Analyst” ≠ “Business Intelligence Analyst” in the taxonomy.
- Specialized Roles: “AI Engineer – NLP” requires explicit entries to avoid mismatches.
Key Insight: Titles like “HR Process Specialist” (workflow design) vs. “HR Operations Specialist” (execution) demand granular distinctions the taxonomy often misses.
Building the Foundation: Data Sanity Checks
I audited 10,000 titles to identify patterns:
Pattern | Example | Solution |
---|---|---|
Role + Industry | “Commercial Accountant – Hotel” | Extract industry tags (e.g., Hospitality). |
Hybrid Roles | “Data Analyst, Microsoft Stack” | Split into [Role] + [Tech] . |
Seniority Noise | “Interim Senior FP&A Lead” | Remove stopwords (Interim). |
Cleaning Steps
- Normalization: Lowercase, remove fluff (e.g., “FTC”).
- Component Extraction: Split titles into
[Role] - [Industry/Tech] - [Seniority]
. - Industry Tagging: Map “Hotel” → Hospitality, “D365” → ERP Systems.
Result: Cleaner data, but ambiguity persisted (e.g., “FP&A Accountant” vs. “Financial Planning Analyst”).
Using an LLM to Identify False Positives with Automation
Here we use the DeepSeek API to enrich the classification table with a False Positive column indicating potential mismatches based on job title alignment and context, despite high similarity scores. This determination is subjective and based on qualitative judgment of role relevance:
Job ID | Job Title | Matched Job Title | Similarity Score | False Positive (Reason) |
---|---|---|---|---|
722 | FP&A Accountant – Hotel Group | Hotel Accountant | 0.7393 | Yes (FP&A ≠ Hotel Accounting) |
1499 | Lead Data Scientist-Category Management | Category Insights Manager | 0.7367 | Yes (Data Science vs. Insights Management) |
13466 | Business Analyst, Microsoft Stack Development | SFDC Business Analyst | 0.7358 | Yes (Microsoft vs. Salesforce stack mismatch) |
80689 | Commerical Accountant | Production Accountant | 0.7338 | Yes (Commercial vs. Production accounting domains differ) |
52480 | Reward Specialist – Media | Reward Manager | 0.7333 | No (Roles align closely) |
40654 | FP&A Accountant | Financial Planning and Reporting Analyst | 0.7318 | No (FP&A aligns with Financial Planning) |
8895 | FP&A Accountant | Financial Planning and Reporting Analyst | 0.7318 | No |
8895 | FP&A Accountant | Financial Planning and Reporting Analyst | 0.7318 | No |
166 | Business Analyst – Electronic Data Interchange | Oracle EBS Business Analyst | 0.7316 | Yes (EDI vs. Oracle EBS mismatch) |
4234 | Data Enablement Analyst | Data Operations Analyst | 0.7277 | No (Similar roles) |
40729 | Commercial FP&A Lead | Financial Planning and Analysis Leader | 0.7243 | No |
32046 | Data Analyst – Databricks | Data Warehouse Analyst | 0.7234 | Yes (Databricks (modern) vs. Data Warehouse (legacy) focus difference) |
51307 | Principal Operational Analyst Consultant | Operational Analyst | 0.7226 | Yes (Principal/Consultant vs. base role mismatch) |
14450 | Sales Planning & Financial Analyst Director | Director of Sales Finance | 0.7208 | No |
18418 | Interim Senior FP&A Analyst – Insurance | Financial Planning and Analysis Finance Manager | 0.7160 | No |
485 | Transition from Actuarial to Partnerships | Actuarial Manager | 0.7131 | Yes (Partnerships vs. Actuarial mismatch) |
71695 | HR Process Specialist | Human Resources Operations Specialist | 0.7125 | No |
77440 | 2nd Line Support Analyst – Leicester | Tier 2 Support Analyst | 0.7102 | No |
91300 | IT Support Analyst – (Netsuite) | Netsuite Administrator | 0.7102 | No |
77380 | Forensic Financial Data Analysis Manager | Financial Crimes Manager | 0.7063 | No (Related domains) |
47462 | Energy Analyst – mixed integer linear prog… | Energy Data Analyst | 0.7045 | Yes (Technical modeling vs. general data analysis) |
67551 | Portfolio Analytics Manager – Actuary | Analytics Portfolio Manager | 0.7041 | Yes (Actuarial focus vs. general analytics) |
78416 | Senior Pricing Analyst – Leading Lloyd’s Sy… | Corporate Pricing Analyst | 0.7027 | No |
57663 | Organics Lab Manager | Organic Chemistry Laboratory Assistant | 0.6990 | Yes (Manager vs. Assistant role mismatch) |
32634 | Fraud Analytics Lead- Digital/ Ebanking | Fraud Strategy Analyst | 0.6970 | No |
473 | Credit Risk- Crypto/Digital assets | Vice President/Credit Risk Manager | 0.6954 | Yes (Crypto-specific vs. generic VP role mismatch) |
69369 | Microsoft D365 & Power Platform Analyst | Informatica Powercenter Developer | 0.6897 | Yes (Microsoft vs. Informatica tools mismatch) |
85313 | AI Engineer Trading Automation – Manchester | Trading Systems Developer | 0.6853 | Yes (AI/ML vs. general systems development) |
40400 | German Speaking Team Lead – Credit Analyst | Lead Credit Analyst | 0.6799 | No |
495 | HubSpot Marketing Specialist | Marketing Promotions Specialist | 0.6797 | Yes (HubSpot platform focus vs. general promotions) |
326 | Data Specialist – Leasing Domain experience | Lease Records Analyst | 0.6789 | No |
48716 | SAP B1 Director of Operations Office based | SAP Applications Manager | 0.6784 | Yes (Director vs. Manager role mismatch) |
65608 | D365 F&O Functional Support Specialist | Dynamics AX Functional Consultant | 0.6665 | Yes (D365 F&O vs. Dynamics AX system mismatch) |
72572 | Analytics Director, Client Services – Adve… | Vice President of Analytics | 0.6658 | Yes (Director vs. VP role mismatch) |
67607 | Yardi Systems Administration Assistant Manager | Systems Administration Manager | 0.6626 | No |
67497 | Full Stack Software Engineer (With strong … | Full Stack Software Development Engineer | 0.6617 | No |
79675 | Steel market researcher | Market Researcher | 0.6596 | No |
12512 | Pricing Modeller – Insurtech | Pricing Actuary | 0.6578 | Yes (Modeling vs. Actuarial focus) |
467 | Business Analyst Process Mapping, PowerBI … | Business Process Analyst Consultant | 0.6519 | No |
136349 | Customer Journey Performance Analyst | Performance Analyst | 0.6308 | Yes (“Customer Journey” specificity ignored) |
726 | FP&A Lead Renewable Energy | Financial Planning and Analysis Finance Manager | 0.6252 | No |
227120 | FPA Analyst | Financial Systems Analyst | 0.6219 | Yes (FPA vs. Financial Systems focus mismatch) |
9580 | Data & MI Analyst – commodities trading firm | Trading Operations Analyst | 0.6077 | Yes (Data/MI vs. Operations focus) |
458 | Analyst, Hotel Group, Central London Office | Hotel Controller | 0.6065 | Yes (Analyst vs. Controller role mismatch) |
55610 | Knowledge Lawyer (Professional Support Law… | Private Client Lawyer | 0.6056 | Yes (Knowledge/PSL vs. Private Client law mismatch) |
215 | AI Engineer – Ecommerce – Manchester | Artificial Intelligence Engineer | 0.6031 | No |
33541 | MI Manager | MPS Manager | 0.5813 | Yes (MI vs. MPS acronym ambiguity) |
As we can see, manual validation confirmed that similarity thresholds alone are insufficient for reliable classification. To improve accuracy, leveraging Open Router to benchmark multiple large language models (LLMs) for automation of this task could identify the best classifier for contextual alignment. Models like GPT-4, Google Gemini, or DeepSeek could be tested for their ability to interpret nuanced differences in job titles, domains, and seniority, reducing false positives through advanced semantic reasoning. This approach combines quantitative scoring with qualitative LLM-driven validation, optimizing precision in matching tasks.
Key Takeaways
- Job Titles ≠ Data: They’re marketing tools, not standardized labels.
- Rules Can’t Scale: Regex fails at novelty (e.g., “Transition from Actuarial to Partnerships”).
- Taxonomy Literacy Matters: Misunderstanding hierarchy leads to systemic errors.
What’s Next?
In Part 2, I’ll share how I enriched sparse data and built a context-aware parser to tackle ambiguity. Spoiler: “HR Process Specialist” ≠ “HR Operations Specialist”—and the distinction cost me two weeks of debugging!