Methodology
Application: WageMix
Purpose: A data analysis tool for HR professionals using BLS OEWS data to analyze salary percentiles and trends for individual or mixed occupations.
1. Overview
Objective
This document describes the methodology used by WageMix to process, interpret, and visualize U.S. Bureau of Labor Statistics (BLS) data for HR and compensation analysis.
The application enables HR professionals to explore percentile-based salary trends over the past five years, identify pay benchmarks, and model composite roles that span multiple occupations.
Scope
- Data Source: U.S. Bureau of Labor Statistics – Occupational Employment and Wage Statistics (OEWS)
- Time Coverage: Last five years of published OEWS data
- Metrics: Percentile-based wage estimates (P10, P25, Median, P75, P90)
- Capabilities:
- View wage trends by occupation
- Combine multiple occupations into a blended profile
- Use AI to infer relevant occupations from job descriptions
Intended Audience
HR analysts, compensation specialists, and business leaders involved in workforce planning and pay benchmarking.
2. Data Source
Dataset
The application uses the Occupational Employment and Wage Statistics (OEWS) dataset provided by the U.S. Bureau of Labor Statistics (BLS).
- URL: https://www.bls.gov/oes/
- Coverage:
- Over 800 Standard Occupational Classification (SOC) codes
- Data available at national, state, and metropolitan statistical area levels
- Annual updates, typically reflecting May data
Data Fields Used
- Occupation code (SOC)
- Occupation title
- Employment estimate
- Hourly and annual wage estimates at P10, P25, Median (P50), P75, and P90
- Year of publication
3. Data Ingestion and Processing
Acquisition
- Automated retrieval from official BLS sources (CSV or XLSX format).
- Version control maintained for each year’s release.
- Metadata (retrieval date, dataset version, and checksum) stored for reproducibility.
Cleaning and Normalization
- SOC codes standardized to the latest schema.
- Wage data normalized into consistent annualized format.
- Missing or suppressed fields marked with internal quality flags (see Section 7).
- Outliers reviewed using percentile validation.
Storage and Indexing
- Data stored in a structured relational or analytical database.
- Indexed by SOC code, year, and geographic level for efficient retrieval.
- Separate tables maintain metadata and suppression indicators.
4. Data Interpretation
Percentile Definitions
- P10 (10th percentile): 10% of workers earn less than this amount.
- P25 (25th percentile): 25% earn less; represents the lower quartile.
- Median (P50): The midpoint of the wage distribution.
- P75 and P90: Represent higher quartiles and top earners in the occupation.
Trend Analysis
- Trends computed as year-over-year percent changes for each percentile.
- Optionally adjusted for inflation using CPI-U if configured.
- Visualization highlights movement in wage distribution over five years.
5. Occupation Mixing Logic
Purpose
To allow users to analyze hybrid or cross-functional roles that are not represented by a single SOC occupation.
Method
Each occupation in a mix is assigned a weight (percentage).
The tool computes a weighted average for each percentile as follows:
[ BlendedPercentile(p) = sum_{i=1}^{n} w_i times P_{i,p} ]
Where:
- ( w_i ) = weight of occupation i (summing to 1.0)
- ( P_{i,p} ) = percentile p (e.g., P10) of occupation i
Example
If a role is modeled as 70% Software Developer and 30% Data Scientist:
[ Median_{blend} = 0.7 imes Median_{Dev} + 0.3 imes Median_{DS} ]
This calculation is performed for each percentile (P10–P90).
Validation
- Ensures weighted percentiles preserve correct ordering (P10 < P25 < Median < P75 < P90).
- Prevents outlier influence by capping extreme percentile deviations.
6. AI-Based Job Description Parsing
Goal
To automatically infer relevant occupations from a job description text.
Process
- Text Extraction: Job description text cleaned and tokenized.
- Model Inference: NLP model (e.g., transformer-based classifier) predicts relevant SOC codes.
- Scoring: Model outputs confidence scores for each occupation.
- Weight Conversion: Confidence scores normalized into mix weights.
User Control
- The user can modify occupation selections and adjust mix sliders.
- Saved mixes are stored with metadata (user ID, creation date, AI source).
- Edited mixes can be reused or compared to new job descriptions.
7. Handling Data Anomalies and Suppression
BLS Suppression and Bounds
The BLS may suppress or cap wage data to maintain confidentiality or account for small sample sizes.
Typical examples:
- Replacing exact values with thresholds (e.g., “>= $100.00/hour”)
- Omitting percentile data entirely for underrepresented occupations
Application Treatment
1. Upper-Bound Values
- Represented in the UI as “≥ $X” when published explicitly by BLS.
- When necessary, inferred using a bounded regression from available percentiles (e.g., estimate P90 using median and P75 slopes).
- All inferred points are tagged with an estimated quality flag.
2. Suppressed or Missing Data
- Labeled as “Insufficient Data” in tables and charts.
- Excluded from occupation mixes by default.
- If a mix includes suppressed values, the affected percentile is computed only from available data.
3. Quality Indicators
Each data point carries a quality flag:
| Flag |
Description |
| D |
Direct BLS published value |
| B |
Bounded (e.g., ≥ $X) |
| E |
Estimated via interpolation or regression |
| S |
Suppressed / Not available |
Transparency
All visualizations include tooltips or footnotes describing estimated or bounded values.
8. Data Validation and Quality Control
Cross-Checks
- Validate national aggregates against official BLS totals.
- Confirm percentile ordering (P10 < P25 < Median < P75 < P90).
- Compare year-over-year trends to detect anomalies.
Change Detection
- Identify occupations with >50% percentile shifts.
- Mark as outlier candidates for manual review.
Auditability
- Maintain full provenance for every record (source year, retrieval date, processing timestamp).
- Versioned data pipeline ensures historical reproducibility.
9. Output and Visualization
Charts
- Percentile band charts showing P10–P90 across five years.
- Line and area visualizations for median trend and distribution width.
- Overlay charts for blended occupations.
Interactivity
- Occupation selection and weight sliders.
- Toggle visibility for individual percentiles.
10. Limitations and Caveats
- Data Lag: OEWS data represents prior-year estimates, not real-time wages.
- Geographic Aggregation: National averages may obscure regional differences.
- Model Bias: AI mappings depend on training coverage; niche roles may be underrepresented.
- Suppression Uncertainty: Estimates for upper-bounded data introduce potential variance.
- Weighting Assumptions: Occupation mixes assume linear wage blending, which may oversimplify complex role structures.
11. Future Enhancements
- Incorporate inflation-adjusted or cost-of-living normalization.
- Add regional filtering for metro/state-level insights.
- Introduce confidence intervals for blended percentiles.
- Expand AI model integration with O*NET skill and task taxonomy.
- Enable time-adjusted forecasting using regression or ARIMA models.
12. References