Methodology

Application: WageMix
Purpose: A data analysis tool for HR professionals using BLS OEWS data to analyze salary percentiles and trends for individual or mixed occupations.

1. Overview

Objective

This document describes the methodology used by WageMix to process, interpret, and visualize U.S. Bureau of Labor Statistics (BLS) data for HR and compensation analysis.
The application enables HR professionals to explore percentile-based salary trends over the past five years, identify pay benchmarks, and model composite roles that span multiple occupations.

Scope

Data Source: U.S. Bureau of Labor Statistics – Occupational Employment and Wage Statistics (OEWS)
Time Coverage: Last five years of published OEWS data
Metrics: Percentile-based wage estimates (P10, P25, Median, P75, P90)
Capabilities:
- View wage trends by occupation
- Combine multiple occupations into a blended profile
- Use AI to infer relevant occupations from job descriptions

Intended Audience

HR analysts, compensation specialists, and business leaders involved in workforce planning and pay benchmarking.

2. Data Source

Dataset

The application uses the Occupational Employment and Wage Statistics (OEWS) dataset provided by the U.S. Bureau of Labor Statistics (BLS).

URL: https://www.bls.gov/oes/
Coverage:
- Over 800 Standard Occupational Classification (SOC) codes
- Data available at national, state, and metropolitan statistical area levels
- Annual updates, typically reflecting May data

Data Fields Used

Occupation code (SOC)
Occupation title
Employment estimate
Hourly and annual wage estimates at P10, P25, Median (P50), P75, and P90
Year of publication

3. Data Ingestion and Processing

Acquisition

Automated retrieval from official BLS sources (CSV or XLSX format).
Version control maintained for each year’s release.
Metadata (retrieval date, dataset version, and checksum) stored for reproducibility.

Cleaning and Normalization

SOC codes standardized to the latest schema.
Wage data normalized into consistent annualized format.
Missing or suppressed fields marked with internal quality flags (see Section 7).
Outliers reviewed using percentile validation.

Storage and Indexing

Data stored in a structured relational or analytical database.
Indexed by SOC code, year, and geographic level for efficient retrieval.
Separate tables maintain metadata and suppression indicators.

4. Data Interpretation

Percentile Definitions

P10 (10th percentile): 10% of workers earn less than this amount.
P25 (25th percentile): 25% earn less; represents the lower quartile.
Median (P50): The midpoint of the wage distribution.
P75 and P90: Represent higher quartiles and top earners in the occupation.

Trend Analysis

Trends computed as year-over-year percent changes for each percentile.
Optionally adjusted for inflation using CPI-U if configured.
Visualization highlights movement in wage distribution over five years.

5. Occupation Mixing Logic

Purpose

To allow users to analyze hybrid or cross-functional roles that are not represented by a single SOC occupation.

Method

Each occupation in a mix is assigned a weight (percentage).
The tool computes a weighted average for each percentile as follows:

[ BlendedPercentile(p) = sum_{i=1}^{n} w_i times P_{i,p} ]

Where:

( w_i ) = weight of occupation i (summing to 1.0)
( P_{i,p} ) = percentile p (e.g., P10) of occupation i

Example

If a role is modeled as 70% Software Developer and 30% Data Scientist:

[ Median_{blend} = 0.7 imes Median_{Dev} + 0.3 imes Median_{DS} ]

This calculation is performed for each percentile (P10–P90).

Validation

Ensures weighted percentiles preserve correct ordering (P10 < P25 < Median < P75 < P90).
Prevents outlier influence by capping extreme percentile deviations.

6. AI-Based Job Description Parsing

Goal

To automatically infer relevant occupations from a job description text.

Process

Text Extraction: Job description text cleaned and tokenized.
Model Inference: NLP model (e.g., transformer-based classifier) predicts relevant SOC codes.
Scoring: Model outputs confidence scores for each occupation.
Weight Conversion: Confidence scores normalized into mix weights.

User Control

The user can modify occupation selections and adjust mix sliders.
Saved mixes are stored with metadata (user ID, creation date, AI source).
Edited mixes can be reused or compared to new job descriptions.

7. Handling Data Anomalies and Suppression

BLS Suppression and Bounds

The BLS may suppress or cap wage data to maintain confidentiality or account for small sample sizes.
Typical examples:

Replacing exact values with thresholds (e.g., “>= $100.00/hour”)
Omitting percentile data entirely for underrepresented occupations

Application Treatment

1. Upper-Bound Values

Represented in the UI as “≥ $X” when published explicitly by BLS.
When necessary, inferred using a bounded regression from available percentiles (e.g., estimate P90 using median and P75 slopes).
All inferred points are tagged with an estimated quality flag.

2. Suppressed or Missing Data

Labeled as “Insufficient Data” in tables and charts.
Excluded from occupation mixes by default.
If a mix includes suppressed values, the affected percentile is computed only from available data.

3. Quality Indicators

Each data point carries a quality flag:

Flag	Description
D	Direct BLS published value
B	Bounded (e.g., ≥ $X)
E	Estimated via interpolation or regression
S	Suppressed / Not available

Transparency

All visualizations include tooltips or footnotes describing estimated or bounded values.

8. Data Validation and Quality Control

Cross-Checks

Validate national aggregates against official BLS totals.
Confirm percentile ordering (P10 < P25 < Median < P75 < P90).
Compare year-over-year trends to detect anomalies.

Change Detection

Identify occupations with >50% percentile shifts.
Mark as outlier candidates for manual review.

Auditability

Maintain full provenance for every record (source year, retrieval date, processing timestamp).
Versioned data pipeline ensures historical reproducibility.

9. Output and Visualization

Charts

Percentile band charts showing P10–P90 across five years.
Line and area visualizations for median trend and distribution width.
Overlay charts for blended occupations.

Interactivity

Occupation selection and weight sliders.
Toggle visibility for individual percentiles.

10. Limitations and Caveats

Data Lag: OEWS data represents prior-year estimates, not real-time wages.
Geographic Aggregation: National averages may obscure regional differences.
Model Bias: AI mappings depend on training coverage; niche roles may be underrepresented.
Suppression Uncertainty: Estimates for upper-bounded data introduce potential variance.
Weighting Assumptions: Occupation mixes assume linear wage blending, which may oversimplify complex role structures.

11. Future Enhancements

Incorporate inflation-adjusted or cost-of-living normalization.
Add regional filtering for metro/state-level insights.
Introduce confidence intervals for blended percentiles.
Expand AI model integration with O*NET skill and task taxonomy.
Enable time-adjusted forecasting using regression or ARIMA models.

12. References

U.S. Bureau of Labor Statistics. Occupational Employment and Wage Statistics (OEWS).
https://www.bls.gov/oes/
U.S. Bureau of Labor Statistics. Standard Occupational Classification (SOC) System.
https://www.bls.gov/soc/