Predict Salary of Employee based on Years of Experience
In the ever-evolving landscape of career development, understanding the nuanced relationship between professional experience and salary is a pivotal aspect for both job seekers and employers. The concept that the number of years an individual spends in a particular field directly influences their earning potential has been a subject of interest and analysis. This article delves into the realm of predictive modeling, specifically aiming to forecast salaries based on years of professional experience.
The purpose of solving the problem titled “Predict Salary based on Years of Experience” encompasses several key objectives, each contributing to a deeper understanding of the relationship between professional experience and compensation. The primary purposes include:
Informed Career Decisions: Empowering individuals with the ability to predict salaries based on years of experience aids in making informed career decisions. Job seekers can better understand the expected trajectory of their compensation as they gain more professional expertise.
Negotiation and Advancement: Armed with insights into how salary evolves with experience, employees can negotiate compensation more effectively during job offers and promotions. Understanding the correlation allows for strategic career planning.
Employer Decision-Making: Employers can utilize predictive models to make informed decisions about salary structures. This includes setting competitive salaries, aligning compensation with industry standards, and recognizing the value of experience in the workforce.
Optimized Human Resources Strategies: Human resources professionals can benefit from insights into salary prediction to develop and optimize compensation strategies within organizations. This includes talent acquisition, employee retention, and overall workforce planning.
Data-Driven Hiring Practices: For hiring managers, having a predictive model for salary based on experience provides a data-driven approach to recruitment. This can lead to fairer and more transparent hiring processes.
Career Planning and Development: Individuals can use the predictive insights for long-term career planning and development. Knowing how salaries typically evolve with experience enables professionals to set realistic goals and expectations.
Understanding Market Trends: The analysis of salary prediction models contributes to a broader understanding of market trends in compensation. This information is valuable not only for individuals and organizations but also for researchers and policymakers.
Enhanced Workforce Productivity: A clear understanding of how salaries correlate with experience can contribute to increased job satisfaction and productivity. Employees who feel fairly compensated are likely to be more engaged and motivated in their roles
Let’s start to implement the above said problem using Machine Learning and Python :
You have a dataframe containing information about “Years of Experience” and “Salary” for employees . So we can understand how the experience of an employee can impact the salary part. Here we understand a linear relationship between experience and salary.
For a better understanding of question let’s start solving the question and divide an answer in some steps for better understanding of the question
STEP 1 : Creating Data Frame:
We can make the model either by creating our own dataset or we use a CSV file with same column names :
When we create our data manually it look’s like
import pandas as pd # Create a DataFrame with employee data data = {'years_of_experience': [2, 5, 7, 10, 3, 6, 8, 12, 4, 9, 1, 11, 5, 3, 6, 9, 2, 8, 10, 4], 'salary': [50000, 70000, 85000, 105000, 55000, 72000, 88000, 115000, 60000, 95000, 48000, 110000, 68000, 56000, 74000, 99000, 52000, 90000, 112000, 62000]} df = pd.DataFrame(data) print(df)
or if we want to import CSV file to use as DataFrame it look’s like :
import pandas as pd # Load the dataset df = pd.read_csv('data.csv')
instead of ‘data.csv’ you have to use path of the file.
Step 2 : Imports the necessary libraries :
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score
Step 3 : Splits the DataFrame into input features X (Years_of_Experience) and target variable y (Salary).
# Split the data into features (Years of Experience) and target (Salary) X = df[['years_of_experience']] y = df['salary']
Step 4 : Splits the data into training and testing sets using 80% for training and 20% for testing.The random_state=42 .
# Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 5 : Initialize a linear regression model.
# Create a Linear Regression model model = LinearRegression()
Step 6 : Trains the linear regression model using the training data.
# Train the model on the training data model.fit(X_train, y_train)
Step 7 : Uses the trained model to make predictions on the test data and calculates the R-squared score (r2) and Mean Squared Error (mse) to evaluate the model’s performance.
# Make predictions on the test data y_pred = model.predict(X_test) r2 = r2_score(y_test, y_pred)
Step 8 : Plots the test data points and the fitted line obtained from the linear regression model. Labels the axes, provides a title, and shows the plot.
# Plot the data points and the fitted line plt.scatter(X_test, y_test, label='Test Data') plt.plot(X_test, y_pred, color='red', label='Fitted Line') plt.xlabel('Years of Experience') plt.ylabel('Salary') plt.title('Linear Regression') plt.legend() plt.show()
Step 9 : Prints the Mean Squared Error and R-squared Error calculated earlier, providing insights into the model’s accuracy and fit to the data.
print(f"Mean Squared Error: {mse:.2f}") print(f"R-Squared Error: {r2:.2f}")
Let’s combine all these steps and the code :
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score # Create a DataFrame with employee data data = {'years_of_experience': [2, 5, 7, 10, 3, 6, 8, 12, 4, 9, 1, 11, 5, 3, 6, 9, 2, 8, 10, 4], 'salary': [50000, 70000, 85000, 105000, 55000, 72000, 88000, 115000, 60000, 95000, 48000, 110000, 68000, 56000, 74000, 99000, 52000, 90000, 112000, 62000]} df = pd.DataFrame(data) # Split the data into features (Years of Experience) and target (Salary) X = df[['years_of_experience']] y = df['salary'] # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Create a Linear Regression model model = LinearRegression() # Train the model on the training data model.fit(X_train, y_train) # Make predictions on the test data y_pred = model.predict(X_test) r2 = r2_score(y_test, y_pred) # Calculate the Mean Squared Error (MSE) mse = mean_squared_error(y_test, y_pred) # Plot the data points and the fitted line plt.scatter(X_test, y_test, label='Test Data') plt.plot(X_test, y_pred, color='red', label='Fitted Line') plt.xlabel('Years of Experience') plt.ylabel('Salary') plt.title('Linear Regression') plt.legend() plt.show() print(f"Mean Squared Error: {mse:.2f}") print(f"R-Squared Error: {r2:.2f}")
You will see the below chart after executing the above code :
Mean Squared Error: 7.79
R- Squared Error: 1.00
Let’s look at some insight of the model and understand it.
Positive Correlation:
The scatter plot and fitted line indicate a positive correlation between years of experience and salary. As the years of experience increase, there is a general trend of higher salaries.
Model Accuracy:
The R-squared score and Mean Squared Error (MSE) suggest that the linear regression model provides a reasonable fit to the data. The R-squared score of the model indicates the proportion of the variability in salary that can be explained by years of experience.
Prediction Capability:
The fitted line represents the model’s ability to predict salaries based on years of experience. Predictions can be made for salaries not included in the training data, allowing for estimates of employee salaries for different levels of experience.
Interpretability:
The slope of the fitted line represents the average increase in salary for each additional year of experience. This interpretable feature allows for a straightforward understanding of the relationship between the independent variable (years of experience) and the dependent variable (salary).
Time to Predict the Salary based on Experience
experience=int(input("Enter the number of experience ")) # Assume you have a new experience value for prediction, for example, 7 years new_experience = np.array([7]).reshape(-1, 1) # Make predictions using the trained model predicted_salary = model.predict(new_experience) # Display the prediction print(f"Predicted Salary for {new_experience[0][0]} years of experience: ${predicted_salary[0]:,.2f}")
FAQ’s on Salary Prediction
1. What does the code aim to achieve?
The code aims to build a linear regression model to predict employee salaries based on their years of experience.
2. Why is linear regression chosen for this scenario?
Linear regression is chosen because it assumes a linear relationship between the independent variable (years of experience) and the dependent variable (salary), making it suitable for predicting salary based on experience.
3. How is the dataset split into training and testing sets, and why is it important?
The dataset is split using the train_test_split function from scikit-learn. This division is crucial to train the model on one subset and evaluate its performance on another, ensuring its ability to generalize to new, unseen data.
4. What do Mean Squared Error (MSE) and R-squared (R2) score represent in this context?
MSE measures the average squared difference between predicted and actual salaries. R-squared score represents the proportion of variability in salary explained by years of experience. Lower MSE and higher R2 indicate better model performance.
5. How is the model’s accuracy visualized in the code?
The code includes a scatter plot with the test data points and a fitted line representing the model’s predictions. This visualization allows a quick assessment of how well the model captures the relationship between years of experience and salary.
6. Can this code be applied to predict salaries for new employees?
Yes, the trained linear regression model can be used to predict salaries for new employees based on their years of experience.
7. What insights can be gained from the fitted line in the scatter plot?
The fitted line shows the average increase in salary for each additional year of experience, providing a visual interpretation of the model’s predictions.
8. What are the limitations of this linear regression model?
The model assumes a linear relationship, and deviations from linearity may affect predictions. It may not capture complex relationships or consider other factors influencing salary.
9. How can this code be extended for more advanced analysis?
To extend the analysis, one could explore additional features, consider non-linear relationships, or experiment with more advanced machine learning models.
10. In what real-world scenarios could this code be applied?
The code is applicable in scenarios such as HR and workforce planning, where understanding the relationship between years of experience and salary is crucial for making informed decisions about compensation structures.