Predict Score on the basis of Studied Hours
You have a DataFrame containing information about students, including their “Study Hours” and “Exam Scores.” How could you use linear regression to predict a student’s exam score based on the number of hours they studied?
For the above questions, let’s divide our solution into some steps :
STEP 1 : Creating DataFrame:
import pandas as pd # Create a DataFrame with student data data = {'study_hours': [12, 21, 31, 44, 15, 25, 37, 42, 27, 17, 14, 23, 33, 46, 19, 35, 39, 40, 24, 18], 'exam_scores': [50, 70, 84, 97, 52, 73, 87, 95, 75, 58, 52, 72, 87, 99, 62, 85, 90, 92, 73, 59 ]} df = pd.DataFrame(data)
You can also create dataframe taking random value using random() of numpy library like:
import pandas as pd import numpy as np # Generating random data np.random.seed(0) num_students = 100 study_hours = np.random.randint(1, 10, num_students) # Random study hours between 1 and 10 exam_scores = 50 + 10 * study_hours + np.random.normal(0, 5, num_students) # Exam scores based on study hours # Creating the DataFrame data = {'study_hours': study_hours, 'exam_scores': exam_scores} df = pd.DataFrame(data) # Displaying the first few rows of the DataFrame print(df.head())
This will create a DataFrame with two columns: “study_hours” and “exam_scores”. Now, let’s use linear regression to predict exam scores based on study hours. We can use the scikit-learn library for this purpose:
Step 2 : Imports the necessary libraries :
Imports the matplotlib.pyplot module, which is used for data visualization.
Imports necessary functions and classes from scikit-learn for data splitting, linear regression modeling, and performance evaluation.
import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score
Step 3 : Splits the DataFrame into input features X (study hours) and target variable y (exam scores).
X = df[['study_hours']] y = df['exam_scores']
Step 4 : Splits the data into training and testing sets using 80% for training and 20% for testing. The random_state=0 ensures reproducibility.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
Step 5 : Initializes a linear regression model.
model = LinearRegression()
Step 6 : Trains the linear regression model using the training data.
model.fit(X_train, y_train)
Step 7 : Uses the trained model to make predictions on the test data and calculates the R-squared score (r2) and Mean Squared Error (mse) to evaluate the model’s performance.
y_pred = model.predict(X_test) r2 = r2_score(y_test, y_pred) mse = mean_squared_error(y_test, y_pred)
Step 8 : Plots the test data points and the fitted line obtained from the linear regression model. Labels the axes, provides a title, and shows the plot.
plt.scatter(X_test, y_test, label='Test Data') plt.plot(X_test, y_pred, color='red', label='Fitted Line') plt.xlabel('Study Hours') plt.ylabel('Exams Score') plt.title('Linear Regression') plt.legend() plt.show()
Step 9 : Prints the Mean Squared Error and R-squared Error calculated earlier, providing insights into the model’s accuracy and fit to the data.
print(f"Mean Squared Error: {mse:.2f}") print(f"R-Squared Error : {r2:.2f}")
Let’s combine all these steps and the code :
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error,r2_score # Create a DataFrame with employee data data = {'study_hours': [12, 21, 31, 44, 15, 25, 37, 42, 27, 17, 14, 23, 33, 46, 19, 35, 39, 40, 24, 18], 'exam_scores': [50, 70, 84, 97, 52, 73, 87, 95, 75, 58, 52, 72, 87, 99, 62, 85, 90, 92, 73, 59 ]} df = pd.DataFrame(data) # Split the data into features (Study Hours) and target (Exam Scores) X = df[['study_hours']] y = df['exam_scores'] # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Create a Linear Regression model model = LinearRegression() # Train the model on the training data model.fit(X_train, y_train) # Make predictions on the test data y_pred = model.predict(X_test) r2 = r2_score(y_test, y_pred) # Calculate the Mean Squared Error (MSE) mse = mean_squared_error(y_test, y_pred) # Plot the data points and the fitted line plt.scatter(X_test, y_test, label='Test Data') plt.plot(X_test, y_pred, color='red', label='Fitted Line') plt.xlabel('Study Hours') plt.ylabel('Exams Score') plt.title('Linear Regression') plt.legend() plt.show() print(f"Mean Squared Error: {mse:.2f}") print(f"R-Squared Error : {r2:.2f}")
You will see the below chart after executing the above code :
The code provided generates a scatter plot (plt.scatter()) representing the test data points and overlays a red line plot (plt.plot()) depicting the fitted line obtained from the linear regression model. Let’s break down what this visualization represents:
1. Scatter Plot (Blue Points):
- X-Axis: Study Hours
- Y-Axis: Exam Scores
- Each blue point on the scatter plot represents a data point from the test dataset. The x-coordinate represents the study hours, and the y-coordinate represents the corresponding exam scores.
2. Fitted Line (Red Line):
- The red line is the fitted line generated by the linear regression model based on the input features (study hours) and the predicted exam scores.
- The slope and intercept of this line are determined by the regression model during training.
- This line represents the model’s best attempt to capture the underlying relationship between study hours and exam scores in the test data.
Insights:
- By visually comparing the scatter plot of the test data points with the fitted red line, you can observe how well the linear regression model fits the given data.
- If the fitted line closely follows the trend of the data points, it indicates that the linear regression model has captured the underlying pattern in the relationship between study hours and exam scores.
- Any deviations between the data points and the fitted line might suggest areas where the model does not perform well. It’s essential to consider factors like outliers, noise in the data, or non-linear relationships when interpreting the fit.
- The visualization provides a clear representation of how the linear regression model predicts exam scores based on study hours, making it easier to explain the model’s behavior to stakeholders or colleagues.
Time to Predict the Score on the basis of Studied Hours
# Given study hours for which you want to predict marks study_hours = int(input("Enter Studied Hours")) # Reshape the new_experience to match the input shape the model expects study_hours = np.array(study_hours).reshape(-1, 1) # Use the trained model to predict the salary for the given experience predicted_marks = model.predict(study_hours) print(f"Predicted Score for {study_hours[0][0]} hours are : {predicted_marks[0]:.2f}")
FAQ’s on Linear Regression Problem
What is Linear Regression, and how is it applied in predicting scores based on marks?
Linear Regression is a statistical method used to model the relationship between a dependent variable (such as exam scores) and one or more independent variables (such as marks). In predicting scores based on marks, linear regression helps establish a linear equation that best fits the relationship between these variables.
What are the key steps involved in performing a Linear Regression for score prediction?
The key steps include data collection, data preprocessing, splitting the data into training and testing sets, creating a linear regression model, training the model, making predictions, and evaluating the model’s performance using metrics like Mean Squared Error (MSE) or R-squared.
Why is it essential to split the data into training and testing sets when working with Linear Regression?
Splitting the data helps in training the model on one subset and testing its performance on another. This ensures that the model does not simply memorize the data but generalizes well to unseen data, providing a reliable evaluation of its predictive ability.
What role do coefficients play in a Linear Regression equation for score prediction?
Coefficients in a Linear Regression equation represent the relationship between independent and dependent variables. In the context of predicting scores based on marks, coefficients indicate how a change in marks influences the predicted scores.
How can Linear Regression be affected by outliers in the data when predicting scores from marks?
Outliers can significantly impact Linear Regression by skewing the regression line. They can disproportionately influence the slope and intercept of the line, leading to inaccurate predictions. Identifying and handling outliers is crucial to maintain the model’s accuracy.
Is Linear Regression the only method for predicting scores based on marks?
No, while Linear Regression is commonly used, there are other machine learning techniques like Decision Trees, Random Forest, and Neural Networks that can also be applied for score prediction based on marks. The choice of method depends on the complexity of the relationship and the dataset.
How can one interpret the results obtained from a Linear Regression model predicting scores from marks?
Interpretation involves understanding the coefficients to see how much a one-unit change in marks affects the predicted score. Additionally, model evaluation metrics like MSE or R-squared provide insights into the accuracy and goodness of fit of the model.
Are there any specific preprocessing techniques applied to marks data before performing Linear Regression?
Yes, preprocessing techniques like normalization or standardization of marks data can be applied to ensure consistency in scale, which aids in the accurate interpretation of coefficients and model performance evaluation.
Can Linear Regression predict scores accurately for various subjects, or does it require customization for each subject?
Linear Regression can predict scores for different subjects if there is a linear relationship between marks and scores across subjects. However, customization might be necessary if the relationships significantly vary between subjects.
How can one improve the accuracy of a Linear Regression model predicting scores based on marks?
Improving accuracy can be achieved through feature engineering, handling outliers, using advanced regression techniques, and ensuring a representative dataset. Regular evaluation and refinement of the model also contribute to enhanced accuracy.