In this post, I’m going to train a machine learning model to predict mobile phone price from this Kaggle dataset.

The task is as follows: given a phone’s specifications (e.g. RAM, number of processors, camera resolution, etc.), predict the phone price range (low, medium, high, or very high).

There are several different approaches here compared to the one I used on my previous post:

Loading the Modules

Loading the Data

df = pd.read_csv('./data/train.csv')
print('Data shape:', df.shape)


Data shape: (2000, 21)
Below are the definitions of each variable, according to the the Kaggle page:

Checking each variable’s type:


Checking for missing values:



Finding out which variables are binary (only have 2 unique values):

df.apply(lambda col: col.nunique(), axis=0).sort_values()


Defining the binary variables:

bin_vars = ['blue', 'touch_screen', 'dual_sim', 'four_g', 'three_g', 'wifi']

Train/Test Split

The data consist of 2000 rows, which is not a lot. So I’m just going to split the data into two sets: the training set and the test set. To choose the best model, later on I’m going to use K-fold cross-validation.

df_train, df_test = train_test_split(df, test_size=0.3, random_state=112)

Exploratory Data Analysis

Correlation Heatmap

plt.title('Correlation Heatmap')

Correlation heatmap

Target Variable Distribution

sns.countplot(df_train, y='price_range')
plt.title('Target Variable Distribution')

Target variable distribution

Observation: the target variable seems to be well-balanced.

Numerical Variable Distributions

df_long = df_train.melt(id_vars='price_range')
is_bin_vars = df_long['variable'].isin(bin_vars)
df_long_num = df_long[~is_bin_vars]
df_long_cat = df_long[is_bin_vars]

g = sns.FacetGrid(df_long_num, col='variable', hue='price_range', col_wrap=3, sharex=False, sharey=False)
g.map_dataframe(sns.kdeplot, x='value')

Numerical variable kde plot

sns.catplot(df_long_num, kind='box',
            col='variable', col_wrap=3,
            x='price_range', y='value',
            sharex=False, sharey=False)

Numerical variable box plot

Observation: only ram variable shows a linear relationship with the price range (more expensive phone tends to have higher RAM).

Binary Variable Distributions

g = sns.catplot(df_long_cat, kind='count',
            col='variable', col_wrap=3,
            sharex=False, sharey=False)
g.fig.suptitle('Count Plot of Binary Variables')

Binary variables count plot

sns.catplot(df_long_cat, kind='count',
            col='variable', col_wrap=3,
            y='value', hue='price_range',
            sharex=False, sharey=False)

Binary variables count plot with price as hue

EDA Conclusions

X/Y Split

def split_X_y(df: pd.DataFrame, target_col: str, feature_cols: list[str]) -> tuple[pd.DataFrame, pd.Series]:
    return (

target_col = 'price_range'
feature_cols = df_train.columns.drop(target_col)

X_train, y_train = split_X_y(df_train, target_col, feature_cols)
X_test, y_test = split_X_y(df_test, target_col, feature_cols)

print('Train shape:', X_train.shape)
print('Test shape:', X_test.shape)


Train shape: (1400, 20)
Test shape: (600, 20)


In this section, I’m going to train several machine learning models.

For preprocessing, I’m only going to standardize the values, and I’m going to make use of the make_pipeline function.

Moreover, since there’s no validation data set, I’m just going to use k-fold cross validation to determine the “out-of-sample” performance.


models = {
    'Dummy Classifier': DummyClassifier(),
    'Logistic Regression': LogisticRegression(random_state=519),
    'Decision Tree': DecisionTreeClassifier(random_state=882),
    'Random Forest': RandomForestClassifier(random_state=511),
    'Bagging': BaggingClassifier(random_state=112),
    'Ada Boost': AdaBoostClassifier(random_state=919, algorithm='SAMME'),
    'Gradient Boosting': GradientBoostingClassifier(random_state=114),
    'LDA': LinearDiscriminantAnalysis(),
    'SVC': SVC(),

model_pipelines = {}
model_scores = {}

for model_name, model in models.items():
    model_pipeline = make_pipeline(
    scores = cross_val_score(
        X_train, y_train,
        cv=5, scoring='accuracy'
    ), y_train)
    model_pipelines[model_name] = model_pipeline

    model_scores[model_name] = scores.mean()


Let’s compare the out-of-sample performance of each model:

model_scores_df = (pd.Series(model_scores)
                   .set_axis(['Model', 'Score'], axis=1))
  Model Score
0 Dummy Classifier 0.254286
1 Ada Boost 0.488571
2 Decision Tree 0.816429
3 SVC 0.862857
4 Bagging 0.865000
5 Random Forest 0.873571
6 Gradient Boosting 0.892857
7 LDA 0.942143
8 Logistic Regression 0.946429

Based on the result above, we have to conclude that the logistic regression model is the best one with accuracy of 94.6%. Next, I’m going to tune the model’s hyperparameter using grid search.

Hyperparameter Tuning


lr_grid = GridSearchCV(
    make_pipeline(StandardScaler(), LogisticRegression()),
        'logisticregression__C': np.logspace(-5, 3, 20),
), y_train)

print('Best model parameters:', lr_grid.best_params_)
print('Best model score:', lr_grid.best_score_)


Best model parameters: {'logisticregression__C': 54.555947811685144}
Best model score: 0.9657142857142856
As you can see, by running a grid search we’ve managed to improve the accuracy from 94.6% to 96.6%.

Model Evaluation

Now let’s evaluate the model on the test data.

best_model = lr_grid.best_estimator_

y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)

test_accuracy = accuracy_score(y_test, y_test_pred)

print('Test accuracy:', test_accuracy)


Test accuracy: 0.965
test_confusion_matrix = ConfusionMatrixDisplay(
    confusion_matrix(y_test, y_test_pred),
plt.title('Confusion Matrix')

Confusion matrix

print(classification_report(y_test, y_test_pred))


              precision    recall  f1-score   support

           0       0.97      0.98      0.98       148
           1       0.94      0.96      0.95       156
           2       0.98      0.92      0.95       144
           3       0.97      0.99      0.98       152

    accuracy                           0.96       600
   macro avg       0.97      0.96      0.96       600
weighted avg       0.97      0.96      0.96       600

Overall, our model seems to work well on the test data, achieving 96% accuracy and 96% average F1 score.

It is also worth noting that the misclassified cases are still only 1 notch away from the true price range (which is good):

y_test_diff = (y_test_pred - y_test)


 0    579
-1     12
 1      9
Name: count, dtype: int64


That’s it for this post. I hope you learning something new today.