python - Linear regression analysis with string/categorical features (variables)?

Question

Welcome To Ask or Share your Answers For Others

python - Linear regression analysis with string/categorical features (variables)?

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Linear regression analysis with string/categorical features (variables)?

Regression algorithms seem to be working on features represented as numbers. For example:

This data set doesn't contain categorical features/variables. It's quite clear how to do regression on this data and predict price.

But now I want to do a regression analysis on data that contain categorical features:

There are 5 features: District, Condition, Material, Security, Type

How can I do a regression on this data? Do I have to transform all the string/categorical data to numbers manually? I mean if I have to create some encoding rules and according to that rules transform all data to numeric values.

Is there any simple way to transform string data to numbers without having to create my own encoding rules manually? Maybe there are some libraries in Python that can be used for that? Are there some risks that the regression model will be somehow incorrect due to "bad encoding"?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-16T23:50:14+0000

Yes, you will have to convert everything to numbers. That requires thinking about what these attributes represent.

Usually there are three possibilities:

One-Hot encoding for categorical data
Arbitrary numbers for ordinal data
Use something like group means for categorical data (e. g. mean prices for city districts).

You have to be carefull to not infuse information you do not have in the application case.

One hot encoding

If you have categorical data, you can create dummy variables with 0/1 values for each possible value.

E. g.

idx color
0   blue
1   green
2   green
3   red

to

idx blue green red
0   1    0     0
1   0    1     0
2   0    1     0
3   0    0     1

This can easily be done with pandas:

import pandas as pd

data = pd.DataFrame({'color': ['blue', 'green', 'green', 'red']})
print(pd.get_dummies(data))

will result in:

   color_blue  color_green  color_red
0           1            0          0
1           0            1          0
2           0            1          0
3           0            0          1

Numbers for ordinal data

Create a mapping of your sortable categories, e. g. old < renovated < new → 0, 1, 2

This is also possible with pandas:

data = pd.DataFrame({'q': ['old', 'new', 'new', 'ren']})
data['q'] = data['q'].astype('category')
data['q'] = data['q'].cat.reorder_categories(['old', 'ren', 'new'], ordered=True)
data['q'] = data['q'].cat.codes
print(data['q'])

Result:

0    0
1    2
2    2
3    1
Name: q, dtype: int8

Using categorical data for groupby operations

You could use the mean for each category over past (known events).

Say you have a DataFrame with the last known mean prices for cities:

prices = pd.DataFrame({
    'city': ['A', 'A', 'A', 'B', 'B', 'C'],
    'price': [1, 1, 1, 2, 2, 3],
})
mean_price = prices.groupby('city').mean()
data = pd.DataFrame({'city': ['A', 'B', 'C', 'A', 'B', 'A']})

print(data.merge(mean_price, on='city', how='left'))

Result:

  city  price
0    A      1
1    B      2
2    C      3
3    A      1
4    B      2
5    A      1

Categories

python - Linear regression analysis with string/categorical features (variables)?

python - Linear regression analysis with string/categorical features (variables)?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

One hot encoding

Numbers for ordinal data

Using categorical data for groupby operations

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags