정리1

728x90

1. 라이브러리 임포트

pandas, numpy, seaborn, matplotlib을 임포트

seaborn은 sns, matplotlib.pyplot은 plt로 별칭(alias) 지정

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

2. 데이터 불러오기

cart_abandon.csv 파일을 불러와야 합니다.
파일을 읽어 abandon_df라는 이름의 데이터프레임에 저장하세요.

abandon_df = pd.read_csv('datasets/cart_abandon.csv')
abandon_df

pd.read_csv

3. 데이터 확인

사용자의 회원 유형(user_type)에 따라 웹사이트 체류 시간(time_on_site)이 어떻게 다른지를 분석하고자 합니다.

Seaborn의 boxplot()을 사용하여 시각화하고, 결과를 바탕으로 올바른 해석을 고르세요.
결과는 '답04' 변수에 저장하세요.

x = user_type
y = time_on_site

1) registered 사용자가 guest보다 평균과 중앙값 모두 높고, 이상치도 많다.
2) guest 사용자는 registered보다 체류 시간이 약간 더 길며, 평균과 중앙값 모두 더 높다.
3) 두 그룹의 중앙값은 같지만, guest는 이상치가 많아 평균만 더 높다.
4) guest는 이상치가 없고 registered는 이상치로 인해 중앙값이 낮아졌다.

abandon_df.groupby('user_type')['time_on_site'].mean()

sns.boxplot(abandon_df, x= 'user_type', y='time_on_site')
답04 = 2

.groupby( '컬럼' )['컬럼']

sns.boxplot( df, x = ' ' , y = ' ' )

4. 결측치 처리

'time_on_site' 컬럼의 결측치를 중앙값(median)으로 채우세요.
'cart_value' 컬럼의 결측치를 평균(mean)으로 채우세요.

inplace = True를 이용하여 처리하세요.

abandon_df['time_on_site'].fillna(abandon_df['time_on_site'].median(), inplace=True)
abandon_df['cart_value'].fillna(abandon_df['cart_value'].mean(), inplace=True)

df [ '컬럼' ].fillna (채울 값, inplace = True)

채울값 : df [ '컬럼' ].mean( ) 아님 .median( )

괄호주의 / inplace=True주의

5. 범주형 변수 인코딩

범주형 변수 컬럼을 get_dummies로 변환하세요.
drop_first=True 옵션 사용
'incoding_df' 변수에 처리한 데이터를 저장하세요.

abandon_df["prev_purchase"] = abandon_df["prev_purchase"].map({"Y": 1, "N": 0})
abandon_df["discount_applied"] = abandon_df["discount_applied"].map({"Y": 1, "N": 0})
abandon_df["user_type"] = abandon_df["user_type"].map({"registered": 1, "guest": 0})

cat_col = ['device', 'country', 'payment_method']

.map( { "값1" : 대체값1 , "값2" : 대체값2 } ) {중괄호}, : , '따옴표'

incoding_df = pd.get_dummies(data=abandon_df, columns=cat_col, drop_first=True)
incoding_df

pd.get_dummies( data = , columns = , drop_first = True )

6. 필요없는 컬럼 제거

'num_items' 컬럼을 삭제합니다.
axis=1, inplace=True를 사용하세요.

incoding_df.drop(columns=['num_items'], axis=1, inplace=True)

df. drop (columns = [ ' ' ], axis = 1 , inplace = True)

7. 이상치 제거

'time_on_site'는 사이트 체류 시간을 나타내는 변수입니다.
해당하는 값을 보면, 음수로 이루어진 값이 존재합니다.

- 'time_on_site' 변수에서 0 미만의 데이터들은 삭제처리하세요.
- 삭제처리된 데이터는 filtering_df로 저장하고, reset_index()를 이용하여 index를 초기화하세요.
- reset_index()에서는 drop=True, inplace=True를 사용하세요.

filtering_df = incoding_df[incoding_df['time_on_site'] >= 0]
filtering_df.reset_index(drop=True, inplace=True)
filtering_df

filtering_df = incoding_df.drop(incoding_df[incoding_df['time_on_site'] < 0].index, axis=0)
filtering_df.reset_index(drop=True, inplace=True)
filtering_df

df. drop ( df[ 조건 ].index , axis = 0 )

조건 : df [ df [ '컬럼' ] < 0 ]

.drop (보통은 columns = 를 쓰지만, 여기서는 [[특정 조건]].index, axis = 0 꼴)

.index 괄호없음

.reset_index( drop = True, inplace = True)

8. 피쳐 / 타겟 분리

모델 학습을 위해 feature(X)와 target(y)을 분리하고자 합니다.

X = 'cart_abandon_yn'제외 전부
y = 'cart_abandon_yn'

X = filtering_df.drop(columns='cart_abandon_yn',axis=1)
y = filtering_df['cart_abandon_yn']

df. drop ( columns = ' 컬럼1 ' , axis = 1 )

df. drop ( columns = [ '여러 컬럼' ], axis = 1 )

728x90

9. 학습 (train) / 평가 (test) 분리

모델 학습과 평가를 위해 분리하려고 합니다.
- 데이터셋을 분리하기 위해 train_test_split을 import하세요.
- 훈련 데이터 셋 = X_train, y_train
- 검증 데이터 셋 = X_valid, y_valid
- random_state=42
- 훈련 데이터셋과 검증 데이터셋의 비율은 8:2
- y데이터를 훈련과 검증 데이터셋으로 균등하게 분리하기 위해 stratify옵션을 사용하세요.

from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

X_train, X_valid, y_train, y_valid = train_test_split (X, y, test_size = , random_state = , stratify = y )

sklearn.model_selection

10. 표준화

StandardScaler를 이용하여 X_train과 X_valid를 표준화합니다.

- StandardScaler 적용
- X_train은 fit_tranform을 이용하여 X_train 변수에 저장
- X_valid는 transform을 이용하여 X_test의 변수에 저장

ss = StandardScaler()

X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_valid)

ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_valid)

sklearn.preprocessing

train(학습) = fit_transform

valid(검증) = transform 만 적용

11. 랜덤 포레스트 분류

RandomForestClassifier를 학습합니다.

- 랜덤포레스트모델을 rfc변수에 저장하세요.
- n_estimators=100, max_depth=7, random_state=42
- 훈련은 스케일링 처리된 데이터를 사용해 주세요.

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=100, max_depth=7, random_state=42)
rfc.fit(X_train,y_train)

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100, max_depth=7, random_state=42)
rfc.fit(X_train,y_train)

sklearn.ensemble

머신러닝 : fit (X_train, y_train)

12. XGBoost 분류

XGBClassifier를 학습합니다.

- XGBoost모델을 xgbc변수에 저장하세요.
- n_estimators=100, max_depth=5, learning_rate=0.1, random_state=42
- 훈련은 스케일링 처리된 데이터를 사용하여 주세요.

from xgboost import XGBClassifier

xgbc = XGBClassifier(n_estimators = 100, max_depth= 5, learning_rate = 0.1, random_state = 42)
xgbc.fit(X_train, y_train)

from xgboost import XGBClassifier
xgbc = XGBClassifier(n_estimators = 100, max_depth= 5, learning_rate = 0.1, random_state = 42)
xgbc.fit(X_train, y_train)

from xgboost import XGBClassifier

머신러닝 : fit (X_train, y_train)

13. 모델 성능 비교 (accuracy, f1 score)

앞서 훈련된 데이터들을 test데이터를 이용하여 예측을 진행합니다.
훈련된 모델 성능을 accuracy와 f1 score로 비교합니다.

- randomforest의 predict를 이용하여 검증 데이터를 예측하고, 예측된 데이터는 rfc_predict변수에 저장하여 주세요.
- xgboost의 predict를 이용하여 검증 데이터를 예측하고, 예측된 데이터는 xgbc_predict변수에 저장하여 주세요.
- accuracy_score, f1_score 사용하여 두 모델의 acc와 f1-score를 출력하여 주세요.

from sklearn.metrics import accuracy_score, f1_score

rfc_predict = rfc.predict(X_test)   
xgbc_predict = xgbc.predict(X_test) 

print('rfc accuracy ', accuracy_score(y_valid, rfc_predict))
print('rfc f1-score ', f1_score(y_valid, rfc_predict))

print('xgbc accracry ', accuracy_score(y_valid, xgbc_predict))
print('xgbc f1-score ', f1_score(y_valid, xgbc_predict))

from sklearn.metrics import accuracy_score, f1_score
rfc_predict = rfc.predict(X_test)
xgbc_predict = xgbc.predict(X_test)

print('rfc accuracy ', accuracy_score(y_valid, rfc_predict))
print('rfc f1-score ', f1_score(y_valid, rfc_predict))

print('xgbc accracry ', accuracy_score(y_valid, xgbc_predict))
print('xgbc f1-score ', f1_score(y_valid, xgbc_predict))

from sklearn.metrics import accuracy_score, f1_score

머신러닝 :

fit (X_train, y_train)

predict (X_test)

accuracy/f1_score(y_valid, predict)

14. 딥러닝 모델 설계 및 학습

딥러닝 모델을 설계하고 학습합니다.

- Dense(64, relu) → Dense(32, relu) → Dense(2, softmax)
- optimizer: adam
- loss: categorical_crossentropy
- metric: accuracy
- epochs=50, batch_size=128

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.utils import to_categorical

tf.random.set_seed(42)
y_train = to_categorical(y_train, num_classes=2)
y_valid = to_categorical(y_valid, num_classes=2)

model = Sequential()
model.add(Dense(64, activation='relu', input_shape=(X_train.shape[1],)))
model.add(Dense(32, activation='relu'))
model.add(Dense(2, activation='softmax'))

model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model.fit(X_train, y_train, epochs=50, batch_size=128,
          validation_data=(X_valid, y_valid))

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.utils import to_categorical

tf.random.set_seed(42)
y_train = to_categorical(y_train, num_classes=2)
y_valid = to_categorical(y_valid, num_classes=2)

model = Sequential()
model.add(Dense(64, activation='relu', input_shape=(X_train.shape[1],)))
model.add(Dense(32, activation='relu'))
model.add(Dense(2, activation='softmax'))

model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])

model.fit(X_train, y_train, epochs=50, batch_size=128, validation_data=(X_valid, y_valid))

딥러닝 :

model = Sequential()model.addmodel.compile ( 옵티,로,메=[''] )model.fit (X_train, y_train, 에포크 , 배치 사이즈 , validation_data = (X_val, y_val) )

정의: validation_data는 모델이 학습 중에 별도로 모델의 성능을 평가하는 데 사용하는 데이터 세트입니다.

학습(train) 데이터와는 다르며, 모델 매개변수(가중치)가 validation 데이터에는 업데이트되지 않습니다.

각 epoch이 끝날 때마다 train accuracy/loss와 함께 validation accuracy/loss를 보고, 성능 비교 목적이나 학습 조기 종료(Early stopping) 기준 등에도 사용합니다.

출처: 내돈내산 문제집

내돈내산으로 책 구매 O / 링크는 광고(쿠팡 파트너스 활동의 일환으로, 이에 따른 일정액의 수수료를 제공받습니다.)

[이패스코리아] 2025 이패스 AI능력시험 AICE Associate /사은품 마스크제공 - 컴퓨터 입문/활용 | 쿠팡

쿠팡에서 [이패스코리아] 2025 이패스 AI능력시험 AICE Associate /사은품 마스크제공 구매하고 더 많은 혜택을 받으세요! 지금 할인중인 다른 컴퓨터 입문/활용 제품도 바로 쿠팡에서 확인할 수 있습

www.coupang.com

728x90

저작자표시 비영리 변경금지 (새창열림)

'자격증 > AICE ASSO' 카테고리의 다른 글

정리5 (6)	2025.08.14
정리4 (10)	2025.08.14
정리3 (4)	2025.08.13
정리2 (2)	2025.08.12
[AICE] 자격증 소개 / 비전공자 교재 추천 / 독학 방법 (내돈내산) (12)	2025.08.11

문과쿙의 데이터 분석 모험

정리1

1. 라이브러리 임포트

2. 데이터 불러오기

3. 데이터 확인

4. 결측치 처리

5. 범주형 변수 인코딩

6. 필요없는 컬럼 제거

7. 이상치 제거

8. 피쳐 / 타겟 분리

9. 학습 (train) / 평가 (test) 분리

10. 표준화

11. 랜덤 포레스트 분류

12. XGBoost 분류

13. 모델 성능 비교 (accuracy, f1 score)

14. 딥러닝 모델 설계 및 학습

'자격증 > AICE ASSO' 카테고리의 다른 글

티스토리툴바

정리1

1. 라이브러리 임포트

2. 데이터 불러오기

3. 데이터 확인

4. 결측치 처리

5. 범주형 변수 인코딩

6. 필요없는 컬럼 제거

7. 이상치 제거

8. 피쳐 / 타겟 분리

9. 학습 (train) / 평가 (test) 분리

10. 표준화

11. 랜덤 포레스트 분류

12. XGBoost 분류

13. 모델 성능 비교 (accuracy, f1 score)

14. 딥러닝 모델 설계 및 학습

'자격증 > AICE ASSO' 카테고리의 다른 글

관련글

티스토리툴바