정리5

728x90

1. 라이브러리 임포트

pandas를 `pd`, numpy를 `np`, seaborn을 `sns`, matplotlib.pyplot을 `plt`로 임포트하세요.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

2. 데이터 불러오기

- 파일을 읽어 churn_df라는 이름의 데이터프레임으로 저장하세요.
- 파일명: employee_churn.csv

churn_df = pd.read_csv('employee_churn.csv')
churn_df

pd.read_csv

3. 데이터 확인

work_life_balance(일과 삶의 균형 만족도)와 department(부서명)에 따른 overtime(초과근무)를 시각화를 통해 보고자합니다.
- seaborn의 barplot()을 이용하여 그래프를 만드세요.
- x= 'department', y='work_life_balance', hue='overtime'
- 4지 선다의 문제를 보고 정답을 답03 변수에 저장하세요.

다음 중 부서(department)와 초과근무 여부(overtime)에 따른 work_life_balance 평균 비교 결과에 대한 설명으로 가장 적절한 것은 무엇인가?
1. 모든 부서에서 초과근무를 하지 않는 직원들의 work_life_balance가 더 높다.
2. finance 부서는 초과근무를 하는 직원의 work_life_balance가 더 높다.
3. engineering 부서는 초과근무 여부와 관계없이 work_life_balance에 큰 차이가 없다.
4. sales 부서는 초과근무를 하는 직원의 work_life_balance가 더 높다.

sns.barplot(data=churn_df, x= 'department', y='work_life_balance', hue='overtime')
답03 = 2

sns.barplot(data=churn_df, x= 'department', y='work_life_balance', hue='overtime')

컬럼명들에 따옴표 주의

4. 결측치 확인 및 처리

결측치가 있는 컬럼을 찾고, 결측치가 존재하는 행을 지우고자 합니다.

- 결측치를 찾는 코드를 작성하고, 결측치가 존재하는 컬럼명을 답04 변수에 저장하세요(예 : 답04 = 'overtime, attrition')
- 결측치가 존재하는 행을 지우고 처리한 데이터를 pre_df 변수에 저장하세요.

- 저장된 pre_df 변수는 reset_index()를 이용하여 index를 초기화 하세요.
- drop = True, inplace = True

churn_df.isna().sum()

답04 = 'years_experience, salary'
pre_df = churn_df.dropna(subset=['years_experience', 'salary'])
pre_df.reset_index(drop=True, inplace=True)
pre_df

churn_df.isna().sum()
pre_df = churn_df.dropna(subset=['years_experience', 'salary'])

pre_df.reset_index(drop=True, inplace=True)
pre_df / # pre_df.isna().sum()

일정 컬럼만 삭제할 때는 dropna( subset = [ ] ) 형태로 무조건 써줘야 함

삭제가 잘 되었는지 꼭 확인해볼 것

5. Groupby를 이용한 데이터 확인

부서별로 직원들의 직무 만족도(job_satisfaction) 평균을 분석하여, 직원 만족도가 높은 부서를 찾고자 합니다.
- groupby를 이용하여 각 부서(department)의 평균 직무 만족도를 계산하여, 가장 만족도가 높은 부서를 찾으세요.
- 문제를 보고 만족도가 높은 부서를 답05 변수에 저장하세요.

1. Sales
2. HR
3. Engineering
4. Marketing
5. Engineering

pre_df.groupby('department')['job_satisfaction'].mean()
답05 = 4

pre_df.groupby('department')['job_satisfaction'].mean()

답05 = 숫자 고르면 됨

6. crosstab을 이용한 데이터 확인

부서별로 이직률(Attrition 비율)을 분석하고자 합니다.
이를 위해 pandas의 crosstab() 함수를 사용하여 부서(department)별 이직 여부(attrition) 비율을 계산합니다.

- 각 부서별로 이직률과 재직률을 비율(%)로 확인합니다.
- 가장 attribution(이직여부)가 높은 부서를 답06 변수에 저장하세요(예 : 답06 = 'hr')

- normalize = 'index'
- round()를 사용하여 소수점 둘째자리까지 출력하세요.

print(pd.crosstab(pre_df['department'], pre_df['attrition'], normalize='index').round(2)
pd.crosstab(index=pre_df['department'], columns=pre_df['attrition'], normalize='index').round(2)

답06 = 'engineering'

print(pd.crosstab(pre_df['department'], pre_df['attrition'], normalize='index').round(2)
pd.crosstab(index=pre_df['department'], columns=pre_df['attrition'], normalize='index').round(2)

-- 동일함

print(pd.crosstab(pre_df['department'], pre_df['attrition'], normalize='index').round(2))

pd.crosstab 두 범주형 변수 간의 빈도나 비율을 빠르게 볼 수 있는 함수
- normalize='index' 옵션: 집계값이 아니라 비율 기반 비교가 가능
pre_df [ 'col' ] 로 써야 하는 이유 : DataFrame이 아니라 pandas 범용 함수이기 때문
- pd범용 함수
pd.crosstab(series1, series1)
pd.concat( [df1, df2, ...] )
pd.cut(series, bins=...)
pd.merge(df1, df2, on = 'key')
pd.pivot(df, index = ... , columns = ...)
- df 메서드
df.groupby('col')
df.sort_values('col')
df.pivot_table(index='col', columns = 'col2')
df.agg( {'col' : 'mean'} )
df.drop( [ 'col1', 'col2'], axis = 1)

출력:

attrition       0     1
department             
engineering  0.61  0.39
finance      0.67  0.33
hr           0.66  0.34
marketing    0.69  0.31
sales        0.73  0.27

7. 범주형 변수 인코딩

범주형 변수들을 인코딩 합니다.
- department 변수는 원핫 인코딩 진행
- overtime은 np.where을 이용하여 Y는 1로 N은 0으로 변경

- 아래 셀을 실행하여 education_level에 대한 인코딩을 진행해 주세요.

edu_map = {

'high_school': 0,

'bachelor': 1,

'master': 2,

'phd': 3

}

pre_df['education_level'] = pre_df['education_level'].map(edu_map)

- 원핫 인코딩은 drop_first = True 옵션 사용
- 처리된 데이터는 encoding_df 변수에 저장

encoding_df = pd.get_dummies(data=pre_df, columns=['department'], drop_first=True)
encoding_df['overtime'] = np.where(encoding_df['overtime'] == 'Y', 1, 0)
encoding_df

encoding_df = pd.get_dummies(data=pre_df, columns=['department'], drop_first=True)
encoding_df['overtime'] = np.where(encoding_df['overtime'] == 'Y', 1, 0)
encoding_df

np.where(조건, 참일 때 값, 거짓일 때 값)

df 이름 할당 주의!!!!!

8. feature / target 분리

모델 학습을 위해 X, y를 분리합니다.
- X = attrition 제외 전부
- y = attrition

X = encoding_df.drop(columns='attrition')
y = encoding_df_clean['attrition']

X = encoding_df_clean.drop(columns='attrition')
y = encoding_df_clean['attrition']

728x90

9. train / test 데이터 분리

모델 학습 및 평가를 위해 데이터를 분리합니다.

- 데이터셋을 분리하기 위해 train_test_split을 import하세요.
- 훈련 데이터 셋 = X_train, y_train
- 검증 데이터 셋 = X_valid, y_valid
- random_state=42
- 훈련 데이터셋과 검증 데이터셋의 비율은 8:2

from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

10. 데이터 표준화

데이터 스케일링을 통해 모델 학습의 성능을 높입니다.
- StandardScaler를 이용해 Feature 데이터를 스케일링하세요.
- X_train은 fit_tranform을 이용하여 X_train 변수에 저장
- X_valid는 transform을 이용하여 X_test의 변수에 저장

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_valid)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_valid)

11. LightGBM 모델 학습

LightGBM 모델을 이용하여 학습합니다.
- n_estimators=100, max_depth=7, random_state=10

- LGBMClassifier를 사용해 attrition을 예측하는 모델을 학습하세요.
- LGBM모델을 lgbmc변수에 저장하세요.
- n_estimators=100, max_depth=6, learning_rate = 0.01, random_state=10
- 훈련은 스케일링 처리된 데이터를 사용해 주세요.

from lightgbm import LGBMClassifier

lgbmc = LGBMClassifier(n_estimators=100, learning_rate=0.1, max_depth=6, random_state=42)
lgbmc.fit(X_train, y_train)

from lightgbm import LGBMClassifier sklearn아님, LightGBMClassifier 아님

lgbmc = LGBMClassifier(n_estimators=100, learning_rate=0.1, max_depth=6, random_state=42)
lgbmc.fit(X_train, y_train)

12. XGBoost 모델 학습

XGBoost 모델을 학습합니다.
- 아래 셀을 실행하여 XGBClassifier를 불러오세요.

- XGBClassifier를 사용해 leave_yn을 예측하는 모델을 학습하세요.
- XGBoost모델을 xgbc변수에 저장하세요.
- n_estimators=200, max_depth=8, learning_rate=0.1, random_state=10
- 훈련은 스케일링 처리된 데이터를 사용하여 주세요.

from xgboost import XGBClassifier

xgbc = XGBClassifier(n_estimators = 200, max_depth = 8, learning_rate = 0.1, random_state = 10)
xgbc.fit(X_train, y_train)

from xgboost import XGBClassifier sklearn아님

xgbc = XGBClassifier(n_estimators = 200, max_depth = 8, learning_rate = 0.1, random_state = 10)
xgbc.fit(X_train, y_train)

13. 모델 성능 비교

모델 예측 성능을 비교합니다.
- accuracy_score, f1_score 출력
- lightGBM의 predict를 이용하여 검증 데이터를 예측하고, 예측된 데이터는 lgbmc_predict변수에 저장하여 주세요.
- xgboost의 predict를 이용하여 검증 데이터를 예측하고, 예측된 데이터는 xgbc_predict변수에 저장하여 주세요.
- 각 모델별 정확도와 f1-score를 출력

from sklearn.metrics import accuracy_score, f1_score

# 답안지
lgbmc_predict = lgbmc.predict(X_valid)
xgbc_predict = xgbc.predict(X_valid)

# 정답지 vs 답안지
print("LightGBM")
print("Accuracy:", accuracy_score(y_valid, lgbmc_predict))
print("F1 Score:", f1_score(y_valid, lgbmc_predict))

print("XGBoost")
print("Accuracy:", accuracy_score(y_valid, xgbc_predict))
print("F1 Score:", f1_score(y_valid, xgbc_predict))

from sklearn.metrics import accuracy_score, f1_score

# 답안지
lgbmc_predict = lgbmc.predict(X_valid)
xgbc_predict = xgbc.predict(X_valid)

# 정답지 vs 답안지
print("LightGBM")
print("Accuracy:", accuracy_score(y_valid, lgbmc_predict))
print("F1 Score:", f1_score(y_valid, lgbmc_predict))

print("XGBoost")
print("Accuracy:", accuracy_score(y_valid, lgbmc_predict))
print("F1 Score:", f1_score(y_valid, lgbmc_predict))

14. 딥러닝 모델 설계 및 학습

딥러닝 모델로 이직 여부를 예측합니다.
모델 구조
  - Dense(64, selu) → BatchNormalization
  - Dense(32, selu) → BatchNormalization
  - Dense(16, selu)
  - Dense(2, softmax)
optimizer: adam, loss: categorical_crossentropy, metric: accuracy
epochs=50, batch_size=32

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, BatchNormalization
from tensorflow.keras.utils import to_categorical

tf.random.set_seed(10)
y_train = to_categorical(y_train, num_classes=2)
y_valid = to_categorical(y_valid, num_classes=2)

# 1. 모델 생성 + 레이어 묶기
model = Sequential([
    Dense(64, 'selu'),
    BatchNormalization(),
    Dense(32, 'selu'),
    BatchNormalization(),
    Dense(16, 'selu'),
    Dense(2, 'softmax')
])

# 2. 컴파일
model.compile(
    optimizer=Adam(),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

# 3. 학습
history = model.fit(
    X_train, y_train,
    epochs=50,
    batch_size=32
)

# 1. 모델 생성 + 레이어 묶기
model = Sequential([
    Dense(64, 'selu'),
    BatchNormalization(),
    Dense(32, 'selu'),
    BatchNormalization(),
    Dense(16, 'selu'),
    Dense(2, 'softmax')
])

# 2. 컴파일
model.compile(
    optimizer=Adam(),
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# 3. 학습
history = model.fit(
    X_train, y_train,
    epochs=50,
    batch_size=32
)

참고

분류 모델의 경우		회귀 모델의 경우
마지막 unit = 1	마지막 unit = 2	마지막 unit = 회귀할 값의 개수에 맞게 설정
이진 분류, 단일 벡터 라벨	이진 분류, 원-핫 인코딩 라벨	일반적으로 단일 연속값 예측은 unit=1 다중 출력 회귀 문제(예: 여러 연속값 예측)는 출력값 개수만큼 유닛 수 지정
활성화 : sigmoid	활성화 : softmax	회귀에서는 출력값이 연속 수치 = 보통 활성화 함수를 사용X = None 또는 linear 활성화 함수
loss : binary_crossentropy	loss : categorical_crossentropy	loss : 평균 제곱 오차(MSE), 평균 절댓값 오차(MAE)등을 사용

출처: 내돈내산 문제집

내돈내산으로 책 구매 O / 링크는 광고(쿠팡 파트너스 활동의 일환으로, 이에 따른 일정액의 수수료를 제공받습니다.)

[이패스코리아] 2025 이패스 AI능력시험 AICE Associate /사은품 마스크제공 - 컴퓨터 입문/활용 | 쿠팡

쿠팡에서 [이패스코리아] 2025 이패스 AI능력시험 AICE Associate /사은품 마스크제공 구매하고 더 많은 혜택을 받으세요! 지금 할인중인 다른 컴퓨터 입문/활용 제품도 바로 쿠팡에서 확인할 수 있습

www.coupang.com

728x90

저작자표시 비영리 변경금지 (새창열림)

'자격증 > AICE ASSO' 카테고리의 다른 글

[AICE ASSOCIATE] 비전공자 내돈내산 합격후기 / 응시료 할인 / 공부 꿀팁 / 독학 강의 및 문제집 추천 (1)	2025.09.08
정리4 (10)	2025.08.14
정리3 (4)	2025.08.13
정리2 (2)	2025.08.12
정리1 (1)	2025.08.12

문과쿙의 데이터 분석 모험

정리5

1. 라이브러리 임포트

2. 데이터 불러오기

3. 데이터 확인

4. 결측치 확인 및 처리

5. Groupby를 이용한 데이터 확인

6. crosstab을 이용한 데이터 확인

7. 범주형 변수 인코딩

8. feature / target 분리

9. train / test 데이터 분리

10. 데이터 표준화

11. LightGBM 모델 학습

12. XGBoost 모델 학습

13. 모델 성능 비교

14. 딥러닝 모델 설계 및 학습

'자격증 > AICE ASSO' 카테고리의 다른 글

티스토리툴바

정리5

1. 라이브러리 임포트

2. 데이터 불러오기

3. 데이터 확인

4. 결측치 확인 및 처리

5. Groupby를 이용한 데이터 확인

6. crosstab을 이용한 데이터 확인

7. 범주형 변수 인코딩

8. feature / target 분리

9. train / test 데이터 분리

10. 데이터 표준화

11. LightGBM 모델 학습

12. XGBoost 모델 학습

13. 모델 성능 비교

14. 딥러닝 모델 설계 및 학습

'자격증 > AICE ASSO' 카테고리의 다른 글

관련글

티스토리툴바