Data Preprocessing¶
- I. Missing Value
- II. Filtering
- III. 데이터프레임 합치기
- IV. 그룹 연산
- V. pivot_table( )
- VI. Multi-Index
- VII. etc
I. Missing Value¶
1) 실습용 'titanic' 데이터셋¶
In [1]:
import seaborn as sns
TD = sns.load_dataset('titanic')
- 'titanic' Dataset Information
- Seaborn에는 Pandas 기능 상속돼있음
In [2]:
TD.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 survived 891 non-null int64 1 pclass 891 non-null int64 2 sex 891 non-null object 3 age 714 non-null float64 4 sibsp 891 non-null int64 5 parch 891 non-null int64 6 fare 891 non-null float64 7 embarked 889 non-null object 8 class 891 non-null category 9 who 891 non-null object 10 adult_male 891 non-null bool 11 deck 203 non-null category 12 embark_town 889 non-null object 13 alive 891 non-null object 14 alone 891 non-null bool dtypes: bool(2), category(2), float64(2), int64(4), object(5) memory usage: 80.7+ KB
2) 결측치 확인¶
- 'age' 및 'deck' 열(Column)에서 결측치(NaN) 확인
In [3]:
TD.head(10)
Out[3]:
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True |
5 | 0 | 3 | male | NaN | 0 | 0 | 8.4583 | Q | Third | man | True | NaN | Queenstown | no | True |
6 | 0 | 1 | male | 54.0 | 0 | 0 | 51.8625 | S | First | man | True | E | Southampton | no | True |
7 | 0 | 3 | male | 2.0 | 3 | 1 | 21.0750 | S | Third | child | False | NaN | Southampton | no | False |
8 | 1 | 3 | female | 27.0 | 0 | 2 | 11.1333 | S | Third | woman | False | NaN | Southampton | yes | False |
9 | 1 | 2 | female | 14.0 | 1 | 0 | 30.0708 | C | Second | child | False | NaN | Cherbourg | yes | False |
- .value_counts(dropna = False)
- 결측치(NaN)를 포함하여 빈도분석 결과 출력 (default: True)
In [4]:
TD['deck'].value_counts(dropna = False)
Out[4]:
NaN 688 C 59 B 47 D 33 E 32 A 15 F 13 G 4 Name: deck, dtype: int64
- .isnull( )
- 결측치(NaN)를 'True'로 출력
In [5]:
TD.head(10).isnull()
Out[5]:
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False | False | False | False | False | True | False | False | False |
1 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False | False | False | False | False | True | False | False | False |
3 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
4 | False | False | False | False | False | False | False | False | False | False | False | True | False | False | False |
5 | False | False | False | True | False | False | False | False | False | False | False | True | False | False | False |
6 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
7 | False | False | False | False | False | False | False | False | False | False | False | True | False | False | False |
8 | False | False | False | False | False | False | False | False | False | False | False | True | False | False | False |
9 | False | False | False | False | False | False | False | False | False | False | False | True | False | False | False |
- 각 열(Column)별로 결측치(NaN) 개수 확인
- .isnull( ).sum(axis = 0)
- axis = 0: 행(Row) default 생략가능
- axis = 1: 열(Column)
In [6]:
TD.isnull().sum(axis = 0)
Out[6]:
survived 0 pclass 0 sex 0 age 177 sibsp 0 parch 0 fare 0 embarked 2 class 0 who 0 adult_male 0 deck 688 embark_town 2 alive 0 alone 0 dtype: int64
- 각 행(Row)별로 결측치(NaN) 개수 확인
- .isnull( ).sum(axis = 1)
In [7]:
TD.isnull().sum(axis = 1).value_counts()
Out[7]:
1 549 0 182 2 160 dtype: int64
- .notnull( )
- 결측치(NaN)를 'False'로 출력
In [8]:
TD.head(10).notnull()
Out[8]:
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | True | True | True | True | True | True | True | True | True | True | True | False | True | True | True |
1 | True | True | True | True | True | True | True | True | True | True | True | True | True | True | True |
2 | True | True | True | True | True | True | True | True | True | True | True | False | True | True | True |
3 | True | True | True | True | True | True | True | True | True | True | True | True | True | True | True |
4 | True | True | True | True | True | True | True | True | True | True | True | False | True | True | True |
5 | True | True | True | False | True | True | True | True | True | True | True | False | True | True | True |
6 | True | True | True | True | True | True | True | True | True | True | True | True | True | True | True |
7 | True | True | True | True | True | True | True | True | True | True | True | False | True | True | True |
8 | True | True | True | True | True | True | True | True | True | True | True | False | True | True | True |
9 | True | True | True | True | True | True | True | True | True | True | True | False | True | True | True |
(1) 결측치 막대 그래프 (missing number 패키지 이용)¶
In [9]:
import missingno as msno
msno.bar(TD,
figsize = (15, 7),
color = (0.2, 0.2, 0.8));
(2) 결측치 Matrix - Data sheet 어느 부분에 결측치 있는지 보여줌¶
In [10]:
msno.matrix(TD,
figsize = (15, 7),
color = (0.8, 0.2, 0.2));
3) 결측치 삭제¶
- 각 열(Column)별로 측정값(Non-NaN) 개수 확인
In [11]:
TD.notnull().sum(axis = 0)
Out[11]:
survived 891 pclass 891 sex 891 age 714 sibsp 891 parch 891 fare 891 embarked 889 class 891 who 891 adult_male 891 deck 203 embark_town 889 alive 891 alone 891 dtype: int64
- 300개 이하 측정값(Non-Null)이 있는 열(Column) 삭제
- .dropna(thresh = 300, axis = 1)
- 'deck' 열 삭제
- 15 -> 14
In [12]:
TD.dropna(thresh = 300, axis = 1).shape
Out[12]:
(891, 14)
- 'age'행(Column) 기준으로 결측치가 있는 행(Row) 삭제
- .dropna(subset = ['age'], how = 'any', axis = 0)
- how = 'all' : 모든 값이 결측치인 경우 삭제
In [13]:
TD.shape
Out[13]:
(891, 15)
- 891 -> 714
In [14]:
TD.dropna(subset = ['age'], how = 'any', axis = 0).shape
Out[14]:
(714, 15)
- 891 -> 182
In [15]:
TD.dropna(subset = ['age', 'embarked', 'deck', 'embark_town'], how = 'any', axis = 0).shape
Out[15]:
(182, 15)
- 결측치(NaN) 확인
In [16]:
TD['age'][4:7]
Out[16]:
4 35.0 5 NaN 6 54.0 Name: age, dtype: float64
- 평균값으로 치환
- 함수 적용 후 결과가 반영된 데이터프레임 반환: inplace = True
In [17]:
TD['age'].fillna(int(TD['age'].mean(axis = 0)), inplace = True)
- NaN -> 29.0
In [18]:
TD['age'][4:7]
Out[18]:
4 35.0 5 29.0 6 54.0 Name: age, dtype: float64
- 명목형 데이터 치환
- 'embark_town'의 결측치를 최빈값으로 치환
- .fillna(most_freq, inplace = True)
- 결측치(NaN) 확인
In [19]:
TD['embark_town'][828:831]
Out[19]:
828 Queenstown 829 NaN 830 Cherbourg Name: embark_town, dtype: object
- 최빈값 확인
In [20]:
TD['embark_town'].value_counts(dropna = True)
Out[20]:
Southampton 644 Cherbourg 168 Queenstown 77 Name: embark_town, dtype: int64
In [21]:
most_freq = TD['embark_town'].value_counts(dropna = True).idxmax()
most_freq
Out[21]:
'Southampton'
In [22]:
type(most_freq)
Out[22]:
str
- 최빈값으로 치환
In [23]:
TD['embark_town'].fillna(most_freq, inplace = True)
- NaN -> Southampton
In [24]:
TD['embark_town'][828:831]
Out[24]:
828 Queenstown 829 Southampton 830 Cherbourg Name: embark_town, dtype: object
- 결측치 치환 with 'ffill' (foward fill)
- 이전 데이터포인트로 치환
- .fillna(method = 'ffill', inplace = True)
In [25]:
TD = sns.load_dataset('titanic')
TD['embark_town'][828:831]
Out[25]:
828 Queenstown 829 NaN 830 Cherbourg Name: embark_town, dtype: object
- method = 'ffill'
In [26]:
TD['embark_town'].fillna(method = 'ffill', inplace = True)
TD['embark_town'][828:831]
Out[26]:
828 Queenstown 829 Queenstown 830 Cherbourg Name: embark_town, dtype: object
- 결측치 치환 with 'bfill' (backward fill)
- 다음 데이터포인트로 치환
- .fillna(method = 'bfill', inplace = True)
In [27]:
TD = sns.load_dataset('titanic')
TD['embark_town'][828:831]
Out[27]:
828 Queenstown 829 NaN 830 Cherbourg Name: embark_town, dtype: object
- method = 'bfill'
In [28]:
TD['embark_town'].fillna(method = 'bfill', inplace = True)
TD['embark_town'][828:831]
Out[28]:
828 Queenstown 829 Cherbourg 830 Cherbourg Name: embark_town, dtype: object
II. Filtering¶
1) 실습용 'titanic' 데이터셋¶
In [29]:
import seaborn as sns
TD = sns.load_dataset('titanic')
TD.head(3)
Out[29]:
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
test
2) 'age'가 10살 이상이면서 20살 미만¶
- (TD.age >= 10) & (TD.age < 20)
In [30]:
Filter_1 = (TD.age >= 10) & (TD.age < 20)
TD.loc[Filter_1, :].head()
Out[30]:
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9 | 1 | 2 | female | 14.0 | 1 | 0 | 30.0708 | C | Second | child | False | NaN | Cherbourg | yes | False |
14 | 0 | 3 | female | 14.0 | 0 | 0 | 7.8542 | S | Third | child | False | NaN | Southampton | no | True |
22 | 1 | 3 | female | 15.0 | 0 | 0 | 8.0292 | Q | Third | child | False | NaN | Queenstown | yes | True |
27 | 0 | 1 | male | 19.0 | 3 | 2 | 263.0000 | S | First | man | True | C | Southampton | no | False |
38 | 0 | 3 | female | 18.0 | 2 | 0 | 18.0000 | S | Third | woman | False | NaN | Southampton | no | False |
3) 'age'가 10살 미만이면서 'sex'이 여자¶
- (TD.age < 10) & (TD.sex == 'female')
In [31]:
Filter_2 = (TD.age < 10) & (TD.sex == 'female')
TD.loc[Filter_2, :].head()
Out[31]:
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | 1 | 3 | female | 4.0 | 1 | 1 | 16.7000 | S | Third | child | False | G | Southampton | yes | False |
24 | 0 | 3 | female | 8.0 | 3 | 1 | 21.0750 | S | Third | child | False | NaN | Southampton | no | False |
43 | 1 | 2 | female | 3.0 | 1 | 2 | 41.5792 | C | Second | child | False | NaN | Cherbourg | yes | False |
58 | 1 | 2 | female | 5.0 | 1 | 2 | 27.7500 | S | Second | child | False | NaN | Southampton | yes | False |
119 | 0 | 3 | female | 2.0 | 4 | 2 | 31.2750 | S | Third | child | False | NaN | Southampton | no | False |
In [32]:
Filter_3 = (TD.age < 10) | (TD.age >= 60)
TD.loc[Filter_3, ['age','sex','alone']].head()
Out[32]:
age | sex | alone | |
---|---|---|---|
7 | 2.0 | male | False |
10 | 4.0 | female | False |
16 | 2.0 | male | False |
24 | 8.0 | female | False |
33 | 66.0 | male | True |
5) 'sibsp'에 3 또는 4 또는 5를 포함¶
- (TD.sibsp == 3) | (TD.sibsp == 4) | (TD.sibsp == 5)
In [33]:
Filter_4 = (TD.sibsp == 3) | (TD.sibsp == 4) | (TD.sibsp == 5)
TD.loc[Filter_4, :].head(6)
Out[33]:
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7 | 0 | 3 | male | 2.0 | 3 | 1 | 21.0750 | S | Third | child | False | NaN | Southampton | no | False |
16 | 0 | 3 | male | 2.0 | 4 | 1 | 29.1250 | Q | Third | child | False | NaN | Queenstown | no | False |
24 | 0 | 3 | female | 8.0 | 3 | 1 | 21.0750 | S | Third | child | False | NaN | Southampton | no | False |
27 | 0 | 1 | male | 19.0 | 3 | 2 | 263.0000 | S | First | man | True | C | Southampton | no | False |
50 | 0 | 3 | male | 7.0 | 4 | 1 | 39.6875 | S | Third | child | False | NaN | Southampton | no | False |
59 | 0 | 3 | male | 11.0 | 5 | 2 | 46.9000 | S | Third | child | False | NaN | Southampton | no | False |
III. 데이터프레임 합치기¶
In [79]:
import pandas as pd
pd.concat([TD[20:23],TD[40:50]], axis=0)
Out[79]:
age | sex | class | fare | survived | |
---|---|---|---|---|---|
20 | 35.0 | male | Second | 26.0000 | 0 |
21 | 34.0 | male | Second | 13.0000 | 1 |
22 | 15.0 | female | Third | 8.0292 | 1 |
40 | 40.0 | female | Third | 9.4750 | 0 |
41 | 27.0 | female | Second | 21.0000 | 0 |
42 | NaN | male | Third | 7.8958 | 0 |
43 | 3.0 | female | Second | 41.5792 | 1 |
44 | 19.0 | female | Third | 7.8792 | 1 |
45 | NaN | male | Third | 8.0500 | 0 |
46 | NaN | male | Third | 15.5000 | 0 |
47 | NaN | female | Third | 7.7500 | 1 |
48 | NaN | male | Third | 21.6792 | 0 |
49 | 18.0 | female | Third | 17.8000 | 0 |
1) 데이터프레임 TB1, TB2, TB3, TB4 생성¶
- TB1
In [80]:
TB1 = pd.DataFrame({'Name': ['송태섭', '최유정', '이한나', '김소혜'],
'Gender': ['남자', '여자', '여자', '여자'],
'Age': ['21', '23', '20', '23']})
TB1
Out[80]:
Name | Gender | Age | |
---|---|---|---|
0 | 송태섭 | 남자 | 21 |
1 | 최유정 | 여자 | 23 |
2 | 이한나 | 여자 | 20 |
3 | 김소혜 | 여자 | 23 |
- TB2
In [92]:
TB2 = pd.DataFrame({'Name': ['송태섭', '최유정', '이한나', '김소혜'],
'Gender': ['남자', '여자', '여자', '여자'],
'Height': [179.1, 177.1, 167.9, 176.1],
'Weight': [63.9, 54.9, 50.2, 53.5]})
TB2
Out[92]:
Name | Gender | Height | Weight | |
---|---|---|---|---|
0 | 송태섭 | 남자 | 179.1 | 63.9 |
1 | 최유정 | 여자 | 177.1 | 54.9 |
2 | 이한나 | 여자 | 167.9 | 50.2 |
3 | 김소혜 | 여자 | 176.1 | 53.5 |
- TB3
In [82]:
TB3 = pd.DataFrame({'Name': ['서태웅', '정대만'],
'Gender': ['남자', '남자'],
'Age': ['24', '24']})
TB3
Out[82]:
Name | Gender | Age | |
---|---|---|---|
0 | 서태웅 | 남자 | 24 |
1 | 정대만 | 남자 | 24 |
- TB4
In [83]:
TB4 = pd.DataFrame({'Grade': [3, 1, 1, 3],
'Picture': ['무', '유', '무', '무']})
TB4
Out[83]:
Grade | Picture | |
---|---|---|
0 | 3 | 무 |
1 | 1 | 유 |
2 | 1 | 무 |
3 | 3 | 무 |
In [38]:
pd.concat([TB1, TB2], axis = 0)
Out[38]:
Name | Gender | Age | Height | Weight | |
---|---|---|---|---|---|
0 | 송태섭 | 남자 | 21 | NaN | NaN |
1 | 최유정 | 여자 | 23 | NaN | NaN |
2 | 이한나 | 여자 | 20 | NaN | NaN |
3 | 김소혜 | 여자 | 23 | NaN | NaN |
0 | 송태섭 | 남자 | NaN | 179.1 | 63.9 |
1 | 최유정 | 여자 | NaN | 177.1 | 54.9 |
2 | 이한나 | 여자 | NaN | 167.9 | 50.2 |
3 | 김소혜 | 여자 | NaN | 176.1 | 53.5 |
- 행기준: axis = 0
- TB1 & TB3
- ignore_index = True
- Label 새로 구성
In [84]:
pd.concat([TB1, TB3], axis = 0, ignore_index = True)
Out[84]:
Name | Gender | Age | |
---|---|---|---|
0 | 송태섭 | 남자 | 21 |
1 | 최유정 | 여자 | 23 |
2 | 이한나 | 여자 | 20 |
3 | 김소혜 | 여자 | 23 |
4 | 서태웅 | 남자 | 24 |
5 | 정대만 | 남자 | 24 |
- 열기준: axis = 1
- TB1 & TB2
In [85]:
pd.concat([TB1, TB2], axis = 1)
Out[85]:
Name | Gender | Age | Name | Gender | Height | Weight | |
---|---|---|---|---|---|---|---|
0 | 송태섭 | 남자 | 21 | 송태섭 | 남자 | 179.1 | 63.9 |
1 | 최유정 | 여자 | 23 | 최유정 | 여자 | 177.1 | 54.9 |
2 | 이한나 | 여자 | 20 | 이한나 | 여자 | 167.9 | 50.2 |
3 | 김소혜 | 여자 | 23 | 김소혜 | 여자 | 176.1 | 53.5 |
- 열기준: axis = 1
- TB1 & TB4
In [89]:
pd.concat([TB1, TB4], axis = 1)
Out[89]:
Name | Gender | Age | Grade | Picture | |
---|---|---|---|---|---|
0 | 송태섭 | 남자 | 21 | 3 | 무 |
1 | 최유정 | 여자 | 23 | 1 | 유 |
2 | 이한나 | 여자 | 20 | 1 | 무 |
3 | 김소혜 | 여자 | 23 | 3 | 무 |
In [93]:
pd.merge(TB1, TB2, on = ['Name', 'Gender'])
Out[93]:
Name | Gender | Age | Height | Weight | |
---|---|---|---|---|---|
0 | 송태섭 | 남자 | 21 | 179.1 | 63.9 |
1 | 최유정 | 여자 | 23 | 177.1 | 54.9 |
2 | 이한나 | 여자 | 20 | 167.9 | 50.2 |
3 | 김소혜 | 여자 | 23 | 176.1 | 53.5 |
IV. 그룹 연산¶
1) 실습용 'titanic' 데이터셋¶
In [94]:
import seaborn as sns
titanic = sns.load_dataset('titanic')
TD = titanic.loc[:, ['age', 'sex', 'class', 'fare', 'survived']]
TD.head()
Out[94]:
age | sex | class | fare | survived | |
---|---|---|---|---|---|
0 | 22.0 | male | Third | 7.2500 | 0 |
1 | 38.0 | female | First | 71.2833 | 1 |
2 | 26.0 | female | Third | 7.9250 | 1 |
3 | 35.0 | female | First | 53.1000 | 1 |
4 | 35.0 | male | Third | 8.0500 | 0 |
2) groupby( ) - 'class' 기준¶
- 'class' 기준의 DataFrameGroupBy 객체 생성
In [96]:
TD['class'].value_counts()
Out[96]:
Third 491 First 216 Second 184 Name: class, dtype: int64
In [97]:
TD['class'].nunique()
Out[97]:
3
In [95]:
grouped = TD.groupby(['class'])
grouped
Out[95]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fdcab6ba980>
- 'First' 키 그룹 정보 확인
- .get_group('First')
In [98]:
grouped.get_group('First').head(3)
Out[98]:
age | sex | class | fare | survived | |
---|---|---|---|---|---|
1 | 38.0 | female | First | 71.2833 | 1 |
3 | 35.0 | female | First | 53.1000 | 1 |
6 | 54.0 | male | First | 51.8625 | 0 |
- groupby 결과 확인(3개 그룹)
- 'First', 'Second', 'Third' 키별 3줄씩 출력
- .get_group('Key_Name')
In [99]:
for key in ['First', 'Second', 'Third']:
print(grouped.get_group(key).head(3))
print('\n')
age sex class fare survived 1 38.0 female First 71.2833 1 3 35.0 female First 53.1000 1 6 54.0 male First 51.8625 0 age sex class fare survived 9 14.0 female Second 30.0708 1 15 55.0 female Second 16.0000 1 17 NaN male Second 13.0000 1 age sex class fare survived 0 22.0 male Third 7.250 0 2 26.0 female Third 7.925 1 4 35.0 male Third 8.050 0
In [110]:
for key, group in grouped:
print('* key :', key)
print('* number :', len(group))
print(group.head(3))
print('\n')
* key : First * number : 216 age sex class fare survived 1 38.0 female First 71.2833 1 3 35.0 female First 53.1000 1 6 54.0 male First 51.8625 0 * key : Second * number : 184 age sex class fare survived 9 14.0 female Second 30.0708 1 15 55.0 female Second 16.0000 1 17 NaN male Second 13.0000 1 * key : Third * number : 491 age sex class fare survived 0 22.0 male Third 7.250 0 2 26.0 female Third 7.925 1 4 35.0 male Third 8.050 0
<ipython-input-110-cc76b7a3294d>:1: FutureWarning: In a future version of pandas, a length 1 tuple will be returned when iterating over a groupby with a grouper equal to a list of length 1. Don't supply a list with a single grouper to avoid this warning. for key, group in grouped:
- 3개 그룹별 평균('age', 'fare', 'survived')
In [100]:
grouped.mean()
<ipython-input-100-eca914d15133>:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function. grouped.mean()
Out[100]:
age | fare | survived | |
---|---|---|---|
class | |||
First | 38.233441 | 84.154687 | 0.629630 |
Second | 29.877630 | 20.662183 | 0.472826 |
Third | 25.140620 | 13.675550 | 0.242363 |
3) groupby( ) - 'class' & 'sex' 기준¶
- 두 개 키(Key) 사용하여 DataFrameGoupBy 객체 생성
- 'class', 'sex' 키 적용
In [101]:
grouped_TWO = TD.groupby(['class', 'sex'])
grouped_TWO
Out[101]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fdcab6b8c70>
- ('First', 'female') 키 그룹 정보 확인
- .get_group(('First', 'female'))
In [49]:
grouped_TWO.get_group(('First', 'female')).head(3)
Out[49]:
age | sex | class | fare | survived | |
---|---|---|---|---|---|
1 | 38.0 | female | First | 71.2833 | 1 |
3 | 35.0 | female | First | 53.1000 | 1 |
11 | 58.0 | female | First | 26.5500 | 1 |
- groupby 결과 확인(6개 그룹)
In [50]:
for key, group in grouped_TWO:
print('* key :', key)
print('* number :', len(group))
print(group.head(3))
print('\n')
* key : ('First', 'female') * number : 94 age sex class fare survived 1 38.0 female First 71.2833 1 3 35.0 female First 53.1000 1 11 58.0 female First 26.5500 1 * key : ('First', 'male') * number : 122 age sex class fare survived 6 54.0 male First 51.8625 0 23 28.0 male First 35.5000 1 27 19.0 male First 263.0000 0 * key : ('Second', 'female') * number : 76 age sex class fare survived 9 14.0 female Second 30.0708 1 15 55.0 female Second 16.0000 1 41 27.0 female Second 21.0000 0 * key : ('Second', 'male') * number : 108 age sex class fare survived 17 NaN male Second 13.0 1 20 35.0 male Second 26.0 0 21 34.0 male Second 13.0 1 * key : ('Third', 'female') * number : 144 age sex class fare survived 2 26.0 female Third 7.9250 1 8 27.0 female Third 11.1333 1 10 4.0 female Third 16.7000 1 * key : ('Third', 'male') * number : 347 age sex class fare survived 0 22.0 male Third 7.2500 0 4 35.0 male Third 8.0500 0 5 NaN male Third 8.4583 0
- 6개 그룹별 평균('age', 'fare', 'survived')
In [102]:
grouped_TWO.mean()
Out[102]:
age | fare | survived | ||
---|---|---|---|---|
class | sex | |||
First | female | 34.611765 | 106.125798 | 0.968085 |
male | 41.281386 | 67.226127 | 0.368852 | |
Second | female | 28.722973 | 21.970121 | 0.921053 |
male | 30.740707 | 19.741782 | 0.157407 | |
Third | female | 21.750000 | 16.118810 | 0.500000 |
male | 26.507589 | 12.661633 | 0.135447 |
- grouped
In [103]:
grouped.agg(['mean', 'std'])
<ipython-input-103-36eab24aa6aa>:1: FutureWarning: ['sex'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning. grouped.agg(['mean', 'std'])
Out[103]:
age | fare | survived | ||||
---|---|---|---|---|---|---|
mean | std | mean | std | mean | std | |
class | ||||||
First | 38.233441 | 14.802856 | 84.154687 | 78.380373 | 0.629630 | 0.484026 |
Second | 29.877630 | 14.001077 | 20.662183 | 13.417399 | 0.472826 | 0.500623 |
Third | 25.140620 | 12.495398 | 13.675550 | 11.778142 | 0.242363 | 0.428949 |
- grouped_TWO
In [107]:
grouped_TWO.agg(['mean', 'std'])
Out[107]:
age | fare | survived | |||||
---|---|---|---|---|---|---|---|
mean | std | mean | std | mean | std | ||
class | sex | ||||||
First | female | 34.611765 | 13.612052 | 106.125798 | 74.259988 | 0.968085 | 0.176716 |
male | 41.281386 | 15.139570 | 67.226127 | 77.548021 | 0.368852 | 0.484484 | |
Second | female | 28.722973 | 12.872702 | 21.970121 | 10.891796 | 0.921053 | 0.271448 |
male | 30.740707 | 14.793894 | 19.741782 | 14.922235 | 0.157407 | 0.365882 | |
Third | female | 21.750000 | 12.729964 | 16.118810 | 11.690314 | 0.500000 | 0.501745 |
male | 26.507589 | 12.159514 | 12.661633 | 11.681696 | 0.135447 | 0.342694 |
- fare 열에만 적용
In [108]:
grouped.fare.agg(['min', 'max'])
Out[108]:
min | max | |
---|---|---|
class | ||
First | 0.0 | 512.3292 |
Second | 0.0 | 73.5000 |
Third | 0.0 | 69.5500 |
- fare 및 age 열에 각각 다른 함수 적용
In [109]:
grouped.agg({'fare' : ['min', 'max'], 'age' : ['mean', 'std']})
Out[109]:
fare | age | |||
---|---|---|---|---|
min | max | mean | std | |
class | ||||
First | 0.0 | 512.3292 | 38.233441 | 14.802856 |
Second | 0.0 | 73.5000 | 29.877630 | 14.001077 |
Third | 0.0 | 69.5500 | 25.140620 | 12.495398 |
In [111]:
grouped.filter(lambda x : len(x) >= 200).head()
Out[111]:
age | sex | class | fare | survived | |
---|---|---|---|---|---|
0 | 22.0 | male | Third | 7.2500 | 0 |
1 | 38.0 | female | First | 71.2833 | 1 |
2 | 26.0 | female | Third | 7.9250 | 1 |
3 | 35.0 | female | First | 53.1000 | 1 |
4 | 35.0 | male | Third | 8.0500 | 0 |
- 그룹별 데이터 개수 확인
In [116]:
grouped.apply(len)
Out[116]:
class First 216 Second 184 Third 491 dtype: int64
- 'age' 열 평균이 30보다 작은 그룹의 결과만 필터링
- 'Second', 'Third'
In [117]:
grouped.filter(lambda x: x.age.mean() < 30).tail()
Out[117]:
age | sex | class | fare | survived | |
---|---|---|---|---|---|
884 | 25.0 | male | Third | 7.050 | 0 |
885 | 39.0 | female | Third | 29.125 | 0 |
886 | 27.0 | male | Second | 13.000 | 0 |
888 | NaN | female | Third | 23.450 | 0 |
890 | 32.0 | male | Third | 7.750 | 0 |
- 그룹별 'age' 열의 평균
In [118]:
grouped.age.mean()
Out[118]:
class First 38.233441 Second 29.877630 Third 25.140620 Name: age, dtype: float64
V. pivot_table( )¶
1) 실습용 'titanic' 데이터셋¶
In [119]:
import seaborn as sns
titanic = sns.load_dataset('titanic')
TD = titanic.loc[:, ['age', 'sex', 'class', 'fare', 'survived']]
TD.head(3)
Out[119]:
age | sex | class | fare | survived | |
---|---|---|---|---|---|
0 | 22.0 | male | Third | 7.2500 | 0 |
1 | 38.0 | female | First | 71.2833 | 1 |
2 | 26.0 | female | Third | 7.9250 | 1 |
2) pivot_table( ) 구성요소¶
- index : 행 인덱스
- column : 열 인덱스
- values : 데이터
- aggfunc : 적용 함수
In [120]:
TD_1 = pd.pivot_table(TD,
index = 'class',
columns = 'sex',
values = 'age',
aggfunc = 'mean')
TD_1
Out[120]:
sex | female | male |
---|---|---|
class | ||
First | 34.611765 | 41.281386 |
Second | 28.722973 | 30.740707 |
Third | 21.750000 | 26.507589 |
3) 두개의 적용 함수¶
In [121]:
TD_2 = pd.pivot_table(TD,
index = 'class',
columns = 'sex',
values = 'survived',
aggfunc = ['mean', 'sum'])
TD_2
Out[121]:
mean | sum | |||
---|---|---|---|---|
sex | female | male | female | male |
class | ||||
First | 0.968085 | 0.368852 | 91 | 45 |
Second | 0.921053 | 0.157407 | 70 | 17 |
Third | 0.500000 | 0.135447 | 72 | 47 |
4) 다중 인덱스, 다중 데이터, 다중 함수¶
In [124]:
TD_3 = pd.pivot_table(TD,
index = ['class', 'sex'],
columns = 'survived',
values = ['age','fare'],
aggfunc = {'age' : ['mean', 'std'], 'fare' : ['min', 'max']})
TD_3
Out[124]:
age | fare | ||||||||
---|---|---|---|---|---|---|---|---|---|
mean | std | max | min | ||||||
survived | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | |
class | sex | ||||||||
First | female | 25.666667 | 34.939024 | 24.006943 | 13.223014 | 151.55 | 512.3292 | 28.7125 | 25.9292 |
male | 44.581967 | 36.248000 | 14.457749 | 14.936744 | 263.00 | 512.3292 | 0.0000 | 26.2875 | |
Second | female | 36.000000 | 28.080882 | 12.915107 | 12.764693 | 26.00 | 65.0000 | 10.5000 | 10.5000 |
male | 33.369048 | 16.022000 | 12.158125 | 19.547122 | 73.50 | 39.0000 | 0.0000 | 10.5000 | |
Third | female | 23.818182 | 19.329787 | 12.833465 | 12.303246 | 69.55 | 31.3875 | 6.7500 | 7.2250 |
male | 27.255814 | 22.274211 | 12.135707 | 11.555786 | 69.55 | 56.4958 | 0.0000 | 0.0000 |
In [126]:
type(TD_3)
Out[126]:
pandas.core.frame.DataFrame
VI. Multi-Index¶
- .xs( ): Cross Section
1) 행 멀티인덱스¶
- names = ['class', 'sex']
In [125]:
TD_3.index
Out[125]:
MultiIndex([( 'First', 'female'), ( 'First', 'male'), ('Second', 'female'), ('Second', 'male'), ( 'Third', 'female'), ( 'Third', 'male')], names=['class', 'sex'])
- 행 멀티인덱스: 객실 등급이 일등실
- level = 'class'
In [127]:
TD_3.xs('First', level = 'class', axis = 0)
Out[127]:
age | fare | |||||||
---|---|---|---|---|---|---|---|---|
mean | std | max | min | |||||
survived | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
sex | ||||||||
female | 25.666667 | 34.939024 | 24.006943 | 13.223014 | 151.55 | 512.3292 | 28.7125 | 25.9292 |
male | 44.581967 | 36.248000 | 14.457749 | 14.936744 | 263.00 | 512.3292 | 0.0000 | 26.2875 |
- 행 멀티인덱스: 성별이 남자
- level = 'sex'
In [130]:
TD_3.xs('male', level = 'sex', axis = 0)
Out[130]:
age | fare | |||||||
---|---|---|---|---|---|---|---|---|
mean | std | max | min | |||||
survived | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
class | ||||||||
First | 44.581967 | 36.248000 | 14.457749 | 14.936744 | 263.00 | 512.3292 | 0.0 | 26.2875 |
Second | 33.369048 | 16.022000 | 12.158125 | 19.547122 | 73.50 | 39.0000 | 0.0 | 10.5000 |
Third | 27.255814 | 22.274211 | 12.135707 | 11.555786 | 69.55 | 56.4958 | 0.0 | 0.0000 |
- 행 멀티인덱스: 객실등급이 일등실이면서 성별이 남자
- level = ['class', 'sex']
In [134]:
TD_3.xs(('First', 'male'), level = ['class', 'sex'], axis = 0)
Out[134]:
age | fare | ||||||||
---|---|---|---|---|---|---|---|---|---|
mean | std | max | min | ||||||
survived | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | |
class | sex | ||||||||
First | male | 44.581967 | 36.248 | 14.457749 | 14.936744 | 263.0 | 512.3292 | 0.0 | 26.2875 |
2) 열 멀티인덱스¶
- names = [None, None, 'survived']
- names = [0, 1, 2]
In [135]:
TD_3.columns
Out[135]:
MultiIndex([( 'age', 'mean', 0), ( 'age', 'mean', 1), ( 'age', 'std', 0), ( 'age', 'std', 1), ('fare', 'max', 0), ('fare', 'max', 1), ('fare', 'min', 0), ('fare', 'min', 1)], names=[None, None, 'survived'])
- .set_names( )
- names: ['Header', 'Fuction', 'Survived']
In [138]:
TD_3.columns.set_names(['Header', 'Function', 'Survived'], inplace = True)
In [139]:
TD_3.columns
Out[139]:
MultiIndex([( 'age', 'mean', 0), ( 'age', 'mean', 1), ( 'age', 'std', 0), ( 'age', 'std', 1), ('fare', 'max', 0), ('fare', 'max', 1), ('fare', 'min', 0), ('fare', 'min', 1)], names=['Header', 'Function', 'Survived'])
In [140]:
TD_3
Out[140]:
Header | age | fare | |||||||
---|---|---|---|---|---|---|---|---|---|
Function | mean | std | max | min | |||||
Survived | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | |
class | sex | ||||||||
First | female | 25.666667 | 34.939024 | 24.006943 | 13.223014 | 151.55 | 512.3292 | 28.7125 | 25.9292 |
male | 44.581967 | 36.248000 | 14.457749 | 14.936744 | 263.00 | 512.3292 | 0.0000 | 26.2875 | |
Second | female | 36.000000 | 28.080882 | 12.915107 | 12.764693 | 26.00 | 65.0000 | 10.5000 | 10.5000 |
male | 33.369048 | 16.022000 | 12.158125 | 19.547122 | 73.50 | 39.0000 | 0.0000 | 10.5000 | |
Third | female | 23.818182 | 19.329787 | 12.833465 | 12.303246 | 69.55 | 31.3875 | 6.7500 | 7.2250 |
male | 27.255814 | 22.274211 | 12.135707 | 11.555786 | 69.55 | 56.4958 | 0.0000 | 0.0000 |
- .set_levels( )
- Level(Survived): ['Dead', 'Alive']
In [141]:
TD_3.columns.set_levels(['Dead', 'Alive'], level = 2, inplace = True)
<ipython-input-141-ace6f01fc354>:1: FutureWarning: inplace is deprecated and will be removed in a future version. TD_3.columns.set_levels(['Dead', 'Alive'], level = 2, inplace = True)
In [142]:
TD_3.columns
Out[142]:
MultiIndex([( 'age', 'mean', 'Dead'), ( 'age', 'mean', 'Alive'), ( 'age', 'std', 'Dead'), ( 'age', 'std', 'Alive'), ('fare', 'max', 'Dead'), ('fare', 'max', 'Alive'), ('fare', 'min', 'Dead'), ('fare', 'min', 'Alive')], names=['Header', 'Function', 'Survived'])
In [143]:
TD_3
Out[143]:
Header | age | fare | |||||||
---|---|---|---|---|---|---|---|---|---|
Function | mean | std | max | min | |||||
Survived | Dead | Alive | Dead | Alive | Dead | Alive | Dead | Alive | |
class | sex | ||||||||
First | female | 25.666667 | 34.939024 | 24.006943 | 13.223014 | 151.55 | 512.3292 | 28.7125 | 25.9292 |
male | 44.581967 | 36.248000 | 14.457749 | 14.936744 | 263.00 | 512.3292 | 0.0000 | 26.2875 | |
Second | female | 36.000000 | 28.080882 | 12.915107 | 12.764693 | 26.00 | 65.0000 | 10.5000 | 10.5000 |
male | 33.369048 | 16.022000 | 12.158125 | 19.547122 | 73.50 | 39.0000 | 0.0000 | 10.5000 | |
Third | female | 23.818182 | 19.329787 | 12.833465 | 12.303246 | 69.55 | 31.3875 | 6.7500 | 7.2250 |
male | 27.255814 | 22.274211 | 12.135707 | 11.555786 | 69.55 | 56.4958 | 0.0000 | 0.0000 |
- 열 멀티인덱스: 평균 나이
In [144]:
TD_3.xs('mean', level = 'Function', axis = 1)
Out[144]:
Header | age | ||
---|---|---|---|
Survived | Dead | Alive | |
class | sex | ||
First | female | 25.666667 | 34.939024 |
male | 44.581967 | 36.248000 | |
Second | female | 36.000000 | 28.080882 |
male | 33.369048 | 16.022000 | |
Third | female | 23.818182 | 19.329787 |
male | 27.255814 | 22.274211 |
- 열 멀티인덱스: 생존자 정보
In [145]:
TD_3.xs('Alive', level = 'Survived', axis = 1)
Out[145]:
Header | age | fare | |||
---|---|---|---|---|---|
Function | mean | std | max | min | |
class | sex | ||||
First | female | 34.939024 | 13.223014 | 512.3292 | 25.9292 |
male | 36.248000 | 14.936744 | 512.3292 | 26.2875 | |
Second | female | 28.080882 | 12.764693 | 65.0000 | 10.5000 |
male | 16.022000 | 19.547122 | 39.0000 | 10.5000 | |
Third | female | 19.329787 | 12.303246 | 31.3875 | 7.2250 |
male | 22.274211 | 11.555786 | 56.4958 | 0.0000 |
- 열 멀티인덱스: 최저 요금
In [147]:
TD_3.xs(('fare', 'min'), level = ['Header', 'Function'], axis = 1)
Out[147]:
Survived | Dead | Alive | |
---|---|---|---|
class | sex | ||
First | female | 28.7125 | 25.9292 |
male | 0.0000 | 26.2875 | |
Second | female | 10.5000 | 10.5000 |
male | 0.0000 | 10.5000 | |
Third | female | 6.7500 | 7.2250 |
male | 0.0000 | 0.0000 |
- 열 멀티인덱스: 사망자의 평균 나이
In [148]:
TD_3.xs(('age', 'mean', 'Dead'), level = ['Header', 'Function', 'Survived'], axis = 1)
Out[148]:
Header | age | |
---|---|---|
Function | mean | |
Survived | Dead | |
class | sex | |
First | female | 25.666667 |
male | 44.581967 | |
Second | female | 36.000000 |
male | 33.369048 | |
Third | female | 23.818182 |
male | 27.255814 |
VII. etc¶
1) 실습용 'titanic' 데이터셋¶
In [149]:
import seaborn as sns
titanic = sns.load_dataset('titanic')
TD = titanic.loc[:, ['age', 'sex', 'class', 'fare', 'survived']]
TD.head()
Out[149]:
age | sex | class | fare | survived | |
---|---|---|---|---|---|
0 | 22.0 | male | Third | 7.2500 | 0 |
1 | 38.0 | female | First | 71.2833 | 1 |
2 | 26.0 | female | Third | 7.9250 | 1 |
3 | 35.0 | female | First | 53.1000 | 1 |
4 | 35.0 | male | Third | 8.0500 | 0 |
2) .value_counts( )¶
- Series
In [150]:
TD['sex'].value_counts()
Out[150]:
male 577 female 314 Name: sex, dtype: int64
- DataFrame
In [151]:
TD[['sex', 'class']].value_counts()
Out[151]:
sex class male Third 347 female Third 144 male First 122 Second 108 female First 94 Second 76 dtype: int64
3) .nunique( ) / .unique()¶
- Series
In [152]:
TD['sex'].nunique()
Out[152]:
2
In [154]:
TD['sex'].unique()
Out[154]:
array(['male', 'female'], dtype=object)
- DataFrame
In [153]:
TD[['sex', 'class']].nunique()
Out[153]:
sex 2 class 3 dtype: int64
In [157]:
#TD[['sex', 'class']].unique()
4) .replace( )¶
- Series
In [158]:
TD.loc[[0, 4], :]
Out[158]:
age | sex | class | fare | survived | |
---|---|---|---|---|---|
0 | 22.0 | male | Third | 7.25 | 0 |
4 | 35.0 | male | Third | 8.05 | 0 |
In [159]:
TD['sex'] = TD['sex'].replace('male', 'MAN')
In [160]:
TD.loc[TD['sex'] == 'MAN', :][:2]
Out[160]:
age | sex | class | fare | survived | |
---|---|---|---|---|---|
0 | 22.0 | MAN | Third | 7.25 | 0 |
4 | 35.0 | MAN | Third | 8.05 | 0 |
- DataFrame
In [161]:
TD.loc[[1, 9, 0], :]
Out[161]:
age | sex | class | fare | survived | |
---|---|---|---|---|---|
1 | 38.0 | female | First | 71.2833 | 1 |
9 | 14.0 | female | Second | 30.0708 | 1 |
0 | 22.0 | MAN | Third | 7.2500 | 0 |
In [162]:
TD[['class']] = TD[['class']].replace({'First':'1st', 'Second':'2nd', 'Third':'3rd'})
In [163]:
TD.loc[[1, 9, 0], :]
Out[163]:
age | sex | class | fare | survived | |
---|---|---|---|---|---|
1 | 38.0 | female | 1st | 71.2833 | 1 |
9 | 14.0 | female | 2nd | 30.0708 | 1 |
0 | 22.0 | MAN | 3rd | 7.2500 | 0 |
- Missing Value
In [164]:
TD.loc[[5], :]
Out[164]:
age | sex | class | fare | survived | |
---|---|---|---|---|---|
5 | NaN | MAN | 3rd | 8.4583 | 0 |
In [165]:
import numpy as np
TD[['age']] = TD[['age']].replace(np.nan, int(TD.age.mean()))
In [166]:
TD.loc[[5], :]
Out[166]:
age | sex | class | fare | survived | |
---|---|---|---|---|---|
5 | 29.0 | MAN | 3rd | 8.4583 | 0 |
'# Coding > 데이터 분석을 위한 Python' 카테고리의 다른 글
Python 기술 통계 (0) | 2023.10.02 |
---|---|
Python 씨본 (0) | 2023.10.02 |
Python 판다스 (0) | 2023.10.02 |
Python 넘파이 (0) | 2023.10.02 |
Python 클래스 & 패키지 (0) | 2023.10.02 |