资金流入流出预测比赛(三)

本文涉及内容为时间序列规则,参考链接为Datawhale 资金流入流出学习内容

首先导入相应的包:

1
2
3
4
5
6
7
import pandas as pd
import sklearn as skr
import numpy as np
import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from dateutil.relativedelta import relativedelta

然后我们将前面做的时间特征提取打包成函数,这个函数可以将时间信息提取出day、month等更方便使用的信息:

1
2
3
4
5
6
7
8
9
def add_timestamp(data: pd.DataFrame, time_index: str = 'report_date')->pd.DataFrame:
data_balance = data.copy()
data_balance['date'] = pd.to_datetime(data_balance[time_index], format= "%Y%m%d")
data_balance['day'] = data_balance['date'].dt.day
data_balance['month'] = data_balance['date'].dt.month
data_balance['year'] = data_balance['date'].dt.year
data_balance['week'] = data_balance['date'].dt.week
data_balance['weekday'] = data_balance['date'].dt.weekday
return data_balance.reset_index(drop=True)

定义一个load_data函数,在读取文件的同时会将时间信息传入上面的函数得到提取过时间特征之后的DataFrame:

1
2
3
4
def load_data(path: str = 'user_balance_table.csv')->pd.DataFrame:
data_balance = pd.read_csv(path)
data_balance = add_timestamp(data_balance)
return data_balance.reset_index(drop=True)

将前面使用到的提取的total_purchase_amt和total_redeem_amt信息也封装成函数:

1
2
3
4
5
def get_total_balance(data: pd.DataFrame, date: str = '2014-03-31')->pd.DataFrame:
df_tmp = data.copy()
df_tmp = df_tmp.groupby(['date'])['total_purchase_amt','total_redeem_amt'].sum()
df_tmp.reset_index(inplace=True)
return df_tmp[(df_tmp['date']>= date)].reset_index(drop=True)

将生成测试集封装成函数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def generate_test_data(data: pd.DataFrame)->pd.DataFrame:
total_balance = data.copy()
start = datetime.datetime(2014,9,1)
testdata = []
while start != datetime.datetime(2014,10,15):
temp = [start, np.nan, np.nan]
testdata.append(temp)
start += datetime.timedelta(days = 1)
testdata = pd.DataFrame(testdata)
testdata.columns = total_balance.columns

total_balance = pd.concat([total_balance, testdata], axis = 0)
total_balance = total_balance.reset_index(drop=True)
return total_balance.reset_index(drop=True)

最后定义一个导入用户信息的函数:

1
2
def load_user_information(path: str = 'user_profile_table.csv')->pd.DataFrame:
return pd.read_csv(path)

下面可以使用前面封装好的函数对数据进行如资金流入流出预测比赛(一)中进行的载入数据以及数据预处理:

1
2
3
4
5
balance_data = load_data('Dataset/user_balance_table.csv')
balance_data = add_timestamp(balance_data)
total_balance = get_total_balance(balance_data, date = '2014-03-01')
total_balance = generate_test_data(total_balance)
total_balance = add_timestamp(total_balance, 'date')

创建数据的深层拷贝:

1
data = total_balance.copy()

下面是比较重要的一步,定义生成时间序列规则预测结果的方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
def generate_base(df: pd.DataFrame, month_index: int)->pd.DataFrame:
# 选中固定时间段的数据集
total_balance = df.copy()
total_balance = total_balance[['date','total_purchase_amt','total_redeem_amt']]
total_balance = total_balance[(total_balance['date'] >= datetime.date(2014,3,1)) & (total_balance['date'] < datetime.date(2014, month_index, 1))]

# 加入时间戳
total_balance['weekday'] = total_balance['date'].dt.weekday
total_balance['day'] = total_balance['date'].dt.day
total_balance['week'] = total_balance['date'].dt.week
total_balance['month'] = total_balance['date'].dt.month

# 统计翌日因子
mean_of_each_weekday = total_balance[['weekday']+['total_purchase_amt','total_redeem_amt']].groupby('weekday',as_index=False).mean()
for name in ['total_purchase_amt','total_redeem_amt']:
mean_of_each_weekday = mean_of_each_weekday.rename(columns={name: name+'_weekdaymean'})
mean_of_each_weekday['total_purchase_amt_weekdaymean'] /= np.mean(total_balance['total_purchase_amt'])
mean_of_each_weekday['total_redeem_amt_weekdaymean'] /= np.mean(total_balance['total_redeem_amt'])

# 合并统计结果到原数据集
total_balance = pd.merge(total_balance, mean_of_each_weekday, on='weekday', how='left')

# 分别统计翌日在(1~31)号出现的频次
weekday_count = total_balance[['day','weekday','date']].groupby(['day','weekday'],as_index=False).count()
weekday_count = pd.merge(weekday_count, mean_of_each_weekday, on='weekday')

# 依据频次对翌日因子进行加权,获得日期因子
weekday_count['total_purchase_amt_weekdaymean'] *= weekday_count['date'] / len(np.unique(total_balance['month']))
weekday_count['total_redeem_amt_weekdaymean'] *= weekday_count['date'] / len(np.unique(total_balance['month']))
day_rate = weekday_count.drop(['weekday','date'],axis=1).groupby('day',as_index=False).sum()

# 将训练集中所有日期的均值剔除日期残差得到base
day_mean = total_balance[['day'] + ['total_purchase_amt','total_redeem_amt']].groupby('day',as_index=False).mean()
day_pre = pd.merge(day_mean, day_rate, on='day', how='left')
day_pre['total_purchase_amt'] /= day_pre['total_purchase_amt_weekdaymean']
day_pre['total_redeem_amt'] /= day_pre['total_redeem_amt_weekdaymean']

# 生成测试集数据
for index, row in day_pre.iterrows():
if month_index in (2,4,6,9) and row['day'] == 31:
break
day_pre.loc[index, 'date'] = datetime.datetime(2014, month_index, int(row['day']))

# 基于base与翌日因子获得最后的预测结果
day_pre['weekday'] = day_pre.date.dt.weekday
day_pre = day_pre[['date','weekday']+['total_purchase_amt','total_redeem_amt']]
day_pre = pd.merge(day_pre, mean_of_each_weekday,on='weekday')
day_pre['total_purchase_amt'] *= day_pre['total_purchase_amt_weekdaymean']
day_pre['total_redeem_amt'] *= day_pre['total_redeem_amt_weekdaymean']

day_pre = day_pre.sort_values('date')[['date']+['total_purchase_amt','total_redeem_amt']]
return day_pre

结果是菜了点,后续我们将做进一步优化:

  • 版权声明: 本博客所有文章除特别声明外,著作权归作者所有。转载请注明出处!

扫一扫,分享到微信

微信分享二维码
  • Copyrights © 2020 chenk
  • 由 帅气的CK本尊 强力驱动
  • 访问人数: | 浏览次数:

请我喝杯咖啡吧~

支付宝
微信