数据可视化

Posted on 2019-11-02 | Edited on 2019-11-24

直方图

直方图（Histogram）是一种可视化在连续间隔，或者是特定时间段内数据分布情况的图表，经常被用在统计学领域。简单来说，直方图描述的是一组数据的频次分布，例如把年龄分成“0-5,5-10，……，80-85”17个组，统计一下中国人口年龄的分布情况。直方图有助于我们知道数据的分布情况，诸如众数、中位数的大致位置、数据是否存在缺口或者异常值。

Numpy

histogram

1	numpy.histogram(a, bins=10, range=None, normed=None, weights=None, density=None)

参数列表

参数	涵义
a	输入数据，一个列表
bins	int: 区间数，默认10等分 list: 一个边界的列表 string: 一个用于计算区间宽度的函数
range	计算区间，默认是a.min() - a.max()
normed	废弃
weights	对应a中每个值的比重，只在density为false的时候有用
density	密度，还没搞懂怎么用

返回值

参数	涵义
hist	每个区间的数量
bin_edges	边界列表

bincount

这个函数是个简化版，无法指定具体的区间大小，一般不会用

1	numpy.bincount(x, weights=None, minlength=0)

参数列表

参数	涵义
x	输入数据，一个列表
weights	对应a中每个值的比重，只在density为false的时候有用
minlength	最少的区间数

返回值

参数	涵义
hist	每个区间的数量

pandas

DataFrame.plot.hist()

参数列表

https://github.com/pandas-dev/pandas/blob/master/pandas/plotting/_core.py#L504-L1533

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.hist.html#pandas.DataFrame.hist

参数	涵义
column	需要统计的column
by	需要聚合的column
grid	需要聚合的column
xlabelsize	多少个区间
xrot	多少个区间
ylabelsize	多少个区间
yrot	多少个区间

x	横坐标上的标签，一般是DataFrame中某个column的名称，默认为None
y	纵坐标上要显示的column，如果不指定column，则默认会绘制DataFrame中所有对象类型为数值型的columns，非数值对象类型的column不显示
kind	选择图表类型，默认为折线图。可选参数为‘line’(折线图)、‘bar’(柱状图，竖直方向)、‘barh’(柱状图，水平方向)、‘hist’(直方图)、‘box’(箱线图)、‘kde’(核密度估计图)、‘area’(面积图)、‘pie’(饼状图)、‘scatter’(散点图)、‘hexbin’(全成为hexagonal binning，有点类似热点图，用于显示一个区域中点的个数，不过是用正六边形表示数值区域)
ax	matplotlib中的axes对象(可以理解为子图对象)。在多子图(使用matplotlib的subplots()函数,或者add_subplot()函数时)，可以通过该参数选择在哪个子图上绘制图形。参数默认为None
subplots	是否单独显示每个columns，默认为False。设置为True时，会将每个columns的数据单独在一个子图中显示
sharex	仅作用于”subplots”为True时，是否允许所有的子图共用同一个X轴标签。当“ax”为None时，“sharex”默认为True；当“ax”不为None，“sharex”默认为False,此时每一个子图有自己单独的X轴标签。
sharey	当“ax”不为None或”subplots”为True是，是否允许共用一个Y轴标签，默认为False。
layout	当“subplots”为True时，用于布置图片显示布局，图片按几行、几列显示，参数为元组。
figsize	元组类型，设置图片尺寸。
use_index	是否使用DataFarme的index作为X轴标签，默认为True。当参数“x”不为None时。当DataFrame的index为非数值(包括字符串、datetime等类型)，use_index参数设置无效。
title	设置图标标题。
grid	是否显示网格线，默认为False。
legend	是否显示图例，默认为True，图例就是clolumn的名称。
style	设置线型，默认为直线。
xticks	设置X轴上的坐标值，需要数值型序列。
yticks	同”xticks”，作用于Y轴
xlim	设置图片中X轴数值刻度显示的区间范围，元祖类型
ylim	同xlim，作用于Y轴
rot	X/Y轴上的刻度值显示时候旋转的角度，水平绘图时旋转X轴坐标，竖直绘图时旋转Y坐标
fontsize	设置X/Y坐标的字体尺寸
colormap	设置图形显示的颜色，用matplotlib内指定的表示颜色的字符串或者colormap对象指定
colorbar	是否显示颜色条，仅仅在绘制有颜色条的图形时使用，如‘scatter’、‘henbin’图
position	仅作用于绘制柱状图时，取值范围[0,1]，用于设置X坐标显示的位置，0表示显示在最左边的柱条处，1表示显示在最右边的柱条处。
sort_columns	是否允许对columns的名称进行排序、决定绘图顺序
scondary_y	是否在第二个（一般默认指右边的Y轴）Y轴上绘图，默认为False。该参数还可以传入一个list或tuple，表示指定哪些columns在第二个Y轴上绘图。
mark_right	当使用第二个Y绘图时，是否允许在图例的右边加上“right”字样，表明是在第二个轴上绘图，默认是True
**kwds	该参数表明除去以上参数外，你可以传入matplotlib其它的绘图方法

seaborn

基于matplotlib上的一层封装

pandas + pyecharts + eplot + numpy

1
2
3

from eplot import eplot

df['principal_payable'].eplot.hist(bins=10)

https://pyecharts.org/#/zh-cn/intro

https://pyecharts.herokuapp.com/

plotly

Cufflinks是一个第三方的实现，对plotly的封装
plotly express是官方的，等同于Cufflinks
Dash——Flask, Plotly.js, and React.js

1 2	fig = px.histogram(s, x="principal_payable", marginal="rug", hover_data=s.columns,nbins=20) fig.show()

https://plot.ly/python/distplot/

地图数据可视化

plotly

百度地图

高德地图

pandas cheat sheet

Posted on 2019-10-28 | Edited on 2019-11-20

import numpy as np
import pandas as pd

读取文件

1 2	pd.read_csv("d.csv",sep=',',index_col=False) excelsource=pd.read_excel('111.xlsx', index_col=None)

月份

1	pd.date_range(start='2018/01/01', end='2018/07/01',freq='M')

导出数据

result.to_csv(“out.txt”)

显示打印行数

pd.set_option(‘display.max_rows’, 20)

精度

pd.option_context(‘display.precision’, 10)

打印宽度

pd.set_option(‘display.width’, 300)

条件修改列数据

result[‘res’][result[‘sum’]<=0]=result[‘AMT’]
result[‘res’][result[‘sum’]>0]=result[‘AMT’]-result[‘sum’]

union

pd.concat([df1,df2])

join

pd.merge(result,thismonthdf,on=”ACCOUNT_ID”,how=’left’)

pd.merge(excelsource,dbsource,left_on=”passId”,right_on=’pass_id’,how=’left’)

groupby

dftemp.groupby(dftemp[‘LOAN_ID’]).agg({“PRODUCT_ID”:’first’,”AMT”:np.sum}).reset_index()

如果reset_index时，column有同名列，就指定drop=True

aaa.reset_index(drop=True)

sort and groupby

data.sort_values(by=’occurtime’, ascending=False).groupby([‘bid’]).first()
data.sort_values(by=’occurtime’, ascending=False).groupby([‘bid’]).head(1)

对比数据差异

fcore[‘test’]=0
result = pd.merge(asset,fcore,on=[‘LOAN_ID’,’REQUEST_ID’],how=’left’)
result[result[‘test’]!=0]

1	result = pd.merge(ALL,BLACK,on=['key'],how='outer')

去重

data21.drop_duplicates()

差集

subset为去重的列，keep为false表示重复的直接删掉而不是distinct

1	WHITE=huaihai.append(BLACK).append(BLACK).append(GREY).append(GREY).drop_duplicates(subset=['key'],keep=False)

文件输出

pd.DataFrame(result[result[‘res’]>0], columns=[‘CUSTOMER_ID’,’ACCOUNT_ID’,’FINANCING_SOURCE’,’res’]).to_csv(“./credit/out_”+stopDate.strftime(“%Y-%m-%d”)+”.tsv”,sep=’\t’)

filter

df2[df2[‘E’].isin([‘two’,’four’])]
bacdata[(bacdata[‘CHARGES’]<0) | (bacdata[‘INTEREST’]<0)]
b[(b[‘complete_day’]>=’2019-02-27’) & (b[‘complete_day’]<’2019-02-28’) & (b[‘timepoint’].astype(int)<2)]

筛选 null 数据

temp[temp[‘1’].isnull()]
temp[temp[‘1’].notnull()]

check if exists

1094961539513730115 in fcoreData[‘LOAN_ID’].values
fcoreData[‘LOAN_ID’].values.tolist()

遍历每一行

def _map(data, exp):
for index, row in data.iterrows(): # 获取每行的index、row
for col_name in data.columns:
row[col_name] = exp(row[col_name]) # 把结果返回给data
return data

#遍历修改
for i, trial in dfTrials.iterrows():
dfTrials.loc[i, “response”] = “answer {}”.format(trial[“no”])

https://blog.csdn.net/ls13552912394/article/details/79349809

修改某一列的值
b.iloc[2][‘complete_day’]=’2019-02-27’

多列加和，axis表示列
b.apply(lambda x: int(x[‘timepoint’]) + int(x[‘branch_type’]), axis=1)

#时间相关
today = datetime.strptime(datetime.today().strftime(“%Y-%m-%d”), ‘%Y-%m-%d’)
datetime.combine(row[‘CLEAR_DATE’], datetime.min.time())
datetime(2016,9,3).date()
raw_data[‘Mycol’] = pd.to_datetime(raw_data[‘Mycol’], format=’%Y-%m-%d %H:%M:%S.%f’)

math.isnan()

画图

%matplotlib inline
result.plot()

类型转换

df[‘col2’] = df[‘col2’].astype(‘int’)
print ‘———–’
print df.dtypes
df[‘col2’] = df[‘col2’].astype(‘float64’)
print ‘———–’
print df.dtypes

精度问题

555.55*100=55554.999999999
所以得 round 一下
details[‘PRINCIPAL’]=details[‘PRINCIPAL’].round()

df.index.name = ‘foo’

hadoop

Posted on 2019-09-25 | Edited on 2019-11-11

HADOOP安装配置

~ tar -zxvf hadoop-x.y.z.tar.gz

~ vim ~/.zshrc
export HADOOP_HOME=/Users/xxx/usr/hadoop
export HADOOP_CONF_DIR=/Users/xxx/workspace/service/hadoop/conf
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

~ hadoop version
Hadoop 3.2.1

es cheat sheet

Posted on 2019-05-01 | Edited on 2019-05-08

es cheat sheet
http://elasticsearch-cheatsheet.jolicode.com

exploring cluster

cluster health

# 健康度
curl -X GET "localhost:9200/_cat/health?v"
# 获取节点
curl -X GET "localhost:9200/_cat/nodes?v"

BitTorrent简介

Posted on 2019-03-19 | Edited on 2019-03-20

历史

2001, Bram Cohen, BitTorrent

2003, Red Hat Linux 9 发布，服务器挤爆，3天交换了21150GB的数据
国内，网络蚂蚁、网际快车

BitTorrent由BitTorrent Community Forum进行维护，相关协议称为BEP(BitTorrent Enhancement Proposals)

微观经济学笔记

Posted on 2019-02-09 | Edited on 2019-10-07

第一章经济学十大原理

稀缺性：指的是社会资源的有限性。
经济学：研究的是社会如何管理稀缺资源的科学。

postgresql ha

Posted on 2019-01-11 | Edited on 2019-02-09

postgresql 基础

请将 pgsql/bin 加入PATH环境变量

pg_hba.conf 配置

可参考：https://www.cnblogs.com/flying-tiger/p/5983588.html

TYPE       DATABASE  USER  ADDRESS                 METHOD
local      database  user  auth-method  [auth-options]
host       database  user  address  auth-method  [auth-options]
hostssl    database  user  address  auth-method  [auth-options]
hostnossl  database  user  address  auth-method  [auth-options]
host       database  user  IP-address  IP-mask  auth-method  [auth-options]
hostssl    database  user  IP-address  IP-mask  auth-method  [auth-options]
hostnossl  database  user  IP-address  IP-mask  auth-method  [auth-options]

1
2
3

# 示例
local   all             all                                     md5
host    all             all             172.22.30.190/32        md5

###postgresql.conf配置

1
2

直方图

Numpy

histogram

bincount

pandas

DataFrame.plot.hist()

seaborn

pandas + pyecharts + eplot + numpy

plotly

地图数据可视化

plotly

百度地图

高德地图

读取文件

月份

导出数据

显示打印行数

精度

打印宽度

条件修改列数据

union

join

groupby

sort and groupby

对比数据差异

去重

文件输出

filter

筛选 null 数据

check if exists

遍历每一行

画图

类型转换

精度问题

HADOOP安装配置

exploring cluster

cluster health

历史

第一章 经济学十大原理

postgresql 基础

pg_hba.conf 配置

第一章经济学十大原理