这里记录一些自己常用的python用法，以便自己快速开发。

pandas(Dataframe)

对于行/索引的操作

获取索引列表

indexList = df.index.tolist()

根据索引/行号获取数据

三种方式：

1.用ix，以切片的方式通过索引获取数据: df.ix[a:b] (获取df的索引a到索引b的数据，包括第b个索引对应的行)

2.用loc,以list的方式通过索引获取数据： df.loc[indexList]

3.通过索引获取数据有时有一定的局限性(如对drop过的数据进行操作，对merge后索引没重建的数据进行操作等等)，这时我们可以考虑以切片的方式通过行号获取数据： df[a:b] (获取df的a到b行的数据，不包括第b行)

注意该操作返回的是一个新的对象，而不是在原来对象上进行修改。

根据索引删除数据

使用drop: df.drop(indexList, axis=0)

注意该操作返回的是一个新的对象，而不是在原来对象上进行修改。

遍历索引/行

使用iterrows()进行遍历：

for index,row in df.iterrows():
    print(index)
    print(row)

重置索引

reindexed_df = df.reindex(index,columns)
其中，index为新的(行)索引，columns为新的列。

对于column的操作

获取列名

columnNames = df.columns.values.tolist()

根据条件筛选数据

单条件筛选： filteredDF = df[df['column_one'] > 5]
多条件筛选：filteredDF = df[(df['column_one']> 5) & (df['column_two']>5)]

替换操作

replace()函数

DataFrame.replace(
        to_replace=None,
        value=None,
        inplace=False,
        limit=None,
        regex=False, 
        method='pad',
        axis=None)

举个例子：

>>> df = 
        BrandName Specialty
        A          H
        B          I
        ABC        J
        D          K
        AB         L
 
df['BrandName'] = df['BrandName'].replace(['ABC', 'AB'], 'A')

>>> df = 
        BrandName Specialty
        A          H
        B          I
        A          J
        D          K
        A          L

numpy

拼接

列拼接

newData = np.column_stack((a,b))

行拼接

newData = np.row_stack((a,b))

shuffle一个数据集(ndaray类型)

法 I：

使用 np.random.shuffle()

法 II：

np.random.permutation() 对行索引进行一次全排列

参考链接

其他

文件与路径(os)

获取路径的文件名

import os
fileName = os.path.basename(path)

获取文件后缀名

使用 os.path.splitext(fileName)[0] 可获得文件名
使用 os.path.splitext(fileName)[1] 可获得文件后缀名

import os
file = "Hello.py"

# 获取前缀（文件名称）
assert os.path.splitext(file)[0] == "Hello"

# 获取后缀（文件类型）
assert os.path.splitext(file)[1] == ".py"
assert os.path.splitext(file)[1][1:] == "py"

遍历目录下的文件:

import os
files = os.listdir(path) #获取目录下的所有文件
for file in files:
    if not os.path.isdir(file):#判断是否是文件夹，不是文件夹才对文件进行处理
        ...

判断文件夹是否存在

import os
if not os.path.exists(path):
    os.mkdir(path)

python读取文件

使用readline()读取

readline() 逐行读取文本，结果是一个list

with open(file) as f:
    line = f.readline()
    while line:
        print line
        line = f.readline()

readline()的利端： 占用内存小，逐行读取
readline()的弊端： 由于是逐行读取，速度比较慢

使用readlines()读取

readlines() 一次性读取文本的所有内容，结果是一个list

with open(file) as f:
    lines = f.readlines():
for line in lines:
    print line

readlines()的利端： 一次性读取文本内容，速度比较快
readlines()的弊端： 随着文本的增大，占用内存会越来越多

实例：将UCI数据new_thyroid.txt处理成特征X与标签Y

原始数据：

1,107,10.1,2.2,0.9,2.7
1,113,9.9,3.1,2.0,5.9
1,127,12.9,2.4,1.4,0.6
1,109,5.3,1.6,1.4,1.5
1,105,7.3,1.5,1.5,-0.1
...
3,118,6.5,1.3,1.7,11.5
3,139,4.2,0.7,4.3,6.3
3,103,5.1,1.4,1.2,5.0
3,97,4.7,1.1,2.1,12.6
3,102,5.3,1.4,1.3,6.7

处理代码：

import numpy as np

file = "./data/new_thyroid/new-thyroid.txt"
x = []
y = []

with open(file) as f:
    line = f.readline()
    while line:
        temp = line.rstrip('\n').split(',')
        x = x + [temp[1:7]]
        y = y + [temp[0]]
        line = f.readline()

x = np.array(x)
y = np.array(y)

np.save("./data/new_thyroid/X.npy",x)
np.save("./data/new_thyroid/y.npy",y)