发布日期：2021-07-01 04:22:00 浏览次数：36 分类：技术文章

本文共 14919 字，大约阅读时间需要 49 分钟。

文章目录

1、使用tensorflow_datasets

tensorflow_datasets是一个非常有用的库，其中包含了很多数据集，通过运行：

tfds.list_builders()

可以查看其中包含的所有数据集。

1.1 加载数据集

import tensorflow_datasets as tfds(raw_train, raw_validation, raw_test), metadata = tfds.load(    'cats_vs_dogs',    split=['train[:80%]', 'train[80%:90%]', 'train[90%:]'],    shuffle_files=False,    batch_size=None,    with_info=True,    as_supervised=True,)

参数说明：

输入：

name：数据集的名称，可以通过运行tfds.list_builders()获得。

split：如何划分数据集，如果不进行划分，则只得到训练集（即全部样本）。

shuffle_files：是否打乱。

batch_size：是否每次分批取出。如果为None，则每次取出一个样本。

with_info：是否输出数据集信息。

as_supervised：为True时，函数会返回一个二元组 (input, label)，而不是返回 FeaturesDict。

输出：

(raw_train, raw_validation, raw_test)：split之后的数据。

metadata：数据集信息。

1.2 查看数据集中某些样本的信息

for image, label in raw_train.take(2):    print(image.shape,label)    print(label)    """输出：(262, 350, 3)tf.Tensor(0, shape=(), dtype=int64)(428, 500, 3)tf.Tensor(1, shape=(), dtype=int64)"""

获取标签所代表的种类

get_label_name = metadata.features['label'].int2strfor image, label in raw_train.take(2):    print(image.shape)    print(label)    print(get_label_name(label))'''输出：(262, 350, 3)tf.Tensor(0, shape=(), dtype=int64)cat(428, 500, 3)tf.Tensor(1, shape=(), dtype=int64)dog'''

1.3 将样本标准化

IMG_SIZE = 160 # All images will be resized to 160x160def format_example(image, label):    image = tf.cast(image, tf.float32)    image = (image/127.5) - 1    image = tf.image.resize(image, (IMG_SIZE, IMG_SIZE))    return image, labeltrain = raw_train.map(format_example)validation = raw_validation.map(format_example)test = raw_test.map(format_example)

当然，这里也可以用下面的代码代替：

for image, label in raw_train:    image = tf.cast(image, tf.float32)    image = (image/127.5) - 1    image = tf.image.resize(image, (IMG_SIZE, IMG_SIZE))

但这将会非常花时间！！！

1.4 将样本打乱、分批

如果在导入数据集的时候没有shuffle和分批，那么可以在之后进行。

BATCH_SIZE = 32SHUFFLE_BUFFER_SIZE = 1000train_batches = train.shuffle(SHUFFLE_BUFFER_SIZE).batch(BATCH_SIZE)validation_batches = validation.batch(BATCH_SIZE)test_batches = test.batch(BATCH_SIZE)

1.5 查看最终的训练样本

for image_batch, label_batch in train_batches.take(1):    print(image_batch.shape)    print(label_batch.shape)'''输出：(32, 160, 160, 3)(32,)'''

2、将已有的csv文件作为数据集

2.2 数据标准化

data_mean = dataset_.mean(axis=0)data_std = dataset_.std(axis=0)dataset_ = (dataset_-data_mean)/data_std

2.3 划分训练集和测试集

因为这个数据集本身不分训练集和测试集，所以在这里要用sklearn库进行划分。

from sklearn.model_selection import train_test_splittrain, test = train_test_split(dataset_, test_size=0.1)

2.4 划分特征与标签

train_x = train[:, :-1].astype(np.float32)train_y = train[:, -1].astype(np.float32)test_x = test[:, :-1].astype(np.float32)test_y = test[:, -1].astype(np.float32)

2.5 切片处理

dataset_train = tf.data.Dataset.from_tensor_slices((train_x, train_y)).shuffle(train_y.shape[0]).batch(32)dataset_test = tf.data.Dataset.from_tensor_slices((test_x, test_y)).shuffle(test_y.shape[0]).batch(32)

将此输入模型，即可进行训练。

3、使用tf.keras.datasets

3.1 导入数据集

(x, y), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()

3.2 特征归一化

因为这里特征是图片，所以除以255即可。

def preprocess(x, y):    x = tf.cast(x, dtype=tf.float32) / 255.0    y = tf.cast(y, dtype=tf.int32)    return x,y

3.3 切片

batchsz = 128db = tf.data.Dataset.from_tensor_slices((x,y))db = db.map(preprocess).shuffle(10000).batch(batchsz)db_test = tf.data.Dataset.from_tensor_slices((x_test,y_test))db_test = db_test.map(preprocess).batch(batchsz)

将此输入模型，即可进行训练。

4、Dataset数据集

4.1 将Dataframe改为Dataset数据集

#target为标签列，将其从dataframe中删除，并返回删除内容于labels中。labels = dataframe.pop('target')ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))

4.2 将array改为Dataset数据集

# 从Numpy array构建数据管道import tensorflow as tfimport numpy as np from sklearn import datasets iris = datasets.load_iris()ds1 = tf.data.Dataset.from_tensor_slices((iris["data"],iris["target"]))for features,label in ds1.take(5):    print(features,label)'''输出：tf.Tensor([5.1 3.5 1.4 0.2], shape=(4,), dtype=float64) tf.Tensor(0, shape=(), dtype=int32)tf.Tensor([4.9 3.  1.4 0.2], shape=(4,), dtype=float64) tf.Tensor(0, shape=(), dtype=int32)tf.Tensor([4.7 3.2 1.3 0.2], shape=(4,), dtype=float64) tf.Tensor(0, shape=(), dtype=int32)tf.Tensor([4.6 3.1 1.5 0.2], shape=(4,), dtype=float64) tf.Tensor(0, shape=(), dtype=int32)tf.Tensor([5.  3.6 1.4 0.2], shape=(4,), dtype=float64) tf.Tensor(0, shape=(), dtype=int32)'''

4.3 将csv文件中数据导入到Dataset数据集

ds4 = tf.data.experimental.make_csv_dataset(      file_pattern = ["../A.csv","../B.csv"],      batch_size=3,       label_name="Survived",      na_value="",      num_epochs=1,      ignore_errors=True)for data,label in ds4.take(2):    print(data,label)

4.4 创建Dataset数据集

batch_size = 5 # 小批量大小用于演示train_ds = df_to_dataset(train, batch_size=batch_size)val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)

此处返回的皆为字典形式。

可以通过以下方式查看数据集信息：

for feature_batch, label_batch in train_ds.take(1):    print('Every feature:', list(feature_batch.keys()))    print('A batch of ages:', feature_batch['age'])    print('A batch of targets:', label_batch )'''输出：Every feature: ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']A batch of ages: tf.Tensor([61 59 58 42 40], shape=(5,), dtype=int32)A batch of targets: tf.Tensor([1 1 0 1 0], shape=(5,), dtype=int32)'''

5、图片

我们用horse2zebra数据集举例：此数据集中包含4个文件夹，分别是horse训练集、zebra训练集、horse测试集以及zebra测试集。每个训练集中都包含1000多张 (256, 256, 3) 的彩色图片（掺有一些灰度图片，之后会在代码中删掉）。

5.1 导入

PATH = 'C:\\Users\\kzb'train_horses = tf.data.Dataset.list_files(PATH+'trainA/*.jpg')train_zebras = tf.data.Dataset.list_files(PATH+'trainB/*.jpg')test_horses = tf.data.Dataset.list_files(PATH+'testA/*.jpg')test_zebras = tf.data.Dataset.list_files(PATH+'testB/*.jpg')

此时导入的是字符串类型的dataset。

5.2 将图片转换为需要的类型

def load(image_file):    image = tf.io.read_file(image_file)    image = tf.image.decode_jpeg(image)    image = tf.cast(image, tf.float32)    return image

打印出一张图片查看：

img = load(PATH+'trainB/n02391049_2.jpg')# casting to int for matplotlib to show the imageplt.figure()plt.imshow(img/255.0)

5.3 删除dataset中的灰度图

for dirname, _, filenames in os.walk(PATH+'trainB'):    for filename in filenames:        img = load(os.path.join(dirname, filename))        if img.shape != (256, 256, 3):            print(filename)            print(img.shape)            os.remove(os.path.join(dirname, filename))

5.4 加入batch和shuffle

AUTOTUNE = tf.data.experimental.AUTOTUNEtrain_horses = train_horses.map(    load, num_parallel_calls=AUTOTUNE).cache().shuffle(    1000).batch(1)train_zebras = train_zebras.map(    load, num_parallel_calls=AUTOTUNE).cache().shuffle(    1000).batch(1)test_horses = test_horses.map(    load, num_parallel_calls=AUTOTUNE).cache().shuffle(    1000).batch(1)test_zebras = test_zebras.map(    load, num_parallel_calls=AUTOTUNE).cache().shuffle(    1000).batch(1)

将此输入模型，即可进行训练。

6、使用 wget.download 在官网下载数据集

以热狗数据集举例。

6.1 去官网下载数据集

import osimport wgetdata = os.getcwd()+'/data'base_url = 'https://apache-mxnet.s3-accelerate.amazonaws.com/'wget.download(    base_url + 'gluon/dataset/hotdog.zip',    data)

其中，os.getcwd() 返回的是当前 .py 文件所在的文件夹。wget.download(data, dir) 是将 data 数据集（压缩包）下载到 dir 文件夹中。

6.2 解压数据集压缩包

import zipfilewith zipfile.ZipFile('data', 'r') as z:	z.extractall(os.getcwd())

6.3 读取图像文件

创建两个 tf.keras.preprocessing.image.ImageDataGenerator 实例来分别读取训练数据集和测试数据集中的所有图像文件。这里我们将训练集图片全部处理为高和宽均为224像素的输入。此外，我们对RGB（红、绿、蓝）三个颜色通道的数值做标准化。

import pathlibtrain_dir = 'hotdog/train'test_dir = 'hotdog/test'train_dir = pathlib.Path(train_dir)train_count = len(list(train_dir.glob('*/*.jpg')))test_dir = pathlib.Path(test_dir)test_count = len(list(test_dir.glob('*/*.jpg')))CLASS_NAMES = np.array([item.name for item in train_dir.glob('*') if item.name != 'LICENSE.txt' and item.name[0] != '.'])CLASS_NAMESimage_generator = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255)  # 标准化BATCH_SIZE = 32IMG_HEIGHT = 224IMG_WIDTH = 224train_data_gen = image_generator.flow_from_directory(directory=str(train_dir),                                                    batch_size=BATCH_SIZE,                                                    target_size=(IMG_HEIGHT, IMG_WIDTH),                                                    shuffle=True,                                                    classes = list(CLASS_NAMES))test_data_gen = image_generator.flow_from_directory(directory=str(test_dir),                                                    batch_size=BATCH_SIZE,                                                    target_size=(IMG_HEIGHT, IMG_WIDTH),                                                    shuffle=True,                                                    classes = list(CLASS_NAMES))

7、文本

使用 tf.data.TextLineDataset 来加载文本文件。TextLineDataset 通常被用来以文本文件构建数据集（文件中的一行为一个样本) 。这适用于大多数的基于行的文本数据（例如，诗歌、小说或错误日志) 。

7.2 得到文本所在目录

7.2.1 下载数据集

如果是自己的数据集，这一步可以跳过。

在这里，我们将使用相同作品（荷马的伊利亚特）的三个不同版本的英文翻译举例，以文本的每一行作为样本特征，以作者为标签。

import tensorflow as tfDIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']for name in FILE_NAMES:    text_dir = tf.keras.utils.get_file(name, origin=DIRECTORY_URL+name)

7.2.2 查找目录地址

parent_dir = os.path.dirname(text_dir)parent_dir

7.3 生成 dataset

7.3.1 为每个类别的样本都单独生成一个数据集

def labeler(example, index):    return example, tf.cast(index, tf.int64)  labeled_data_sets = []for i, file_name in enumerate(FILE_NAMES):    lines_dataset = tf.data.TextLineDataset(os.path.join(parent_dir, file_name))    labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))    labeled_data_sets.append(labeled_dataset)

tf.data.TextLineDataset()： 输入一个文件地址，输出是一个 dataset。dataset 中的每一个元素就对应了文件中的一行。

比如：

a = tf.data.TextLineDataset(os.path.join(parent_dir, 'cowper.txt'))for each in a.take(5):    print(each)'''输出：tf.Tensor(b"\xef\xbb\xbfAchilles sing, O Goddess! Peleus' son;", shape=(), dtype=string)tf.Tensor(b'His wrath pernicious, who ten thousand woes', shape=(), dtype=string)tf.Tensor(b"Caused to Achaia's host, sent many a soul", shape=(), dtype=string)tf.Tensor(b'Illustrious into Ades premature,', shape=(), dtype=string)tf.Tensor(b'And Heroes gave (so stood the will of Jove)', shape=(), dtype=string)'''

然后我们将得到的 dataset 映射到 labeler 函数中，将标签添加到 dataset 中：

b = a.map(lambda ex: labeler(ex, 0))for each in b.take(5):    print(each)'''输出：(
   
    , 
    
     )(
     
      , 
      
       )(
       
        , 
        
         )(
         
          , 
          
           )(
           
            , 
            
             )'''

7.3.2 将三个 dataset 合并成一个 dataset

all_labeled_data = labeled_data_sets[0]for labeled_dataset in labeled_data_sets[1:]:    all_labeled_data = all_labeled_data.concatenate(labeled_dataset)

7.3.3 将 dataset 打乱

BUFFER_SIZE = 50000all_labeled_data = all_labeled_data.shuffle(BUFFER_SIZE, reshuffle_each_iteration=False)

我们可以打印 dataset 中前5个元素：

for ex in all_labeled_data.take(5):    print(ex)'''输出：(
   
    , 
    
     )(
     
      , 
      
       )(
       
        , 
        
         )(
         
          , 
          
           )(
           
            , 
            
             )'''

可见此三个类别的样本都已经包含在 dataset 中了。

7.4 将文本编码成数字形式

7.4.1 建立词汇表

import tensorflow_datasets as tfdsimport ostokenizer = tfds.features.text.Tokenizer()vocabulary_set = set()for text in df["text"]:    some_tokens = tokenizer.tokenize(text)    vocabulary_set.update(some_tokens)vocab_size = len(vocabulary_set)vocab_size'''输出：10000'''

其中 tokenizer = tfds.features.text.Tokenizer() 的目的是实例化一个分词器，tokenizer.tokenize 可以将一句话分成多个单词，例如：

for text_tensor, _ in all_labeled_data.take(1):    print(text_tensor)    print(text_tensor.numpy())	print(tokenizer.tokenize(text_tensor.numpy()))

tf.Tensor(b"Uprear'd, a wonder even in eyes divine.", shape=(), dtype=string)b"Uprear'd, a wonder even in eyes divine."['Uprear', 'd', 'a', 'wonder', 'even', 'in', 'eyes', 'divine']

7.4.2 建立编码器

encoder = tfds.features.text.TokenTextEncoder(vocabulary_set)

我们拿一个样本实验：

example_text = next(iter(all_labeled_data))[0].numpy()print(example_text)encoded_example = encoder.encode(example_text)print(encoded_example)'''输出：b'I mean to pound his flesh, and smash his bones.'[1677, 9644, 1762, 15465, 12945, 9225, 13806, 5555, 12945, 4829]'''

然后，我们将编码器写成函数供以后调用：

def encode(text_tensor, label):    encoded_text = encoder.encode(text_tensor.numpy())    return encoded_text, label

7.4.3 对所有样本进行编码

def encode_map_fn(text, label):    # py_func doesn't set the shape of the returned tensors.    encoded_text, label = tf.py_function(encode,                                        inp=[text, label],                                        Tout=(tf.int64, tf.int64))    # `tf.data.Datasets` work best if all components have a shape set    #  so set the shapes manually:     encoded_text.set_shape([None])    label.set_shape([])    return encoded_text, labelall_encoded_data = all_labeled_data.map(encode_map_fn)

其中，我们使用了 tf.py_function(func, inp, Tout, name=None) 函数：

作用：包装 Python 函数，让 Python 代码可以与 tensorflow 进行交互。

参数：
- func ：自己定义的python函数名称
- inp ：自己定义python函数的参数列表，写成列表的形式，[tensor1,tensor2,tensor3] 列表的每一个元素是一个Tensor对象，
- Tout：它与自定义的python函数的返回值相对应的，
  - 当Tout是一个列表的时候，如 [ tf.string,tf,int64,tf.float] 表示自定义函数有三个返回值，即返回三个tensor，每一个tensor的元素的类型与之对应；
  - 当Tout只有一个值的时候，如tf.int64，表示自定义函数返回的是一个整型列表或整型tensor；
  - 当Tout没有值的时候，表示自定义函数没有返回值。

注意：如果这里不使用 tf.py_function 而是使用 dataset.map，程序会报错：

AttributeError: 'Tensor' object has no attribute 'numpy'

这是因为 datastep.map(function) 给解析函数 function 传递进去的参数，即上面的 encode(text_tensor, label) 中的 text_tensor 和 label 是 Tensor 而不是 EagerTensor 。可以这样理解：

因为对一个数据集 dataset.map 的时候，并没有预先对每一组样本先进行 map 中映射的函数运算，而仅仅是告诉 dataset，你每一次拿出来的样本时要先进行一遍 function 运算之后才使用的，所以 function 的调用是在每次迭代 dataset 的时候才调用的，但是预先的参数 text_tensor 和 label 只是一个“容器”，迭代的时候采用数据将这个“容器”填起来，而在运算的时候，虽然将数据填进去了，但是 text_tensor 和 label 依然还是一个 Tensor 而不是 EagerTensor，所以才会出现上面的问题。

此时，我们得到的最终 dataset 中的样本已经从文本转换成了数字向量：

for i in all_encoded_data.take(5):    print(i)'''输出：(
   
    , 
    
     )(
     
      , 
      
       )(
       
        , 
        
         )(
         
          , 
          
           )(
           
            , 
            
             )'''

7.5 将数据集分割为测试集和训练集

BATCH_SIZE = 64TAKE_SIZE = 5000train_data = all_encoded_data.skip(TAKE_SIZE).shuffle(BUFFER_SIZE)train_data = train_data.padded_batch(BATCH_SIZE, ((None, ), ()))test_data = all_encoded_data.take(TAKE_SIZE)test_data = test_data.padded_batch(BATCH_SIZE, ((None, ), ()))

使用 tf.data.Dataset.take 和 tf.data.Dataset.skip 来建立一个小一些的测试数据集和稍大一些的训练数据集。tf.data.Dataset.take(TAKE_SIZE) 表示取 TAKE_SIZE 个样本做测试集；tf.data.Dataset.skip(TAKE_SIZE) 表示取总样本数-TAKE_SIZE 个样本做训练集。

在数据集被传入模型之前，数据集需要进行分批处理。最典型的是，每个批次中的样本大小与格式需要一致。但是数据集中样本并不全是相同大小的（每行文本字数并不相同）。因此，我们使用 tf.data.Dataset.padded_batch（而不是 batch ）将样本填充到相同的大小，这里把形状设置成 (None, ) 之后，它会判断在这个批次中的最长样本的单词个数，然后将该批次所有其他样本用零填充到这个长度。

sample_text, sample_labels = next(iter(test_data))sample_text[0], sample_labels[0]'''输出：(
   
    , 
    
     )'''

由于我们引入了一个新的 token 来编码（填充零），因此词汇表大小增加了一个。

vocab_size += 1

之后在训练的时候，直接将 train_data 输入词嵌入层即可。训练的详细信息请参照Tensorflow2.0之文本分类确定文章译者。

8、将标签数字化

原始数据

# categorical  实际上是计算一个列表型数据中的类别数，即不重复项，# 它返回的是一个CategoricalDtype 类型的对象，相当于在原来数据上附加上类别信息 ，# 具体的类别可以通过和对应的序号可以通过  codes  和 categories df.airline_sentiment = pd.Categorical(df.airline_sentiment).codesdf

数字化后的数据

转载地址：https://mtyjkh.blog.csdn.net/article/details/88350306 如侵犯您的版权，请留言回复原文章的地址，我们会给您删除此文章，给您带来不便请您谅解！

上一篇：获取中文字符串的拼音

下一篇：数据分析系列：完善统计图（matplotlib）

发表评论

关于作者

喝酒易醉，品茶养心，人生如梦，品茶悟道，何以解忧？唯有杜康！

-- 愿君每日到此一游！

文章目录

1、使用tensorflow_datasets

1.1 加载数据集

1.2 查看数据集中某些样本的信息

1.3 将样本标准化

1.4 将样本打乱、分批

1.5 查看最终的训练样本

2、将已有的csv文件作为数据集

2.2 数据标准化

2.3 划分训练集和测试集

2.4 划分特征与标签

2.5 切片处理

3、使用tf.keras.datasets

3.1 导入数据集

3.2 特征归一化

3.3 切片

4、Dataset数据集

4.1 将Dataframe改为Dataset数据集

4.2 将array改为Dataset数据集

4.3 将csv文件中数据导入到Dataset数据集

4.4 创建Dataset数据集

5、图片

5.1 导入

5.2 将图片转换为需要的类型

5.3 删除dataset中的灰度图

5.4 加入batch和shuffle

6、使用 wget.download 在官网下载数据集

6.1 去官网下载数据集

6.2 解压数据集压缩包

6.3 读取图像文件

7、文本

7.2 得到文本所在目录

7.2.1 下载数据集

7.2.2 查找目录地址

7.3 生成 dataset

7.3.1 为每个类别的样本都单独生成一个数据集

7.3.2 将三个 dataset 合并成一个 dataset

7.3.3 将 dataset 打乱

7.4 将文本编码成数字形式

7.4.1 建立词汇表

7.4.2 建立编码器

7.4.3 对所有样本进行编码

7.5 将数据集分割为测试集和训练集

8、将标签数字化

发表评论

最新留言

关于作者

推荐文章