Load CSV Data From URL in TensorFlow

In this tutorial, we are going to learn how to Load CSV Data From URL in TensorFlow with Python programming so that we can use it for our task.

Before we go forward, let me explain about CSV data file in brief…

CSV stands for comma-separated values. It allows data to be stored in tabular format. CSV files can be used with Microsoft Excel or Google Spreadsheets. It uses .csv extension.

Why CSV file?
CSV file serves data in a large amount. It is used for many different kind of business purposes. When we are working on a machine learning project, it needs data in a large amount. These data are stored in and access through CSV file.

What we need for this module…

  • NumPy
  • TensorFlow
  • Python

Let’s start with importing dependencies. The functools module is for the functions that act on or return other functions, which we’ll be needed later.

from __future__ import absolute_import, division, print_function, unicode_literals
import functools

import numpy as np
import tensorflow as tf

Downloading data from URL:

TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TEST_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"

train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
test_file_path = tf.keras.utils.get_file("eval.csv", TEST_DATA_URL)
np.set_printoptions(precision=3, suppress=True)
Downloading data from
https://storage.googleapis.com/tf-datasets/titanic/train.csv

32768/30874 [===============================] - 0s 3us/step Downloading data from 

https://storage.googleapis.com/tf-datasets/titanic/eval.csv

16384/13049 [=====================================] - 0s 1us/step

(The datasets are taken from TensorFlow.)

LABEL_COLUMN = 'survived'
LABELS = [0, 1]

we’ll label column named “survived” as 0 or 1 as we’ll be revisiting it’s value, since it’s the data we want for this module.

 

def get_dataset(file_path, **kwargs):
  dataset = tf.data.experimental.make_csv_dataset(
      file_path,
      batch_size=5, 
      label_name=LABEL_COLUMN,
      na_value="?",
      num_epochs=1,
      ignore_errors=True, 
      **kwargs)
  return dataset

Now, we’ll be reading our data file and will create our dataset.

get_dataset‘ function takes file path as input which will return dataset. Whereas, ‘**kwargs‘ is an argument that allows us to add more arguments in future.
batch_size‘ represents one single batch that was made by combining number of records.

 

raw_train_data = get_dataset(train_file_path)
raw_test_data = get_dataset(test_file_path)

Function called ‘show_batch‘ will return the dataset that we stored in ‘raw_train_data’.

def show_batch(dataset):
  for batch, label in dataset.take(1):
    for key, value in batch.items():
      print("{:20s}: {}".format(key,value.numpy()))
show_batch(raw_train_data)
sex                 : [b'male' b'male' b'male' b'male' b'female']
age                 : [51. 44. 37. 24. 39.]
n_siblings_spouses  : [0 1 2 0 1]
parch               : [0 0 0 0 1]
fare                : [ 7.054 26.     7.925  8.05  79.65 ]
class               : [b'Third' b'Second' b'Third' b'Third' b'First']
deck                : [b'unknown' b'unknown' b'unknown' b'unknown' b'E']
embark_town         : [b'Southampton' b'Southampton' b'Southampton' b'Southampton'
 b'Southampton']
alone               : [b'y' b'n' b'n' b'y' b'n']

Here, as we can see, TensorFlow is returning the column names.

CSV_COLUMNS = ['survived', 'sex', 'age', 'n_siblings_spouses', 'parch', 'fare', 'class', 'deck', 'embark_town', 'alone']

temp_dataset = get_dataset(train_file_path, column_names=CSV_COLUMNS)

show_batch(temp_dataset)
sex                 : [b'male' b'male' b'female' b'male' b'male']
age                 : [28. 28. 45. 80. 23.]
n_siblings_spouses  : [0 0 0 0 0]
parch               : [0 0 1 0 0]
fare                : [56.496  8.05  14.454 30.     7.854]
class               : [b'Third' b'Third' b'Third' b'First' b'Third']
deck                : [b'unknown' b'unknown' b'unknown' b'A' b'unknown']
embark_town         : [b'Southampton' b'Southampton' b'Cherbourg' b'Southampton' b'Southampton']
alone               : [b'y' b'y' b'n' b'y' b'y']

 

SELECT_COLUMNS = ['survived', 'age', 'n_siblings_spouses', 'parch', 'fare']
DEFAULTS = [0, 0.0, 0.0, 0.0, 0.0]
temp_dataset = get_dataset(train_file_path, 
                           select_columns=SELECT_COLUMNS,
                           column_defaults = DEFAULTS)

show_batch(temp_dataset)
age                 : [28. 28. 60. 46. 19.]
n_siblings_spouses  : [0. 0. 0. 0. 0.]
parch               : [0. 0. 0. 0. 0.]
fare                : [ 7.879 12.35  26.55  79.2   10.171]

select_columns argument allows you to insert your choice of columns.

We are going to pack all the columns with tf.stack (). After that, apply the pack() into elements of the dataset just like you can see below:

def pack(features, label):
  return tf.stack(list(features.values()), axis=-1), label

Now, a preprocessor is all we need, we’ll be selecting and packing the numeric data into a single column.

packed_dataset = temp_dataset.map(pack)

for features, labels in packed_dataset.take(1):
  print(features.numpy())
  print()
  print(labels.numpy())
[[28.     0.     0.     7.229]
 [28.     0.     0.     7.733]
 [22.     0.     1.    55.   ]
 [28.     0.     2.     7.75 ]
 [39.     0.     0.    13.   ]]

[0 1 1 0 0]
example_batch, labels_batch = next(iter(temp_dataset))
class PackNumericFeatures(object):
  def __init__(self, names):
    self.names = names

  def __call__(self, features, labels):
    numeric_freatures = [features.pop(name) for name in self.names]
    numeric_features = [tf.cast(feat, tf.float32) for feat in numeric_freatures]
    numeric_features = tf.stack(numeric_features, axis=-1)
    features['numeric'] = numeric_features

    return features, labels
NUMERIC_FEATURES = ['age','n_siblings_spouses','parch', 'fare']

packed_train_data = raw_train_data.map(
    PackNumericFeatures(NUMERIC_FEATURES))

packed_test_data = raw_test_data.map(
    PackNumericFeatures(NUMERIC_FEATURES))
show_batch(packed_train_data)
sex                 : [b'male' b'male' b'male' b'male' b'female']
class               : [b'Second' b'Third' b'Third' b'Third' b'First']
deck                : [b'unknown' b'unknown' b'unknown' b'unknown' b'unknown']
embark_town         : [b'Southampton' b'Southampton' b'Southampton' b'Southampton'
 b'Southampton']
alone               : [b'n' b'y' b'n' b'y' b'n']
numeric             : [[ 24.     2.     0.    73.5 ]
 [ 28.     0.     0.     7.55]
 [ 11.     5.     2.    46.9 ]
 [ 28.     0.     0.     8.05]
 [ 28.     1.     0.   133.65]]

Now We’ll be normalizing our unorganized dataset. The process will give numbers as result that are distributed with expectation 0.

import pandas as pd
desc = pd.read_csv(train_file_path)[NUMERIC_FEATURES].describe()
desc

 

This is how we can load CSV type file from the URL in TensorFlow with Python.

Leave a Reply

Your email address will not be published. Required fields are marked *