Google Cloud Storage

From NoskeWiki
Jump to navigation Jump to search

About

This page is a child of: Google


Google Cloud Storage (GCS) is a cloud storage service provided by Google Cloud Platform (GCP) that offers object storage for live or archived data. It is highly scalable and secure, making it suitable for a wide range of applications including storing large unstructured data sets, archival and disaster recovery, and serving website content.

Features

  • High Durability: GCS offers high data durability through redundancy and replication.
  • Scalability: It seamlessly scales to handle large amounts of data.
  • Security: Provides robust security features including fine-grained access controls and encryption at rest and in transit.

Use Cases

  • Data Storage: For storing files, backups, and large datasets.
  • Data Archiving: Long-term archival of data, including integration with Google's Coldline storage for cost-effectiveness.
  • Static Website Hosting: Hosting static websites directly from storage buckets.

Interacting with GCS using Python

You can interact with GCS programmatically in many languages, including python.

  • Google Cloud Client Library for Python: Use the `google-cloud-storage` library to interact with GCS.
  • Authentication: Typically done via service accounts. Securely manage and use credentials for GCP.
  • Operations: Common operations include creating and managing storage buckets, uploading and downloading files, and setting file metadata.

Installation and Setup

  • Install the library using pip: `pip install google-cloud-storage`.
  • Set up authentication by creating a service account in GCP and downloading the JSON key file.
  • Use the client library in Python scripts to interact with GCS.


Code Examples: Downloading a File from GCS (Python)

See: Download files from Google Storage with Python script


Code file gcs_loader.py
Simple download.
# Simple script to download a object (file) from Google Cloud Storage (GCS).
#
# NOTE: To get this working you'll want to setup your own GCS account
# with a file to download and a service key to access it. Once setup it
# will download to a "tmp/" folder on your local/running machine.
#
# INSTRUCTIONS:
# (1) To create a file to download.
#   * Sign into Google Cloud Console (https://console.cloud.google.com/)
#     (WARNING: May need to use a credit card to sign up for free trail)
#   * Create a "New Project".
#   * On the left-hand side menu go to "Cloud Storage" > "Buckets".
#   * Click "+ Create" (bucket) and give it a name (eg: "noske-test-datasets").
#   * Click "Upload Files" and drag in any file to upload it (eg: "train-00000-of-00001.parquet").
#
# (2) To create a service account key:
#   * Sign into Google Cloud Console (https://console.cloud.google.com/)
#   * Choose the GCP project (drop-down list at the top of the console).
#   * On the left-hand side menu, go to "IAM & Admin" > "Service Accounts".
#   * Click: "Create Service Account", give it a name, description, and click "Create".
#   * Assign the necessary roles to the service account  (e.g., "Storage Object Viewer"
#     or "Cloud Storage Admin" for accessing GCS objects). Click "Continue".
#   * In the "Keys" tab click "Add Key"... and chose JSON and it will download a .json.
#   * Move the .json into a code subdir (eg: "gcs-service-keys/storage-key.json").
#
# (3) Update the global constants below (BUCKET_NAME, OBJECT_NAME)
#
# (4) Setup project as:
#     account-key/    (copy your access key here)
#     tmp/            (starts empty)
#     gcs_loader.py   (this file)

BUCKET_NAME = 'noske-test-datasets'              # (1) Set to the name of your GCS bucket.
OBJECT_NAME = 'train-00000-of-00001.parquet'     # (1) Set to the name of your file in the bucket.
LOCAL_FILE_DIR = 'tmp'
LOCAL_FILE_PATH = LOCAL_FILE_DIR + '/' + OBJECT_NAME
SERVICE_KEY_ACCOUNT = 'account-key/noske-test-gcp-storage-account-key.json'  # (2) Set to your downloaded service key.

from google.cloud import storage

def download_gcs_blob(bucket_name, source_blob_name, destination_file_name):
    """Downloads a blob from the bucket."""
    storage_client = storage.Client.from_service_account_json(SERVICE_KEY_ACCOUNT)
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(source_blob_name)
    blob.download_to_filename(destination_file_name)

    print(f'Blob {source_blob_name} downloaded to {destination_file_name}.')

print(f'Attempting to download {BUCKET_NAME}/{OBJECT_NAME} from CVS to {LOCAL_FILE_PATH}... \n')
download_gcs_blob(BUCKET_NAME, OBJECT_NAME, LOCAL_FILE_PATH)


Code file gcs_path_loader.py
Downloading a [[[HuggingFace]] Dataset from GCS.
# Simple script to download a dataset from a GCS path.
#
# For instance the Google Cloud Storage (GCS) path is:
#   'gs://noske-test-datasets/subfolder/train.csv'
#
# Instructions: See `gcs_loader.py`.

GCS_PATH = 'gs://noske-test-datasets/subfolder/train.csv'   # Set to "gs://bucket-name/path/to/file".
FILETYPE = 'csv'                                            # Set to csv" or "parquet" as appropriate.
TEMP_FILE_PATH = 'tmp/downloaded_file'
SERVICE_KEY_ACCOUNT = 'account-key/noske-test-gcp-storage-account-key.json'  # Set to your downloaded service key.

from datasets import Dataset, load_dataset
from google.cloud import storage

def load_dataset_from_gcs(path: str, filetype: str) -> Dataset:
    """Downloads a file from GCS... the path must be in "GCS path" format.

    NOTE: The `path` must look like: "gs://noske-test-datasets/subfolder/train.csv"
    and has some error checking for the path.
    """
    # Check path is as expected.
    assert filetype in ["csv", "parquet"], f"Filetype must be csv or parquet. Received: {filetype}"
    assert path.startswith("gs://"), f"Path must start with 'gs://'. Path must be in 'GCS path' format (eg: 'gs://bucket-name/path/to/file'). Received: {path}"
    end_bucked_idx = path.find('/', 5)
    assert end_bucked_idx > 0, f"Path must be include a bucket-name. Path must be in 'GCS path' format (eg: 'gs://bucket-name/path/to/file'). Received: {path}"
    gcs_bucket_name = path[5:end_bucked_idx]
    gcs_file_path = path[end_bucked_idx+1:]

    print(f"gcs_bucket_name = '{gcs_bucket_name}', gcs_file_path = '{gcs_file_path}'")

    # Connect to client.
    storage_client = storage.Client.from_service_account_json(SERVICE_KEY_ACCOUNT)
    
    # Download to a temp file.
    bucket = storage_client.bucket(gcs_bucket_name)
    blob = bucket.blob(gcs_file_path)
    blob.download_to_filename(TEMP_FILE_PATH)
    print(f"GCS file '{gcs_file_path}' in bucket '{gcs_bucket_name}' copied SUCCESFULLY to '{TEMP_FILE_PATH}'.")

    # Load dataset.
    return load_dataset(filetype, data_files=TEMP_FILE_PATH)

# Download dataset:
print(f'Attempting to download {FILETYPE} dataset from {GCS_PATH}... \n')
dataset = load_dataset_from_gcs(GCS_PATH, FILETYPE)
print(dataset)               # Print format of dataset   (eg: "DatasetDict({ train: Dataset({ features: ['id', 'review', ...], num_rows: 7 })})").
print(dataset['train'][-1])  # Print last row of dataset (eg: "{'id': 123, 'review': 'TEST 1', ...}")


Code Examples: Saving a File to GCS (Python)

Code file gcs_saver.py
Simple save to CSV program.
# Demo script so save a object (file) to GCS path.
#
# NOTE: In this case we save a .parquet file but it could be anything.
# NOTE: As long as the bucket-name is right, it should create subfolders
# as needed.
#
# For instance the Google Cloud Storage (GCS) path is:
#   'gs://noske-test-datasets/subfolder/new-train.csv'
#
# Instructions: See `gcs_loader.py`.

BUCKET_NAME = 'noske-test-datasets'                # Set to the name of your GCS bucket.
SAVE_OBJECT_PATH = 'savefolder/new-train.parquet'   # Set to the name of your desired file path in the bucket.
TEMP_LOCAL_FILEPATH = 'tmp/newdata'
SERVICE_KEY_ACCOUNT = 'account-key/noske-test-gcp-storage-account-key.json'  # Set to your downloaded service key.

from google.cloud import storage
from datasets import Dataset

# Function to create a sample dataset and save it as a parquet file.
def save_sample_dataset_to_local(local_path):
    data = {"column1": [1, 2, 3], "column2": ["one", "two", "three"]}
    dataset = Dataset.from_dict(data)
    dataset.to_parquet(local_path)
    print(f"Temp local file created at '{local_path}'.")

# Function to upload a file to GCS.
def upload_to_gcs(local_path, bucket_name, save_path):
    """Uploads a file to the bucket."""
    # Replace 'your-service-account-key.json' with the path to your GCS service account key
    storage_client = storage.Client.from_service_account_json(SERVICE_KEY_ACCOUNT)
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(save_path)

    if blob.exists():
        print(f"WARNING: The file {save_path} already exists in the bucket, so this is where you might have logic to allow or deny overwrite as it will overwrite by default.")

    blob.upload_from_filename(local_path)
    print(f"File '{local_path}' uploaded to '{save_path}'.")

# Create a sample dataset and save as a parquet file
save_sample_dataset_to_local(TEMP_LOCAL_FILEPATH)

# Upload the file to GCS
upload_to_gcs(TEMP_LOCAL_FILEPATH, BUCKET_NAME, SAVE_OBJECT_PATH)


External Links