2.2 Unstructured Data for Machine Learning

2.2.1 Image Data for Machine Learning

Unstructured data - images, text, audio - requires specialized preprocessing before it can be fed into ML models.

Why Not Tabular?

Traditional ML algorithms expect tabular input, but treating images as flat tabular data loses spatial information and creates extremely large feature vectors (a 1000x1000 image becomes a vector of 1 million values). This is computationally expensive and degrades model performance.

Convolutional Neural Networks (CNNs)

The preferred approach is a CNN where each layer identifies progressively more complex features - early layers detect edges and textures, deeper layers capture complex structures. In practice, ML teams start with pre-trained CNN models and fine-tune them on their specific task and data.

Image Augmentation

Preparing images for deep learning typically involves augmentations to increase training data diversity:

Augmentation	Purpose
Resizing	Standardize dimensions for batch processing
Scaling (normalization)	Scale pixel values to [0, 1] for faster convergence
Flipping / Rotating	Teach invariance to orientation
Cropping	Focus on relevant regions
Brightness / Contrast	Teach robustness to lighting conditions

import tensorflow as tf
import tensorflow_datasets as tfds

# Load dataset
dataset = tfds.load("cats_vs_dogs", split="train", as_supervised=True)

def resize_normalize(image, label, image_size=150):
    """Resize to fixed dimensions and normalize pixel values to [0, 1]."""
    image = tf.image.resize(image, [image_size, image_size])
    image = image / 255.0
    return image, label

def augment(image, label):
    """Apply random augmentations for training data diversity."""
    image = tf.image.random_flip_left_right(image)
    image = tf.image.rot90(image)
    image = tf.image.random_contrast(image, lower=0.2, upper=0.8)
    image = tf.image.random_brightness(image, max_delta=0.5)
    return image, label

# Apply preprocessing pipeline
image, label = next(iter(dataset))
image, label = resize_normalize(image, label)
image, label = augment(image, label)

2.2.2 Text Preprocessing and Classification

NLP tasks span sentiment analysis, article classification, chatbots, spam detection, customer segmentation, and product recommendations. Despite advances in language models, text preprocessing remains important: raw text contains typos and irrelevant characters, training LLMs is expensive, and text features often need to be combined with categorical or numeric features.

Preprocessing Pipeline

Step	Action	Example
1. Cleaning	Remove punctuation, extra spaces, special characters	`"Hello!! World"` → `"Hello World"`
2. Normalization	Lowercase, expand contractions, convert symbols	`"I can't"` → `"i cannot"`
3. Tokenization	Split into individual tokens (words, subwords)	`"i cannot go"` → `["i", "cannot", "go"]`
4. Stop word removal	Filter common words with little meaning	`["i", "cannot", "go"]` → `["cannot", "go"]`
5. Lemmatization	Replace words with their base form (lemma)	`"getting"` → `"get"`, `"got"` → `"get"`

import re
import unicodedata
import spacy

# Common contractions dictionary
CONTRACTION_MAP = {
    "can't": "cannot", "don't": "do not", "it's": "it is",
    "won't": "will not", "they're": "they are",
    # ... (full map omitted for brevity)
}

def preprocess_text(text, nlp, special_chars=None, lemmatize=False):
    """Full text preprocessing pipeline."""
    if special_chars is None:
        special_chars = ["~", "@", "#", "$", "%", "^", "&", "*"]

    # Remove special characters
    pattern = "[" + "|".join(re.escape(c) for c in special_chars) + "]"
    text = re.sub(pattern, "", text)

    # Normalize: lowercase, strip whitespace, remove accents
    text = text.lower().strip()
    text = unicodedata.normalize("NFKD", text).encode("ascii", "ignore").decode("utf8")

    # Expand contractions
    text = " ".join(
        CONTRACTION_MAP.get(word, word) for word in text.split()
    )

    # Tokenize, remove punctuation, optionally lemmatize
    doc = nlp(text)
    if lemmatize:
        tokens = [token.lemma_ for token in doc if not token.is_punct]
    else:
        tokens = [token.text for token in doc if not token.is_punct]

    return " ".join(tokens)

2.2.3 Text Vectorization and Embeddings

After preprocessing, text must be converted into numerical representations for ML algorithms.

Traditional Vectorization

Method	How it Works	Limitations
Bag of Words	Binary vector - `1` if word appears in document, `0` otherwise	High-frequency words dominate; rare but meaningful words are underweighted
TF-IDF	TF (term frequency in document) × IDF (inverse document frequency across corpus)	High-dimensional sparse vectors, ignores word proximity and context

Both work well for smaller datasets where key words are significant to the task (e.g., sentiment analysis).

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

reviews = [
    "this wonderful price amount you get",
    "great product big amount",
    "I buy this my son his hair thin"
]

# Bag of Words
bow_vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w+\b")
bow_vectorizer.fit(reviews)
bow_vectors = bow_vectorizer.transform(reviews)
print(bow_vectorizer.get_feature_names_out())
print(bow_vectors.todense())

# TF-IDF
tfidf_vectorizer = TfidfVectorizer(token_pattern=r"(?u)\b\w+\b")
tfidf_vectorizer.fit(reviews)
tfidf_vectors = tfidf_vectorizer.transform(reviews)
print(tfidf_vectors.todense())

Word and Sentence Embeddings

A word embedding is a dense vector that captures semantic meaning, learned from word co-occurrences (e.g., word2vec, GloVe). However, word embeddings ignore position within a sentence.

Sentence embeddings capture the semantic meaning of entire sentences in lower-dimensional vectors. Pre-trained models:

Source	Models
Open source	SentenceTransformers (`sbert.net`)
Closed source	OpenAI, Anthropic, Google

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained model
embedder = SentenceTransformer("all-MiniLM-L6-v2")

sentences = [
    "this is a wonderful price for the amount you get",
    "great product big amount",
    "I bought this for my son",
]

# Encode sentences into 384-dimensional vectors
embeddings = embedder.encode(sentences)
print(embeddings.shape)  # (3, 384)

# Compute cosine similarity to first sentence
sim = cosine_similarity(embeddings, embeddings)[0, 1:]
print(sim)  # [0.51, 0.14] - first pair is more similar

Embeddings can be used as features for downstream ML algorithms, or directly for clustering and similarity search.