2.2 Unstructured Data for Machine Learning

2.2.1 Image Data for Machine Learning

Unstructured data - images, text, audio - requires specialized preprocessing before it can be fed into ML models.


Why Not Tabular?

Traditional ML algorithms expect tabular input, but treating images as flat tabular data loses spatial information and creates extremely large feature vectors (a 1000x1000 image becomes a vector of 1 million values). This is computationally expensive and degrades model performance.


Convolutional Neural Networks (CNNs)

The preferred approach is a CNN where each layer identifies progressively more complex features - early layers detect edges and textures, deeper layers capture complex structures. In practice, ML teams start with pre-trained CNN models and fine-tune them on their specific task and data.


Image Augmentation

Preparing images for deep learning typically involves augmentations to increase training data diversity:

AugmentationPurpose
ResizingStandardize dimensions for batch processing
Scaling (normalization)Scale pixel values to [0, 1] for faster convergence
Flipping / RotatingTeach invariance to orientation
CroppingFocus on relevant regions
Brightness / ContrastTeach robustness to lighting conditions
import tensorflow as tf
import tensorflow_datasets as tfds

# Load dataset
dataset = tfds.load("cats_vs_dogs", split="train", as_supervised=True)

def resize_normalize(image, label, image_size=150):
    """Resize to fixed dimensions and normalize pixel values to [0, 1]."""
    image = tf.image.resize(image, [image_size, image_size])
    image = image / 255.0
    return image, label

def augment(image, label):
    """Apply random augmentations for training data diversity."""
    image = tf.image.random_flip_left_right(image)
    image = tf.image.rot90(image)
    image = tf.image.random_contrast(image, lower=0.2, upper=0.8)
    image = tf.image.random_brightness(image, max_delta=0.5)
    return image, label

# Apply preprocessing pipeline
image, label = next(iter(dataset))
image, label = resize_normalize(image, label)
image, label = augment(image, label)

2.2.2 Text Preprocessing and Classification

NLP tasks span sentiment analysis, article classification, chatbots, spam detection, customer segmentation, and product recommendations. Despite advances in language models, text preprocessing remains important: raw text contains typos and irrelevant characters, training LLMs is expensive, and text features often need to be combined with categorical or numeric features.


Preprocessing Pipeline

StepActionExample
1. CleaningRemove punctuation, extra spaces, special characters"Hello!! World""Hello World"
2. NormalizationLowercase, expand contractions, convert symbols"I can't""i cannot"
3. TokenizationSplit into individual tokens (words, subwords)"i cannot go"["i", "cannot", "go"]
4. Stop word removalFilter common words with little meaning["i", "cannot", "go"]["cannot", "go"]
5. LemmatizationReplace words with their base form (lemma)"getting""get", "got""get"
import re
import unicodedata
import spacy

# Common contractions dictionary
CONTRACTION_MAP = {
    "can't": "cannot", "don't": "do not", "it's": "it is",
    "won't": "will not", "they're": "they are",
    # ... (full map omitted for brevity)
}

def preprocess_text(text, nlp, special_chars=None, lemmatize=False):
    """Full text preprocessing pipeline."""
    if special_chars is None:
        special_chars = ["~", "@", "#", "$", "%", "^", "&", "*"]

    # Remove special characters
    pattern = "[" + "|".join(re.escape(c) for c in special_chars) + "]"
    text = re.sub(pattern, "", text)

    # Normalize: lowercase, strip whitespace, remove accents
    text = text.lower().strip()
    text = unicodedata.normalize("NFKD", text).encode("ascii", "ignore").decode("utf8")

    # Expand contractions
    text = " ".join(
        CONTRACTION_MAP.get(word, word) for word in text.split()
    )

    # Tokenize, remove punctuation, optionally lemmatize
    doc = nlp(text)
    if lemmatize:
        tokens = [token.lemma_ for token in doc if not token.is_punct]
    else:
        tokens = [token.text for token in doc if not token.is_punct]

    return " ".join(tokens)

2.2.3 Text Vectorization and Embeddings

After preprocessing, text must be converted into numerical representations for ML algorithms.


Traditional Vectorization

MethodHow it WorksLimitations
Bag of WordsBinary vector - 1 if word appears in document, 0 otherwiseHigh-frequency words dominate; rare but meaningful words are underweighted
TF-IDFTF (term frequency in document) × IDF (inverse document frequency across corpus)High-dimensional sparse vectors, ignores word proximity and context

Both work well for smaller datasets where key words are significant to the task (e.g., sentiment analysis).

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

reviews = [
    "this wonderful price amount you get",
    "great product big amount",
    "I buy this my son his hair thin"
]

# Bag of Words
bow_vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w+\b")
bow_vectorizer.fit(reviews)
bow_vectors = bow_vectorizer.transform(reviews)
print(bow_vectorizer.get_feature_names_out())
print(bow_vectors.todense())

# TF-IDF
tfidf_vectorizer = TfidfVectorizer(token_pattern=r"(?u)\b\w+\b")
tfidf_vectorizer.fit(reviews)
tfidf_vectors = tfidf_vectorizer.transform(reviews)
print(tfidf_vectors.todense())

Word and Sentence Embeddings

A word embedding is a dense vector that captures semantic meaning, learned from word co-occurrences (e.g., word2vec, GloVe). However, word embeddings ignore position within a sentence.

Sentence embeddings capture the semantic meaning of entire sentences in lower-dimensional vectors. Pre-trained models:

SourceModels
Open sourceSentenceTransformers (sbert.net)
Closed sourceOpenAI, Anthropic, Google
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained model
embedder = SentenceTransformer("all-MiniLM-L6-v2")

sentences = [
    "this is a wonderful price for the amount you get",
    "great product big amount",
    "I bought this for my son",
]

# Encode sentences into 384-dimensional vectors
embeddings = embedder.encode(sentences)
print(embeddings.shape)  # (3, 384)

# Compute cosine similarity to first sentence
sim = cosine_similarity(embeddings, embeddings)[0, 1:]
print(sim)  # [0.51, 0.14] - first pair is more similar

Embeddings can be used as features for downstream ML algorithms, or directly for clustering and similarity search.