Posts tagged Puppeteer

How to Webscrape Images from HTML: A Developer’s Guide (Without Getting Blocked 🚫)

Hey folks! 👋 Sharih Hassan back at it again—your code-cracking buddy from SharihHassan.com. Last time, we tackled how to prevent Angular from returning HTML when you wanted JSON. Today? We’re going rogue: how to webscrape images from HTML without ending up on a website’s naughty list. Buckle up, and let’s do this ethically (and hilariously).


Why Webscrape Images? (And Why It’s Like Walking a Tightrope 🎪)

Webscraping images is like sneaking cookies from the cookie jar 🍪—everyone does it, but you gotta be smooth. Maybe you’re building a meme archive, training an AI, or just really into cat pictures 🐈. But scrape carelessly, and you’ll get blocked faster than a pop-up ad in 2005.


Step 1: Inspect the HTML—Be a Detective 🕵️♀️

Right-click any webpage, hit “Inspect,” and you’ll see HTML tags laughing at you. Images usually live inside <img> tags with src attributes. Your mission? Extract those src URLs like a pro.

html

<!-- Example: A very serious cat image -->  
<img src="https://website.com/cat-in-a-suit.jpg" alt="Business Cat">  

Run HTML


Step 2: Use JavaScript (But Don’t Be a Script Kiddie)

For static sites, Cheerio (a lightweight jQuery-like library) is your BFF. For dynamic sites, Puppeteer (a headless browser) will pretend to be a human. Here’s a sneak attack:

JavaScript

// Puppeteer example: Scrape images like a ninja  
const puppeteer = require('puppeteer');  

(async () => {  
  const browser = await puppeteer.launch();  
  const page = await browser.newPage();  
  await page.goto('https://example.com');  

  const images = await page.$$eval('img', imgs =>  
    imgs.map(img => img.src)  
  );  

  console.log("🕶️ Here’s your loot:", images);  
  await browser.close();  
})();  

Pro Tip: Always check a site’s robots.txt first. Ignoring it is like ignoring a “Beware of Dog” sign 🐕.


Step 3: Filter & Download Responsibly 🌱

Not all images are created equal. Filter out logos, icons, or anything with alt="annoying-ad". Use Node.js’s fs module or Python’s requests to download them.

JavaScript

// Filter only high-res cat pics (priorities, people)  
const catImages = images.filter(src =>  
  src.includes('cat') && !src.includes('thumbnail')  
);  

Step 4: Avoid Getting Blocked (AKA Don’t Be a Greedy Goblin)

Websites hate scrapers that hit them 100 times/second. Add delays, rotate user agents, and use proxies. Or just… don’t be a bot 🤖.

JavaScript

// Puppeteer with a polite delay  
await page.waitForTimeout(3000); // 3 seconds = good karma  

Step 5: Ethical Stuff (Because We’re Not Villains)

  • Respect copyrights: Don’t scrape Unsplash and sell the pics as NFTs.
  • Credit creators: If you use ’em, link back.
  • Rate limits: Treat websites like a buffet—take one plate at a time.

Final Thoughts: Scrape Smart, Not Hard 💡

Webscraping images is powerful, but with great power comes great “why is my IP banned?” energy. Use tools wisely, respect website rules, and maybe leave a virtual thank-you note 💌.

And hey, if you’re still wrestling with HTML in Angular, revisit my guide on preventing HTML responses in Angular. Because nobody wants JSON dressed as HTML!

Got scraping horror stories? Share ’em on the blog! Catch you in the next post.
Sharih Hassan ✌️