Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I see im still looking for a way to control browser from the inside via an extension browser. very tough problem to solve.


Yup. Lately, I've been doing it a completely different way (but still from the outside)... Using a Raspberry Pi as a fake keyboard and mouse. (Makes more sense in the context of mobile automation than desktop.)

What's good for security is generally bad for automation... and trying to automate from inside a heavily secured sandbox is... frustrating. It works a little bit (as Cypress folks more recently learned), but you can never get to 100% covering all the things you'd want to cover. Driving from the outside is easier... but still not easy!


interesting so you are emulating hardware inputs from RPi

how is it reading whats on the screen? computer vision?


Not to make this an ad for my project, but I'm starting to document it more here: https://valetnet.dev/

The Raspberry Pi is configured to use the USB HID protocol to look and act like a mouse and keyboard when plugged into a phone. (Android and iOS now support mouse and keyboard inputs). For video, we have two models:

- "Valet Link" uses an HDMI capture card (and a multi-port dongle) to pull the video signal directly from the phone if available. (This applies to all iPhones and high-end Samsung phones.)

- "Valet Vision" which uses the Raspberry Pi V3 camera positioned 200mm above the phone to grab the video that way. Kinda crazy, but it works when HDMI output is not available. The whole thing is also enclosed in a black box so light from the environment doesn't affect the video capture.

Then once we have an image, yes, you use whatever library you want to process and understand what's in the image. I currently use OpenCV and Tesseract (with Python). Could probably write a book about the lessons learned getting a "vision first" approach to automation working (as opposed to the lower-level Puppeteer/Playwright/Selenium/Appium way to do it.


> Could probably write a book about the lessons learned getting a "vision first" approach to automation working

ha that would be splendid! please do maybe even a blog on valetnet.dev (lovely site btw a demo or video would be a nice)

I'm convinced vision first is the way to go despite people saying its slow the benefits are tremendous as lot of websites simply do not play nice with HTML and I do not like having to inspect XHR to figure out APIs

SikuliX was my last love affair with this approach but eventually I lost interest in scraping and automation so I'm pleased to see people still working on vision first automation approaches.


Agreed on the need for a demo. #1 on the TODO list! If I know at least one person will read it, I might even do a blog, too! :)

The rise of multi-modal LLMs is making "vision first" plausible. However, my basic test is asking these models to find the X,Y screen coordinates of the number "1" on a screenshot of a calculator app. ChatGPT-4o still can't do it. Same with LLaVA 1.5 last I tried. But I'm sure it'll get there someday soon.

Yeah, SikuliX was dependent on old school "classic" OpenCV methods. No machine learning involved. To some extent those methods still work in highly constrained domains like UI automation... But I'm looking forward to sprinkling in some AI magic when it's ready.


You already have a fan! Feel free to contact me if you need more traffic i'll be sure to spread the word.


I do this for https://browserflow.app (and the AI version in development at https://browserbot.ai) via the chrome.debugger API: https://developer.chrome.com/docs/extensions/reference/api/d...


I do alot quick manually scrapes via devtools

you could try this

Chrome web scraper extension - https://chromewebstore.google.com/detail/web-scraper-free-we...


are you using native messaging? there's a way to bridge a program running with full permissions inside the computer that could use puppeteer or the like. https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/Web...

seems like it wouldn't be that hard to sync the two but the devil is in the details. also installing the native script is outside the purview of the webext so you need to have an installer.


If it's a single file you could just make it a download.

There's also the newer file system APIs (though in Safari you'll be missing features and need to put some things in a Web Worker).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: