Hacker Newsnew | past | comments | ask | show | jobs | submit | acrophobic's commentslogin

You can also use VarSet[0], which I think is easier than spreadsheet since you don't have to switch the workbench.

[0]: https://wiki.freecad.org/Std_VarSet


The downside of spreadsheets is they can really slow your model down. Every cell change triggers a full recompute of the 3D model. VarSets offer much faster performance while sacrificing a couple spreadsheet features. So always choose VarSets over spreadsheets if you can.


On the one hand it's clearly suboptimal for any change, even ones that nothing depends on, to trigger a recompute. But also it feels like there's something a bit broken with spreadsheet dependency resolution in the first place. I've never been able to nail down a test case, but models seem to go over a performance cliff at a certain point. Ordinarily I'd put it down to something being unavoidably quadratic, but I've had cases where I'm certain that the same model is radically slower after being reloaded off disk.


Did not know about this. How do you see all the properties?


Just click the varset in the tree view and it lists them in the properties pane


> ...it being Javascript didn't suit my project.

If you're using Go, I maintain Go ports of Readability[0] and Trafilatura[1]. They're actively maintained, and for Trafilatura, the extraction performance is comparable to the Python version.

[0]: https://github.com/go-shiori/go-readability

[1]: https://github.com/markusmobius/go-trafilatura


We've been active users of go-trafilatura and love it


this is what i came here to see, thanks!


Is Mozilla's Readability really abandoned? The latest release (v0.6.0) is just 2 months ago, and its maintainer (Gijs) is pretty active on responding issues.


That codebase definitely leaves much to be desired, I’ve already had to fork it for work in order to fix some bugs.

1 such bug, find a foreign language with commas in between numbers instead of periods, like Dutch(I think), and a lot of prices on the page. It’ll think all the numbers are relevant text.

And of course I tried to open a pr and get it merged, but they require tests, and of course the tests don’t work on the page Im testing. It’s just very snafu imho


This seems to be https://github.com/mozilla/readability/pull/853#issuecomment... and I think their expectations are pretty reasonable.


Meh, maybe I'm standing too close to the problem, Idk. It is always frustrating trying to use a tool, and it not work though. I know it's free and all, but then I feel like helping people make good contributions is paramount in maintaining and fixing bugs.

Clearly the comma thing is a bug, it's the lack of wanting to fix it actually that is a bit disheartening, and why I think it is a deadish repo


I don't know how you can interpret "we'd really like to make sure that the patch works and that we don't break it in the future" as "lack of wanting to fix it", but you do you.


I've been working on several web extractors project, so I think I could share some of my findings while working on them. Granted it's been several months since I worked on it so I might be forgetting some things.

There are several open source projects for extracting web contents. However, there are three extractors that I've worked with and give us good result:

- readability.js[1], web extractor by Mozilla that used in Firefox.

- dom-distiller[2], web extractor by Chromium team, written in Java.

- trafilatura[3], Python package by Adrien Barbaresi from BBAW[4].

First, readability.js, as expected is the most famous extractor. It's a single file Javascript library with modest 2,000+ lines of code, released under Apache license. Since it's in JS, you can use it wherever you want, either in web page using `script` tag or by using it in Node project.

Next, DomDistiller is extractor that used in Chromium. It uses Java language with whopping 14,000+ lines of code and can only be used as part of Chromium browser, so you can't exactly use it as standalone library or CLI.

Finally, Trafilatura is a Python package released under GPLv3 license. Created in order to build a text databases[5] for NLP research, it mainly intended for German web pages. However, as development continues, it works really great with other languages. It's a bit slow though compared to Readability.js.

All of those three work in similar way: extract metadata, remove unneeded contents, and finally returns the cleaned up content. Their differences (that I remembered) are:

- In Readability, they insist to make no special rules for any website, while DomDistiller and Trafilatura give a small exception for popular sites like Wikipedia. Thanks to this, if you use Readability.js in Wikipedia pages, it will shows `[edit]` button thorough the extracted content.

- Readability has a small function to detect whether a web page can be converted to reader mode. While it's not really accurate, it's quite convenient to have.

- In DomDistiller, the metadata extraction is more thorough than the others. It supports OpenGraph, Schema.org, and even the old IE Reading View mark up tags.

- Since DomDistiller is only usable within Chromium, it has the advantage to be able to use CSS styling to determine if an element is important or not. If an element is styled to be invisible (e.g. `display: none`) then it will be deemed unimportant. However, according to a research[6] this step is actually doesn't really affect the extraction result.

- DomDistiller also has an experimental feature to find and extract next page in sites that separated its article to several partial pages.

- For Trafilatura, since it was created for collecting web corpus, it main ability is extracting text and the publication date of a web page. For the latter, they've created a Python package named htmldate[7] whose only purpose is to extract the publication or modification date for a web page.

- Trafilatura also has an experimental feature to remove elements that repeated too often. The idea is if the element occured too often, then it's not important to the reader.

I've found benchmark[8] that compare the performance between the extractors, and it said that Trafilatura has the best accuracy compared to the others. However, before you start rushing to use Trafilatura, you should remember that Trafilatura is intended for gathering web corpus, so it's really great for extracting text content, but IIRC is not as good as Readability.js and DomDistiller for extracting a proper article with images and embedded iframes (depending on how you look, it could be a feature though).

By the way, if you are using Go and need to use a web extractor, I already ported all three of them to Go[9][10][11] including their dependencies[12][13], so have fun with it.

[1]: https://github.com/mozilla/readability

[2]: https://github.com/chromium/dom-distiller

[3]: https://github.com/adbar/trafilatura

[4]: https://www.bbaw.de/en/

[5]: https://www.dwds.de/d/k-web

[6]: https://arxiv.org/abs/1811.03661

[7]: https://github.com/adbar/htmldate

[8]: https://github.com/scrapinghub/article-extraction-benchmark

[9]: https://github.com/go-shiori/go-readability

[10]: https://github.com/markusmobius/go-domdistiller

[11]: https://github.com/markusmobius/go-trafilatura

[12]: https://github.com/markusmobius/go-htmldate

[13]: https://github.com/markusmobius/go-dateparser


This is it! So it was onion, not potato. Thanks!


Besides the address bar, another issue for me is now all image is lazy-loaded even in websites that don't use Javascript.

While I realize there are advantages to lazy-loading image, I've never liked it because often it makes the content moved a bit, which a bit annoying. However, in pages that uses JS to lazy-load image at least they usually put placeholder image so I know that there will be image there.

Unfortunately, since Firefox do it even for ordinary websites, now I often scrolling away without realizing there are images, only to find out later the paragraph that I read suddenly jumped to bottom.

I'm worried it will be permanent, especially since right now it's a bit hard to revert this feature.


It still defaults to eager loading. You have to explicitly add the loading="lazy" attribute for it to load lazily.


Is there a flag to force it on most images, even the ones without the loading attribute?


As a developer, I just started using loading="lazy" attribute. It has nothing to do with Javascript. It depends on the developer not the browser.


Unfortunately, I don't think I'm good enough to do that.

Besides, porting an app IMHO is a great way to learn a new language. It give me a clear goal and let me focus on coding instead of making design decision.


Thank you very much.

To clarify, it won't download the full HTML, only the content part of the webpage. It works by using go-readability[0], which strips unnecessary elements from a webpage.

[0] https://github.com/RadhiFadlillah/go-readability/


Yeah, I have plan to create add on for Chrome and Firefox. The rest API for saving bookmarks is already there, so I think it should be easy enough to do.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: