As someone who wrote a lot of Perl in the late 90's and early 2000's and moved to Python for scientific work (simulations, etc.) as well as other tasks (web apps, etc), what I saw was:
- Perl actually was really popular for a while back then, especially in the Bioinformatics/Genomics space. It was all over that field, I think partially because it's really easy to think of a genome as just a text string of ATCGs and Perl was really convenient for manipulating text.
- I originally switched to Python for some projects because I had to do simple GUIs and visualization. Pygame and TKinter were much nicer to deal with than Perl's options. If you were just reading data in from one file and writing it out to another, Perl was fine, but the GUI toolkits were miserable.
- Numpy ("Numeric" at the time) was what really sealed the deal and probably pushed a lot of others to switch. Perl had PDL and similar but they weren't as fast or as easy to understand and use. Even if Perl is "Naturally far more efficient" than Python, it's still slow enough that for non-trivial calculations you would not want to use it directly. You have to use one of these other libraries where the actual number crunching is happening in highly tuned FORTRAN or C libraries. Python's options back at that time just leapfrogged Perl's and Numpy had really nice integration with Pygame, which made visualization in your app really smooth. Perl might've caught back up, but that was around when Perl6 was announced and sucked all the air out of the room.
I built and maintained the python stack at a financial firm that had a sizeable investment in perl and PDL, and iirc PDL was unreliable (crashy, not sure about other aspects of reliability) and I don't recall that it was making the transition to 64 bit in the mid/late 2000s at the speed numpy was. Do you remember any more detail from the perl side of the fence?
Not really. I vaguely remember trying out PDL and not getting very far. I was pretty comfortable writing C back then, so any time I'd run into performance issues in Perl, I'd usually just switch to C. Getting PDL working properly was painful enough that it didn't really seem worth spending a lot of time on when I could just write a small C program, especially for commandline data pipeline type situations. Once I started hitting projects with requirements for interactive GUIs though, keeping everything in the same language becomes more valuable and Python had the better complete package.
My first internship was re-writing a Perl program that parsed the output of `samtools mpileup`[0] into a C program that used htslib[1] to directly read the BAM file and extract the relevant data. This preprocessing (the Perl version) was the slowest part of the pipeline, even slower than the actual data analysis.
Yeah, I feel like that was a pretty common workflow. Someone would get a new algorithm working in Perl, test it on fairly small datasets, maybe publish their paper, then some poor chump would have to carefully rewrite it in C to make it actually useful for real data. I was at a university, so our chumps were grad students. I'm sure interns were the industry equivalent :)
oof I can't imagine converting perl string manipulation to C was fun. Did you try to replicate what the perl code was doing, or did you just redo the entire thing in a more C like way?
Data scientist here. Readable code is important in data science. People rely on our products to be based on solid numbers and logic, sometimes without any form of external validation. We can't just scribble line noise in a REPL until we get some output that looks vaguely reasonable. We need to be able to actually read the code and know that what it's doing makes sense.
So readability matters. And Python is one of the most readable languages out there. It remains relatively readable even as the code and data structure get very complex. Its syntax resembles math and pseudo-code more closely than other languages. It feels more like a tool of thought, not just a bunch of alien hieroglyphics you have to write to crudely and inefficiently express how you really think about the problem.
Perl, on the other hand, is one of the least readable languages for a general audience. It is not a tool of thought for non-experts. Its syntax is ugly, obscure, overly symbol-laden, and beloved only by gurus. The supposed "feature" of TIMTOWTDI just makes it more obscure.
The thing I can never get over with Perl is that it didn't even have a sane, idiomatic notation for functions until recently (and even today I suspect it's not widely adopted). There was no such thing as just defining a function f(a,b) of multiple named formal arguments. You had to use some special magic variable and people start talking about shift operators or some such nonsense.
This is the point at which I start saying "yes, all the language choices are Turing complete, but that doesn't mean they're all equally effective choices".
Having a lot of different bad ways of dealing with something is not an excuse for not having a single good way of dealing with something, which is kind of the point about why Perl has failed as a programming language.
Readability is really a weak argument.
I've seen a lot of unreadable code written by data scientist in Python. And a lot of readable Perl code. It's not about the language, it's about the best practices.
The concept behind sigils is very simple and once you've learned it, it's not that threatening. And when talking about data structures, it is a benefit to have it more structured and know that you are dealing with an array or hash or scalar.
In my opinion, Perl is not used for data science because it lacks(or people are not aware of) libraries like numpy, pandas, etc.
Unless you are hell-bent on writing obfuscated code, Python code is almost always readable. Even when I didn't know any Python or any other language, I could make an educated guess what some Python program is doing. However, I have seen some of Perl code and I was like wtf.
As for best practices, Python does insist on writing readable code a lot. Python people always nudge you to write readable code. Guido himself said he designed Python to be a readable language.In general, Python has the philosophy of one way of doing something which contrasts with Perl way where you can accomplish a task in numerous ways. You can't be too clever with Python (little exaggerated position). However, the same is not true with Perl.
I vehemently disagree. When choosing among languages today, performance is mostly good enough across all languages and readability becomes the most important consideration. It's "subjective", but human beings are subjects, and subjective things are important to them.
> I've seen a lot of unreadable code written by data scientist in Python. And a lot of readable Perl code. It's not about the language, it's about the best practices.
This is a bad and cliche argument. Yes, there is well-written code in languages that deprioritize readability and badly-written code in languages that highly prioritize readability. That doesn't mean that any differences in the inherent, built-in readability of X and Y are moot. And "best practices", which are not widely agreed upon or followed, absolutely do not suffice to paper over this.
> The concept behind sigils is very simple and once you've learned it, it's not that threatening.
"Once you've learned it" or "get used to it" is the biggest programming cliche in the world. People say this about all their favorite languages and language features. I'm sure I could complain about doing data science in Brainfuck and get someone in the comments who would say it's not threatening "once you get used to it".
Well sorry, but I don't want to get used to your favorite language's line noise. That's why I choose another language. That's why everyone else chooses another language.
> And when talking about data structures, it is a benefit to have it more structured and know that you are dealing with an array or hash or scalar.
You can keep track of types in Python with gradual typing. Most people don't bother, so evidently it's not that important in most cases, but you can do it.
Other than that, no, I don't need to fill my code with more line noise to constantly remind myself what type of container I'm dealing with. That's obvious enough from the methods I'm using to manipulate the container.
> In my opinion, Perl is not used for data science because it lacks(or people are not aware of) libraries like numpy, pandas, etc.
No, this is after the horse has already left the barn. Those libraries were developed in Python because the authors of those libraries preferred to work in Python. The authors preferred to work in Python because of the legibility considerations I've already mentioned.
Python's design and adoption were both heavily driven by a vision of readability from the very earliest stages in the 1980s/90s. People got into Python because they liked how the code looked:
> A friend of mine who knows nearly all the widely used languages uses Python for most of his projects. He says the main reason is that he likes the way source code looks. That may seem a frivolous reason to choose one language over another. But it is not so frivolous as it sounds: when you program, you spend more time reading code than writing it. You push blobs of source code around the way a sculptor does blobs of clay. So a language that makes source code ugly is maddening to an exacting programmer, as clay full of lumps would be to a sculptor.
> I vehemently disagree. Readability is the most important consideration. It's "subjective", but most human "subjects" prefer Python to Perl on this dimension, and it's very important to them. That's why we are where we are today.
I absolutely agree that readability is most important consideration. I just disagree that it depends on the language. As said before, there is a lot of write-only code written by data scientists in Python.
> Well sorry, but I don't want to get used to your favorite language's line noise. That's why I choose another language. That's why everyone else chooses another language.
You don't have to. It's called a programming language paradigm, not noise. You either learn it or not when studying a language. If you want to understand something without studying it, it's your choice.
I am proficient in both of these languages and can see the strengths and weaknesses in both of them.
> Those libraries were developed in Python because the authors of those libraries preferred to work in Python. The authors preferred to work in Python because of the legibility considerations I've already mentioned.
Perl was extensively used by scientific community, but Python was the lucky one that was backed by Google in 2005. And that's ok. It is better than Perl for data science now, but not because of its syntax :)
We'll have to agree to disagree on most of this, which is fine, better than having a stupidity-driven argument. I'm sure the aspects of Perl that are weird and gross to me are great for some other people.
> Perl was extensively used by scientific community, but Python was the lucky one that was backed by Google in 2005.
I'm curious about this. Can you point to any specific evidence that being "backed by Google in 2005" was a momentous event for Python? AFAIK Google is not a major player in the Python scientific computing / data analysis / ML community outside of TensorFlow, which was introduced after Python was already huge.
In 2005 Google hired Guido van Rossum, where he spent half of his time developing the Python language.
As you can see, before 2006, Python was little less popular than Perl(9th vs 8th).
By the time Guido left Google(December 2012), Python became 4th most popular programming language.
I don't think Guido did anything special for the scientific/technical/data computing community after he came to Google. Python was on version 2.4 as of 2005, and that version would have been fine to develop against indefinitely. In fact, it's been a huge headache getting scientists to move to Python 3.
The sibling comment mentioned that there's probably no contribution from Guido regarding the meteoric rise of Python in data science but the truth it's really the opposite [1]. Guido was really instrumental in suitability of Python for data science [2].
To quote from the magazine article on Python for Scientists and Engineers:
"During these early years, there was considerable interaction between the standard and scientific Python communities. In fact, Guido van Rossum, Python's Benevolent Dictator For Life (BDFL), was an active member of the matrix-sig. This close interaction resulted in Python gaining new features and syntax specifically needed by the scientific Python community. While there were miscellaneous changes, such as the addition of complex numbers, many changes focused on providing a more succinct and easier to read syntax for array manipulation. For instance, the parenthesis around tuples were made optional so that array elements could be accessed through, for example, a[0,1] instead of a[(0,1)]. The slice syntax gained a step argument— a[::2] instead of just a[:], for example—and an ellipsis operator, which is useful when dealing with multidimensional data structures."
Personally I'd much prefer D language to replace python for data science due to its Python like programming vibes but with more powerful real-time programming constructs, and its suitability for beginners or type A data scientists [3]. Python is already showing its limitations especially for the B type of data scientists [4]. There's recent discussions on Dlang forum for the potential improvements of D for data science and there's also recommendation for Walter to join such SIG as was Guido [5].
I'm sure Guido did things for the scientific community. My claim was that he probably didn't move the needle for them after coming to Google in 2005.
The other person claimed that Google magic somehow led to Python becoming dominant in the sci/numeric/data computing community, and that just strikes me as wildly implausible. Google has done negligible foundational work on scientific Python. No one in that community has ever cared what Google thinks.
Anyways, I'm curious why you think D has potential as a Data Science language? Why not a more popular language like Julia or Rust?
> performance is mostly good enough across all languages
Performance of applications written in Python is good enough only were there is a very low requirement for fast code (e.g. glue code / shell replacement of which there is a lot and, as computers get faster, more and more) or there are native code libraries (e.g. numpy). Of the popular languages, only Ruby is slower. I have high hopes for Julia (at least in the data science / numerical computing domain) which performs much better, while preserving readability and ease of use of Python.
As someone who had to interview for VLSI design positions (both sides of the table) back when Perl was popular, I used to take a printout of a small tool I developed in Perl with me to every interview (parse a VCD file and check the results).
This forced the interviewers into MY subset of Perl. That made a huge difference into how competent they thought I was with Perl.
Most of the interviewers gave up after we went through about 10-20 lines of code because they had never seen half of the stuff in my code--and I wasn't doing anything particularly obscure (decorate-sort-undecorate/Schwartzian transform would lose most of them).
I guess I'm showing my youth here but I never imagined that sorting on a computed key was once known by such an exalted title as Schwartzian Transform. This is basic, built-in functionality in Python.
It's not just that the key is computed, but that the key has an internal sort comparison which is way faster than a user code comparison. It was an example of the fact that "Big-Oh notation is nice, but constants sometimes matter."
Both sorts have the same O(n log n) number of operations performance. However, in Perl, using an internal comparison for integers was a significant constant faster than using a user specified comparison function--at least an order of magnitude and sometimes two. Consequently, the 2*O(n) to decorate and undecorate the list with integers to sort by got swamped by the 1/10*O(n log n) sort improvement.
I believe that this overhead eventually got reduced. However, I remember using this a LOT in Perl 4 and early Perl 5 days.
Sounds generally plausible to me. I learned all sorts of obscure tricks to make MATLAB run faster in my grad school days... I wouldn't be surprised if something like this was useful over there as well.
However, for data science, the biggest drawback vs Python is how you work with complex data structures. Because everything in Python is an object, (de)referencing things is generally straightforward, like:
It is not entirely about the syntax, for example, the code looks similar in Python (if I remember Perl correctly):
print(f"it turns out that {TV[family]['lead']} has ", end=‘’)
print(len(TV[family]['kids']), " kids named ", end='')
print(*[kid.name for kid in TV[family]['kids']], sep=', ')
For me, it was the culture “there is _one_ obvious (for a Dutch) way to do it” vs. “many” in Perl and most people lack discipline/good taste to handle the [too much] freedom provided by Perl.
Somehow Python took the niche of “executable pseudocode”, “glue language” for me.
Minus the sigils and crazy wrapping, like: @{ $TV{$family}{kids} } to refer to the array, but a different sigil/form if you're talking about one element, etc.
Which may not look intimidating on its own, but it's leaving out % and the backslash \$reference syntax, $object->method and other things that get confusing.
You can't, for example really make a "list of hashes/dicts". Try it, and it just turns the keys/values into array elements. You have to specifically make a "list of references to hashes/dicts". So you do something like:
For printing strings, there is more than one way to do it in Python; differently readable for.
print("""It turns out that {} has {} kids named {}""".format(
TV[family]['lead'],
len(TV[family]['kids'],
', '.join(*[kid.name for kid in TV[family]['kids']]*)
)
It doesn't use the nifty f string formatter, but the English sentence part is a bit easier to read.
Some people don't like sigils on variable names, and I get that. Personally I find them useful.
As far as the perldsc page example, I think that comes down to having a better personal style that makes your code readable. I would make the top level variable a hash reference rather than a plain hash, and purposefully not insist on dereferencing multiple levels all on one line.
As someone pointed out in another comment, you can make the Python version look just as lousy by virtue of a lousy style.
My version of code doing the same task would be something like this:
my $familyData = $TV->{$family};
my $lead = $familyData->{lead};
my $kidsList = $familyData->{kids};
my $kidCount = scalar( @{ $kidsList } );
my $kidListStr = join (
", ",
map { $_->{name} } @{ $kidsList }
);
print "it turns out that $lead has ";
print "$kidCount kids named $kidListStr\n";
I used Perl for a lot data processing, mostly text. That time, our machine learning code was mostly C/C++. Perl slowly died before the big frameworks took off. It didn't compete with Python to be honest. Of course, even if there was a direct battle between Perl and Python, Python would probably win anyway.
I do find Perl much easier for manipulating text files, file names, and the sorts of things where bash or awk are too tedious. Then there's a line where something like Python becomes more attractive. Perl is still a lot faster than Python for many tasks too...the Perl devs seems to spend more time optimizing certain paths. Or perhaps "everything is an object" in Python slows everything down?
Python is not a good language for low-level numerics like FORTRAN, but it has good facilities to make foreign language libraries (written in C, FORTRAN, CUDA, etc.) that do high-level operations. There is not just the foreign function interface but also the operator overloading that makes it possible to write something like
dataframe["A"] + dataframe["B"]
and get a Series. You can make things that look like numbers or arrays but actually do something different.
Both Python and Perl made bold transitions, Python from 2 to 3 and Perl from 5 to 6.
The Python transition was difficult but ultimately successful. On the other hand, Perl rolled the dice and lost. In some alternate universe it might have been the other way around.
> Both Python and Perl made bold transitions, Python from 2 to 3 and Perl from 5 to 6.
Perl 6 (Raku) is effectively a different language. Porting from 5 to 6 would change most lines of code, in other words, a complete rewrite. To ease transition, Perl 6 made it so that both could coexist, but if you want to go v6, then it is all or nothing.
Python 2 and 3 are not that different. You can easily write code that is compatible in both, and porting is often not much more than using 2to3.py most of it is replacing "print" with "print()" and converting between bytes and str.
Perl strategy was that to create a modern language inspired by Perl, in indeed, Perl 6 has quite a few features ahead of its time. And Raku (the new name for Perl 6) is still one of the most advanced language today.
Python strategy was to keep the same language but address a few pain points that require breaking compatibility, Unicode in particular.
I'm a software developer who worked with research assistants and scientists for 10
years on large scale multinational projects at an Ivy League university. It's getting better these days, but at the time, those folks didn't have the time to learn proper software development - they had a ton of pressure to deliver results and publish papers.
Here's a bunch of things I was dealing with on a daily basis at the time:
* No revision control - sometimes files got deleted and research was lost
* Code that worked by accident, e.g. accessing the first element in an array using @array[$0] instead of @array[0], which worked only because $0 evaluates to "myscript.pl", and @array["myscript.pl"] returns the first element in the array.
* Some people just liked their code left-indented (read: no indentation at all anywhere)
Most of the code was Perl and Matlab. Perl was hard to read, but it was good for text processing, not so much numerical processing, and Matlab was good for numerical processing, but it was slow and bloated.
When Python came along, with batteries included and with science-friendly libraries like numpy and scikit (so not just text processing, but proper numerical processing) you could get away with coding all your stuff in Python (though a lot of stuff is still Matlab). And on top of that it was easier to learn and to read than Python and it was fast enough. It was a no-brainer, so newcomers stopped learning Perl.
That said, Python's biggest contribution to this world, in my opinion, was forcing people to indent their goddamn code, because, as super-smart as these science guys generally were, they were also super-stubborn.
As for me, I moved from academia into the industry and it's so much nicer to work with other software developers. Sometimes I wonder if I should stop making some billionaire richer and go back to contributing to the scientific field, but, unfortunately, I have bills to pay. Maybe when I retire.
I got into Python because I outgrew Excel (and Origin) and started with Jupyter notebooks and Pandas. The Pandas functions DataFrame.from_excel and DataFrame.to_excel combined with how visual a notebook is made the transition very easy for me.
I do come across some perl every now and then but it looks like Bash on steriods to me (perhaps because of the $variables), and I never see any DataFrames or similar structure that looks familiar to me.
Idk, Perl never occurred to me and nobody ever recommended it. Is Perl good for data science? Do you have any examples? I never ran into "efficiency problems" with Python btw, I'm not using it at that scale, at what scale would I notice this? I have run out of ram at times, but that's usually when a dataset on disk is already larger than my ram. But then I usually find a tool to deal with the data anyway (for example pyosmium for super large osm/o5m fles).
Edit: I feel that Python also gives me other nice things to get started with, things like Django and Snakemake. This also leads me to recommend Python to other people, it's a broad basis for a lot of stuff. That's why even though some people recommended R when I got started, I choose Python anyway. I have no regrets, unless you are going to blow my mind with some examples...
Perl is expressive but the code can be hard to read. In general, Python is readable because it enforces the indentation and other language design choices.
Python also got heavyweight libraries such as Numpy and Pandas which put it in the front. Perl do not have have such well known libraries as far as I know.
Perl seems to have died down just about everywhere. I've not used it professionally or seen it in use in well over a decade. It's so expressive and has so many ways to do things that reading other people's Perl can be challenging. I think this contributed to its downfall. And "other people" can often mean "yourself in six months" too.
> And "other people" can often mean "yourself in six months" too.
Amen to this point. I am currently the single engineer in a small software biz and after the last engineer left I thought to myself "well, at least I won't have to decipher his code anymore"...needless to say, I've been shocked by how much time I can spend on a system only to revisit it within a year and have little to no memory of how or why I did something a certain way!
The last time I wrote production Perl code (2012) I stressed about keeping it readable, especially by anyone who would be learning Perl. The company was mostly C/C++ focused with maybe Python as a #3. I think Perl-isms can be foreign if you haven't seen them, so they should be avoided in any non-Perl focused company.
Looking now (it's open source) there's two keywords I have forgotten. It's a Perl binding to a C library using an old version of Swig. There's lots of Perl or C helper code to smooth out the bumps. It also looks like it hasn't been updated (bummer).
> Python also got heavyweight libraries such as Numpy
This. For the kind of tasks data science involves, you either write C++ or you need the equivalent of Numpy. No Numpy means no Pandas, which means no sane way to wrangle datasets.
For all the hate it gets, Numpy is a world-class library.
I love perl for regex scripts, where I need to quickly filter or transform a text file. I never liked it for other kinds of programming projects, for some reason to me it doesn’t feel as well suited to, say, numeric simulations.
> Naturally far more efficient
What does this mean exactly? One big reason for Python’s success in data science is numpy, which is far more efficient (especially on large data) than vanilla Python. I’m unaware of the state of the Perl ecosystem, does it have something similar?
> how come Perl lags far behind or gets dwarfed in data science by Python?
Perl seems to be lagging behind in general, no? Is there anything specific about data science where Perl should be shining?
> I’m unaware of the state of the Perl ecosystem, does it have something similar [to numpy]?
Perl has PDL (http://pdl.perl.org), which I think predates numpy and does the same kind of array manipulation.
As for the original question, Python's dominance for "data science" projects is just a matter of momentum at this point: it's where Google and others have put their money, and that is now affecting undergraduate and graduate courses. Originally, I think it was just that some people didn't like Perl's syntax.
Personally, I prefer to use a mix of Perl and Julia, each for its respective strengths, and find Python all-around mediocre.
Also data science requires a lot of intermediate/mockup data visualization and python puts matplotlib and scikit-learn right at your fingertips. Hard to find a match Perl side, but I might be wrong.
Circa 2000 I did a lot of unix scripting with Perl and also wrote cgi-scripts for the web with Perl.
In the cgi-script mode Perl had to start a new process and compile all your code for each request. There was "mod_perl" which was more efficient but frequently you struggled with memory leaks and other reliability problems.
PHP came out and then Apache Tomcat, web hosting systems for Ruby, etc. all of which had efficiency similar to mod_perl but easy and reliable environments to work in. Generally there were many modules in CPAN that were essential to web development (HTML escaping) for which bugs were not getting fixed and that added to the feeling that Perl was slipping behind.
As for more general scripting I think people found Python was better. Even though it is cross-platform, Perl has a strong UNIX feel to it. Python doesn't feel like it belongs to Windows, UNIX or any other environment, rather it feels comfortable anywhere.
Python has a very simple and consistent and unsurprising syntax compared to Perl. I think most programmers from other languages can look at Python and feel like they're reading pseudo-code that they actually understand for most operations. So for people coming from a mathematical or scientific background, it unlocks computing abilities without having to learn a bunch more knowledge from another domain. Add to that the success of libraries like NumPy and SciPy that round out the capabilities of the language itself and make many practical tasks very accessible, and it's just a nuke-from-orbit type situation for most other languages.
I mean, even the author of Learning Perl, said "sometimes Perl looks like line noise to the uninitiated, but to the seasoned Perl programmer, it looks like checksummed line noise with a mission in life."
Python can be easy to read, but it's filled with "gotcha's" and occasionally infuriating syntatic "sugar". Don't get me wrong -- I like the language and the ecosystem is increasingly good -- but I find it plenty confusing enough at times. Perl's main footgun was "everything is a regex". Python's is "everything is an object". Of course, that it's main strength. A quick example:
I agree that a lot of the implicit type conversions are problematic, but Python's handling of that is hardly as unintuitive as things like a lot of Perl syntax, or JavaScript typing, etc. If that's the kind of impractical example you have to come up with to showcase this problem, honestly I think that's still pretty good, and I'd still call the implicit type conversions an overall win for the use case being asked about.
I think we could make better languages than Python that fix these issues. Like an implicit boolean -> int conversion does more harm than good, IMO (and it appears in your example). But I'm still not surprised that of the languages we have had when we have had them, that Python absolutely dominates this space.
You can write hard-to-read code in any language. You can't write easy-to-read code in any language. Python tends to make easy-to-read code easier to write than Perl does.
> You can't write easy-to-read code in any language.
I've seen (rarely) old shell scripts and even assembler made readable by copious use of comments. I think the claim should rather be that you can't write easy-to-read self-documenting code in all languages.
The issue is with data movement, in my opinion. Heavily pipelined command languages like shell and stack languages tend towards unreadable because they perform actions on arbitrary data with implicit transfers between functions, which can be confusing to the uninitiated, because you have to know every (not necessarily obvious) command and it's parameters beforehand.
PDL is just awesome, but I feel that Perl is a programmer's programming language. If you come from Linux, shell scripting, C/C++, etc, you'll probably handle Perl well.
However, it might be a bit too much if you are from other fields and just want to get your sums in order.
I came from Linux, shell scripting and C/C++ and still, Perl and me just didn't get along. Only after I learned Common Lisp much later in life, Perl made more sense, but by then it was almost entirely replaced by Python and Ruby.
Python got numpy fairly early on. It is fast and powerful.
Then a whole ecosystem was built on top of numpy, with scipy, pandas, PIL, et cetera. Everything that uses n-dimensional arrays used numpy as their base and as a result all those things can be combined trivially. That's very powerful.
Then later came ipython, a much improved interactive shell, and Web based notebooks that are very useful for data science work.
That the language involved is Python isn't even important, imo. Numpy + ecosystem replaced Matlab. All Python has to be is be a better language than Matlab, and it is.
I love Perl too, but it can very quickly become a "write only" language. Which is fine for what I make of it.
One thing to consider is that scientists are not necessarily programmers, there is an overlap, especially in data science, but from my experience, they tend to write terrible code. It is not to be dismissive, it is just a different skill set, there would be no need for professional programmers if scientists could do better and vice versa.
And Python is very interesting in that it is actually difficult to write terrible code with it, forced indentation and clean syntax certainly helps, it also heavily promotes the one "pythonic" way. Contrast with Perl "there is more than one way to do it" philosophy.
It is not that you can't make a huge mess with Python, but it only tends to happen at the intermediate level, like when you are starting to write libraries but have not yet reached mastery.
perl doesn’t have the numerical chops to keep up, and if it started to fix that now, it has 20 years of headwind to fight through for probably marginal gains. Good luck.
Perl is also a nasty language to work with. Incredibly ugly. Even if perl was the standard for data science, I would be looking to escape it at every opportunity. In fact, that's what I did when I worked in a sector that had tooling and network effects for perl. I tried to escape using perl every chance I could.
There's R which is weird in a way like Perl is weird.
I think R is worse but it is bundled with a very good stats library. R users aren't so interested in "programming in the large" or even the small so it is OK.
Other than the meaningful indentation Python strikes me as a generic and normal programming language. I'm quite amused at how Python and Java got pattern matching at about the same time with a broadly similar approach to "fit this new feature in this language so that it feels like it belongs and plays well with decades of existing practice".
Is Perl even close to be on a top ten list of languages that make sense for data science? Not to be nasty to the OP, but the question would maybe make sense +10 years ago. Whats baffling now is seeing Perl as a real competitor to anything.
The Perl community is horrible, that is the real reason. Criticism about Perl readability and other Perl language meme critics are easily debunked myths. the Perl language semantics and logic is fantastic and easily readable after you LEARN it. The Perl interpreter is fantastic, performs great and have very helpful warnings and strictures.
However the community is untalented, they historically produce very bad code, it all started with Matt's script archive. They do create horrible sites, PerlMonks is a good example, It looks horrible and its usability is horrible. The code posted by PerlMonks users is mostly very bad. CPAN that is often pointed as positive is just a mostly poorly documented repository of very bad code. There are very few usable modules on CPAN.
Perl community has also written some of the worst technical books ever.
Talent attracts talent, that is the reason Perl is dead for any kind of usage and Python is popular.
I'm not a programmer but I did get my start in tech because I installed Mandrake on my computer for fun in the 90s and couldn't figure out how to format the drive back so I could reinstall windows. I was accidentally forced to learn linux, and with that things like bash and Perl. I feel like people who customize their bashrcs and tweaking their IP tables and generally are comfortable scripting, are people who gravitate towards Perl, but for developers and data science folks, Python is much more accessible in terms of finding out how to learn that application of the language to those types of problem? When I thought about learning some more data science a few years ago, the vast majority of the tutorials on youtube etc are for people shifting from apps like excel to things like sql and python, I didn't see anything about Perl.
C is more efficient than both of them. For that matter, so are tons of other languages.
For that matter, I'm not sure what you mean when you say Perl is "naturally more efficient" than Python. There's nothing about Perl that makes it easy to run faster, and the ability to write incomprehensible one-liners is not a very satisfying measure of "efficiency".
As for why people choose Python over Perl, Perl is a pain in the ass in a lot of ways, and I say this having written thousands of lines of Perl back in the day. Dollar signs on variable names? Obvious bad code turning out to be syntactically correct but do weird stuff, because of strange irregular legacy syntax rules? Library code being unreadable because of the aforementioned incomprehensible one-liners?
Python is bad enough about not finding errors until the code blows up in weird ways, but Perl is worse.
But it doesn't really matter. Computers increase in speed over time. Libraries allow speed sensitive sections to use more sane data types. As you pointed out, the Perl language's design decisions cause much worse problems -- hard to identify bugs.
Data science involves building many statistical models, visualizing complex data structures in many ways, and summarizing results into "pretty" figures.
In this way, the statistical packages and plotting software in Python is better than Perl. I would say R is even better than Python and Perl for certain statistical analyses and quickly plotting complex data in different ways.
Perl might beat other languages in wrangling certain types of data like comma separated values or other 2D arrays in terms of writing expressive one-liners. Perhaps that is what you mean by "naturally efficient"? How much one can do in one line of code?
Not much more than an anecdote, but I found working with julia quite a lot of fun. It was not data science, but since julia is more or less aimed at data science, I'd predict that my positive experience means julia is a very good choice there too.
There is the "time to first plot" issue, and I found that the "efficient as C, easy as Python" motto really means "efficient as C XOR easy as Python", but all in all, it's very easy to write stuff very cleanly, and the path to effiency is quite natural if you know where to look, and the metaprogramming makes it that more powerful.
I used Perl for LAMP apps in the 90ties. Perl lost webdev to PHP (some Ruby) and science stuff to Python. Fans of OOP/functional style programming went to Ruby.
System administration never fully replaced awk/sed/bash with Perl and the new wave was all configuration management, like chef and puppet.
Python was considered to be a "clean", algol-style language, so universities started teaching it twenty years ago. Only logical after teaching Pascal for decades. Students kept using Python, so now there are lots of data science projects around.
Actually, Perl was a quite natural choice for a while for some computational biology / bioinformatics workloads. In particular, defining scripts where you expect a sequence (like DNA) as input and a filtered or modified sequence as output allowed for processing pipelines that just flowed nicely: script1 | script2 | final_script.
It's been a while since I was in that field, but I suspect those kind of low level operations are now heavily optimized in faster languages as sequences to operate on became longer and operations more complex.
In my opinion, as someone who worked a short time in perl, perl is so much harder to learn, especially comming from other programing languages (which most data scientists learn if they started from a degree in computer science/engineering)
One of the main issues I have with perl is the `there is more than one way to do it` slogan, which, for me, means that each person what writes perl, writes practically a different language. This make the bar of starting a new project much higher, even if you have use perl before.
Data science is primarily about communication, both of your analysis to other people, including laypeople, domain experts, other data scientists, ops folks ... and those same people communicating with you about their needs, important context, expectations of SLAs or constraints ...
The main barriers to good communication of data science is the translation between the real world goals, data, algorithms, assumptions and priors, scientific and statistical methods being applied, testing, the deployed model's performance and ongoijg monitoring and management, etc .. and the actual code and artifacts produced to achieve those ends.
Python wins because there is a wide ecosystem around those translation efforts.
* Visualization and statistical profiling are first class citizens, probably the killer app in terms of communicating difficult mathematical concepts
* Easy to extend, very "framework-friendly", so you can "speak the same language" between data engineering, DS, analysts, MLOps folks
* Needless to say, network effects of a community used to Pythonic idioms
In general Perl is faster than Python. But Python has had a lot of performance enhancements when it comes to crunching numbers so Python is faster for 'data science' stuff.
I think efficience is not that important. When I was doing research in China, we would use python to iterate quickly and visualize what we were doing. Once we were done, either performance was not really an issue, or we would just translate the code to C++ anyway
I'm ignorant of Perl, so it is possible that this is off base, but:
If I were writing something where I actually cared about low-level performance (a 'let's see what we can get the compiler to inline and unroll' sort of code), I guess I'd
1) start by writing pseudocode
2) code it in C or Fortran.
The fact that there's a runnable version of pseudocode called Python means that often people will stop at the first step, realize computers are incredibly fast, and be happy enough with not writing the Fortran (just sprinkle in some NUMPY for the crunchy bits).
Lots of cases can be handled with large calls to heavily tuned libraries anyway, where most programmers won't beat the library in C or Fortran, let alone Perl.
Every time I look at an example I am turned off and my mind is in a way "boggled" ...
I just opened the tutorial on the Raku site I see this
say looks_like_number "foo";
I find this awful,
right from the beginning, the say turns me off, computers do not speak, at least I really hope the computer is not actually speaking when I type that. Then there are the weird rules of quoting to not quoting ... I can't quite tell the rules are
the example is unpalatable, why does the looks_like_number word have underscores? is that a odd variable name that needs underscores? but it looks like a message to the user, what a terrible choice either way!
Well, in a way this is funny. As far as I can see, that example is taken from the page describing the differences between Perl and Raku (https://docs.raku.org/language/5to6-nutshell#CPAN), and how you can use Perl inside Raku using the Inline::Perl5 module, and how you can load a Perl module that exports a "looks_like_number" subroutine, and use it from Raku.
The example for "looks_like_number" has probably be chosen, because many experienced Perl programmers are familiar with that.
In Raku, you can use kebab-case, and if there would be such a routine in Raku, it would probably be called "looks-like-number", and return a Bool, as in either True or False.
"say" is also builtin in Perl, at least for the past 12 years or so. And if you are on a Mac, running "say this is your computer talking to you', it will talk to you.
In any case, the thing you appear to refer to, is in no way a tutorial, but rather documentation intended for a specific group of people: Perl programmers.
Because it was used by a newsroom in Lawrence, Kansas and they wrote a web framework in python called Django. Their journalistic culture led Django to have really great documentation, which led to an influx of more casual developers who had an expectation of great documentation to make up for the fact that their main expertise was elsewhere. This made libraries with good docs more successful, making it easier for university courses to choose python as an initial language. This made it easier for university labs to agree on python as a language. This led Travis Oliphant & friends to develop numpy etc.
Does Perl have mature equivalents for pandas and sklearn (setting aside what everyone else is saying about numpy)? The python ecosystem has a bunch of killer apps that make the workaday tasks of data science extremely ergonomic. R is similar with the tidyverse imo, but I don’t know of other languages with a comparable package landscape.
Quick addendum: data science != computer science, most data scientists learn coding on top of another skillset, not as their primary area of expertise, so things like under-the-hood efficiency are often second order concerns to learn-ability, ease of use and maintenance.
Perl doesn't have the tooling/library ecosystem Python does for Data Science. Additionally lot of Data Science people come from maths/stats and Python is easier to begin with. In my experience most Data Scientists aren't full blown devs since you focus on different things (business aspects of what you do vs scalability).
Before Python, Matlab was the most widely used language for the same use case, but numpy/matplotlib essentially allows you to write Matlab in Python, and Python is much more ergonomic than Matlab for "business logic"
Perl was my first language in 2001. I now know 10 different languages and perl is the worst one to read and my second least favorite language I know right after php. I would rather eat my own vomit than ever use perl again.
Perl is what I am most proficient in, and have already completed various AI projects with, but my colleagues tell me it will be worth it to learn how to program in python, even though I will be set back in the short term.
This doesn’t answer your question, but Python has no equivalent for Perl Pie or the other inline terminal features, so for shell one-liners Perl is still heavily relevant because Python doesn’t offer the same functionality.
Well, for the longest time perl couldn't distinguish an int from a long when doing computation could be one reason. Perl is just not designed for any kind of math.
Yes, the problem was that to read Perl you had to know every way to express things, and Perl was much, much too flexible there. I wrote a bunch of Perl in the early 2000s and I hated it. Both Ruby and Python were far superior. Even Java was better (albeit with other caveats). I hated Perl. List vs. scalar context is one example of a "feature" in Perl that causes confusion in readers.
In science doing things in a different way is good, not bad. If everybody uses the same tool you kill the cross-pollination between people with different skills and innovation often comes from that.
Exploring Bioperl for genetics leads to several big rewards, but you need to want to take the extra-mile and doing it. And is not easy. In the same way as exploring R in the past where none used it, was definitely benefical to me. R is awesome.
All is readable if you find the correct people able to read it. People at the university just keep using the same as their boss so you really can't always choose. Maybe Perl wizards were highly sough and didn't remained in the university. Maybe is just a question of different generations of programmers and Perl is a little older. Dunno.
I started using python for data science-esque tasks back in late undergrad/early grad school before python had really caught on for scientific computing (i.e. ~2003-2005 back before numpy proper, when numeric and or numarray where the containers of choice).
At the time, I did actually use perl a fair bit for data munging. Perl is a lot nicer for anything that required lots of system calls and involved a lot of pure text processing, but that only goes so far. The standard pattern was pre-process ascii data in perl, write ascii or simple binary formats out to disk, invoke some system executable (often something written in fortran) on the file you've written out, read back in the output. Perl is definitely nicer than python for that workflow. However, that workflow has severe limitations. The roundtrip to disk / stdout / etc is pretty crappy for some things.
There really weren't good numeric data containers in perl, at least that I was aware of at the time. Even before numpy, there was numeric. Numpy/numeric focus on c-like in-memory arrays that can be semi-directly passed into / referenced from low-level libraries. That's huge -- suddenly it's easy to manipulate large numeric datasets in memory and _maintain memory efficiency_. No linked lists, very clear rules about what creates intermediate copies, etc. You then can pass these directly into C / Fortran routines without a copy in many cases. (Okay, that last part is non-trivial, especially at the time, but very possible.)
Then there's plotting. Folks forget just how interactive matplotlib is, and was from the very early days. From the perl side, I was using gnuplot/etc (and even more of a domain specific tool called GMT). That meant static figures. Matplotlib meant I got an interactive figure and something that I could easily embed in Tk to make quick GUIs.
I also used Matlab heavily at the time, but it was pretty difficult for the things that needed to interact with everything else (read: old F77 routines and proprietary domain-specific data processing tools). Licensing was also an issue, as there were a limited number of matlab licenses, and you couldn't reliably count on being able to check one out, especially for cron-esque jobs.
Python bridged the two. You had a matlab like environment, decent data munging ability, a good language, and also a good environment for building other tools. This was all possible in Perl, in principle, but the key tools weren't there in Perl, even almost 20 years ago. Basically Perl couldn't replace Matlab easily and Python could.
So why didn't they get built in Perl instead of Python initially? I suspect the short answer is operator overloading. Python is _really_ nice for that, and it's a very nice way of having flexible array manipulation syntax. Second to that is that Python is more readable, and readability matters in the long term.
Also, don't discount how big of a deal having Tk support by default in python is, though. Yeah, sure, these days folks completely ignore desktop GUIs, but at the time web apps were pretty irrelevant. Desktop GUIs were everything. Being able to whip up a quick reusable gui data processing application that a random lab assistant or new grad student could easily use was/is a very big deal, and that was way easier in Python than most other things, especially at the time.
- Perl actually was really popular for a while back then, especially in the Bioinformatics/Genomics space. It was all over that field, I think partially because it's really easy to think of a genome as just a text string of ATCGs and Perl was really convenient for manipulating text.
- I originally switched to Python for some projects because I had to do simple GUIs and visualization. Pygame and TKinter were much nicer to deal with than Perl's options. If you were just reading data in from one file and writing it out to another, Perl was fine, but the GUI toolkits were miserable.
- Numpy ("Numeric" at the time) was what really sealed the deal and probably pushed a lot of others to switch. Perl had PDL and similar but they weren't as fast or as easy to understand and use. Even if Perl is "Naturally far more efficient" than Python, it's still slow enough that for non-trivial calculations you would not want to use it directly. You have to use one of these other libraries where the actual number crunching is happening in highly tuned FORTRAN or C libraries. Python's options back at that time just leapfrogged Perl's and Numpy had really nice integration with Pygame, which made visualization in your app really smooth. Perl might've caught back up, but that was around when Perl6 was announced and sucked all the air out of the room.