I think we need a lot better benchmarks in order to capture the real complexity ...

I think we need a lot better benchmarks in order to capture the real complexity of typical day to day development.

I gave it my typical CI bootstrapping task:

> Generate gitlab ci yaml file for a hybrid front-end/backend project. Fronted is under /frontend and is a node project, packaged with yarn, built with vite to the /backend/public folder. The backend is a python flask server built with poetry. The deployable artifact should be uploaded to a private pypi registry on pypi.example.com. Use best practices recommended by tool usage.

and it generated scripts with docker run commands [1]:

  install_dependencies:
    stage: install
    script:
      - docker run --rm -v $(pwd):/app -w /app/frontend node:14 yarn install
      - docker run --rm -v $(pwd):/app -w /app/backend python:3.8 poetry install

  build_frontend:
    stage: build
    script:
      - docker run --rm -v $(pwd):/app -w /app/frontend node:14 yarn build

  build_backend:
    stage: build
    script:
      - docker run --rm -v $(pwd):/app -w /app/backend python:3.8 poetry build

  deploy_artifact:
    stage: deploy
    script:
      - docker run --rm -v $(pwd):/app -w /app/backend python:3.8 poetry publish -r $PYPI_URL --username $PYPI_USER --password $PYPI_PASSWORD
    only:
      - master

Which, feels more like "connect the dots" or a very rough sketch, that might end up completely replaced. Commands in general seem ok (yarn install && yarn build, poetry build && poetry publish), but the docker run could be better expressed simply as a "image: " attribute of each job. I asked about that and I've been given general "why docker is useful" non-answer.

It also introduced a parallel build stage: frontend and backend are built at the same time, but in my question, I deliberately introduced serial dependency: the frontend code goes into the backend project. The parallel approach would be of course better, if it would correctly construct the end artifact before uploading, but it doesn't do so. Also a bit surprisingly, node install and poetry install could actually run in parallel as-is, but the generated code runs serially.

It uses outdated versions of tools. Python 3.8 seems still ok and used in many online examples due to some compatibility quirks with compiled libraries, but node 14 is more than 3 years old now. Current node LTS is 20.

For comparison, here's the chatgpt4 version [2] :

  prepare:
    stage: prepare
    image: python:3.9
    script:
      - apt-get update && apt-get install -y nodejs npm
      - npm install --global yarn
      - cd frontend && yarn install
      - cd ../backend && poetry config virtualenvs.create false && poetry install

  build-frontend:
    stage: build-frontend
    image: node:latest
    script:
      - cd frontend
      - yarn install
      - yarn build --outDir ../backend/public

  build-backend:
    stage: build-backend
    image: python:3.9
    script:
      - cd backend
      - poetry install --no-dev

  package:
    stage: package
    image: python:3.9
    script:
      - cd backend
      - poetry build
    artifacts:
      paths:
        - backend/dist/*

  deploy:
    stage: deploy
    image: python:3.9
    script:
      - pip install twine
      - cd backend
      - twine upload --repository-url $PYPI_REPOSITORY_URL -u $PYPI_USERNAME -p $PYPI_PASSWORD dist/*
    only:
      - main

Not perfect, but catches a lot more nuance:

- Uses python as base image, but adds the node to it (not a big fan of installing tools during build, but at least took care of that set-up)

- Took care of passing the artefacts built by the frontend; explicitly navigates to correct directories (cd frontend ; ... ; cd ../backend)

- --no-dev flag given to `poetry install` is a great touch

- Added "artifacts: " for good troubleshooting experience

- Gave "only: main" qualifier for the job, so at least considered a branching strategy

- Disabled virtualenv creation in poetry. I'm not a fan, but makes sense on CI

I would typically also add more complexity to that file (for example using commitizen for releases) and I only feel confident that gpt4 won't fall apart completely.

EDIT: Yes, gpt4 did ok-ish with releases. When I pointed out some flaws it responded with:

  You're correct on both counts, and I appreciate your attention to detail.

Links:

- [1] https://www.phind.com/agent?cache=clsye0lmt0019lg08bg09l2cf

- [2] https://chat.openai.com/share/67d50b56-3b68-4873-aa56-20f634...