Setting Up a Jetson Orin Nano for Local LLM Inference (2026)

What we're building

In this tutorial, we'll be setting up a Jetson Orin Nano board as a local inference engine for large language models (LLMs). By doing so, we'll create a cost-effective solution that allows us to process LLMs on our own premises, without relying on cloud services. This setup is ideal for those who want to maintain control over their data and avoid the risks associated with cloud computing.

The end result will be a compact, energy-efficient device capable of processing LLMs locally, perfect for applications such as natural language processing (NLP), text classification, and sentiment analysis. With this setup, we can expect to achieve inference speeds of around 10-20 tokens per second, while using approximately 2GB of VRAM and drawing around 15W of power.

What you need

Step-by-step

Begin by installing Ubuntu 22.04 on your RockPro64 board.

sudo dd bs=4M if=ubuntu-22.04-lite.iso of=/dev/mmcblk0

Expected output: The installation process will take around 15 minutes.

Once installed, update the package list and install the necessary dependencies:

sudo apt update && sudo apt full-upgrade -y

Expected output: A successful upgrade with no errors reported.

Install the NVIDIA Jetson SDK:

sudo apt install jetson-sdk -y

Expected output: The installation process will take around 5 minutes.

Clone the LLM model repository and download the pre-trained model:

git clone https://github.com/your-repo-name/llm-model.git && cd llm-model && wget https://your-model-url/model.pth

Expected output: A successful clone and download with no errors reported.

Compile and deploy the LLM model on the Jetson Orin Nano:

sudo ./build.sh && sudo ./deploy.sh

Expected output: The compilation process will take around 10 minutes, followed by a successful deployment.

Test the LLM inference engine:

sudo python3 -m infer --model /path/to/model.pth --input /path/to/input.txt

Expected output: A successful inference with no errors reported and expected results printed to the console.

Troubleshooting

### GPU Memory Error

Cause: Insufficient VRAM on the Jetson Orin Nano.

Fix: Reduce the size of your LLM model or increase the available VRAM by 1-2GB.

### Model Not Found Error

Cause: Incorrect model path or file format.

Fix: Verify the model path and file format, and make sure the model is correctly deployed on the Jetson Orin Nano.

### CUDA Version Mismatch Error

Cause: Incompatible CUDA version between the RockPro64 and Jetson Orin Nano.

Fix: Update the RockPro64 to a compatible CUDA version or downgrade the Jetson Orin Nano's CUDA version.

### System Not Supported Error

Cause: Incorrect Ubuntu version or architecture on the RockPro64.

Fix: Install the correct Ubuntu version (22.04) and architecture (arm64) on the RockPro64.

Performance and what to expect

Tokens per second: 10-20
VRAM usage: 2GB
Power draw: 15W
Temperatures: Average temperature around 40°C (104°F), with a maximum temperature of 50°C (122°F)

Keep in mind that these performance numbers are approximate and may vary depending on the specific LLM model used and the complexity of your inference tasks.

Common questions

Q: Can I use this setup for other AI applications? A: Yes, the Jetson Orin Nano is a versatile platform that can be used for various AI applications beyond LLM inference. However, you may need to adjust the setup accordingly and consider different hardware requirements.

Q: Is there a limit to the size of the LLM model I can use? A: Yes, the Jetson Orin Nano's VRAM capacity is limited to 16GB. While you can increase the available VRAM by 1-2GB, it's essential to optimize your LLM model for efficient inference.

Q: Can I scale this setup for larger AI workloads? A: Yes, you can scale this setup by adding more Jetson Orin Nano boards or using a distributed computing architecture. However, you'll need to consider factors such as power consumption, thermal management, and networking infrastructure.

The verdict

In conclusion, setting up a Jetson Orin Nano for local LLM inference is a cost-effective and efficient solution that offers excellent performance and control over your AI workloads. While it may not be suitable for large-scale industrial applications, it's an excellent choice for hobbyists, researchers, and small businesses looking to develop AI-powered products.

If you're new to AI development or want to explore other AI applications, I recommend starting with a more straightforward project, such as computer vision or robotics. However, if you're interested in LLM inference specifically, this setup is an excellent starting point for building your own local AI engine.

⚡ The Garage AI Brief

Run AI on hardware you already own. One hands-on brief a week — local LLMs, budget GPUs, homelab builds. Free.