Loading Data
Loading data using Readers into Documents
Before you can start indexing your documents, you need to load them into memory.
A reader is a module that loads data from a file into a Document
object.
To install readers call:
We offer readers for different file formats.
import { CSVReader } from '@llamaindex/readers/csv';
import { DocxReader } from '@llamaindex/readers/docx';
import { HTMLReader } from '@llamaindex/readers/html';
import { ImageReader } from '@llamaindex/readers/image';
import { JSONReader } from '@llamaindex/readers/json';
import { MarkdownReader } from '@llamaindex/readers/markdown';
import { ObsidianReader } from '@llamaindex/readers/obsidian';
import { PDFReader } from '@llamaindex/readers/pdf';
import { TextFileReader } from '@llamaindex/readers/text';
SimpleDirectoryReader
LlamaIndex.TS supports easy loading of files from folders using the SimpleDirectoryReader
class.
It is a simple reader that reads all files from a directory and its subdirectories and delegates the actual reading to the reader specified in the fileExtToReader
map.
import { SimpleDirectoryReader } from "@llamaindex/readers/directory";
const reader = new SimpleDirectoryReader();
const documents = await reader.loadData("../data");
documents.forEach((doc) => {
console.log(`document (${doc.id_}):`, doc.getText());
});
Currently, the following readers are mapped to specific file types:
- TextFileReader:
.txt
- PDFReader:
.pdf
- CSVReader:
.csv
- MarkdownReader:
.md
- DocxReader:
.docx
- HTMLReader:
.htm
,.html
- ImageReader:
.jpg
,.jpeg
,.png
,.gif
You can modify the reader three different ways:
overrideReader
overrides the reader for all file types, including unsupported ones.fileExtToReader
maps a reader to a specific file type. Can override reader for existing file types or add support for new file types.defaultReader
sets a fallback reader for files with unsupported extensions. By default it isTextFileReader
.
SimpleDirectoryReader supports up to 9 concurrent requests. Use the numWorkers
option to set the number of concurrent requests. By default it runs in sequential mode, i.e. set to 1.
Example
import {
FILE_EXT_TO_READER,
SimpleDirectoryReader,
} from "@llamaindex/readers/directory";
import { TextFileReader } from "@llamaindex/readers/text";
import type { Document, Metadata } from "llamaindex";
import { FileReader } from "llamaindex";
class ZipReader extends FileReader {
loadDataAsContent(fileContent: Uint8Array): Promise<Document<Metadata>[]> {
throw new Error("Implement me");
}
}
const reader = new SimpleDirectoryReader();
const documents = await reader.loadData({
directoryPath: "../data",
defaultReader: new TextFileReader(),
fileExtToReader: {
...FILE_EXT_TO_READER,
zip: new ZipReader(),
},
});
documents.forEach((doc) => {
console.log(`document (${doc.id_}):`, doc.getText());
});
Tips when using in non-Node.js environments
When using @llamaindex/readers
in a non-Node.js environment (such as Vercel Edge, Cloudflare Workers, etc.)
Some classes are not exported from top-level entry file.
The reason is that some classes are only compatible with Node.js runtime, (e.g. PDFReader
) which uses Node.js specific APIs (like fs
, child_process
, crypto
).
If you need any of those classes, you have to import them instead directly through their file path in the package.
As the PDFReader
is not working with the Edge runtime, here's how to use the SimpleDirectoryReader
with the LlamaParseReader
to load PDFs:
import { SimpleDirectoryReader } from "@llamaindex/readers/directory";
import { LlamaParseReader } from "@llamaindex/cloud";
export const DATA_DIR = "./data";
export async function getDocuments() {
const reader = new SimpleDirectoryReader();
// Load PDFs using LlamaParseReader
return await reader.loadData({
directoryPath: DATA_DIR,
fileExtToReader: {
pdf: new LlamaParseReader({ resultType: "markdown" }),
},
});
}
Note: Reader classes have to be added explicitly to the
fileExtToReader
map in the Edge version of theSimpleDirectoryReader
.
You'll find a complete example with LlamaIndexTS here: https://github.com/run-llama/create_llama_projects/tree/main/nextjs-edge-llamaparse
Load file natively using Node.js Customization Hooks
We have a helper utility to allow you to import a file in Node.js script.
node --import @llamaindex/readers/node ./script.js
import csv from './path/to/data.csv';
const text = csv.getText()