Code,  LangChain,  Weaviate

Okay, folks, here’s the deal. This project is showing us how we can team up OpenAI with our knowledge base or other documents. And the cool part? We can do these fancy ‘semantic searches’ and even whip up prompts that we can tweak the generation of the LLM response just the way we like.

This project contains a Streamlit Chat interface and a Luigi ETL Pipeline that processes and stores documents into a Weaviate Vectorstore instance.

The ETL pipeline performs several tasks: converting Jupyter notebooks and Python scripts to Markdown format, cleaning the code blocks in the Markdown files, removing unnecessary files and directories, and uploading the processed data to a Weaviate Instance.


Before you start, ensure that you have Python 3.11 or later installed.

  1. Clone the repository:
git clone

cd etl-pipeline-for-langchain-docs
  1. Set up the project by installing the dependencies using the setup directive. This creates a virtual environment and installs the necessary packages specified in the pyproject.toml file.
make setup
make add

make run-etl-daemon

Environment Variables (.env)

Before running the application, set the following environment variables:


Running the ETL Pipeline

You can start the ETL pipeline by running the run-etl-orchestrator directive. This command starts the Luigi pipeline with the orchestrator task.

cd etl

poetry run python -m luigi --module orchestrator Orchestrator

To open the Luigi visualizer to monitor the progress of the ETL pipeline, you can use the run-etl-visualizer directive. This command opens the Luigi visualizer interface in a web browser at http://localhost:8082.

make run-etl-visualizer

Luigi Tasks (ETL Orchestrator)


The ETL pipeline consists of several Luigi tasks:

  • BuildDataLake: Downloads data from a repository and builds a data lake.
  • ConvertIPYNBtoMyst: Converts Jupyter notebooks to MyST format.
  • ConvertPytoMd: Converts Python files to Markdown format.
  • DeleteNonMdFiles: Deletes files that are not in Markdown format.
  • RemoveEmptyDirectories: Removes empty directories.
  • CleanupCode: Cleans up the code in the Markdown files.
  • RemoveJupyterText: Removes any Jupyter-specific text from the Markdown files.
  • Upsert: Uploads the processed data to a Weaviate Vectorstore instance.

These tasks are orchestrated by the OrchestratorTask in the script.

Running the Chat Interface

You can run the chat interface using the run-chat directive. This command starts the Streamlit application with the script.

make run-chat


Weaviate Configuration

The configuration for the Weaviate class and schema is defined in the script. The Weaviate class is named “Document”, and it uses the “text2vec-openai” vectorizer and the “generative-openai” module with the “gpt-4” model. The class has a single property “content” of data type “text”.


