Evaluation :: LLM Thinking & Tooling using GLM-4.7-Flash

Bhaskar S

02/01/2026

Overview

In the following sections, we will evaluate the LLM model GLM-4.7-Flash on its Thinking and Tooling capabilities using a set of test use-cases.

The evaluation will be on a Ubuntu 24.04 LTS based Linux desktop with a decent 8-core/16-thread CPU, 64 GB of system memory, and an NVidia GPU with 16 GB of video memory.

Before proceeding further, ensure that Ollama is installed and setup on the desktop (see instructions).

In addition, ensure that the Python 3.1x programming language as well as the Jupyter Notebook package is installed and setup on the desktop.

Assuming the ip address of the Linux desktop is 192.168.1.25, start the Ollama platform by executing the following command in the terminal window:

$ docker run --rm --name ollama --gpus=all -p 192.168.1.25:11434:11434 -v $HOME/.ollama:/root/.ollama ollama/ollama:0.15.2

Open a new terminal window and execute the following docker command to download the GLM 4.7 Flash model:

$ docker exec -it ollama ollama run glm-4.7-flash:latest

Once the model is downloaded successfully, to exit the user input, execute the following user prompt:

>>> /bye

In the following sections, we will evaluate the model on each of the test cases.

Initialization & Setup

Install the following necessary Python modules by executing the following command:

$ pip install dotenv langchain langchain-core langchain-ollama ollama

Create a file called llm-evals.env with the following environment variables defined:

LLM_TEMPERATURE=0.2
LLM_TOP_P=0.3
OLLAMA_MODEL='ollama:glm-4.7-flash:latest'
OLLAMA_BASE_URL='http://192.168.1.25:11434'

To load the environment variables and assign them to Python variable, execute the following code snippet:

from dotenv import dotenv_values

config = dotenv_values('llm-vals.env')

llm_temperature = float(config.get('LLM_TEMPERATURE'))
llm_top_p = float(config.get('LLM_TOP_P'))
ollama_model = config.get('OLLAMA_MODEL')
ollama_base_url = config.get('OLLAMA_BASE_URL')

To initialize a client instance of Ollama running the desired LLM model glm-4.7-flash:latest, execute the following code snippet:

from langchain.agents import create_agent
from langchain.chat_models import init_chat_model
from langchain_core.runnables import RunnableConfig

ollama_chat_llm = init_chat_model(ollama_model,
                                  base_url=ollama_base_url,
                                  temperature=llm_temperature,
                                  top_p=llm_top_p)

system_prompt = 'You are a helpful virtual assistant.'

tools_agent = (create_agent(ollama_chat_llm,
                           system_prompt=system_prompt)
               .with_config(RunnableConfig(recursion_limit=3)))

Notice the use of the model provider prefix 'ollama:' along with the model name. This prefix loads and initializes the desired model provider (Ollama in this case).

Test Case - 1 :: Solve a Valid Logical Puzzle

In this test case, we will evaluate the thinking and reasoning ability of the selected LLM model by prompting a valid logical puzzle.

Execute the following Python code snippet for the result:

from langchain_core.messages import AIMessage

question = """
Four people have to cross a bridge at night and they will use a torchlight. The batteries in the torchlight only lasts 15 minutes. Alice can cross in one minute, Ben in two minutes, Cindy in five minutes, and Don in eight minutes. At any point in time, only two people can cross the bridge. However, when two people cross the bridge, they have to go at the slower person’s pace. How do they all get across the bridge in 15 minutes? Attempt to solve in no more than 5 iterations. Else, say The problem is not solvable!
"""

result = tools_agent.invoke(
    {'messages': [
        {
            'role': 'user',
            'content': question
        }
    ]}
)

agent_messages = [msg for msg in result['messages'] if isinstance(msg, AIMessage)]
if agent_messages:
    print(agent_messages[-1].content)

The following was the typical output:

Output.1

Here is the solution:

1.  **Alice and Ben** cross the bridge. Time taken: 2 minutes.
2.  **Alice** returns with the torch. Time taken: 1 minute. (Total: 3 minutes)
3.  **Cindy and Don** cross the bridge. Time taken: 8 minutes. (Total: 11 minutes)
4.  **Ben** returns with the torch. Time taken: 2 minutes. (Total: 13 minutes)
5.  **Alice and Ben** cross the bridge. Time taken: 2 minutes. (Total: 15 minutes)

All four people have crossed the bridge in exactly 15 minutes.

Well Done - the LLM model was able to solve the logical puzzle successfully !!!

Test Case - 2 :: Solve an Invalid Logical Puzzle

In this test case, we will evaluate the thinking and reasoning ability of the selected LLM model by prompting an invalid logical puzzle.

Execute the following Python code snippet for the result:

from langchain_core.messages import AIMessage

question = """
Four people have to cross a bridge at night and they will use a torchlight. The batteries in the torchlight only lasts 15 minutes. Alice can cross in two minutes, Ben in three minutes, Cindy in five minutes, and Don in eight minutes. At any point in time, only two people can cross the bridge. However, when two people cross the bridge, they have to go at the slower person’s pace. How do they all get across the bridge in 15 minutes? Attempt to solve in no more than 5 iterations. Else, say The problem is not solvable!
"""

result = tools_agent.invoke(
    {'messages': [
        {
            'role': 'user',
            'content': question
        }
    ]}
)

agent_messages = [msg for msg in result['messages'] if isinstance(msg, AIMessage)]
if agent_messages:
    print(agent_messages[-1].content)

The following was the typical output:

Output.2

The problem is not solvable!

WOW - the LLM model was able to successfully indicated the logical puzzle is not solvable !!!

Test Case - 3 :: Generate Python Code

In this test case, we will evaluate the code generation capability of the selected LLM model by prompting for generation of a Python code that exposes two REST based endpoints with the second endpoint being protected by an api token.

Execute the following Python code snippet for the result:

from langchain_core.messages import AIMessage

question = """
Generate Python code that exposes two REST based endpoints - /hello and /secure-hello. The second endpoint /secure-hello must be protected by an api token. Also, create a standalone Python utility for generating api tokens. Do not make it complicated - keep it clean and simple.
"""

result = tools_agent.invoke(
    {'messages': [
        {
            'role': 'user',
            'content': question
        }
    ]}
)

agent_messages = [msg for msg in result['messages'] if isinstance(msg, AIMessage)]
if agent_messages:
    print(agent_messages[-1].content)

The following was the typical output:

Output.3

Here is a clean and simple solution using the **Flask** framework.

### Prerequisites
You will need to install Flask:
```bash
pip install flask
```

### 1. The Token Generator Utility
Save this code as `token_generator.py`. This script generates a secure random token and saves it to a file named `.api_token`.

```python
import secrets
import os

def generate_token():
    # Generate a 32-byte (64 hex characters) random token
    token = secrets.token_hex(32)
    
    # Save the token to a file
    filename = '.api_token'
    with open(filename, 'w') as f:
        f.write(token)
    
    print(f"Token generated and saved to {filename}")
    print(f"Token value: {token}")

if __name__ == "__main__":
    generate_token()
```

### 2. The Flask Application
Save this code as `app.py`. It reads the token from the file created by the utility and protects the `/secure-hello` endpoint.

```python
import os
from flask import Flask, request, jsonify

app = Flask(__name__)

def get_valid_token():
    """Reads the token from the .api_token file."""
    token_path = '.api_token'
    if not os.path.exists(token_path):
        return None
    with open(token_path, 'r') as f:
        return f.read().strip()

@app.route('/hello')
def hello():
    """Public endpoint."""
    return jsonify({"message": "Hello, World!"})

@app.route('/secure-hello')
def secure_hello():
    """Protected endpoint requiring an API token."""
    # Get the token from the 'X-API-KEY' header
    provided_token = request.headers.get('X-API-KEY')
    
    valid_token = get_valid_token()
    
    # Check if token exists and matches
    if valid_token and provided_token == valid_token:
        return jsonify({"message": "Hello, Secure World!"})
    else:
        return jsonify({"error": "Unauthorized"}), 401

if __name__ == '__main__':
    app.run(debug=True, port=5000)
```

### How to Run It

1.  **Generate a Token:**
    Run the utility script first to create the token file.
    ```bash
    python token_generator.py
    ```

2.  **Start the Server:**
    Run the Flask application.
    ```bash
    python app.py
    ```

3.  **Test the Endpoints:**

    *   **Test Public Endpoint:**
        ```bash
        curl http://127.0.0.1:5000/hello
        ```
        *Response:* `{"message": "Hello, World!"}`

    *   **Test Protected Endpoint (Without Token):**
        ```bash
        curl http://127.0.0.1:5000/secure-hello
        ```
        *Response:* `{"error": "Unauthorized"}` (Status 401)

    *   **Test Protected Endpoint (With Token):**
        ```bash
        curl -H "X-API-KEY: " http://127.0.0.1:5000/secure-hello
        ```
        *Response:* `{"message": "Hello, Secure World!"}`

Not bad - the LLM model was able to successfully generate the Python code for the given task along with user steps !!!

Test Case - 4 :: Invoke User Supplied Tool

In this test case, we will evaluate the tool calling ability of the selected LLM model by prompting to find and display specific host system information.

Execute the following Python code snippet for the result:

from langchain_core.messages import ToolMessage
from langchain.tools import tool

@tool
def execute_shell_command(command: str) -> str:
    """Tool to execute shell commands
    Arg:
        command (str): Shell command to execute
    Returns:
        str: Output of the command execution
    """
    print(f'Executing shell command: {command}')
    try:
        output = subprocess.run(command, shell=True, check=True, text=True, capture_output=True)
        if output.returncode != 0:
            return f'Error executing shell command - {command}'
        return output.stdout
    except subprocess.CalledProcessError as e:
        print(e)
        return f'Error executing shell command - {command}'

tools =[execute_shell_command]

question = """
Determine all the ip address(es) of the primary network interface on this Linux system!
You have access to the following tools:
{tools}
You can use these provided tools to answer the question, if necessary.
Only display the network interface and its ip address in the output - nothing more.
"""

tools_agent = create_agent(ollama_chat_llm,
                           tools,
                           system_prompt=system_prompt)

result = tools_agent.invoke(
    {'messages': [
        {
            'role': 'user',
            'content': question
        }
    ]}
)

tool_messages = [msg for msg in result['messages'] if isinstance(msg, ToolMessage)]
if tool_messages:
    print(tool_messages[-1].content)

The following was the typical output:

Output.4

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host noprefixroute 
       valid_lft forever preferred_lft forever
2: enp42s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether aa:7c:aa:17:aa:5a brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.25/24 brd 192.168.1.255 scope global noprefixroute enp42s0
       valid_lft forever preferred_lft forever
    inet6 aaaa::a85e:aaaa:5db3:2919/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
3: wlo1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether bb:c2:bb:ee:bb:73 brd ff:ff:ff:ff:ff:ff
    altname wlp41s0
4: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether a6:69:1b:90:41:60 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::a469:1bff:fe90:4160/64 scope link 
       valid_lft forever preferred_lft forever
9: vethdc87142@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default 
    link/ether 56:c1:27:50:7e:1b brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::bc8a:d6ff:fe04:d9d6/64 scope link 
       valid_lft forever preferred_lft forever

Not Quite - the LLM model was NOT able to follow the specific instructions !!!

!!! ATTENTION !!!

On an average (for the given host system specs), the response from the model will take anywhere between 50 to 75 seconds !!!

Overall, the GLM-4.7-Flash model has performed very well !!!